Business continuity remains a top priority for global companies, given that disruptions caused by natural disasters, regional network and power outages, cyberattacks and breaches, and user error (just to name a few) are not an if but a when.
The case for business continuity is particularly compelling for a company such as The Depository Trust & Clearing Corporation (DTCC), which is designated as a systemically important financial market utility (SIFMU), a U.S. Congress-enacted status recognizing that disruption or failure of such an organization would destabilize financial markets. This is why DTCC is committed to delivering the world’s most efficient and resilient post-trade financial market infrastructure. Snowflake on AWS supports our business resiliency initiatives and enables us to meet and scale disaster recovery with operational efficiency and confidence.
Before we go further into our Snowflake and AWS story, here’s a bit more about DTCC to help you understand what’s at stake. We settle a majority of securities transactions in the U.S., with $4.5 trillion per day in U.S. government securities and a monthly average of $8.35 trillion in mortgage-backed securities. You get the idea: business continuity is imperative for us for settling securities transactions or to run internal reports, so our IT strategy is based on the three foundational pillars of security, resilience and stability.
Building resiliency into every element with Snowgrid
At DTCC, the notion of resiliency is built into all our initiatives, whether for clearing securities or offering clients the ability to perform data analytics, including how we go about modernizing our applications. Each application has a disaster recovery plan, including what we call a runbook, detailing the failover and failback schema as well as the objectives for the two main criteria in disaster recovery:
- Recovery point objective (RPO): The extent of data recovery you expect to achieve should data be lost
- Recovery time objective (RTO): The maximum amount of time you will tolerate an application not being available in the event of disaster
Since implementing Snowflake on AWS in June 2020 for our risk and data analytics, our organization has been incident-free. One of the reasons for this resiliency success is Snowflake’s Snowgrid capabilities. Snowgrid enables customers to replicate data across regions and clouds, unlocking greater resiliency and minimizing business disruption.
We have conducted at least 15 disaster recovery exercises using Snowgrid technology for business continuity. Our Snowflake instance handles over 700,000 queries per day across 15 applications supporting more than 400 users, and we have been able to achieve close to zero data loss and near-zero RTO using Snowflake’s account replication capabilities.
Snowflake’s built-in redundancy is a major benefit for DTCC; there is triple redundancy for all critical services and automatic retries for failed parts of any query. At the zone level, Snowflake uses availability zones on AWS and also offers cross-region replication and failover, which has helped us achieve our business continuity goals of close to zero data loss and near-zero recovery time objectives. We can use the Snowflake Time Travel feature to query for and retrieve deleted data for up to 90 days—and a fail-safe feature offers an additional seven days past the retention period for time travel.
Snowgrid’s account replication capabilities allow each account to have one or more failover groups, so we can segregate apps by line of business. This lends a lot of flexibility to our disaster recovery process design, including the ability to fail over an app with its own connection URL intact, so the app and its connection fail over together (and can fail back together as well). We also gain the ability to rotate apps independently without impacting one another.
Reaping the benefits: consistency, speed, collaboration and cost savings
We always strive for an RTO of zero. Snowflake supports this effort with many of its key features, including multi-cloud support, on-demand scalability, SOC 1 and SOC 2 compliance, replication, and failover. Over the past 9+ months we have done resiliency (chaos) testing, stress testing, and testing P99 lags; we feel that we’ve put Snowflake replication through thorough testing and had good success.
At DTCC, the benefits of Snowgrid replication and failover include consistency, speed and cost savings.
Automated syncing across primary and secondary accounts and cloud providers eliminates manual migration tasks for operational efficiency. Each application has one runbook for disaster recovery (DR) processes globally, meaning there is only one code base for centralized management and execution of replication. We can use the same code base and process for the U.S. and EU, saving effort.
An application can be DR-enabled, tested and equipped with its runbook (detailing DR plans) in less than three days. The simplicity and elegance of design make it fast to work with Snowflake for DR.
Snowflake replication is inexpensive. Our previous on-premises replication solution doubled our cost due to having to double the hardware and licensing costs.
With Snowflake’s separation of compute and storage, highly compressed micro-partitions are replicated, which improves storage efficiency and data freshness at the replication site. Paired with the ability to spin up compute resources instantly, we are able to recover quickly while paying for the compute only when needed. Avoiding the need to dual-load and transfer data (ETL) has helped us realize savings of roughly 30%.
Four tips for business continuity success
DTCC’s partnership with the Snowflake team gives us a close and constant feedback loop and the opportunity to try out new features while in private preview. Together, we’ve made it possible to move big rocks—complex things like System for Cross-domain Identity Management (SCIM) provisioning and user replication.
As you undertake (or continue) your own business continuity initiatives, we highly recommend Snowflake as the foundation and offer this advice:
- Make sure you understand your company’s assets and identify what represents acceptable loss or downtime (if any) for each application.
- Test constantly and look for edge cases.
- Automate, automate, automate—it’s the only way to achieve the scale and efficiency needed for mission-critical applications.
- Keep measuring for continuous improvement.
At DTCC, we pride ourselves on designing our IT strategies for resiliency right from the start.
With Snowflake, and the cross-cloud abilities of Snowgrid, we know that the security and operations aspects of our architecture are covered so we can focus on optimizing the user experience and adding value to our business.
Curious about Snowgrid? Read the Operate at Global Scale with Snowgrid solution brief.