How We Built Snowflake on Azure

Author: Polita Paulus

Engineering, Snowflake Technology

Today, we announced the general availability of Snowflake on Azure. As a part of the engineering team that built Snowflake on Azure, I’m especially excited to unveil what we’ve been working on.

Snowflake on Azure has been an ambitious project. We wanted to offer the same Snowflake service customers already use on other cloud platforms but built for Azure, including all existing and new features, with a single code base, and with the same performance characteristics. In this post, I’ll tell you a little more about how we did it, and some of the Azure features and strengths we built upon.

Leveraging Azure’s Strengths

Enabling Snowflake to run on Azure included three big categories: building on top of Azure Blob Storage for all internal and customer-facing persistent storage, using Azure Compute to run workloads, and securing all access using Azure Active Directory and security features built into Azure components.

Storage

Azure blob storage has a two-level hierarchy. Storage accounts hold containers, and containers have a classic folder hierarchy within them. Containers can be independently secured with Shared Access Signature (SAS) tokens, which are time- and permission-scoped credentials. When accessing customer data, Snowflake uses SAS tokens scoped only to that customer’s container, which allows Snowflake to ensure that data in one customer’s container is never accessible when running within the context of another customer.

Snowflake uses soft delete for Azure storage blobs to protect data from corruption and accidental deletion, and to recover data in case of a catastrophic event. Built in coordination with our team, soft delete allows us to offer data resiliency without building our own snapshotting feature.

Snowflake’s workload tends to have high storage usage, and high scale storage accounts give Snowflake increased capacity and higher ingress and egress limits on our storage accounts. Accelerated networking also gives Snowflake a boost in networking performance, which is important for communications between machines running a query and for reads and writes to storage. These two features were critical for reaching our performance goals.

Compute

When you run your workload in Snowflake, the machines used to run your queries are dedicated to your exclusive use. To add computing power when you want it, Snowflake elastically allocates machines for your workload using Azure Resource Manager templates. Azure Compute allows us to create, manage, and deallocate those resources while ensuring a single fault in a data center, or a forced system update, doesn’t impact your ability to run your queries.

Security

We take security very seriously at Snowflake, so building an airtight security model was crucial. Our security model is built upon the native security concepts within Azure. We use Azure Active Directory to manage identities and provide continuous security logging and monitoring. As I mentioned above, Snowflake makes use of SAS tokens to ensure one customer’s data is never accessible while running another customer’s query, even to internal Snowflake processes. But SAS tokens allow us to secure more than just internal data within a storage container. Snowflake also dynamically creates short-term, expirable tokens that our Snowflake drivers use to retrieve results files, or put and retrieve data in storage areas, such as table, user, and named stages. These tokens ensure that connections requesting data are secured using TLS and originate from Snowflake’s own IP addresses. SAS tokens can be used only for the specific operations and files a customer needs, following principles of least privilege. Finally, all SAS tokens we create expire after a limited time.

Snowflake encrypts all data at all times. Data on Snowflake’s storage accounts is encrypted at rest using Azure Storage Encryption. In addition, we store data using an additional layer of encryption with AES-256 data and key encryption using Snowflake-managed keys. Like SAS tokens, the encryption keys for a customer’s account are retrievable only when running queries for that account, ensuring one customer can’t decrypt another customer’s data.

External stages allow customers to import and export data they manage on their own Azure storage accounts. Because SAS tokens offer fine-grained, scoped, expiry-based control, we use customer-created SAS tokens to access data within Azure external stages. In addition to requiring usage of scoped tokens, we strongly encourage customers to encrypt all files on their external stages using client-side encryption and a 256-bit encryption key. Snowflake then decrypts that data on load and encrypts on unload using your keys and the same AES data encryption and AES-KW key encryption supported by the Azure Storage SDK.

How We Build It

Supporting multiple cloud platforms can add a significant tax to your engineering and operations teams. From the beginning, we decided we needed to build support for Azure using the same code base we use for other cloud platforms. We use layers of abstraction to encapsulate interactions with cloud-specific storage, compute and security APIs. We have a single build process and a single set of binaries that we deploy to all Snowflake regions, no matter what cloud platform they run on. As Snowflake expands to more regions, this keeps our engineering, release and maintenance processes scalable. It also means we can deploy a new release within hours in every region and monitor all regions and cloud platforms using a single set of tools.

 

Partnering with Microsoft

We couldn’t have built this without close coordination with Microsoft. The Snowflake on Azure project included many long phone sessions and meetings in Redmond conference rooms. Several new features were built with Snowflake in mind, including Azure storage soft delete and improvements to virtual machine provisioning. A big thank you goes out to the Azure team for helping us deliver together.

What’s Next?

We’re far from done. Every new feature for Snowflake is now built in tandem for each cloud platform we support. And, as new features are added to Azure, we will leverage them in Snowflake for Azure. Today, we call East US 2 home. In the coming months, we will branch out to new Azure regions to reach more customers. But first, maybe a short vacation for the Snowflake on Azure team.