A Deep Dive into Envoy at Snowflake

Snowflake announced our intention to migrate our edge networking stack from NGINX to Envoy Proxy back in September 2023. Since then, the Traffic Team has been working to make the migration a reality for Snowflake’s ~50 supported cloud regions. With the migrations wrapped up and 100% of Snowflake traffic served by Envoy, we’re excited to dive into Snowflake’s new network architecture and share a bit about our journey to get here.
Background
Snowflake’s edge networking stack handles routing customer requests to the correct cloud region and backing Snowflake service, across a massive global and cross-cloud footprint. This stack enables secure connectivity via transport layer security (TLS), load balancing for reliability and business-aware routing for product functionality. The Snowflake edge interface is defined by the URLs used to connect, the TLS certificates presented by the edge proxy and the TLS support; all other details are transparent to Snowflake customers.
Today, Snowflake’s edge networking stack is built on Envoy Proxy, and the Traffic Team maintains a fleet of Envoy Proxies in each of our supported cloud regions. Collectively at Snowflake, Envoy serves hundreds of thousands of requests per second, and facilitates query latency-sensitive workloads such as Unistore, where every network hop counts.
In 2023, Snowflake decided to migrate our legacy NGINX ingress layer to an Envoy-based stack. This decision was made to achieve uniform infrastructure across our cloud regions and accommodate Snowflake’s rapidly expanding feature set, with new products bringing new requirements. For example, Streamlit in Snowflake relies on the WebSocket protocol to serve Streamlit apps. WebSockets benefit from long-lived TCP connections, which were challenging to offer efficiently on the NGINX proxy layer. Each proxy configuration update created a new NGINX process pool. This didn’t scale as multiple generations of processes, each consuming memory proportional to the size of the routing configuration, were kept alive. In contrast, Envoy doesn’t require new processes when the configuration changes, enhancing performance and reducing hardware resource requirements. As an additional benefit, Envoy enabled both HTTP/2 and TLS 1.3 as part of the migration, bringing both performance and security improvements to our customers.
With multiple ways of connecting to Snowflake services, including private connectivity, transparent migration was a major obstacle to overcome. This required engineering new systems to seamlessly move customers between ingress stacks while working around limitations in functionality offered by cloud service providers. All of the migrations have now been completed and decommissioning of the legacy NGINX fleet is almost complete. Let’s dive into Snowflake’s Envoy architecture and how we accomplished the mammoth task of migrating in less than 18 months.
Network architecture
Multiregion, multi-CSP (Cloud Service Provider)

Snowflake customers access their account via the ingress stack, outlined in Figure 1. Customer HTTP requests traverse a CSP-managed L4 load balancer to reach Envoy, which has the routing logic for appropriately directing requests to the backend service. Each Snowflake region is fully isolated to ensure issues are limited in scope; the Traffic Team therefore operates a dedicated fleet of Envoy Proxies in each supported cloud region to handle customer requests. We use per-Snowflake account DNS records to map accounts to the appropriate region’s Envoy fleet.
Dynamic Envoy control plane

We configure Envoy Proxy at Snowflake using our xDS control plane, written in Go. This control plane receives inputs from various data sources and generates Envoy configuration resources to program Envoy to correctly route customer traffic to backend services. These backend services could be part of the service layer, running on our internal Kubernetes platform, or even a customer workload running on Snowpark Container Services. Using the xDS protocol, Envoy is able to receive these dynamic configuration updates and reprogram the data plane to rapidly adapt to changing routing requirements.
Our xDS control plane continuously receives data from the Snowflake service layer to allow Envoy to incorporate per-customer routing rules into the configuration (e.g., to which instances a particular customer should be directed, based on load and tenancy). With rapid auto-scaling at the service layer, we have strict requirements on the time to propagate Envoy configuration changes through xDS, requiring efficient processing of dynamic inputs and asynchronous computation in the control plane. The xDS control plane is also responsible for providing Envoy with TLS certificates for securing customer requests to Snowflake via the secret discovery service (SDS).
Envoy’s extensibility has played a big role in supporting new features like Streamlit in Snowflake; today, Lua filters are used for some request transformations, though as we look ahead to more functionality at the edge layer, we’ll be exploring in-house C++ extensions to Envoy. We look forward to sharing more in-depth information on how our dynamic xDS control plane works in a future blog post.
Migrating to Envoy
Mechanics of the migration
Public
Each Snowflake account has its own DNS record, which directs traffic for the account to the appropriate cloud region’s ingress stack. During these migrations, we made use of these DNS records to gradually shift customer traffic from NGINX to Envoy. To make this happen, we introduced an internal concept of a “load balancer identifier,” with different identifiers mapping to different CSP load balancers and edge proxy backends via DNS CNAME records. In metadata, the Snowflake service layer associates each Snowflake account with the load balancer that ought to handle requests for that account. The service layer issues DNS mutations when this metadata changes, and the update will be propagated to public DNS.
At the start of our migration, all Snowflake account DNS records were backfilled to resolve via the NGINX load balancer CNAME in their region. Over the course of the migrations, we automatically updated account metadata to map accounts to the new Envoy load balancer, and via the DNS change, shift the account’s traffic to the Envoy ingress stack. This mechanism allowed us to migrate traffic slowly between the ingress stacks, ramping up from 10% all the way to 100%, in each supported cloud region.
Given that Snowflake account DNS records are publicly resolvable, you can actually see this for yourself:
$ dig +noall +answer s3testaccount.snowflakecomputing.com
s3testaccount.snowflakecomputing.com. 15 IN CNAME partition-05.lbid-100.prod1.traffic.snowflakecomputing.com.
partition-05.lbid-100.prod1.traffic.snowflakecomputing.com. 300 IN CNAME lbid-100.prod1.lb.snowflakecomputing.com.
lbid-100.prod1.lb.snowflakecomputing.com. 132 IN CNAME a98fad7d84389a5f7723e4c78528f5db-732bd8b7d926982c.elb.us-west-2.amazonaws.com.
a98fad7d84389a5f7723e4c78528f5db-732bd8b7d926982c.elb.us-west-2.amazonaws.com. 33 IN A 54.71.115.210
a98fad7d84389a5f7723e4c78528f5db-732bd8b7d926982c.elb.us-west-2.amazonaws.com. 33 IN A 52.42.165.140
a98fad7d84389a5f7723e4c78528f5db-732bd8b7d926982c.elb.us-west-2.amazonaws.com. 33 IN A 34.211.109.146
For the curious, the “partition” CNAME in this resolution chain provides a layer of DNS granularity between the per-account CNAME, and the per-region “lbid” load balancer identifier, and facilitates changes like “redirect 10% of traffic to a specific load balancer.”
Our migrations were gated on the health of the Envoy stack, and our automation was predicated on positive signals from our internal synthetic probing platform, which we’ll cover later.
Private connectivity

Snowflake currently runs on three major CSPs: AWS, Azure and GCP. While all three have a similarly shaped private connectivity product offering — PrivateLink, Private Link and Private Service connect, respectively — each has unique challenges to seamlessly change the backend receiving traffic through the service. Snowflake’s architecture — with the PrivateLink Endpoint Services in one virtual network and the Envoy ingress layer in another — introduced further complications that required bespoke configurations to facilitate the migration. In AWS, two layers of CSP L4 load balancers were used as, at the time of migration, we could not register the Kubernetes Envoy targets with the NLB behind the PrivateLink Endpoint Service. Similarly, in Azure and GCP, it was not possible to register load balancer targets from different virtual networks. This restriction led us to build a fleet of Envoy TCP proxies in the service virtual network, allowing us to forward traffic to the Kubernetes virtual network.
Synthetic probes
To build confidence in a major customer traffic migration, we overhauled Snowflake’s monitoring at the network level and built our new monitoring stack on Cloudprober, which has since been adopted company-wide. A key principle for making production changes at Snowflake is that we have visibility into the health of the system. Thus, as a prerequisite for the Envoy migrations, new synthetic probes were deployed to each region, verifying that HTTP requests to an internal account configured to route via Envoy were successful and served by the intended ingress stack.
Snowflake’s monitoring posture gained a huge boost from this effort. By going from an external synthetic probe provider to an internal platform built on Cloudprober, we were able to use synthetic probes to validate private connectivity for the first time. This proved a valuable signal for migration automation to make a go/no-go decision.
Major challenges
Over the course of the migration effort, Snowflake’s Traffic Team faced a wide variety of challenges. Some of these challenges were self-inflicted, while others were due to external behavioral oddities.
Implicit behavioral contracts
The behavior of Snowflake’s legacy NGINX ingress stack was largely defined as “however the system currently behaves,” and little was codified in explicit tests of the ingress functionality. This represented a huge barrier when approaching the migration to Envoy. We had two options to proceed, each with various trade-offs. Either we could redefine the behavior of the system from first principles, which is particularly challenging in a large distributed system, or we could mimic the existing behavior on an entirely new stack, often without knowing whether a particular “feature” is working as intended. In our case, this situation was exacerbated by the presence of business logic at the edge routing layer, as many teams at Snowflake had implicit dependencies on the status quo of NGINX functionality. Delayed discovery of gaps in feature parity had knock-on effects for new product launches dependent on the migration; the upcoming section on TLS Certificate Rotation List problems addresses one such gap.
Given the unbounded scope of possible behavioral dependencies, we opted to have Envoy emulate NGINX’s behavior as closely as possible by analyzing configuration differences and contrasting HTTP request logs between the stacks. Even with these precautions, several behavioral differences evaded our detection, including case sensitivity in the matching of HTTP methods for routes and behavior around empty HTTP headers (NGINX dropped these, while Envoy by default forwarded them upstream). We consciously decided to keep some notable differences — namely, no longer cycling connections every minute.
Certificate authority change in AWS
One of the major goals of the migration to Envoy was to achieve uniformity in our edge ingress footprint across the CSPs. On the NGINX stack, TLS termination differed across CSPs. In AWS, we terminated TLS at an L7 Elastic Load Balancer, with TLS certificates issued by AWS Certificate Manager (signed by Amazon Trust Services); while Azure and GCP terminated TLS on the CSP LBs (Azure Application Gateway and GCP Application Load Balancer, respectively) with Digicert certificates. On Envoy, TLS is terminated at the Envoy proxy with Digicert certificates for all CSPs, bringing uniformity to Snowflake’s TLS interface.
Initially, we intended to perform the migration from Amazon Trust Services to Digicert TLS coincident with the Envoy migration. We quickly realized that this was a breaking change for some customers with customized TLS truststores, and so delayed the uniformity change to allow time for customers to prepare for the switchover (Change of Certificate Authority and OCSP Allowlist for AWS Customers). To keep migration momentum, migrations were temporarily performed with TLS termination at the NLB fronting Envoy, decoupling the backend ingress stack migration from the changes to trust Digicert-issued TLS certificates in customer environments.
Per-customer migrations for private connectivity
When private connectivity migrations were first initiated, the migration applied to all customers in a given cloud region simultaneously. This approach was going well, though with a less-than-ideal risk posture, until individual customers in some regions experienced compatibility issues with the Envoy stack. As our only unit of migration was a cloud region, single customer issues forced region-wide rollbacks to mitigate impact for these customers. Unfortunately, some Snowflake product launches, including private connectivity for Streamlit in Snowflake, were dependent on this migration, and indefinite holdback on NGINX was not an option. To move the migration forward while taking customer needs into account, we pivoted to build a mechanism whereby individual customer accounts could be temporarily routed back to NGINX via a TCP filter in the L7 Envoy fleet, so all but select hostnames would be on Envoy and able to access new products. This enabled a more measured approach to investigating and resolving single customer issues.
Diamond routing in AWS

“Diamond routing” is a phenomenon that can introduce a small chance of connection failures to load balancing infrastructure. This can arise on AWS Network Load Balancers configured with client IP preservation and cross-zone load balancing when a client connects to the same load balancer backend target via multiple availability zones (AZs). In this case, the connection 4-tuple (client IP address, client port, destination IP address, destination port) is identical, leading to an incoherent state in the backend’s TCP session. This can manifest as intermittent TCP connection failures, with a probability of occurrence inversely proportional to the number of load balancer backends and the number of source ports used by the client.
On Snowflake’s NGINX stack, this problem didn’t exist, as we used L7 Elastic Load Balancers. However, the switch to L4 Network Load Balancers on Envoy exposed some clients to this change in system behavior. After a deep investigation and close collaboration with AWS, this challenge was solved with the realization that the combination of client IP preservation and cross-zone load balancing is a generally unsafe configuration, and our load balancers will now only ever use one of these options.
TLS certificate revocation lists in .NET
Another behavior change between the NGINX and Envoy stacks resulted in a major investigation. After moving from NGINX on Azure Application Gateways to Envoy on Azure Load Balancers, we identified a significant spike in Snowflake client memory usage. Eventually, the team discovered that the Application Gateways were performing TLS OCSP stapling, while the Envoy stack on the L4 Load Balancers was not. In .NET, clients on Linux will fall back on parsing the Certificate Revocation List (CRL) to determine if a TLS certificate remains valid. Since our TLS certificate provider, DigiCert, recently had a mass revocation incident, this led to a large CRL on our TLS certificates, and inefficient memory allocation in the .NET runtime on Linux resulted in out-of-memory exceptions for some clients. With a CRL containing approximately 430,000 entries, clients needed to allocate 430 MiB of memory per connection.
At Snowflake, we resolved this by implementing OCSP stapling in Envoy, with the stapled responses computed by our xDS control plane; this was another example of implicit system behavior that we (and our customers) depended on but had not explicitly tested before.
Reflection on outages
While the migration to Envoy was transparent for the vast majority of customers, there was some adverse impact along the way as the new stack was operationalized and scaled up to meet the challenging demands of Snowflake traffic.
Upstream connection limits in Envoy
Our migrations started in the smallest AWS regions by Snowflake usage. With positive signals from customer traffic on Envoy in these regions, we moved to gradual rollouts in the larger regions. Immediately, we began experiencing elevated failure rates, which triggered a quick rollback to NGINX. We discovered that Envoy includes many circuit-breaking configuration properties by default, including a limit of 1024 connections to an upstream cluster. These properties limited the throughput we were able to achieve on Envoy and manifested as customer-visible HTTP request delays. Reconfiguration of Envoy to increase these limits in accordance with the capacity of the upstream clusters alleviated this problem and allowed us to re-migrate customers to Envoy.
This performance degradation motivated the team to invest in better load testing, to ensure that our ingress layer can meet the scale requirements of the busiest cloud regions.
Support for diverse hostnames
Snowflake users may be familiar with the range of account identifiers that are valid for connecting to Snowflake accounts. Some older URL formats remain supported by Snowflake, but were initially not supported by Envoy, which attempted to derive the valid URLs from well-known information about the cloud regions. This gap in support led to cases where customers were no longer able to access their account via a subset of URLs post-migration.
These outages were a huge motivator for the Traffic Team’s investment in synthetic probing, and expanding this probing to cover all of the URLs that customers use to access Snowflake today. This effort provided a much enhanced operational posture as migrations resumed, and paved the way for fully automated, hands-off infrastructure changes at the edge proxy layer.
IP address changes
Since the migration to Envoy was driven by DNS changes to direct traffic to new L4 load balancers fronting Envoy, this migration changed the IP addresses that Snowflake customers observe when accessing their accounts. Though Snowflake doesn’t share static ingress IP addresses, we discovered that some customers had taken strict dependencies on the legacy NGINX load balancer IP addresses (e.g., at their firewall layer). This experience has contributed to future plans where, on each release of Envoy configuration changes, net new load balancers with new IP addresses will be deployed — this more strictly encodes dynamic IP addresses as the public interface to Snowflake services.
What’s next for edge networking at Snowflake
Migration from NGINX to Envoy was the first major evolution of the edge networking architecture at Snowflake, but it certainly isn’t the last. Work is underway to transparently rearchitect the reverse proxy layer into a tiered architecture, with an edge layer of Envoy Proxies responsible for general concerns (like TLS and eBPF for efficient traffic filtering), and a second layer of Envoy Proxies handling service-specific logic. This will allow us to get new products and services into the hands of customers faster by centralizing core networking functionality at the edge, and allowing Snowflake teams to focus on the components important to their specific product.
Summary
Completing the migration to Snowflake’s new Envoy ingress stack has delivered huge value for the business, enabling the launch of product offerings like Streamlit in Snowflake and Snowpark Container Services via private connectivity, with more to come. Uniformity of the edge networking layer across CSPs has made it easier for teams to launch products quickly in all supported cloud regions, including via private connectivity. The migration to Envoy also allowed us to launch support for TLS 1.3 for clients connecting to Snowflake services — announced in Changes in TLS Cipher Suite Requirements.
Snowflake’s Traffic Team and our edge networking stack both matured significantly over the course of our migration to Envoy. In addition to the deeper understanding of our previously organically developed ingress layer, a renewed automation posture — requiring all changes to be orchestrated by automation tooling — allowed us to scale to the demands of our many-region cloud footprint and move toward uniformity. Experience with customer-impacting events moved the team toward an SLO-driven approach to making infrastructure changes, helping to maintain service quality for Snowflake customers in accordance with our SLA commitments. New product features and new engineers on the team are now exclusively onboarded with the Envoy stack.
We’re excited to consolidate our efforts on our Envoy stack, and enable the delivery of new features to Snowflake customers faster, and more reliably, than ever before!
Acknowledgements
We’d like to thank the Envoy open source community for their helpful insights on this journey. We’d also like to thank Snowflake customers for their feedback over the course of the migrations, and collaborating with us to identify unique compatibility issues on the new stack. Finally, we’d like to thank all of the teams at Snowflake who worked tirelessly alongside us to execute these migrations as seamlessly as possible.
Curious about the engineering challenges described in this blog post? Interested in opportunities to drive reliability, security and performance improvements for Snowflake? Join us! The Snowflake Traffic Engineering team is hiring in Dublin, Ireland.