Optimizing Savings Plan Usage in Snowflake

Snowflake customers collectively execute several billion queries a day over exabytes of their data. These queries are part of diverse workloads, spanning analytics, AI, data engineering and more. Cloud service providers (CSPs) have pricing models that incentivize long-term commitments of compute capacity for greater discounts, but customer workloads are dynamic and scale up and down rapidly as needed.
Accurately forecasting future demand is challenging due to growth and changes in user demand, new product features with different resource requirements, and availability of new generations of hardware with different capabilities from the CSPs. This fundamental mismatch between the drivers of supply and demand between CSPs and consumers of cloud resources motivates several opportunities for optimization.
Daily workload patterns
Snowflake is a consumption-based service, and compute demand follows the expected daily, weekly and seasonal patterns of the business world. Snowflake can scale up compute and storage independently and instantaneously depending on workload demands. We maintain a free pool of ready-to-go virtual machines (VMs) to scale up for new queries instantly, and we bill customers per-second so costs immediately scale down with usage.
As seen in Figure 1, even within a single day, the average maximum hourly demand for compute VMs is 34% higher than the daily minimum. Even as Snowflake expands into more global markets across time zones on different continents, the effects of this variability must be carefully considered in different regions. The cost-saving opportunities of longer-term capacity commitments must be balanced with the needs of immediate scalability for customer workloads.

Hardware improvements
While weekly demand has a predictable pattern, as shown in Figure 1, we must also account for the steady stream of new hardware platforms and software changes that increase the efficiency of Snowflake and impact the total demand of VMs for our workloads. These changes make customer workloads run faster; thus customers need resources to be active for shorter periods of time, and that impacts the demand for VMs from the CSPs.
Some of the major hardware-related improvements that CSPs have introduced with their newest VMs in recent years include:
Introduction of more price-performant processors
Transition to DDR5 memory
Improvements in the performance of local SSD
Increased networking bandwidth
The combination of ARM and DDR5 memory has provided up to a 50% improvement in memory bandwidth (as compared to prior architectures). Memory bandwidth is critical for the Hash Table and Bloom Filter implementations of our workload, and this increase in bandwidth directly translates to lower query latency. These VM instances also have increased network bandwidth, which helps in data transfer between nodes (e.g., during distributed hash joins). Improvements in the performance of the local SSD translates to faster access time to ephemeral storage.
All together, these types of hardware improvements can provide significant performance benefits. Since these changes roll out with some regularity, they must be accounted for when making optimal long-term VM capacity commitments from the CSPs.
Software improvements
In addition to hardware changes, we are continuously making software code efficiency improvements through algorithmic changes and low-level performance optimizations to utilize hardware more effectively. A few representative examples are:
A highly parallelizable encryption mechanism that effectively doubles the network bandwidth and results in bandwidth-heavy queries executing up to 40% faster
Optimized Top-K pruning, which benefits queries that use ORDER and LIMIT by an average of 12.5%
Optimizations in memory management for holistic and adaptive broadcast join decisions that improve queries with joins
As with the rollout of new generations of hardware platforms above, these result in customer queries running faster, thus reducing the aggregate demand for VMs from the CSPs.
Optimal compute commitment levels
The periodic workload demands shown above, followed by the continuous improvements in hardware and software, can provide a challenge for making long-term, fixed-capacity commitments to reduce compute costs. In order to minimize costs we will formulate a more precise optimization problem assuming the demand curve f(x) represents the number or cost of VM instances needed over time. We then seek to identify the horizontal line y=c, which represents the capacity commitment level. When our demand curve f(x) is above the capacity commitment level, we must pay more expensive on-demand rates. When our demand curve f(x) is below the capacity commitment level, we are paying for unused commitments.

where:
(1) 𝐴 is the cost factor for the area above the line (e.g., on-demand capacity)
(2) 𝐵 is the cost factor for the area below the line (e.g., unused savings plans)
The attached interactive visualization shows how this cost function, C(c), changes with different commitment levels, y=c. Drag the threshold slider to see how a different commitment level impacts the amount of unused commitment and of on-demand prices paid for needs above the commitment level.
Summary
The robust forecasting model takes into account user workload demand changes, new hardware releases and software performance improvements, and allows us to run the optimization described here continuously to fine-tune the minimal cost-commitment level for our compute demand.
For a more detailed write up about this process and some related optimizations we use to minimize costs, see our paper Shaved Ice: Optimal Compute Resource Commitments for Dynamic Multi-Cloud Workloads, which will be presented at the International Conference on Performance Engineering (ICPE) in May.
To support further research into cloud forecasting, commitment optimization and capacity planning, we have also released a data set of normalized VM demand for 12 different machine types in four different regions over a three-year period.
Curious about the engineering challenges described in this blog post — like using data and quantitative methods to drive reliability, efficiency and performance improvements for Snowflake? Join us! Snowflake Engineering is hiring around the world.