Summary

Snowflake is adding a 99.99% SLA target to its service-level agreement (SLA). Our data shows that our existing SLA, targeting 99.9%, proved to actually be better for customers, so we’re also keeping that SLA. The objective of this change is to remove one potential point of confusion in a very complex topic, make it easier to compare our SLAs to our competitors, and put our money where our mouth is.

Background

Snowflake’s Support Policy defines our SLA to our customers. This document (which only a lawyer could love) provides a very measurable and unambiguous lens on our product, which can be reliably evaluated on a per-customer basis. At a high level, this is defined as no more than 1% query error rate, 99.9% of the time. This ensures reliable operation nearly all of the time, but acknowledges the potential for up to 43 minutes of outage over the course of a month.

Another data warehouse offers an SLA with different parameters, which is no more than a 10% error rate, 99.99% of the time. This permits a relatively high background rate of errors (up to 10%) all of the time, but limits complete outages to about 4 minutes per month.

In both cases, the user is offered usage credits if the SLA is violated. We have data going back several years that allows us to measure the reliability of Snowflake for our customers, and calculate the impact of these different SLA thresholds. We empirically offer more credits to users using our historical 1% error calculation as compared to the 10% error rate at 4 nines (99.99%), and we believe it is the more stringent measure which maps to what our customers actually want: all of their queries to succeed. 

The new SLA

Starting the month of June 2022, Snowflake will commit to a successful query execution SLA of the most demanding of two thresholds:

  • Less than 1% error rate, 99.9% of the time
  • Less than 10% error rate, 99.99% of the time

Primarily, this means we will now also offer our customers credits during months with a single extensive short outage, between 4 and 43 minutes in length, where we returned a very high rate of errors. This outage profile is uncommon for Snowflake. Our data shows that in a given month, if we adopted just the 10% error rate acceptability for 99.99% SLA, customers would experience a decrease in SLA credits of 40%. By adopting the best of both SLAs, customers will see approximately a 24% increase in the number of credits issued. These are not symmetric because almost all outages are currently covered by our existing SLA. This change acknowledges the importance of avoiding those outages, and aligns our incentives with those of our customers to ensure we continue to provide excellent service.

Technical minutia for reliability engineers

These industry-standard SLA thresholds are not particularly good measures of user experience.  The underlying service-level indicators (SLIs) focus on query execution, which misses many critical components of the actual user workflow, from client library behaviors to the correct results being served. They also are implicitly dependent upon the user being able to reach and authenticate to Snowflake.

We currently monitor many of these additional dimensions of availability, and incorporate them into our internal service-level objectives (SLOs). When we violate our SLOs we do rigorous internal engineering postmortems to understand why, and when they impact our users we publish external versions of these as a root cause analysis. We aggregate the data from those postmortems by Snowflake deployment, and publish that “availability” data on our Community site. While this aggregate measure does not directly translate into SLA credits, it does give insight into the increasing upward trend of our reliability over time.