Automatic Clustering, Materialized Views & Automatic Maintenance in Snowflake
Nov 13, 2018 | 3 Min Read
Author: Artin Avanes
Engineering, Snowflake News
Performance and scale are core to Snowflake. From day one, the founding team has focused on building a fast and zero-maintenance cloud-built data warehouse. Snowflake users focus on analytical tasks instead of continuously tuning the data warehouse or investing in complicated workload management due to concurrency limits. From today, two new automation and performance capabilities are available in Snowflake. They are the latest advancements to Snowflake’s ability to deliver the best performance and scalability without requiring administration or maintenance from our customers. They also lay the foundation for a series of new features to come in the months ahead.
Snowflake’s automatic clustering feature is now available for all regions and clouds. Automatic clustering is a standard feature customers can enable by contacting Snowflake Support.
With legacy on-premises and cloud data warehouses, it’s the user’s burden to constantly optimize the underlying data storage. This includes updating indexes and statistics, post-load vacuuming procedures, choosing the right distribution keys, dealing with slow partitions due to growing skews, and the need to manually reorder data as new data arrives or gets modified.
With Snowflake, a user can define clustered tables if the natural ingestion order is not sufficient in the presence of varying data access patterns. Now, we are introducing Snowflake’s automatic clustering, which constantly maintains optimal clustering for tables defined as clustered tables without any impact to production workloads. The benefits include:
- No need to run manual operations to re-cluster data.
- Incremental clustering as new data arrives or a larger amount of data is modified.
- ETL pipelines consisting of DML operations (INSERT, DELETE, UPDATE, MERGE) can run concurrently and are not blocked.
- Users can resume and suspend automatic clustering on a per table basis, and are billed by the second for only the compute resources used.
Materialized Views & Automatic Maintenance
Snowflake’s materialized views (MVs) are public preview on a per request basis and offered in our enterprise edition. As we looked at an age-old database and optimization challenge, we focused on a few crucial design principles we wanted to achieve to avoid the pain points of traditional approaches:
- Ensure fast DML operations in the presence of MVs – no slow downs when issuing DML statements against base tables, which is common among traditional data warehouses systems.
- Provide always, up-to-date and consistent query results when accessing MVs.
- Ensure ease-of-use by providing a maintenance service that continuously runs and updates the MVs in the background, and is entirely transparent to the end user.
Main uses cases we’ve enabled with this initial version of Snowflake’s MVs are:
- Defining numerous MVs with different clustering keys, leading to improved query performance for different access patterns.
- Materializing single table aggregations.
- Materializing query results and analysis against semi-structured data, using flatten or approximate queries (sampling).
Scalable compute service: no infrastructure to manage
Both features are built on top of our scalable, multi-cluster virtual warehouse technology and entirely managed by Snowflake. This means Snowflake efficiently and automatically executes the re-clustering or MV refresh statements in the background. No need to create, size or resize a virtual warehouse.
In case of the background clustering, the compute service continuously monitors the clustering quality of all registered clustered tables. It starts with the most unclustered micro-partitions, and iteratively performs the clustering until an optimal clustering depth is achieved. In the case of MVs, the compute service monitors the base tables and kicks off refresh statements for the corresponding MVs if significant changes are detected. This maintenance process of all depending MVs is asynchronous. In scenarios when a user is accessing an MV, which has yet to be updated, Snowflake’s query engine will perform a combined execution with the base table to always ensure consistent query results. Similar to Snowflake’s automatic clustering with the ability to resume or suspend per table, a user can resume and suspend the automatic maintenance on per MV basis.
Conclusions and Acknowledgements
We thank our numerous preview participants for providing us valuable feedback early on throughout the private beta. The early adoption in our customer base is encouraging and we will continue listening to their feedback. This is just the beginning of a number of additional features related to performance, automation and manageability we’ll be announcing in the next weeks and months ahead. Learn more about materialized views, automatic clustering, and automatic maintenance here.