An overview of recommendations and best practices from how we optimized performance on Snowflake across all our workloads.
In the first part of this series, we looked at the core architectural components and performance services and features for monitoring and optimizing performance on Snowflake. Part 2 describes some of the best practices we’ve learned that can help improve your performance and reduce costs on Snowflake.
Optimizing Storage Cost
Review your data retention period for all permanent tables in Snowflake.
Time Travel—while a great out-of-the-box feature—has an associated cost. The greater the number of days you retain history, the larger the storage cost. Hence consider your data set size, the amount of changes that your data set undergoes, and the business requirement for time travel before determining your optimal time value for data retention. It is ideal to set different time travel durations for different types of databases.
Turn off time travel when dropping a table.
Whenever you drop a table, the data will be retained based on the duration set for time travel (via the DATA_RETENTION_TIME_IN_DAYS parameter). If you know you don’t want to retain time travel for a table that is being dropped, it is better to use the ALTER TABLE command, set DATA_RETENTION_TIME_IN_DAYS=0, and then proceed with dropping the table. Time travel will be purged immediately and the associated storage costs can be saved.
Consider using transient and temporary tables rather than permanent tables.
If you have a table that doesn’t need resiliency, create a transient table to save associated costs. Snowflake maintains transient tables across sessions so they incur storage charges, but because they are not permanent, they do not incur fail-safe costs. You can also create temporary tables that get dropped automatically at the end of the session. (Temporary tables do incur storage costs, however, while they are being used in an active session.)
Take advantage of Snowflake Zero-Copy Cloning
Available out of the box, Snowflake Zero-Copy Cloning can be used to keep multiple copies of data at any given point in time with zero to minimal storage cost depending on the changes that the associated base table undergoes.
Optimizing Compute Costs
Audit warehouse sizes.
Verify occasionally that you are using the right size warehouse for your workloads. In one case, we had a single-sized warehouse being used for ingesting data from a source system to Snowflake. The ingestion involved different query types like PUT, COPY, INSERT, MERGE, SELECT, and DELETE. Since this involved different sized data sets (some were lookup tables with less than 1,000 rows whereas some were transaction tables with billions of rows), a one-size-fits all approach wasn’t optimal for all data. Our solution was to use different size warehouses for different tables dynamically depending on the underlying data size. This resulted in $375,000 savings per year. While it is not always straightforward to determine the correct size initially, you can use a trial-and-error process to find the optimal size over multiple iterations. See Query History in part 1 for more information about picking candidates for iterating to determine the optimal size.
Use efficient query techniques.
This section describes several factors that can make your queries more efficient, helping to increase performance and reduce costs.
Table join efficacy:
- Self-joins are costly, especially if the table is huge. Check queries with self-joins to find ways to avoid such situations.
- Avoid many-to-many joins. While joining two tables on key columns, make sure they are either one-to-many or many-to-one. Many-to-many will result in a cartesian join and hence an enormous number of rows in results have to be passed to a subsequent step.
- Aggregation operations (such as GROUP BY, ORDER BY, and DISTINCT) are costly, especially when the result data set is huge. Although these operations cannot be avoided completely, look at using them efficiently. For instance, if the underlying table doesn’t have any duplicates, there is no need to use DISTINCT or GROUP BY. Using such an operation will only deteriorate query performance and increase costs.
- Any inequality usage in the filter will result in a full table scan, defeating the purpose of clustering. Examples of such operations are <>, NOT IN, and NOT LIKE. A potential solution for such a use case would be using the function itself while defining the clustering key. For example, if a column employee_name has the UPPER function always on while filtering, then cluster the table using UPPER(employee_name) instead of just employee_name.
- Use of PARSE_JSON while extracting JSON data into a variant is an additional overhead and will result in expensive computations. Directly extracting JSON data into a VARIANT is supported out of the box.
- Use of the LATERAL FLATTEN function will result in optimized performance to traverse through JSON elements in a query.
Common table expressions (CTE)
- Usage of a CTE will always be more cost-effective to store an intermediate result. Use it rather than creating temporary or transient tables.
- Since CTE is an entirely memory-based operation, huge I/O savings can be reaped compared to physicalizing such data.
- At times when an inefficient table join results in billions of rows, the Snowflake query execution might fail with an out-of-memory error. If such an incident occurs, examine the data in the underlying tables, join conditions, filters used, clustering key efficiency, and the size of warehouse used.
- A UNION operator will always perform better than an OR operation. For instance, the filterorg_id=1 or bu_id=1 can be rewritten as two queries with a UNION operator in between.
- UNION ALL always performs better than UNION. Use UNION only if you want to filter the duplicates in the result set. Otherwise by default you should use “union all” in place of “union.”
- Row-by-row insert is costly. Not only is it inefficient, but it results in keeping the warehouse up with the least load, and hence results in burning credits unnecessarily Instead, leverage COPY INTO operations to do batch inserts. If COPY INTO is not possible, use multi-row inserts.
Creating Custom Monitoring Dashboards
Although Snowflake provides multiple views to check credit use, it is a best practice to build custom dashboards for easy consumption. Virtual warehouse credit usage can be queried with the WAREHOUSE_METERING_HISTORY table function for hourly usage. However, for wider consumption and further analysis, consider creating a custom dashboard.
A custom dashboard can be built on top of the data showing warehouse and date trends. It will help end users to identify their usage patterns and fine-tune usage accordingly. Individual dashboards can be built to find patterns of storage cost, clustering cost, materialized view refresh cost, overall Snowflake cost, and more. Use a query tag to group workloads logically and analyze associated costs.
Thanks for checking out our two part series on how we optimized our use of Snowflake.
- Understanding Snowflake’s Resource Optimization Capabilities
- 10 Best Practices Every Snowflake Admin Can Do to Optimize Resources
- Managing Snowflake’s Compute Resources
- Snowflake documentation
At the time of writing this blog, Manickaraja Kumarappan was an Architect for Data & Analytics at Cisco.