Using Materialized Views to Solve Multi Clustering Performance Problems
The recent release of Snowflake materialized views (MV) provides an exciting new feature that adds performance improvements to Snowflake. Materialized views support several different use cases, including performance. In this blog post, I’ll focus on using materialized views to solve a specific performance problem that arises when a large data set has multiple access paths. The example presented in this post shows a 10 billion row table and two different ways to query the data.
The Setup
Imagine having a weblog that keeps track of certain metrics that are accessible by different attributes (IDs) for different use cases such as by timestamp and PAGE_ID. For this example, let’s assume that one of the metrics tracked in the weblog table is the time it takes to load a page in milliseconds (TIME_TO_LOAD_MS).
To simplify the setup, the table below illustrates using bigint as the data type for the IDs. In a real-world use case, data types can be anything supported by Snowflake; however, to increase the size of the data set and make the use case more realistic, I’m adding attributes (metric2 through metric9).
CREATE OR REPLACE TABLE WEBLOG (
CREATE_MS BIGINT,
PAGE_ID BIGINT,
TIME_TO_LOAD_MS INTEGER,
METRIC2 INTEGER,
METRIC3 INTEGER,
METRIC4 INTEGER,
METRIC5 INTEGER,
METRIC6 INTEGER,
METRIC7 INTEGER,
METRIC8 INTEGER,
METRIC9 INTEGER
);
I want to compute the average load time for two different use cases. In use case A), I'll focus on the average load for a particular timeframe across all pages. For use case B), I'll focus on the average load time for a particular Page_ID. Use-case A) is satisfied by a query directly against the table marked WEBLOG. In this instance the query returns quickly because the data is clustered naturally by the creation date. Use-case B) requires more time, however, because the data is filtered by Page_ID, yet it's clustered by the creation date, CREATE_MS.
The Problem: Multiple Access Paths
To illustrate the problem, I’ll generate a big data set (10 billion rows). The data flows into the WEBLOG table sequentially when assuming that CREATE_MS is measured in milliseconds and the data for the Page_ID is random.
INSERT INTO WEBLOG
SELECT
(SEQ8())::BIGINT AS CREATE_MS
,UNIFORM(1,9999999,RANDOM(10002))::BIGINT PAGE_ID
,UNIFORM(1,9999999,RANDOM(10003))::INTEGER TIME_ON_LOAD_MS
,UNIFORM(1,9999999,RANDOM(10005))::INTEGER METRIC2
,UNIFORM(1,9999999,RANDOM(10006))::INTEGER METRIC3
,UNIFORM(1,9999999,RANDOM(10007))::INTEGER METRIC4
,UNIFORM(1,9999999,RANDOM(10008))::INTEGER METRIC5
,UNIFORM(1,9999999,RANDOM(10009))::INTEGER METRIC6
,UNIFORM(1,9999999,RANDOM(10010))::INTEGER METRIC7
,UNIFORM(1,9999999,RANDOM(10011))::INTEGER METRIC8
,UNIFORM(1,9999999,RANDOM(10012))::INTEGER METRIC9
FROM TABLE(GENERATOR(ROWCOUNT => 10000000000))
ORDER BY CREATE_ms;
A key indicator for estimating performance in Snowflake is the clustering depth for the filter attribute. To retrieve clustering depth for an attribute on a table, first specify the clustering attribute with an ALTER TABLE statement.
ALTER TABLE WEBLOG CLUSTER BY (CREATE_MS);
SELECT SYSTEM$CLUSTERING_INFORMATION( 'WEBLOG' , '(CREATE_MS)' );
SELECT SYSTEM$CLUSTERING_INFORMATION( 'WEBLOG' , '(PAGE_ID)' );
Clustering depth for attribute CREATE_MS is a good option because it offers a small value. In this case it’s ~1, but that is suboptimal (a large value) for PAGE_ID. However, because I created the data sorted by CREATE_MS, this is an expected outcome. On the other hand, PAGE_ID is randomly distributed.
Running a query against WEBLOG confirms the problem. Making the problem easily visible requires running a relatively small warehouse for a 10 billion row data set. The numbers below are based on a MEDIUM cluster.
SELECT COUNT(*) CNT, AVG(TIME_TO_LOAD_MS) AVG_TIME_TO_LOAD
FROM WEBLOG
WHERE CREATE_MS BETWEEN 1000000000 AND 1000001000;
SELECT COUNT(*) CNT, AVG(TIME_TO_LOAD_MS) AVG_TIME_TO_LOAD
FROM WEBLOG
WHERE PAGE_ID=100000;
Both queries summarize approximately 1000 rows. The first query runs in about 0.1 secs and the second requires roughly 100 secs (assuming a cold data cache). The substantial difference occurs because the first query examines only one micro partition, while the second query scans approximately 21,000.
Running both queries with equal performance requires using a second copy of the data that’s organized differently, hence optimizing access for both query patterns.
The Solution: Clustered Materialized View
The solution to the problem lies with two new features in Snowflake: materialized views and auto-clustering.
Creating the materialized view with Snowflake allows you to specify the new clustering key, which enables Snowflake to reorganize the data during the initial creation of the materialized view.
For a source table of about 10 billion rows, a MEDIUM-sized warehouse takes about 20 minutes to create the materialized view. To increase performance and reduce the time to create the test date, I used another feature in Snowflake, which is the ability to elastically resize the warehouse. When running the code below, you need to change the placeholder <name> with the name of your own warehouse.
Using the example below decreases the initial build time for the materialized view, making it 16X faster than running the same statement on a MEDIUM-sized cluster. When running the code shown below, change the placeholder <name> with the name of your warehouse.
ALTER WAREHOUSE <NAME> SET WAREHOUSE_SIZE=XXXLARGE;
CREATE OR REPLACE MATERIALIZED VIEW MV_TIME_TO_LOAD
(CREATE_MS, PAGE_ID, TIME_TO_LOAD_MS) CLUSTER BY (PAGE_ID)
AS
SELECT CREATE_MS, PAGE_ID, TIME_TO_LOAD_MS
FROM WEBLOG;
ALTER WAREHOUSE <NAME> SET WAREHOUSE_SIZE=MEDIUM;
After completing the materialized view build, validate optimal data distribution by retrieving the clustering information.
SELECT SYSTEM$CLUSTERING_INFORMATION
( 'MV_TIME_TO_LOAD' , '(PAGE_ID)' );
The clustering depth, which should be ~2, provides an indication that a query against the materialized view by PAGE_ID should be much faster. How much faster?
To get a baseline, I first ran the query against the base table. Achieving more reproducible results requires limiting the impact of query results caching. Query results caching is another great feature in Snowflake; however, when comparing performance, you want to compare apples to apples. For that reason, I disabled the cached results feature.
ALTER SESSION SET USE_CACHED_RESULT=FALSE;
SELECT COUNT(*),AVG(TIME_TO_LOAD_MS)
FROM WEBLOG
WHERE PAGE_ID=100000;
The second query doesn’t take the roughly 100 seconds that it did when running the query for the first time. This time it takes closer to 20 secs. Why? Because now you’re seeing the impact of Snowflake providing both result caching and micro partition caching on the execution cluster. If this was the first time running this query, and the cluster cache was cold, it would still require about 100 seconds.
Now, let’s check the runtime for the same query against the materialized view. Testing shows the results taking about 0.1 seconds, which is similar to the lookup by CREATE_MS against the base table.
ALTER SESSION SET USE_CACHED_RESULT=FALSE;
SELECT COUNT(*),AVG(TIME_TO_LOAD_MS) AVG_TIME_TO_LOAD
FROM MV_TIME_TO_LOAD
WHERE PAGE_ID=100000;
This is a substantial improvement. However, the scenario above, by itself, isn’t that impressive because creating a new, materialized version of the base table through a Create Table AS (CTAS) statement accomplishes the same thing. So, what happens when the data in the base table changes? The data in the CTAS version immediately becomes stale.
Putting It All Together
The real power of materialized views stems from the fact that all DML changes (INSERT, UPDATE, DELETE) to the base table (WEBLOG) are automatically applied to the materialized view. Not only are the changes be applied, they’re applied in a transactionally consistent way. Now, at any point in time, the data in the materialized view will match the data for the view query when applying it to the source table. Lastly, having a materialized view on top of a source table doesn’t impact the DML performance against the source table.
To simulate activity against the source table, I wrote a small Python program that inserts batches of rows with a configurable number of seconds in between batches.
import snowflake.connector
import time
import sys
def getSnowFlakeCntxt():
# Set SnowFlake context.
SnowFlakectx = snowflake.connector.connect(
user=<user>,
password=<password>,
account=<account>
warehouse=<warehouse>,
database=<database>,
schema=<schema>
)
return SnowFlakectx
def executeSnowflakeCmd(iteration,batch,seed):
conn = getSnowFlakeCntxt()
cs = conn.cursor()
try:
cs.execute("alter session set QUERY_TAG = %s", ("MV_INSERT_TEST"))
insSQL = """
insert into WEBLOG
select
((seq8())+""" + str(iteration*batch+seed) +""")::bigint as create_ms
,uniform(1,9999999,random(10002))::bigint page_id
,uniform(1,9999999,random(10003))::integer time_to_load_ms
,uniform(1,9999999,random(10005))::integer metric2
,uniform(1,9999999,random(10006))::integer metric3
,uniform(1,9999999,random(10007))::integer metric4
,uniform(1,9999999,random(10008))::integer metric5
,uniform(1,9999999,random(10009))::integer metric6
,uniform(1,9999999,random(10010))::integer metric7
,uniform(1,9999999,random(10011))::integer metric8
,uniform(1,9999999,random(10012))::integer metric9
from table(generator(rowcount => """ +str(batch) + """))
order by create_ms
"""
cs.execute(insSQL)
finally:
cs.close()
if __name__ == "__main__":
iterations=int(sys.argv[1])
batch_size=int(sys.argv[2])
sleep_time=int(sys.argv[3])
seed=int(sys.argv[4])
for i in range(0,iterations):
executeSnowflakeCmd(i,batch_size,seed)
time.sleep(sleep_time)
print("***** COMPLETED ******")
The statement below invokes the script with 20 iterations and each iteration inserts 600k rows, with 60 seconds between batches. The last parameter for the script is a seed value to create new rows with an ID greater than the existing rows.
python TableInsert.py 20 600000 60 100000000
With every batch added into the source table (WEBLOG), changes to the data returned by the query shown above are visible.
SELECT COUNT(*),AVG(TIME_TO_LOAD_MS)
FROM MV_TIME_ON_SITE
WHERE PAGE_ID=100000;
To prove that the materialized view returns the same results as a query against the source table, let’s run the statement below that will show the result from both the source table and the MV at the same time.
SELECT 'WEBLOG', COUNT(*),AVG(TIME_TO_LOAD_MS)
FROM WEBLOG
WHERE PAGE_ID=100000
UNION ALL
SELECT 'MV_TIME_TO_LOAD',COUNT(*),AVG(TIME_TO_LOAD_MS)
FROM MV_TIME_TO_LOAD
WHERE PAGE_ID=100000;
Conclusion
Using materialized views solves a core performance issue that occurs when accessing data in a large table through different sets of attributes. The improvements are exponential and don’t require spending time on building additional maintenance tasks, including MV refresh jobs.
There are many more viable use cases for materialized views, including materialized view projections and selections that enable frequently filtering queried subsets from a large source table. Support for single table aggregations can also replace manual processes for computing aggregations for one or many levels. Materialized views even supports popular functions like FLATTEN or SAMPLING. Learn more about materialized views and the Snowflake auto-clustering service.