As data volumes continue to grow across many industries, it becomes increasingly important that queries are performant enough to meet the needs of data-intensive and time-constrained analytics. One such common query type is “top-k,” which are queries that have both ORDER BY and LIMIT clauses. We are pleased to announce that optimizations resulting in potentially significant improvements in the performance of these types of queries in Snowflake are now generally available.
Top-k queries are used to retrieve the k highest- or lowest-ranked results, where k is the number of results to be retrieved. For example, let’s say a cybersecurity analyst, while investigating a potential threat, uses top-k queries to find the most recent 1,000 logs generated from a particular IP address. Or a data analyst at a bank could be using this query type to find the most recent 100 transactions from a particular merchant during dispute management.
Snowflake understands the importance and time-sensitivity of these kinds of queries. Over the past six months, we have invested in this initiative, and our teams have developed a new optimization that improves the performance of long-running top-k queries by an average of 38% and, in some cases, up to 99%.
We have several happy customers who tried this feature out in preview. One such customer is Island, a fast-growing company in the cybersecurity space. Island pioneered development of the rapidly emerging new category of enterprise browsers. The Island Enterprise Browser empowers organizations to reimagine security, user productivity, and application delivery.
According to Dan Amiga, Island’s Co-Founder and CTO: “Snowflake’s recent groundbreaking work on top-k has enabled us to develop new capabilities that can quickly present critical data to our clients, regardless of whether they are on Island’s managed Snowflake environment or their own Snowflake instances. It has been game-changing for us.”
So, how does it work?
Queries with only a LIMIT clause can be made faster by simply stopping execution as soon as enough rows have been seen. However, the problem becomes significantly harder if we include an ORDER BY clause. With this additional constraint, instead of being able to gather the first rows read directly into the result, the engine must consider all rows in the data set and find those that qualify for the top k.
To avoid the large overhead of sorting the entire table, we begin scanning rows and continuously track the k-th largest value seen so far. Any rows containing a value smaller than that can be filtered out, since they will not be part of the result. The new optimization also allows us to skip fetching and scanning entire sections of tables that we know do not satisfy the k-th largest criteria. When done in combination, these techniques greatly increase the speed of query execution for top-k queries.
We have seen several top-k queries improved by this enhancement. As illustrated in the graph below, we plotted the latency of several customer queries seen over a short period of time, both with and without the feature. The yellow dots and their values on the y-axis indicate the query execution time with the top-k pruning enhancement, while the red dots and their x-axis values correspond to the same query’s execution time without the enhancement. The execution time of complex top-k queries that would otherwise take up to 7 to 9 minutes instead take only 15 to 50 seconds with this enhancement.
As the results in this graph indicate, we see our optimization providing significant improvements—and now it’s time for you to take advantage of this speedup!
Nothing to do on your end
How can you benefit from these improvements? The good news is you don’t need to make any changes. The query optimizer will automatically apply this enhancement to any queries that would benefit from it and will improve their performance. We are confident that you will be impressed, too! See the release notes to learn more about the feature and where this enhancement comes into play.
We will continue to make enhancements in query performance, so stay tuned to hear more from us in the coming months.
For the latest feature announcements and more, subscribe to the Inside the Data Cloud Blog.