Optimizing Query Execution in Cortex AISQL

AI is redefining data analytics, and expectations to deliver impact from AI are higher than ever. Extracting and synthesizing insights across data sources now require increasingly complex, multistep pipelines that are often fragmented across tools and teams. This results in slower insights and higher costs.

Cortex AISQL integrates AI into SQL primitives, so you can run unstructured analytics on text, images and audio with industry-leading LLMs. For example, in a single declarative SQL query, a product manager can use AI_FILTER to identify frustrated customer conversations, then use AI_SUMMARIZE_AGG to semantically generate concise summaries by product. (You can see the full function list here.)

Efficiency becomes critical as AI workloads scale. Poor query plans can lead to longer runtimes and escalating token usage, often caused by system-level factors like GPU contention or incorrect model choices. Many queries don’t need the most powerful LLMs; smaller, faster models can deliver comparable accuracy at a fraction of the cost.

The Snowflake AI Engineering and Research teams developed three techniques that make AISQL workloads even faster and more cost-efficient. Our benchmarks show that:

“AI-aware” query optimization for AI operators deliver 2x-8x more performant execution plans.
Adaptive model cascading routes queries between smaller and larger models, achieving 2x-6x faster inference with 90%-95% accuracy. It incorporates new insights to adapt a standard model cascades algorithm inside a distributed query execution engine.
Converting semantic joins into classification tasks reduces execution time and costs by 15x-70x without impacting quality.

In this article, we dive into the research behind these techniques and share benchmark results showing how they enhance AISQL performance, scalability and efficiency.

AI-aware query optimization

In general, query optimizers try to identify the most efficient way to run a SQL query by minimizing total execution cost, including time, I/O, CPU and network usage, given the data and system architecture. However, when queries include AI operators like AI_FILTER or AI_CLASSIFY, these conventional heuristics can produce very inefficient execution plans. Unlike standard SQL operations, AI operators invoke LLMs, which are orders-of-magnitude more computationally expensive and higher in latency.

Our technique treats LLM inference cost as a first-class optimization objective.

A typical SQL engine will push predicates as low in the execution plan as possible in order to reduce the number of rows that upstream operators, such as joins and aggregations, have to process. Rather than blindly pushing predicates below joins to minimize intermediate data, our optimizer explicitly considers the number and cost of model calls across LLM invocations required by different plan alternatives. Reordering these predicates or changing their placement, relative to joins, can dramatically reduce total inference cost.

Let’s take an example. The sample AISQL query below is asking the database to summarize research papers whose abstracts and associated images both discuss energy efficiency in database systems, limited to those published between 2010 and 2015.

SELECT AI_SUMMARIZE_AGG(p.abstract)
FROM papers p JOIN paper_images i ON p.id = i.id
WHERE p.date BETWEEN 2010 AND 2015 AND
  AI_FILTER(PROMPT('Abstract {0} discusses energy efficiency in database systems', p.abstract))
  AND AI_FILTER(PROMPT('Image {0} shows energy consumption of different systems using the TPC-H workload', 
i.image_file));

In Plan A, the compiler naively pushed an AI_FILTER below a join, which triggered more than 110,000 model calls.

Alternatively, in Plan B, we prioritized LLM inference cost during query optimization. This means the compiler can craft a plan that drastically reduces LLM calls. This is achieved by strategically altering the predicate evaluation order and elevating expensive AI predicates above selective joins. In Plan B, this approach led to a remarkable 300x improvement, with only 330 LLM calls.

Significant improvements in total execution time are also observed, as AI operators typically dominate the execution time of AISQL queries.

These optimizations can significantly enhance AISQL query performance, with the degree of improvement depending on factors like the selectivities of various query operators (e.g., predicates, joins).

Across synthetic data sets and queries modeled after actual Snowflake customer workloads, we've measured 2x-8x more performant execution plans. By reasoning about LLM cost directly during query planning, AISQL now makes even complex multimodal workloads efficient and scalable.

Adaptive model cascading

Running a powerful LLM to evaluate every row of a large table is costly. Model cascades address this by routing simple tasks to smaller, faster models and reserving larger models for the hardest problems. While model cascading is a known technique, our innovation lies in adapting it inside the database engine, where inference occurs inline during query execution.

AISQL now implements an adaptive, two-tier inference strategy, where:

a lightweight “proxy model” (a relatively small model that’s typically faster and cheaper to run, such as Claude Haiku 4.5) processes the majority of rows;
a larger “oracle model” (a state-of-the-art large model, such as Claude Sonnet 4.5) is invoked when the proxy’s confidence is low.

The core technical challenge is learning an effective routing strategy at query runtime, without prior knowledge of the query’s data distribution.

Our approach frames this as a statistical estimation problem. The system first executes the proxy model on all rows to obtain confidence scores, then uses importance sampling to selectively query the oracle model on a representative subset. As shown in Figure 1, these oracle-labeled samples allow the algorithm to learn two confidence thresholds that divide the data rows into three regions: (1) a reject region (high-confidence negatives), (2) an uncertainty region where rows require oracle review, and (3) an accept region (high-confidence positives).

Figure 1: Learning confidence thresholds in AISQL. The system uses proxy model scores and oracle feedback to split rows into reject, uncertainty, and accept regions.

Thresholds are computed using concentration inequalities that provide probabilistic guarantees with limited samples. The algorithm accounts for sampling bias through importance weighting, and threshold refinement occurs iteratively across batches, allowing the system to adapt dynamically as data characteristics evolve.

Figure 2: Performance comparison of model cascading on six public benchmark data sets. The cascade uses adaptive-threshold learning to route predictions between a proxy model (Llama3.1-8B) and oracle model (Llama3.3-70B). The left panel shows execution time with speedup factors annotated. The right panel shows model accuracy (F1) scores with exact values labeled. Model cascading reduces execution time by 1.22x-5.85x with mean F1 decreasing by 4.3%.

Our experimental results from six benchmark data sets (see Figure 2) demonstrate that model cascading achieves a mean speedup of 2.9x (65.5% reduction in execution time) while maintaining 95.7% of the oracle model's quality, with F1 scores decreasing by only 4.3 percentage points on average.

On data sets where the proxy model's confidence correlates well with prediction difficulty, speedups reach up to 5.85x with negligible quality degradation. For example, on the NQ data set, model cascading achieves 5.85x speedup and preserves near-identical F1 scores.

Query rewriting for semantic joins

Semantic joins using AI_FILTER enable powerful natural language-based matching between two input data sets.

For example, let's consider a scenario where a user has two Snowflake tables: (1) “Reviews” contains product reviews, and (2) “Categories” includes names of product categories.

A user might want to associate each review with one or more relevant product categories with an AISQL query like this:

SELECT * FROM
Reviews JOIN Categories
ON AI_FILTER(PROMPT('Review {0} is mapped to category {1}',
    Reviews.review, Categories.label));

This query is highly inefficient to execute. Because of the semantic nature of the join condition, efficient join algorithms, such as hash-join or sort-merge join, that work well on equi-joins are not applicable here. To that end, the Snowflake execution engine treats the semantic join as a cross join, which has quadratic complexity. This means it needs to produce all possible pairs of product reviews and categories and evaluate the AI_FILTER on each and every one of them.

Common queries like this can result in thousands of model calls and high inference costs, even for small data sets. After analyzing a wide range of queries from customers, we discovered that, in many cases, semantic joins are equivalent to a multilabel classification problem.

For instance, in the previous example, rather than applying AI_FILTER for every review-category combination, we can reframe the semantic join as a classification problem. This means we can use multilabel AI_CLASSIFY for each product review, with the values from the Categories table serving as labels. This allows us to reduce the number of AI inference calls from 24 (4 reviews x 6 labels) to just four (one for each review), an improvement of 6x.

Detecting such cases automatically is not a trivial task. During the compilation of AISQL queries, an AI-based oracle looks for semantic joins to determine if it can be re-expressed as a multiclassification problem. The oracle analyzes the natural language prompt, schema metadata (e.g., column and table names, number of distinct values and actual values from the input sources). Then, the oracle can determine if a semantic join should be converted into a classification task, and which input data set contains the labels for the classification. Subsequently, the transformation is performed as a regular rewrite operation by the compiler.

In addition to lower latency, this rewrite often enhances prediction quality.

By evaluating all possible matches together, AI_CLASSIFY allows the model to compare options in context, rather than making isolated binary decisions (yes/no) for each pair. This comparative reasoning is particularly effective at reducing false positives, as the model can better discriminate between similar candidates when evaluating them jointly.

Figure 3. Comparison of cross-join rewrite strategies across eight semantic join public benchmarks. Left: Execution time with speedup factors annotated. Right: F1 scores with exact values labeled. The rewrite reduces execution time by 15.2 – 69.5x with mean F1 improving by 44.7%.

Our evaluation across eight semantic join benchmarks (see Figure 3) demonstrates 15x-70x faster execution time (with a mean speedup of 31x), a direct correlation to lower inference costs. On the CNN data set¹, execution time drops from 4.4 hours to 3.8 minutes, a staggering 69x improvement! Accuracy also improves, with the mean F1 score increasing by 44.7 percentage points, primarily due to higher precision from the multilabel classification approach.

This technique is particularly effective for entity matching and document categorization workloads, where comparative reasoning significantly improves discrimination quality.

Making AISQL practical at scale

Integrating AI into SQL unlocks powerful new ways to analyze unstructured data, but it also introduces challenges in performance, scalability and cost. With these advances, Cortex AISQL takes a major step toward making AI-powered analytics practical at enterprise scale. Read the research paper to learn more.

Try Cortex AISQL in your Snowflake account to experience how simple and cost-efficient AI query execution can be.

¹ Hermann, K. M., Kočiský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., & Blunsom, P. (2015). Teaching Machines to Read and Comprehend.