Blog/Gen AI/Smarter AI Queries: How Snowflake Is Raising the Bar on Quality and Efficiency
MAY 28, 2026/14 min readGen AI

Smarter AI Queries: How Snowflake Is Raising the Bar on Quality and Efficiency

Most of the data that drives business decisions (support tickets, product reviews, contracts, survey responses) is unstructured text that traditional SQL cannot reason about. Snowflake Cortex AI Functions changes that by bringing LLM reasoning directly into SQL. A single query can filter, classify or summarize millions of rows of text without building a custom ML pipeline.

Adoption has been fast, and teams are now running Cortex AI Functions in production pipelines that process hundreds of millions of rows on a daily schedule. That volume is a sign that the approach works, but it also raises two questions that every team running AI at scale needs to answer:

  1. How do you know the output is correct? LLMs hallucinate, and analyses can contain claims that the data does not support.
  2. How do you make AI efficient enough to run at scale? LLM inference is resource-intensive, and overhead grows linearly with table size unless you reduce the number of calls.

Over the past year, the Snowflake AI Research team has been working with researchers at Brown University, UC Los Angeles, University of Chicago and UC Santa Barbara to address both. The result is five papers that push the state of the art on Cortex AI Functions quality and efficiency, and that shapes how we build the product:

  1. Perfect classification accuracy at 3.2x lower cost when verifying benchmark claims from AI-generated summaries of real customer review data. (Evergreen)
  2. A new benchmark that tests analytical correctness, not just pipeline execution, exposing how small early errors cascade into wrong conclusions across enterprise analytics tasks. (AvalancheBench)
  3. Maximizing semantic ranking accuracy across compute budgets by dynamically selecting the near-optimal ranking strategy. (Semantic ORDER BY)
  4. Up to 19x reduction in AI query overhead by learning smarter filter evaluation order in real time, while maintaining the same accuracy, when compared to Palimpzest and Quest.
  5. 95%+ accuracy across every data set tested by routing easy rows through a faster, smaller model and escalating only uncertain cases to the full LLM. (Streaming Model Cascades)

Those two questions — How do you ensure the output is correct? And how do you run efficiently at scale? — organize the rest of this post. For each paper, we describe the problem it addresses, how the research solves it and the key results.

How do we evaluate accuracy?

When a pipeline summarizes thousands of rows, you need to know whether the output is actually correct. Most benchmarks do not answer that question because they test isolated model performance on narrow tasks, not whether a full pipeline produces correct business answers. We tackled this at two levels: individual AI operators and end-to-end pipelines.

1. Evaluating quality on individual AI operators: Evergreen

"Evergreen: Efficient Claim Verification for Semantic Aggregates" in collaboration with Alexander Lee and Ugur Cetintemel at Brown University (View paper on arXiv | GitHub repo)

The problem

The AI_AGG operator uses an LLM to reduce a collection of rows into a natural language aggregate, such as a summary. However, the resulting LLM-generated summary can contain claims that the underlying data does not support. For example, the summary may state that "the majority of customer reviews are positive" when in fact only a minority of reviews are.

Verifying such claims is challenging since they involve quantifiers, groupings and comparisons over many rows, requiring a combination of both semantic and symbolic processing. Prior approaches to claim verification that leverage LLM-as-a-judge, retrieval-augmented generation or coding agents all fall short in this setting because they cannot scale to large data sets, lack optimized semantic and symbolic processing, and do not provide principled explanations for why claims are true or false.

Our approach

Evergreen verifies AI_AGG outputs by compiling each claim into a declarative verification query that runs against the source data on the same query engine. Given a natural language aggregate, the system first decomposes the text into individual claims and then translates each one into a query that checks whether the original rows actually support that claim, producing a provenance-backed verdict that cites the specific rows justifying the result.

On its own, this approach achieved perfect verification quality on our benchmark (evg_unopt in Figure 1). To also make verification efficient, Evergreen avoids costly LLM calls through several optimizations (evg_opt in Figure 1):

  • Early stopping: Halt verification once a claim is confirmed or refuted.
  • Relevance sorting: Evaluate the most informative rows first.
  • Estimation with confidence sequences: Statistically estimate a claim's verdict based on a sample of rows.
  • Operator fusion: Combine consecutive AI functions into a single LLM call per row.
  • Similarity filtering: Skip rows that are clearly irrelevant based on embedding distance.
  • Prompt caching: Reuse LLM responses across multiple claims with shared subqueries.

Figure 1 shows the cost-quality trade-off of different claim verification approaches across Anthropic Claude and Meta Llama models. The optimized versions of Evergreen (evg_opt) exclusively occupy the Pareto frontier (indicated by the dashed line).

Figure 1: F1 score vs. cost. evg_opt is Evergreen with all optimizations enabled; evg_unopt is Evergreen without any optimizations; rag_agent is a retrieval-augmented verification agent; and base_rm is an LLM-as-a-judge with reasoning.
Figure 1: F1 score vs. cost. evg_opt is Evergreen with all optimizations enabled; evg_unopt is Evergreen without any optimizations; rag_agent is a retrieval-augmented verification agent; and base_rm is an LLM-as-a-judge with reasoning.

Key results:

  • Obtained perfect verification quality (F1 = 1.00) with a strong LLM at 3.2x lower cost and 4.0x lower latency compared to unoptimized verification.
  • Outperformed a strong LLM-as-a-judge baseline at 48x lower cost and 2.3x lower latency using a weaker model.
  • Surpassed a retrieval-augmented agent in quality and latency at comparable cost when both use a strong LLM; when Evergreen used a weaker LLM, it achieved the same quality as the strong retrieval-augmented agent at 63x lower cost and 4.2x lower latency.

In practical terms, Evergreen automatically verifies claims generated by AI_AGG in a reliable, efficient and explainable manner.

2. Evaluating quality in end-to-end pipelines: AvalancheBench

"AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery" in collaboration with Darek Kleczek, Fuheng Zhao, Paweł Liskowski, Alexander Lee, and Ugur Cetintemel at Brown University (View on arXiv)

The problem

Real enterprise analytics tasks are goal-driven: Diagnose declining customer satisfaction or identify growth opportunities. Many valid pipelines can pursue those goals, but correctness remains grounded in the underlying data. A pipeline can execute without errors, produce a well-formatted report and still recover the wrong analytical structure entirely.

Existing benchmarks do not catch this because they measure pipeline execution or answer plausibility, not whether a system recovers the correct analytical understanding of the data. A system can produce plausible analyses while recovering the wrong analytical structure, hallucinating reasonable-sounding customer segments rather than verifying them in the data, or merging two distinct events into one, and none of that registers as a failure.

Our approach

AvalancheBench frames the problem as latent world recovery. We fully specify a hidden analytical structure before generating any data, then produce structured tables and unstructured text from that structure. Because the ground truth is known in advance, every rubric question has a verifiable correct answer.

Figure 2 illustrates the framework. Hidden signals like customer personas and temporal events are enforced into structured data rows, then rendered as natural language reviews where they surface only implicitly. The agent sees only the right panel and must recover the left.

Figure 2: The latent world recovery framework. The researchers specify the hidden analytical structure and generate both structured data and unstructured reviews from it. The agent sees only the right panel and must recover the left.
Figure 2: The latent world recovery framework. The researchers specify the hidden analytical structure and generate both structured data and unstructured reviews from it. The agent sees only the right panel and must recover the left.

The first use case is e-commerce: 18 rubric questions, spanning segmentation, temporal analysis, ranking and synthesis, across a product catalog, 23,000 sales records and 10,000 customer reviews.

This design improves on existing benchmarks in three ways:

  1. Analytical understanding over pipeline completion: Systems are scored on whether they recover the segments, drivers and relationships that explain the data, not on whether they produce a plausible report.
  2. Verifiable ground truth: Every rubric question has a correct answer derived from the known latent structure.
  3. Compounding errors: The approach exposes the "avalanche effect," where mistakes in discovery or segmentation propagate into the final synthesis.

Key results:

  • On a first e-commerce use case, the strongest configuration of a leading coding agent recovered only 26% of the rubric, with failures concentrated in generic customer segmentations and merged temporal events.
  • Those failures illustrate the avalanche effect: Wrong segmentations invalidate every downstream analysis, producing reports that look complete but are analytically wrong.
  • A pipeline-completion benchmark would record these same runs as successful, while the latent-world rubric exposes the analytical content as largely wrong.

AvalancheBench is designed to complement existing benchmarks, not replace them, and future work extends the approach to additional modalities and enterprise domains, including customer support and marketing analytics.

3. Optimizing semantic ORDER BY accuracy

In collaboration with Fuheng Zhao, Jiayue Chen, Yiming Pan, Tahseen Rabbani, Divyakant Agrawal, Amr El Abbadi and team at UC Santa Barbara, University of Chicago and UC Los Angeles (View the paper on arXiv | GitHub repo)

The problem

Cortex AI Functions can rank rows based on semantic criteria that traditional SQL cannot express, such as "degree of positivity" or "relevance to a user question." There are multiple ways to implement the semantic ranking (scoring each row independently, comparing rows pairwise or ranking batches together). Through extensive evaluations, we find that there is no single approach that is always the best. On some workloads, pointwise scoring is most accurate; on others, pairwise comparison or listwise ranking wins. Choosing the wrong algorithm for a given task can result in low-quality rankings.

Our approach

The paper introduces a budget-aware optimizer for the semantic ORDER BY operator that dynamically selects the near-optimal semantic sorting algorithm for each ranking query. Rather than committing to a single static implementation, the optimizer uses heuristic rules, along with consensus aggregation via Borda count or LLM-as-a-judge evaluation, to determine which access path will produce the highest-quality ranking within a given budget.

As Figure 3 illustrates, no single algorithm universally wins across all workloads, making this dynamic selection critical. We also observed a distinct test-time scale effect: While throwing more compute at the problem generally improves semantic ordering accuracy, those gains eventually plateau. To power this optimizer, we are also introducing two new semantic ranking algorithms. Semantic External Merge Sort: A highly efficient approach that significantly reduces expensive LLM calls compared to existing listwise ranking methods (such as listwise sliding window bubble sort). Semantic Quick Sort with Majority Voting: A robust variant of the vanilla semantic quick sort that leverages majority voting to drive higher overall accuracy.

Figure 3: Sorting accuracy vs. monetary budget ($) across multiple algorithms and LLM models (NBA height ranking and DL19 passage ranking). No single algorithm is universally optimal; accuracy scales with compute but eventually plateaus.
Figure 3: Sorting accuracy vs. monetary budget ($) across multiple algorithms and LLM models (NBA height ranking and DL19 passage ranking). No single algorithm is universally optimal; accuracy scales with compute but eventually plateaus.

Key results:

  • The optimizer consistently achieved ranking accuracy on par with or superior to the best static method across all benchmarks.
  • Through extensive evaluations, we found that no single fixed ranking algorithm outperforms the others across all tasks, validating the need for adaptive selection.
  • The two proposed semantic ranking algorithms expand the optimizer's search space, unlocking better trade-offs between LLM token costs and ranking accuracy.

Queries that rank or sort rows by semantic criteria can achieve higher-quality results without requiring the user to manually choose a sorting strategy.

How we optimized efficiency

Quality is only half the equation. In a traditional SQL query, the bottleneck is scanning and joining data, but when you add LLM inference, inference dominates: It represents 80% to 90% of total query token spend and scales linearly with row count. Two research efforts tackle this from different angles.

4. Efficient semantic filters evaluation for AI function pipelines

In collaboration with Fuheng Zhao, Paweł Liskowski, Zihan Li, Puxuan Yu, Dimitris Tsirogiannis

The problem

When a query applies multiple AI_FILTER conditions to a table, each AI_FILTER on each row is a separate LLM call. For a table with millions of rows, a poorly chosen evaluation order results in many more LLM calls than necessary, because an earlier filter could have already short-circuited the row. The order in which filters are evaluated determines how many of those calls are wasted.

Our approach

The system learns which filters are most likely to short-circuit each particular row, and reorders evaluation to maximize the semantic filters' short circuits. We present two variants: a reinforcement learning agent that learns the best filter ordering through trial and error, and a supervised selectivity estimator that predicts, for each row, the probability each filter will pass — using the row's and the predicate's embeddings as input. Either variant chooses the evaluation order that minimizes total LLM token cost.

Training takes just 9 to 11ms, completely hidden behind LLM calls that take hundreds of milliseconds, so there is zero added training latency overhead.

Figure 4 illustrates how the system works compared to alternatives: Palimpzest (PZ) uses static selectivity ordering, Quest uses a per-row priority that combines selectivity and cost, and our approach uses online learning that adaptively picks the best filter to evaluate based on observations from previous rows.

Figure 4: How three approaches choose filter evaluation order. PZ uses static selectivity, Quest uses selectivity/cost ratio, and our approach uses a learning agent that adapts per row.
Figure 4: How three approaches choose filter evaluation order. PZ uses static selectivity, Quest uses selectivity/cost ratio, and our approach uses a learning agent that adapts per row.

Key results:

  • Up to 19x reduction in AI_FILTER token overhead (excess above the optimal lower bound) compared to state-of-the-art alternatives (Palimpzest, Quest).
  • Saved over 164 million tokens on a 67,000-document workload vs. Palimpzest.
  • Best configuration lands within 1.7% to 3.8% of theoretical optimal.

For a team filtering a large table with multiple AI conditions, this means the query finishes in a fraction of the time and uses a fraction of the tokens, while delivering the exact same outcomes.

5. Semantic SQL efficiency: Streaming Model Cascades

"Streaming Model Cascades for Semantic SQL" in collaboration with Paweł Liskowski and Kyle Schmaus | View paper on arXiv

The problem

Not every row in a table is equally hard to classify. A one-star review saying "worst product ever" is obviously negative, but a mixed review that praises the hardware while criticizing the software requires real reasoning to categorize. Yet when AI_FILTER invokes an LLM on each qualifying row, every row gets the same expensive LLM by default. For most tables, the majority of rows are closer to the obvious case, so the system spends most of its inference budget on rows that a simpler model could handle correctly.

Our approach

Streaming Model Cascades places a small, fast model (the proxy) in front of the expensive, oracle LLM. For each row, the proxy makes a prediction and produces a confidence score. If confidence is high in either direction, that answer is used and the oracle LLM is skipped. If confidence is low, the row is escalated to the oracle LLM for an authoritative answer.

Figure 5 shows this in action on three rows from a table (x₁, x₂, x₃). The proxy scores each row on how likely it is to match the filter. Row x₁ gets a 92% confidence score, clearly a match, so it's accepted immediately. Row x₃ gets 8%, clearly not a match, so it's rejected. Only row x₂, at 55%, is ambiguous enough to need the expensive model.

In practice, the algorithms learn them from the data during query execution, so how often the oracle is called reflects the actual difficulty of the data set.

Figure 5: Model cascade routing. The proxy LLM scores each row's confidence. High-confidence rows are resolved immediately; only uncertain rows escalate to the oracle LLM.
Figure 5: Model cascade routing. The proxy LLM scores each row's confidence. High-confidence rows are resolved immediately; only uncertain rows escalate to the oracle LLM.

The key question is: How confident does the proxy model need to be before you trust its answer? Two algorithms set those thresholds, with deliberately different contracts:

  1. GAMCAL trades classification quality against oracle cost, and adapts automatically to how hard the data set is.
  2. SUPG-IT takes minimum precision and recall targets from the user and guarantees both with high probability.

Both adapt as the query executes, learning their thresholds from the oracle's answers on a small sample of escalated rows. The streaming design needs no upfront training, no labeled data set and no global pass that would block the query's output. Figure 4 illustrates the cascade architecture.

Key results:

  • F1 > 0.95 at its best operating point on all six test data sets in Snowflake's production Cortex AI Functions engine.
  • GAMCAL reaches that F1 with up to 58% fewer oracle calls than the strongest published cascade baseline (the SUPG cascade in LOTUS).

This means queries that classify large tables achieve near-perfect accuracy while using a fraction of the LLM inference that a single-model approach would require.

Where Cortex AI Functions is headed: Research-driven, enterprise-proven

These papers reflect a deliberate investment in making Cortex AI Functions both more accurate and efficient at enterprise scale, two things that have to be true simultaneously for AI analytics to become a reliable part of how companies operate. Every design decision in Cortex AI Functions is informed by measurement, not assumption, and that standard is only getting more rigorous as the platform grows.

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Where Data Does More