Blog/Gen AI/Arctic-Text2SQL-R2: What It Takes to Beat Frontier Models on Enterprise SQL
MAY 27, 2026/9 min readGen AI

Arctic-Text2SQL-R2: What It Takes to Beat Frontier Models on Enterprise SQL

Enterprise SQL is where general-purpose frontier models may still struggle. Real customer schemas are messy, Snowflake dialect details matter, and user questions are often underspecified. Correctness requires more than plausible SQL: The model has to read unfamiliar schemas, reason over business logic and avoid near-miss queries that happen to execute successfully.

Arctic-Text2SQL-R2 was built around that reality. Rather than relying on scale alone, we co-designed the model recipe, training data, reward pipeline and RL system around Snowflake SQL. The result is a compact SQL reasoning model that outperforms Gemini 3.1 Pro, Claude Opus 4.7 and every other frontier LLM we tested on the most challenging Snowflake SQL generation problems.

Figure 1: Accuracy on the Snowflake Text-to-SQL HARD benchmark, a deliberately difficult evaluation dataset where even the strongest frontier models struggle. Arctic-Text2SQL-R2 leads despite being 30–150x smaller than other high-performing models.
Figure 1: Accuracy on the Snowflake Text-to-SQL HARD benchmark, a deliberately difficult evaluation dataset where even the strongest frontier models struggle. Arctic-Text2SQL-R2 leads despite being 30–150x smaller than other high-performing models.

The sharpest comparison in Figure 1 is not just R2 versus the frontier. It is R2 versus Qwen3.6, the nearest-sized general model in the figure. Qwen3.6 trails by about 15 points and sits near the bottom. R2, in contrast, sits at the top while remaining dramatically smaller than the frontier systems it beats.

This is the second chapter of a story we started with Arctic-Text2SQL-R1.5, where we showed that a specialized, low-latency model could match frontier LLM accuracy on Snowflake SQL at a fraction of the cost. R2 extends that arc from matching frontier models to beating them. Getting there required progress across four connected fronts: Snowflake-dominant mid-training, realistic SQL from the wild, collision-resistant execution rewards and RL system innovations built for long-context SQL reasoning. The systems side is central to R2's model-system co-design: it reduces redundant computation in long-schema, multi-rollout RL training, letting us iterate quickly on both the model and the reward design. For a deeper dive, see our companion systems blog post.

Figure 2: Arctic-Text2SQL-R2 improves on R1.5 quality through a) realistic SQL data from the wild, b) collision-resistant rewards and c) Snowflake-dominant mid-training.
Figure 2: Arctic-Text2SQL-R2 improves on R1.5 quality through a) realistic SQL data from the wild, b) collision-resistant rewards and c) Snowflake-dominant mid-training.

Snowflake-dominant mid-training

Arctic-Text2SQL-R1.5 followed a standard two-stage recipe. First, the model was supervisedly fine-tuned on labeled question-SQL pairs, where it learned to imitate high-quality examples. Then came GRPO, or group relative policy optimization, where the model generated SQL for each task and received a reward when the query was successfully executed and returned the correct answer.

That recipe worked, but most of the training signal was still non-Snowflake, with industrial Snowflake-specific data mixed in only near the end.

R2 changes the center of gravity. Instead of relying on a short supervised fine-tuning stage built mostly around labeled question-SQL pairs, we introduce Snowflake-dominant mid-training on a large curated corpus of Snowflake-related DDL, documentation, analytical scripts, stored procedures and SQL of many forms. Supervised question-SQL pairs are still included, but they are no longer the dominant ingredient.

R2 also changes the RL stage. Because the model has already been exposed to much more Snowflake data during mid-training, we can scope RL post-training to Snowflake-only examples instead of relying on the non-Snowflake-to-Snowflake progression used in R1.5.

Together, these changes move the Snowflake-specific signal from a final-stage adaptation to the dominant force across both major training phases. As a result, GRPO no longer has to teach Snowflake SQL from a shallow starting point. It can focus on the harder part: reasoning over intent, schemas, joins and dialect-specific query shapes.

SQL from the wild

The guiding principle for R2's data pipeline was simple: Training data should look like the SQL problems the model will face in production. Most easy-to-obtain SQL data does not.

LLM-authored questions are too precise. Synthetic question generation often produces unnaturally specific questions: "What is the total revenue for product category 'Electronics' in Q3 2025, excluding returns, grouped by region?" Real users are more likely to ask: "Show me revenue by region." Training too heavily on over-specified questions shifts the model away from the ambiguity distribution it will encounter in production.

Labeled question-SQL pairs are only the tip of the iceberg. They are valuable but narrow, and they rarely cover long-tail SQL features. Unannotated SQL from public code repositories, open analytics projects, Snowflake documentation, stored procedures and analytical scripts expose the model to a much broader distribution: rare window-function patterns, unusual join shapes, nested CTEs, semi-structured data access and Snowflake syntax used by expert practitioners.

The best training material is SQL written to get work done. Practitioner's SQL contains temporal joins, nullable columns, inconsistent naming conventions, multi-CTE transformations and business-specific filters. That messiness is exactly what makes it useful. Extensive ablations helped us separate corpora that measurably improved R2 from those that only appeared useful. The final R2 corpus has both more volume and more realism than the synthetic-heavy data used in R1.5.

Regrounding prevents schema memorization. Labeled question-SQL pairs often cluster on a small number of schemas, which tempts the model to memorize table names, column conventions and characteristic joins. Our regrounding pipeline breaks that dependency: the same analytical intent can be regrounded onto arbitrary schemas with different domains, naming conventions and data distributions.

Figure 3: Regrounding a single analytical intent across arbitrary schemas. The same seed question can be regrounded against different domains. In R2’s training pipeline, we reground every question on a unique schema to improve generalization.
Figure 3: Regrounding a single analytical intent across arbitrary schemas. The same seed question can be regrounded against different domains. In R2’s training pipeline, we reground every question on a unique schema to improve generalization.

This teaches Arctic-Text2SQL-R2 to read the schema in front of it, rather than rely on one it has seen before.

Collision-resistant execution rewards

Execution-based rewards are powerful for text-to-SQL. The model generates SQL, we run it on Snowflake, and we compare the execution result with the gold query's result.

But standard execution-match has a hidden failure mode: reward collisions. A reward collision occurs when an incorrect SQL query returns the same result as the gold query on a particular database. The SQL is wrong, but the reward says it is right.

For example, DISTINCT is a no-op if the rows are already unique. An outer join and an inner join return the same result if every row happens to have a match on both sides. A top-N query using LIMIT 5 may match a rank-aware query if there are no ties at the cutoff.

Figure 4: For cases where stricter comparison alone is not sufficient, we enumerate near-miss SQL patterns and inject edge-case rows into the training database until the gold query and each near-miss diverge.
Figure 4: For cases where stricter comparison alone is not sufficient, we enumerate near-miss SQL patterns and inject edge-case rows into the training database until the gold query and each near-miss diverge.

For R2, we examined these collisions systematically and tightened the reward pipeline so it pays out only when the SQL is genuinely correct. In cases where normal database contents are not enough to distinguish the gold query from likely near-misses, we inject edge-case rows that force them to diverge.

The impact is visible in the benchmark diff. Among tasks that R2 passes and R1.5 fails, a meaningful slice comes from exactly the patterns this pipeline targets: top-N queries with tied rows, missing DISTINCT when rows are already unique, outer-vs-inner join mistakes when every row has a match, and AND-vs-OR filters that happen to select the same rows.

Removing those false positives makes RL more useful. The model is no longer reinforced for queries that merely work by accident. Instead, it learns query shapes that remain correct when the database exposes the edge case.

Together, these three quality improvements change the kinds of errors R2 avoids. The gains are not limited to complex queries; many come from structurally simple SQL where correctness depends on Snowflake-specific semantics or subtle query-shape choices.

Figure 5: Structurally simple SQL examples where Arctic-Text2SQL-R2 gets the Snowflake-specific answer right while frontier models stumble. These examples illustrate the long tail of Snowflake idioms and quirks that general-purpose LLMs often miss.
Figure 5: Structurally simple SQL examples where Arctic-Text2SQL-R2 gets the Snowflake-specific answer right while frontier models stumble. These examples illustrate the long tail of Snowflake idioms and quirks that general-purpose LLMs often miss.

This is the accuracy side of R2's model-system co-design: a Snowflake-native foundation, realistic schema-diverse data and rewards that reject accidental correctness. The remaining question is how to run that recipe efficiently at scale.

RL system innovations for efficient long-context SQL reasoning

The training recipe and reward design only matter if we can run them at scale. Enterprise SQL makes that difficult: each prompt can include long schemas, DDL, documentation, sample values and business context, while RL requires multiple rollout completions per example.

Arctic-Text2SQL-R2 is trained using the Arctic RL backend, featuring ZoRRo, or Zero Redundancy Rollouts. In standard RL training, the same long prompt is redundantly processed for every rollout. ZoRRo mitigates that inefficiency through split attention and prompt deduplication, allowing the model to compute the prompt representation once and reuse it across 8–64 completions.

For R2, this was not just a systems optimization; it changed what we could train. By eliminating prompt redundancy, ZoRRo made training 3.5x faster and enabled us to scale context length from 20K to 64K tokens, a 3.2x increase that would have been out of reach with standard RL systems due to GPU memory constraints. In practice, this compressed training cycles from roughly five days to under 36 hours, letting us iterate much faster on reward formulations, architectural variations and the overall training recipe.

The longer context window also mattered directly for SQL quality. It allowed the model to process more complete database schemas alongside complex user questions, reducing truncation in the long-schema settings that are common in enterprise SQL. Arctic RL further accelerates rollout generation through Forest Cascade Attention and speculative decoding, improving throughput during the active reasoning phase where the model generates and evaluates candidate SQL solutions.

This is the system half of R2's model-system co-design: faster iteration and longer context made it practical to refine the model and reward design at the scale required for enterprise SQL. For the full details of the systems behind ZoRRo and Arctic RL, see our companion systems blog post.

Conclusion

Closing the gap on enterprise SQL required four pieces compounding together: Snowflake-dominant mid-training, realistic SQL designed to get work done, collision-resistant execution rewards and RL infrastructure designed for long-context schema reasoning.

Each piece mattered. None was sufficient on its own. Better data with a noisy reward signal still teaches the wrong lessons. A clean reward signal on a weak Snowflake foundation hits a lower ceiling. A strong recipe without the right RL infrastructure is too slow to iterate on at the scale enterprise SQL requires.

Arctic-Text2SQL-R2 shows that enterprise SQL is not solved by generic scale alone. The winning recipe is specialization: a model trained on Snowflake-native data, optimized with rewards that distinguish true correctness from accidental execution matches and supported by RL systems built for long-schema reasoning.

The result is a model small enough to deploy at production scale and accurate enough to beat the frontier — the combination that makes Arctic-Text2SQL-R2 a natural choice for Snowflake Cortex Analyst customers.

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Where Data Does More