Smaller Models, Smarter SQL: Arctic-Text2SQL-R1 Tops BIRD and Wins Broadly

Natural language interfaces have made it easier than ever to “talk to your data,” but turning a question into working SQL remains a surprisingly hard problem. SQL’s power lies in its structure and precision, but that same rigidity makes it inaccessible to most users. And while large language models (LLMs) offer a promising bridge, they often falter under real-world demands: multi-table joins, nested logic, ambiguous intent and messy schemas.
What if natural language interfaces could do more than simply translate your question? What if they could reason through the structure of your data to understand the intent behind your question to generate SQL that works?
Introducing Arctic-Text2SQL-R1 — a new family of reasoning-first models from Snowflake AI Research that rethinks how natural language gets translated into structured, executable SQL. Instead of mimicking what SQL “should” look like, it learns what actually works — with simple reward signals grounded in execution accuracy.
These models are No. 1 on the BIRD benchmark: Arctic-Text2SQL-R1 32B achieves the highest execution accuracy of any open or proprietary model.1
Beyond BIRD, Arctic-Text2SQL-R1 models demonstrate strong generalization and efficiency:
State-of-the-art results at every scale across benchmarks: Consistent performance across six diverse and challenging text-to-SQL benchmarks.
Breakthrough in parameter efficiency: The 7B model, with 95x fewer parameters, outperforms open source giants like DeepSeek-v3 and even commercial models like GPT-4o.
The result: better accuracy, smaller models and faster inference — built to meet the quality and efficiency demands of enterprise AI.
In this blog, we’ll break down why Text-to-SQL accuracy is so elusive, walk through our design approach and share key learnings from building Arctic-Text2SQL-R1 to achieve these results. We’ll also show how to get started with the open source release.
The lingering challenge: Why Text-to-SQL is still hard
Turning a simple question into a simple SQL query is easy. But real-world data analysis? That’s a different story. It’s full of edge cases and complexity that break most models.
Traditional LLMs can generate SQL that appears correct, but the code breaks down when it matters. That’s because they’re trained to mimic how SQL looks, not trained to produce code that actually works. The result may seem plausible, but it’s frequently invalid, incomplete or just wrong.
Here’s why:
Complex relationships: Ask for “total sales of eco-friendly products to customers in New York last quarter,” and you’re touching five or more tables — products, categories, orders, order items, customer addresses — all linked through fragile joins. A joining misstep can lead to wildly inaccurate results, or no results at all.
Nested logic: Take a question like: “Which employees earned more than the average salary in their department and joined in the last two years?” Answering this means first calculating average salaries per department — typically with a subquery or CTE — then filtering employees by that dynamic threshold and hire date. It’s multi-step reasoning, and models often miss the full chain.
Language ambiguity: A request like “Show me our top-performing products” sounds simple — but what does “top-performing” actually mean? Is it based on revenue, profit margin, units sold or customer ratings? The model has to infer intent, often without context, and map it to the right metric — like
sales_volume
,profit_margin
, orreview_score
. Without that understanding, the query may technically work but deliver the wrong answer.Messy, opaque schemas: In real-world databases, column names aren’t always helpful:
trn_val_01
might actually meantransaction_value
. Relationships between tables may be undocumented or loosely defined, and data types can be inconsistent. To write valid SQL in this environment, a model needs to reason through ambiguity and pick up on subtle clues across the schema.
In short, Text-to-SQL isn’t just a language task. It’s a reasoning problem over often-imperfect data.
Our approach to Arctic-Text2SQL-R1: Simple rewards, strong reasoning
To tackle these challenges, we designed Arctic-Text2SQL-R1 around a simple philosophy: Focus on what matters most — whether the query works.
As shown in Figure 1, our approach centers on execution correctness, with a training pipeline that combines data filtering, step-by-step reasoning and a direct reward signal tied to real query results. So, instead of optimizing for how SQL looks, we train models to generate SQL that runs — and returns the right answer.

This led us to three key design principles:
1. Execution-aligned reinforcement learning (RL): The power of real-world feedback
Many Text-to-SQL models rely on complex, handcrafted reward functions that guess whether a query might be correct. That often turns into a feedback loop failure: The model learns to chase the proxy signal, not actual performance.
We skipped the guesswork. Our reward is simple and grounded: Did the SQL run? Did it return the right answer? This direct feedback forces the model to learn what actually works — against a real, live database — and to produce SQL you can trust, not just SQL that looks right.
To make this work in a sparse-reward setting — where many early queries fail — we use Group Relative Policy Optimization (GRPO), a reinforcement learning technique built for noisy environments with limited successes. It’s stable, effective and keeps the model focused on one thing: generating queries that work in the real world.
2. Data-centric training: Curate, don’t accumulate
The old adage holds: More data doesn’t mean better data — especially in Text-to-SQL.
For Arctic-Text2SQL-R1, we didn’t just collect data; we curated it. We started by filtering established benchmarks like BIRD and SPIDER, targeting not just errors but noise — queries that return empty results or take ages to run. Those examples don’t teach useful reasoning and just slow everything down. Cleaner data meant a clearer signal for learning.
But we didn’t stop there. To scale complexity and coverage, we turned to synthetic data — generating new Text-to-SQL examples with models like GPT-4o. Then came the crucial step: model-based filtering. We used Arctic-Text2SQL itself to vet the generated queries, keeping only the ones that executed correctly and matched the expected results.
It’s a virtuous cycle: Use better models to get better data, then use that data to train even better models. The end result? It gives Arctic-Text2SQL-R1 the range to generalize across the messy, multistep and often ambiguous questions analysts face every day.
3. Systematic exploration of training strategy: Learn faster by doing it right
Our training strategy for Arctic-Text2SQL-R1 began with a careful evaluation of optimization algorithms. We compared GRPO and Proximal Policy Optimization (PPO), and found GRPO consistently outperformed PPO for Text2SQL tasks.
Choosing the right starting model proved equally important. Models such as Qwen-2.5-Coder-Instruct or supervised fine-tuned variants such as OmniSQL — already strong in instruction-following and accuracy — provided a significant advantage. In contrast, models that started with lower accuracy, even those trained on other reasoning tasks, failed to catch up during RL fine-tuning.
We also found that the mode of reinforcement learning (RL) interaction matters. Online RL — where the model interacts continuously with the environment — led to stronger results than batch RL. We attribute this to online RL’s ability to adapt dynamically and learn from diverse, complex negative examples encountered in live training. This aligns with observations in math and programming domains and now extends to Text2SQL as well.
Prompt structure emerged as another critical factor. Switching from a generic prompt to a carefully adapted version of the OmniSQL prompt — designed for RL — resulted in significant gains. Key improvements came from structuring the prompt effectively, including “thinking” instructions and optimizing how the database schema is serialized.
Interestingly, while we experimented with more fine-grained reward functions, these often backfired — encouraging "lazy" behaviors where models optimized for short-term gains rather than global correctness. In contrast, our simpler, execution-driven reward signal proved more effective, reinforcing desired behaviors without introducing perverse incentives.
Breakthrough performance: Smaller, smarter and state-of-the-art
The combination of execution-based training, clean data and strong initialization paid off. Arctic-Text2SQL-R1 doesn’t just improve on the status quo — it sets a new bar for performance, efficiency and reliability in Text-to-SQL.
Arctic-Text2SQL-R1 leads the BIRD benchmark
BIRD, one of the most rigorous Text-to-SQL benchmarks available, evaluates real execution accuracy on complex, multitable, real-world queries. Arctic-Text2SQL-R1 doesn’t just perform well on BIRD; it dominates, with best-in-class results across every model size.

Arctic-Text2SQL-R1-32B sets a new state-of-the-art, achieving 71.83% execution accuracy on BIRD, outperforming all other open and proprietary models.
Arctic-Text2SQL-R1-14B scores 70.04%, making it the top-performing model under 30B parameters. It outperforms larger systems like XiYanSQL-QwenCoder-32B and Arctic-ExCoT-70B, proving that size isn’t the only path to accuracy.
Arctic-Text2SQL-R1-7B stands out as the most compact model in the Top 10. At 68.47%, it even matches the performance of ExCoT-70B, despite being one-tenth the size, and stands as the only 7B model to break into the leaderboard’s top tier, showcasing best-in-class parameter efficiency.
Across all size categories, Arctic-Text2SQL-R1 models are not just competing; they are setting the pace on the BIRD benchmark, delivering top performance with remarkable efficiency.

This performance isn’t just about leaderboard rankings, however; it’s about efficiency that scales. As shown in Figure 3, Arctic-Text2SQL-R1 models sit firmly in the top-left corner of the cost-for-performance curve: smaller models with higher accuracy.
While many competitors require 70B+ or even 100B+ parameters to push past 65%, Arctic’s 7B, 14B and 32B models achieve best-in-class results through reasoning-first training and execution-aligned optimization.
Winning broadly: State-of-the-art across diverse data sets
Beyond the BIRD benchmark, we evaluated Arctic-Text2SQL-R1 across six diverse data sets, including BIRD-dev, Spider-test, Spider2.0-SQLite, Spider-DK, Science Benchmark and EHRSQL. This comprehensive testing, covering everything from multihop reasoning to domain-specific SQL, confirms that our models deliver consistent state-of-the-art accuracy, regardless of size.

7B model: This compact powerhouse not only outperforms other leading specialized 7B SQL models such as SQL-R1-7B and OmniSQL-7B, but incredibly, it even surpasses DeepSeek-V3 (a massive 671B MoE model). This highlights the incredible efficiency achieved with just 7 billion parameters.
14B model: This model reinforces its strength by being the top performer under 30 billion parameters. It consistently tops other reasoning-optimized models in its class, including Reasoning-SQL-14B and OmniSQL-14B.
32B model: This flagship model achieves the highest average accuracy across this broad suite of benchmarks. It even outperforms leading commercial models like GPT-4o, demonstrating its superior capability in generating accurate SQL.
The story is one of exceptional parameter efficiency. Despite having dramatically fewer parameters, Arctic-Text2SQL-R1 models consistently outperform both open-source and commercial leaders. For example, they surpass models like DeepSeek-v3 (671B) and even GPT-4o, all while using up to 95x fewer parameters.
These results are a testament to our approach: Arctic-Text2SQL-R1 proves that smaller, smarter and task-optimized models can not only compete with but decisively win against much larger counterparts, delivering exceptional Text-to-SQL capabilities.
Key learnings on the path to robust Text-to-SQL
Our journey with Arctic-Text2SQL-R1 was an iterative process of experimentation and refinement. We didn't just build a model; we uncovered crucial insights along the way that we believe can significantly benefit the broader AI and data community.
1. Data quality and strategic filtering are paramount
In the era of big data, it's tempting to believe that "more is always better." Our experiments painted a more nuanced picture.
When we added a large volume of unfiltered synthetic data (Gretel-Synth-NonFiltered) alongside our BIRD and SPIDER benchmarks, performance actually dipped. For the 14B model, execution accuracy on BIRD-dev dropped from 64.9% to 64.6%, and SPIDER-test fell from 86.8% to 86.4%.
The magic happened when we applied model-based filtering — using our own Arctic-Text2SQL models to vet the synthetic queries and retain only high-quality, correct examples. With this curated “Gretel-Synth-Filtered” set, performance jumped to 66.5% on BIRD-dev and 88.3% on SPIDER-test.
The takeaway: Strategically filtering for correctness turned noisy synthetic data into a meaningful training signal.
Base Model | Training Data | BIRD-dev | SPIDER-test |
---|---|---|---|
Qwen-coder-14B-Inst | BIRD, SPIDER | 64.9 | 86.8 |
Qwen-coder-14B-Inst | BIRD, SPIDER, Gretel-NonFiltered | 64.6 | 86.4 |
Qwen-coder-14B-Inst | BIRD, SPIDER, Gretel-Filtered | 66.5 | 88.3 |
2. The power of a good foundation
Reinforcement learning (RL) success depends heavily on where you begin. Our experiments showed that starting from a strong, instruction-tuned model that is already fluent in code and SQL consistently led to better results than using a general-purpose or mismatched base.
When we started from a general-purpose model, BIRD-dev accuracy reached 64.4%. Switching to an instruction-tuned variant gave us a modest boost to 64.9%. Then we pushed the idea further.
By initializing RL with OmniSQL-32B — a model already fine-tuned for SQL — we saw another leap in performance: 67.9% on BIRD-dev and 88.2% on SPIDER-test.
The lesson is simple: The stronger your foundation, the faster and farther RL can go.
Base Model (32B) | Training Strategy (Online RL) | Optimization | BIRD-dev | SPIDER-test |
---|---|---|---|---|
QwQ-32B | --- | GRPO | 55.2 | 79.3 |
Qwen2.5-Coder-32B | --- | GRPO | 64.4 | 87.3 |
Qwen2.5-Coder-32B-Inst | --- | GRPO | 64.9 | 87.7 |
Qwen2.5-Coder-32B-Inst* | Online RL | GRPO | 66.6 | -- |
OmniSQL-32B | Online RL | GRPO | 67.9 | 88.2 |
3. The right RL setup makes a big difference
Beyond the foundational model, the specifics of the RL strategy itself are critical.
Keep the reward signal simple: Did the SQL execute and return the correct result? This simplicity focused the model on generating correct, executable SQL — and avoided "reward hacking," a common pitfall where models learn to exploit complex proxy rewards without genuinely improving on the desired task.
RL algorithm choice matters: Our comparative experiments indicated that GRPO (Group Relative Policy Optimization) offered a consistent performance advantage over the more traditional PPO (Proximal Policy Optimization) for our specific task and data.
Online learning advantage: We found that online RL, where the model learns from live SQL execution feedback, significantly outperformed batch RL. On BIRD-dev, switching from batch to online training (using the same model and GRPO optimizer) improved accuracy from 64.9% to 66.6%. For dynamic, reasoning-intensive tasks like Text-to-SQL, this immediacy helped the model adapt faster and generalize better — highlighting that RL setup isn’t one-size-fits-all, and that choosing the right learning strategy can have a measurable impact.
Base Model (Qwen2.5-Coder-32B-Inst) | Training Strategy | Optimization | BIRD-dev | SPIDER-test |
---|---|---|---|---|
Qwen2.5-Coder-32B-Inst | Batch RL | PPO | 63.0 | 85.7 |
Qwen2.5-Coder-32B-Inst | Batch RL | GRPO | 64.9 | 87.7 |
Qwen2.5-Coder-32B-Inst* | Online RL | GRPO | 66.6 | -- |
Prompt engineering is non-negotiable (even for specialized models): In our experiments, switching from a self-defined prompt to a refined version of the OmniSQL prompt boosted BIRD-dev performance from 67.9% to 70.5%. The model, training setup and optimization algorithm stayed the same — only the prompt changed. The takeaway: Prompt design remains a critical lever for performance, even in advanced, fine-tuned systems.
Base Model (OmniSQL-32B) | Training Strategy (Online RL) | Optimization | BIRD-dev | SPIDER-test |
---|---|---|---|---|
OmniSQL-32B | Online RL + Self-defined Prompt | GRPO | 67.9 | 88.2 |
OmniSQL-32B | Online RL + Modified OmniSQL Prompt | GRPO | 70.5 | 88.7 |
Together, these insights show that high-performing Text-to-SQL isn’t about a single breakthrough. It’s about combining good data, smart initialization, the right RL strategy and careful prompt design into a cohesive system. Please read our paper for more technical details.
Get started: Arctic-Text2SQL-R1 is now open source
We’ve open sourced the full Arctic-Text2SQL-R1 7B model on Hugging Face, along with evaluation code at ArcticTraining. Whether you're experimenting with new prompts, benchmarking on custom data or integrating Text-to-SQL into your stack, we invite you to explore, reproduce and build on our work. We look forward to seeing what you do!
Join us in the Snowflake AI Community to connect with researchers, developers and engineers building the next wave of intelligent data tools.
Contributors
Snowflake AI Research: Zhewei Yao, Lukasz Borchmann, Bohan Zhai, Hao Zhang and Yuxiong He
Academic Collaborators: Guoheng Sun, Zheyu Shen and Ang Li (University of Maryland), Minghang Deng (UC San Diego)
We are especially grateful to our academic partners for their thoughtful collaboration and impactful contributions throughout this work.
1 Based on BIRD Leaderboard Top 10 models, as of May 20, 2025.