Gen AI

JUL 30, 2025|7 min read

Bridging the Gap Between LLMs and Real-World Challenges: Snowflake at ACL 2025

The Association for Computational Linguistics (ACL) conference is the premier natural language processing research event, where the field’s most significant advances are presented and debated.

Snowflake AI Research brought four papers to ACL 2025 that demonstrate that efficiency and practicality don’t require sacrificing performance. Each of our works addresses a fundamental challenge facing enterprises today. Below see a brief summary of each of our most impactful findings.

1. How do we deploy increasingly large models efficiently?

The challenge: Large mixture-of-experts (MoE) models achieve impressive performance by routing tokens to specialized sub-networks, but their size makes deployment challenging. Models like GPT-4 (rumored to have eight large experts), DeepSeek R1 (256 experts) or Snowflake Arctic (128 experts) can require terabytes of memory. Traditional compression methods can degrade the performance to an unacceptable degree, forcing a difficult trade-off between efficiency and capability.

Our solution: STUN (structured-then-unstructured) pruning removes entire experts before pruning individual weights — the opposite of conventional wisdom. This counterintuitive approach works because MoE models have an inherent robustness to expert removal. This works specifically for MoE models because of how they're trained: With typically few experts active per token, the model learns redundancy. Many experts develop similar specializations, especially for common tokens. By removing redundant experts first, we maintain the model's weight distribution properties, making subsequent unstructured pruning more effective.

Figure 1. Overview of our proposed STUN. (1) We first remove redundant experts with expert-level structured pruning, then (2) perform unstructured pruning inside individual experts. Black box represents a layer in MoE, and different colors represent different behavioral similarities.

We developed an algorithm that identifies redundant experts by analyzing their router weights and activation patterns across thousands of examples. Experts that consistently activate on similar inputs likely perform similar functions. By computing activation correlation matrices, we can cluster experts and keep only one representative per cluster.

This innovation makes large MoE models more practical for deployment, reducing infrastructure costs while maintaining the capabilities enterprises rely on.

Snowflake’s pursuit of efficiency is multifaceted. This is exemplified by our work on Arctic Inference, an open source library of advanced optimizations that makes the serving of large models faster and more cost-effective. Together, these efforts showcase our deep commitment to making large-scale AI practical.

Resources: Paper

2. How can we process real-world documents without massive resources?

The challenge: Enterprises process vast quantities of complex documents, such as contracts, financial reports, or technical manuals with mixed text and visual content. While large vision-language models achieve strong zero-shot performance, they require billions of parameters and massive computational resources, leading to high costs, whether self-hosted or accessed via APIs. They struggle with long documents and may still need fine-tuning for specialized domains or document types. The key challenge is achieving comparable document understanding with models that are economical to operate at scale.

Our solution: Arctic-TILT is a compact encoder-decoder model that achieves performance that matches or exceeds models thousands of times its size. Through a combination of architectural design and optimization techniques, it processes documents up to 400k tokens (approximately 500 pages) on a single 24GB GPU, making enterprise deployment both practical and cost-effective.

Figure 2. Arctic-TILT consumes long, richly formatted PDFs given a single, cost-efficient GPU and can produce their summary, answer questions and extract values, outperforming vastly heavier LLMs and LVLMs.

The architecture progressively integrates textual, visual and layout information using tensor product-inspired fusion applied throughout the encoder layers. For long-context processing, we employ chunked attention that reduces complexity from quadratic to linear, combined with gradient checkpointing and CPU offloading to minimize memory usage. The model underwent two-stage training: extensive self-supervised pretraining on large PDF corpora, followed by supervised fine-tuning on diverse document understanding tasks.

The effect? Arctic-TILT establishes state-of-the-art results on seven document understanding benchmarks. Moreover, it demonstrates exceptional few-shot adaptation; with just 5-10 examples, it can surpass the zero-shot performance of much larger models like GPT-4o on domain-specific tasks. The model also provides outstanding confidence calibration, which is crucial for enterprise deployments where reliability matters.

This research directly impacts our product capabilities. A variant of Arctic-TILT serves as a key component within Snowflake Document AI, helping to process richly formatted enterprise documents. In line with our commitment to the community, we have also released the model as open source, allowing developers everywhere to build their own state-of-the-art document-understanding applications.

Resources: Paper | Models

3. Can we make natural language database queries reliable enough for production?

The challenge: Business users want to query databases using natural language, but current LLMs applied to the text-to-SQL task lack the precise reasoning for real-world use. Traditional solutions rely on expensive, human-annotated data or use simple prompting techniques like chain-of-thought (CoT), which provide only marginal gains. This creates a complex trade-off between the high training data cost and the existing models' low reliability.

Our solution: ExCoT (execution-guided chain-of-thought) introduces a framework that iteratively optimizes a model's reasoning by relying solely on execution feedback. This approach combines CoT with direct preference optimization (DPO). For each question, the model generates multiple reasoning paths and SQL queries. A query is labeled a "winner" only if its executed result matches the ground-truth result; all others are "losers." DPO then trains the model to prefer the winning reasoning paths. This works specifically for text-to-SQL because the CoT steps expose the model's internal logic, allowing DPO to correct the logical flaws that lead to incorrect queries, rather than just penalizing the final output.

Figure 3. We use the model trained with off-policy DPO to generate new candidate CoT data for on-policy DPO. We repeat this process iteratively for multiple rounds.

We developed a multi-stage process that begins with an initial fine-tuning using verified data from a strong teacher model. The system then enters an iterative on-policy loop, generating its candidate solutions, verifying their results against the database and creating new preference pairs from its outputs. The model continuously refines its ability to handle complex joins, aggregations and nested queries by repeatedly training on its successes and failures.

This innovation makes text-to-SQL systems far more accurate without requiring additional human annotations, reducing development costs and increasing the reliability of natural language database interfaces.

While ExCoT is a research exploration, it tackles the core challenge of reliability that is fundamental to the text-to-SQL task. Improving the precision of natural language querying is a key focus for Snowflake, as this capability is foundational for tools like Snowflake Cortex Analyst that empower business users to interact directly with their enterprise data.

Resources: Paper | Models

4. Are we measuring LLMs’ progress correctly?

The challenge: The AI community uses standardized benchmarks to track progress, but benchmark design choices can dramatically affect results. One overlooked detail is how multiple-choice questions are presented to models. This isn't about making benchmarks easier — it's about ensuring they test what we think they're testing. When evaluation methods don't match task requirements, we get misleading signals about model capabilities.

Our solution: We systematically compared two evaluation methods across multiple benchmarks. In the "separation" approach, models score each answer choice in isolation without seeing alternatives. In the "options" approach, models see all choices together — mirroring how humans actually take tests. This seemingly minor difference has profound implications, especially for questions requiring comparative reasoning.

Figure 4. ARC Challenge evaluation results depend on whether the model sees other options. Differences can reach up to 35%, and the assumed setup impacts model rankings.

The results are striking. Simply switching from "separation" to "options" evaluation can improve scores by 30+ percentage points. On ARC Challenge, Llama 3.1 70B jumps from 64% to 93% accuracy. Similar dramatic improvements appear across other benchmarks: OpenBookQA sees 40+ point gains that push performance past human baselines, while the gap between LLMs and humans on Social IQA — previously cited as evidence that LLMs lack social understanding — largely disappears.

These findings suggest that some benchmarks we consider "challenging" may already be effectively solved when evaluated properly. More broadly, this work calls for a fundamental reassessment of how we evaluate AI systems and what our benchmarks actually measure.

A commitment to rigorous, objective measurement is a theme that runs through both our internal research and our open source contributions. Just as this paper questions the methodology behind standardized benchmarks, our open source library TruLens empowers developers to question the performance of their own applications. It provides the programmatic "unit tests" for LLMs, enabling the kind of deep, reliable evaluation necessary to move AI from prototype to production.

Resources: Paper

Research with purpose

The path forward isn't just about building bigger models, it's about building more practical ones. Our research at ACL 2025 explores different approaches to efficiency, from architectural choices to training methods to evaluation practices. These contributions reflect a practical philosophy: AI systems need to work within real-world constraints.

Join us at ACL 2025 to discuss this research with our team. Visit the Snowflake booth to explore collaboration opportunities or learn more about how these innovations power our AI products.

Authors

Lukasz Borchmann

Snowflake AI Research

Product

Solutions

Why Snowflake

Resources

Developers

Pricing

Bridging the Gap Between LLMs and Real-World Challenges: Snowflake at ACL 2025

1. How do we deploy increasingly large models efficiently?

2. How can we process real-world documents without massive resources?

3. Can we make natural language database queries reliable enough for production?

4. Are we measuring LLMs’ progress correctly?

Research with purpose

Authors

Subscribe to our blog newsletter
Get the best, coolest and latest delivered to your inbox each week

Where Data Does More

Bridging the Gap Between LLMs and Real-World Challenges: Snowflake at ACL 2025

1. How do we deploy increasingly large models efficiently?

2. How can we process real-world documents without massive resources?

3. Can we make natural language database queries reliable enough for production?

4. Are we measuring LLMs’ progress correctly?

Research with purpose

Authors

Share Article

Subscribe to our blog newsletterGet the best, coolest and latest delivered to your inbox each week

Where Data Does More

Subscribe to our blog newsletter
Get the best, coolest and latest delivered to your inbox each week