Gen AI

Arctic-Extract: Compact, Efficient and State-of-the-Art Vision-Language Processing

Extracting structured information from business documents, such as invoices, contracts and scanned records, has long been a challenging problem in applied AI. Traditionally, this requires complex, multistage processing pipelines. Most document AI systems use a multistage pipeline that first parses text with OCR, then passes it to a language model. 

In a previous engineering blog post, the Snowflake AI Research team introduced the initial version of Arctic-Extract, a vision-language model (VLM) for high-fidelity document extraction.

Here, we present the next evolution of that work. This new version of Arctic-Extract has been refined into a compact 6.6 GiB VLM that performs document understanding in a single step, jointly reasoning over visual and textual information without relying on external OCR. We benchmarked Arctic-Extract against leading proprietary and open source models, and despite its small size, Arctic-Extract achieved state-of-the-art accuracy at scale, processing up to 125 A4 pages on a single 24 GB A10 GPU

This end-to-end design powers the AI_EXTRACT feature in Snowflake, enabling users to turn unstructured documents into structured data that can be queried directly through Cortex AI Functions. While earlier models like Arctic-TILT established the groundwork for state-of-the-art document understanding at Snowflake, Arctic-Extract now represents a major step forward in accuracy, efficiency and capability.

In this post, we explore the research and engineering behind Arctic-Extract, including its architecture, optimization techniques, training data and benchmark results that show how efficiency and top-tier performance can coexist in large-scale document AI. Read the full technical paper on arXiv

Architecture: Built for scale and high-fidelity understanding

Arctic-Extract is built upon the Qwen 2.5-VL architecture and optimized for the structure and complexity of real business documents. 

A core innovation lies in its efficient token compression, which merges every four vision tokens into one. This technique allows a standard A4 page to be represented by approximately 1,000 tokens, fully leveraging the model's vast 128,000-token context window for robust multipage reasoning.

Aspect Details
Base architecture Qwen 2.5-VL
Compression 4 visual tokens → 1 token
Context length 128,000 tokens
Number of parameters 7 billion
Hardware 8 × NVIDIA H200 (141 GB each)
Optimizer / Precision AdamW / bf16
Quantization 4-bit AWQ
Final model size 6.6 GiB

To push efficiency even further, the model incorporates two pivotal optimization techniques:

  1. LoRA fine-tuning enables rapid adaptation to domain-specific document tasks without modifying the full model weights. It reduces compute and memory overhead while maintaining accuracy across a wide range of document types.

  2. 4-bit AWQ quantization reduces memory usage while preserving high fidelity.

Training data: Focus on business extraction

Arctic-Extract was trained on a rich, diverse corpus of 372,544 data points spanning 35 data sets, specifically curated to cover question answering (QA), table extraction and multilingual understanding.

Task category Data sets Samples
Table extraction (TE) 13 33,916
Question answering (QA) 13 112,901
Multilingual understanding 9 225,735
Total 35 372,544

The Table Extraction (TE) data sets are a key novelty of this work. We introduced new data to transform unstructured business documents into normalized tabular formats. Their creation required iterative manual annotation, continuous feedback loops and synthetic augmentation to ensure coverage of complex and rare table structures found in real-world business scenarios.

Together, these architectural and data decisions form the foundation of a highly efficient VLM. To assess how well this efficiency translates into real-world performance, we benchmarked Arctic-Extract across a comprehensive suite of document-understanding tasks.

Benchmark results across four essential document tasks

Arctic-Extract was rigorously benchmarked against leading models, including large proprietary systems like GPT-5 and Claude 4 Sonnet, as well as other open source models like Qwen 2.5-VL and Llama 3.1 (405 B). All evaluations used ANLS* as the primary evaluation metric and covered four core dimensions of document understanding: visual reasoning, multilingual question answering, table extraction and English-text comprehension.

Across these benchmarks, Arctic-Extract delivers performance competitive with and often exceeding significantly larger multimodal systems.

Visual understanding

The development of next-generation solutions for complex document understanding is a challenging domain. This encompasses tasks like visual question answering (VQA) and information extraction from documents with highly complex and varied layouts. This work addresses the critical need for systems that can accurately interpret both the textual and visual information within a document to derive comprehensive, structured data.

Figure 1: ​​Arctic-Extract achieved the highest visual understanding score among all models evaluated, outperforming GPT-5, Claude 4 Sonnet, Qwen 2.5 VL, Arctic-TILT and Pixtral 12B.
Figure 1: ​​Arctic-Extract achieved the highest visual understanding score among all models evaluated, outperforming GPT-5, Claude 4 Sonnet, Qwen 2.5 VL, Arctic-TILT and Pixtral 12B.

As shown in Figure 1, Arctic-Extract demonstrated superior capability across a diverse range of document-understanding tasks when evaluated on our internal visual data sets, establishing itself as the leading solution amongst these models. It achieved the highest average performance compared to competing models, significantly outperforming all models it was evaluated against in terms of both accuracy and operational efficiency, especially when tackling the most complex extraction and comprehensive challenges inherent in real-world document processing. 

This performance reflects the effectiveness of its end-to-end architecture and its robust ability to interpret both the structure and content of real-world enterprise documents.

Multilingual understanding

Multilingual QA presents a significant challenge for natural-language processing models, requiring them to accurately understand and process information across a wide array of languages with distinct grammatical structures and linguistic nuances. Effectively addressing this complexity is crucial for global applications.

Figure 2: Arctic-Extract achieved the highest multilingual accuracy among all evaluated models, outperforming larger systems such as LLaMA 3.1 405B, GPT-5, Claude 4 Sonnet and Qwen 2.5 VL.
Figure 2: Arctic-Extract achieved the highest multilingual accuracy among all evaluated models, outperforming larger systems such as LLaMA 3.1 405B, GPT-5, Claude 4 Sonnet and Qwen 2.5 VL.

As shown in Figure 2, Arctic-Extract demonstrates superior performance in multilingual evaluation across a diverse range of languages. It achieved the highest average score among all models tested, establishing a clear lead over its rivals in tackling this complex challenge. 

The model exhibited particularly strong capabilities in major European languages, including French, German and Italian QA tasks. Arctic-Extract also showed robust competency in Asian languages, leading in Korean and proving highly competitive in Japanese. The model consistently secured the top rank across the majority of languages tested, including English, Spanish, Chinese, Greek and Romanian, highlighting its consistent global strength in QA.

Table extraction

Transforming unstructured document content into structured tables is a difficult task that requires understanding spatial relationships, contextual cues and the organizational patterns found in business documents such as invoices, contracts and reports.

Earlier models like Arctic-TILT established an initial baseline for this capability, and Arctic-Extract builds on that foundation. Arctic-Extract delivers TE performance that closely matches Arctic-TILT while surpassing significantly larger multimodal models.

Figure 3: Arctic-Extract achieves near-best-in-class performance on table extraction, closely matching Arctic-TILT and outperforming larger models, including Claude 4 Sonnet, GPT-5, Qwen 2.5 VL and Pixtral 12B.
Figure 3: Arctic-Extract achieves near-best-in-class performance on table extraction, closely matching Arctic-TILT and outperforming larger models, including Claude 4 Sonnet, GPT-5, Qwen 2.5 VL and Pixtral 12B.

As shown in Figure 3, Arctic-Extract performs near the top on table extraction benchmarks, accurately identifying and converting complex layouts, including multipage and nested tables, into clean and structured data. This capability simplifies downstream processing, supports advanced analytics workflows, and enables reliable extraction from a wide range of enterprise documents.

English-text understanding

The SQuAD2.0 data set serves as a widely used benchmark for measuring how effectively large language models can understand and answer questions. It builds upon the original SQuAD data set by including both answerable questions and intentionally unanswerable ones, challenging models not only to provide accurate responses when possible but also to recognize when no valid answer exists. This makes SQuAD2.0 a rigorous tool for assessing a model’s comprehension, reasoning ability and robustness in real-world QA scenarios.

Figure 4: Arctic-Extract achieves English text performance that is highly competitive with much larger models while operating with greater resource efficiency.
Figure 4: Arctic-Extract achieves English text performance that is highly competitive with much larger models while operating with greater resource efficiency.

Our evaluation reveals that Arctic-Extract achieves highly competitive performance in the ANLS* metric. The model demonstrates that it can achieve performance highly competitive with significantly larger models while maintaining superior resource efficiency, highlighting its strength relative to its smaller size.

Simplicity of use

The example below shows the simplicity of Arctic-Extract using the AI_EXTRACT function within Snowflake Cortex AI. 

In a single command, a user passes the document file and a list of natural language questions, and the model returns the structured JSON output with the extracted key-value pairs. For more details, see the documentation

Extract information from an input string

SELECT AI_EXTRACT(
    text => 'John Smith lives in San Francisco and works for Snowflake',
    responseFormat => {
        'name': 'What is the first name of the employee?',
        'city': 'Where does the employee live?'
    }
);

Extract information from a file

AI_EXTRACT(<file>, <responseFormat>)

SELECT AI_EXTRACT(
    file => TO_FILE('@db.schema.files', 'document.pdf'),
    responseFormat => [
        ['name', 'What is the first name of the employee?'],
        ['city', 'Where does the employee live?']
    ]
);

A new standard for efficient document AI

The Arctic-Extract technical paper demonstrates that strategic architectural design and targeted optimization can achieve state-of-the-art document understanding without the massive computational overhead of typical large multimodal LLMs.

By seamlessly merging visual and textual reasoning into a highly compact and efficient design, Arctic-Extract sets a new standard for performance in multilingual, tabular and visual document tasks. This work lays crucial groundwork for the next generation of efficient, scalable document AI research and deployment at Snowflake.

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Where Data Does More

  • 30-day free trial
  • No credit card required
  • Cancel anytime