MAY 29, 2026/7 min readCore Platform

Achieving Human-Level Document Extraction with AI_EXTRACT Fine-Tuning

AI_EXTRACT, powered by Arctic-Extract, Snowflake's proprietary model purpose-built to deliver best-in-class key information extraction from documents. With Arctic-Extract Fine-Tuning now generally available, you can train it to adapt to your specific layouts, field names, internal terminologies and abbreviations. In this post, we'll show you how to fine-tune Arctic-Extract to achieve human-level accuracy, enabling you to automate your document workflows and significantly reduce the need for manual reviews.

Key takeaways

You can make AI_EXTRACT even more accurate by fine-tuning the underlying Arctic-Extract model and realize the following benefits:

Automate more work with near-human accuracy: Arctic-Extract fine-tuned models reach 93.8% Mean ANLS*, close to human performance, enabling reliable extraction across complex documents.
Reduce manual review with high-confidence outputs: 89% of extractions are perfect, making it practical to automate downstream workflows without constant human validation.
Lower cost per document with broader coverage: With 94.7% answer coverage, only a small fraction of cases require fallback handling or human intervention.
Maintain performance across diverse document types: Strong generalization helps maintain consistent results, even as formats and layouts vary.
Operate reliably at scale: Low variability in outputs means predictable performance in high-volume production environments.

Where fine-tuning delivers the most value

Arctic-Extract is a proprietary, vision-based large language model purpose-built for high-quality key information extraction across unstructured data, ranging from documents, images, to raw text. It is optimized for document intelligence use cases, such as understanding complex layouts, interpreting checkboxes, parsing tables and reading handwriting like a person would.

For certain domain-specific workloads, fine-tuning can further improve extraction accuracy beyond even strong baseline LLM performance:

Workload Category	Example Applications	The Challenge	The Fine-Tuning Advantage
Regulated compliance	Insurance, Banking and Fintech onboarding pipelines.	Every data point must be "evidence-grade" and traceable for audits.	Align the model to your exact document packets and compliance field schemas.
High-volume operations	Back-office processing of thousands of invoices/manifests.	Small error rates compound at scale, creating massive manual bottlenecks.	Recognizes specific vendor formats and tables to maximize "first-time" accuracy.
Domain-specific niche	Legal (Patents), Healthcare (Charts) or Internal Finance.	Proprietary shorthand, abbreviations and nonstandard visual layouts.	Adapts the model's vocabulary and visual awareness to your industry's "language."
Human-in-the-loop	Enterprises scaling automation while minimizing manual review.	Inaccurate confidence scores lead to either too much manual work or missed errors.	Recalibrates confidence scoring so only true edge cases are routed to human reviewers.

Quality gains from fine-tuning

To quantify the value of fine-tuning for high-impact customer workloads, we evaluated a domain-adapted Arctic-Extract model against two baseline approaches on a held-out test set:

Fine-tuned Arctic-Extract: fine-tuned on domain-specific training data via Cortex Fine-Tuning
Arctic-Extract (zero-shot): the base model with no fine-tuning, as a reference
Claude Sonnet 4.6: a strong general-purpose frontier multimodal LLM, for broader context for comparison

Methodology

Under the hood, Cortex Fine-Tuning uses LoRA (Low-Rank Adaptation), a parameter-efficient fine-tuning (PEFT) technique. Rather than retraining the full model, LoRA inserts small trainable rank-decomposition matrices into the model's attention layers. Only these lightweight adapters are updated during training — the base model weights stay frozen. This keeps fine-tuning fast and cost-efficient while allowing the model to deeply internalize your document schema and layout patterns.

Data sets

We evaluated fine-tuning across multiple internal data sets covering a variety of document types and tasks — form understanding, invoice processing, contract analysis, patent documents, pay stubs and signature detection. Each data set consists of scanned documents or PDFs with corresponding annotations for entity extraction, classification and question answering.

Evaluation used ANLS*, the standard metric for document understanding benchmarks. It ranges from 0.0 (complete mismatch) to 1.0 (exact match).

Data set	Arctic-Extract Zero-shot	Claude Sonnet (for comparison)	Fine-tuned Arctic-Extract	Arctic-Extract fine-tuning gain
GHEGA-based data set	0.442	0.488	0.837	+39.5pp
Financial data set	0.648	0.587	0.983	+33.5pp
Signatures data set	0.520	0.336	0.674	+15.4pp
Agreements data set	0.459	0.665	0.704	+24.5pp

Fine-tuning delivers significant gains with up to +39.5pp on the GHEGA-based data set and near-perfect extraction (0.983) on the Financial data sets.

Metric	Zero-shot Arctic-Extract	Claude Sonnet 4.6 (for comparison)	Fine-tuned Arctic-Extract
Mean ANLS*	0.8361	0.8952	0.9379
% Perfect (= 1.0)	72.5%	85.8%	89.0%
% Zero (= 0.0)	14.0%	9.6%	5.3%
Macro Mean (doc-level)	0.881	0.927	0.965
Std deviation	0.347	0.297	0.228

In this evaluation, the fine-tuned Arctic-Extract model achieved the strongest results across the measured dimensions.

Key observations are:

1. 93.8% Mean ANLS* on domain-specific documents. The fine-tuned model achieves a Mean ANLS* of 0.9379, well above both reference points and near the human evaluation ceiling of 0.9811.

2. 89% of extractions are perfect. 89% of questions receive a perfect score (ANLS* = 1.0). This is the rate that drives automation confidence: When extraction accuracy is sufficiently high, organizations can reduce manual review requirements in certain downstream workflows.

3. Only 5.3% of questions go unanswered. The fine-tuned model returns a useful answer for 94.7% of all questions — well above other reference options. In production, this reduces the human-in-the-loop review queue and lowers operational cost per document.

4. Strong generalization across documents. The doc-level Macro Mean reaches 0.965, confirming that fine-tuning improvements hold consistently across diverse documents in the test set and not just high-volume document types.

5. Highly consistent predictions. Standard deviation of 0.228 reflects tight, reliable output across the full document set, a fine-tuned model that behaves consistently at scale.

Calibrated confidence scores for human-in-the-loop routing

Beyond raw accuracy, fine-tuning changes how the model assigns confidence to its predictions. In particular, it increases the separation between confidence scores for correct and incorrect answers, which is a key property for threshold-based routing.

Across 630 predictions on the same domain-specific data sets, we measured the average confidence on correct vs. incorrect predictions for both the zero-shot and fine-tuned models:

Model	Mean ANLS*	Accuracy (ANLS ≥ 0.95)	Confidence Score (Correct responses)	Confidence Score (Incorrect responses)	Gap
Zero-shot Arctic-Extract	0.517	38.6%	0.755	0.625	0.130
Fine-tuned Arctic-Extract	0.774	65.1%	0.895	0.646	0.249

While fine-tuning increases confidence for both correct and incorrect predictions, the separation between them widens substantially — from 0.130 to 0.249. This improved separation makes it easier to define effective confidence thresholds: Lower-confidence extractions can be routed to manual review, while higher-confidence extractions are processed automatically.

In Snowflake internal benchmarking, useful gains were observed starting with approximately 20 labeled documents, with strongest gains on datasets exceeding 150 training data points.

Unlock quality gains with minimal effort

Figure 1: The three-phase fine-tuning pipeline. Labeled documents on a Snowflake Stage are assembled into a Snowflake data set, a single FINETUNE SQL call launches the training job, and the resulting model is stored in the Model Registry for inference.

Getting started with fine-tuning Arctic-Extract is straightforward. You bring your own labeled examples — document files paired with the extraction questions and correct answers — packaged as a Snowflake data set.

From there, a single SQL call launches the job and stores the resulting model directly in your Snowflake Model Registry. Fine-tuning Arctic-Extract happens entirely within Snowflake and your data stays within your account.

SELECT SNOWFLAKE.CORTEX.FINETUNE(
 'CREATE',
 'ARCTIC_EXTRACT_DEMO.PUBLIC.invoice_extractor_v1',           -- resulting model FQN
 'arctic-extract',                                            -- base model
 'snow://dataset/ARCTIC_EXTRACT_DEMO.PUBLIC.DS_INVOICES/versions/v1',  -- training dataset
 'snow://dataset/ARCTIC_EXTRACT_DEMO.PUBLIC.DS_INVOICES/versions/v1_eval' -- (optional) validation
);

The resulting model is stored in your Snowflake Model Registry without your data leaving your environment.

-- will result with all of the models in model registry for the account:

SHOW MODELS IN ACCOUNT;

Figure 2: Achieving Human-Level Document Extraction with AI_EXTRACT Fine-Tuning

The fine-tuned model remembers the schema you trained it on, so there is no need to specify the schema during inference.

SELECT AI_EXTRACT(
  model => 'db.schema.my_tuned_model',
  file => TO_FILE('@db.schema.files','document.pdf'));

Conclusion

Fine-tuning Arctic-Extract takes an already strong extraction model and makes it yours. In our benchmarks, fine-tuned models trained on as few as 20 labeled documents reached 93.8% Mean ANLS*, 89% perfect extractions and 94.7% answer coverage.

Because the entire workflow runs securely inside Snowflake, your documents never leave your account. Once trained, your fine-tuned model resides in your Model Registry, ready to handle high-volume production inference with just a single SQL call.

To learn more and start fine-tuning, check out the Arctic-Extract fine-tuning documentation today.

Learn more about the authors

Achieving Human-Level Document Extraction with AI_EXTRACT Fine-Tuning

Key takeaways

Where fine-tuning delivers the most value

Quality gains from fine-tuning

Methodology

Data sets

Calibrated confidence scores for human-in-the-loop routing

Unlock quality gains with minimal effort

Conclusion

Learn more about the authors

Mateusz Chilinski

Jerzy Toeplitz

Piotr Storozenko

Neeraj Jain

Subscribe to our blog newsletterGet the best, coolest and latest delivered to your inbox each week

Subscribe to our blog newsletter
Get the best, coolest and latest delivered to your inbox each week