Cortex AI Parse Document: Text Extraction That Combines High Quality, Speed and Simplicity

Every enterprise workflow that touches documents — from regulatory compliance to knowledge management — depends on one fundamental capability: accurately extracting text from those documents. When optical character recognition (OCR) fails, however, everything downstream fails too: Search returns irrelevant results; AI chatbots hallucinate answers; and compliance checks miss vital information.

Our research team at Snowflake developed a solution to help address this challenge. We trained an OCR model that beats performance of popular open source and commercial solutions on the most frequently occurring enterprise documents (standard document formats). To make this solution scalable and easy to implement, OCR capabilities are as easy as calling a SQL function. No external services to configure. No complex integrations or separate orchestration tools to maintain.

We tested performance for a RAG solution using financial services documents and again saw better performance than commercial solutions. With parse_document(), organizations can now transform unstructured documents within their existing Snowflake environments into valuable, structured data assets.

Figure 1. The results of our real-world documents OCR benchmark tests, which were performed on diverse public documents for different file formats (e.g., PDF, DOCX, PPTX, TIFF) with manually annotated ground truth. The tests measure how accurately an OCR system extracts text. See the results section for more details.

Engineering a better solution

Many customers told us that existing OCR approaches fell short. Open source solutions like Tesseract struggled with accuracy, especially for non-English content. Cloud providers offered better accuracy but required sending documents to external services, creating data governance challenges and adding complexity.

Our goal was to deliver the high quality expected from a top-tier provider but make it as easy as SQL so that a data engineer can easily implement it without needing assistance from developers. This led us to develop a simple to use but powerful multilayered approach.

OCR in the modern digital landscape

First, we created a specialized digital-born extraction system. Most enterprise documents today are created digitally as PDF, DOCX or PPTX files. Rather than treating these like scanned images, our system intelligently extracts the machine-readable content directly from the file structure. This preserves special characters, punctuation and diacritics with high fidelity, while handling embedded images with sophisticated deduplication logic.

Digital-born extraction refers to the process of retrieving machine-readable content that was originally encoded in the file structure itself. This method optimizes processing by bypassing traditional OCR steps for content that doesn't require them.

The system also employs sophisticated handling for embedded images within documents:

Intelligent image filtering: Images are filtered based on minimum width/height requirements and a maximum count per page to prevent processing decorative elements or irrelevant graphics.
Coverage analysis: We calculate the total image coverage of each page and use this metric to make intelligent processing decisions. Pages that are primarily images rather than text are handled differently.
Hybrid processing: For pages containing both text and images, we run our OCR model on each qualifying image. The extracted text from these images is then carefully merged with the digital-born text lines.
Deduplication logic: Our system implements sophisticated algorithms to remove any duplicated text that might appear as overlap between text and image content, enabling clean results.

The system includes intelligent routing logic to determine the optimal processing path:

Pages primarily covered by images bypass digital extraction and are processed entirely by the OCR model.
If no text lines are detected during digital extraction, the whole page is automatically directed to OCR processing.
For mixed content pages, our hybrid approach ensures complete text capture while maintaining efficiency.

This approach allows us to get the best of both worlds: the precision and efficiency of direct text extraction for digital content, combined with the power of our custom OCR model for embedded images and scanned documents.

Training a custom OCR model

For scanned documents and images, we developed a custom-trained OCR model. We chose to use a traditional OCR system rather than a visual language model due to significant advantages in processing speed, throughput and deterministic behavior that prevents hallucinations — critical factors for enterprise document processing at scale.

We first performed an analysis on character coverage that resulted in a vocabulary of 245 unique characters to fully cover our languages of interest: English, Polish, German, Italian, French, Spanish, Portuguese, Swedish and Norwegian.

This character set includes all necessary diacritical marks and additional special characters. By design, we excluded mathematical symbols, chemical notation and non-Latin alphabets to maintain focus on our core use cases.

The quality of training data is paramount for OCR performance. We implemented a sophisticated dual-corpus approach:

Curated language samples: We leveraged the open source Oscar corpus, carefully sampling 100,000 sequences of up to 25 characters in length for each target language. We implemented intelligent upsampling of rare characters to help ensure the model could recognize less-common diacritical marks and special characters.
Real-world document mining: We extracted text directly from PDFs identified in the Common Crawl data set. Using language detection models, we filtered for our target languages and sampled 100,000 examples from each, creating a diverse collection of 900,000 real-world text samples.

To help ensure our model could handle the diverse text presentations found in enterprise documents, we created three distinct training data sets, each approximately 900,000 images in size:

Controlled rendering: Using the Oscar corpus samples, we employed the text_renderer library with open-source fonts and varied backgrounds. We added sophisticated augmentations that simulated potential errors from text detection, such as cropped images and fragments of overlapping text lines.
Scene-like text generation: We utilized the synthtiger library with the same Oscar corpus to generate images mimicking text as it appears in real-world scenarios, with varied lighting, perspective and environmental factors.
Authentic document extractions: We directly utilized text line images extracted from Common Crawl PDFs, carefully sampling for diversity in font style, size and color. These images were further enhanced with the augraphy library to simulate various document conditions.

Each approach offered distinct advantages. The synthetic methods (i.e., controlled rendering and scene-like generation) provided clean, error-free training data with flexible augmentation options. The real-world extractions, while more resource-intensive to produce and occasionally containing errors, offered authentic examples of document text in its natural context.

By combining these approaches, we created a training data set that balanced idealized text representations with the messiness of real-world documents. This comprehensive training strategy enabled our model to achieve exceptional accuracy across multiple languages and document types.

Evaluating the results

Engineering claims are meaningless without data to back them up. That's why we conducted extensive benchmarking against leading commercial and open source alternatives.

We created a comprehensive evaluation framework using three data sets: a controlled synthetic benchmark, a collection of real-world business documents across multiple languages and artificially degraded versions simulating low-quality scans. We measured performance using character n-gram F-scores (chrF) and character-weighted accuracy, metrics that capture the nuances of OCR quality better than simple word error rates.

The results speak for themselves (see Figure 2). On our real-world document benchmark, Snowflake's OCR achieved a chrF score of 0.936, outperforming AWS Textract (0.902) and open source Tesseract (0.847). For languages with complex diacritics like Polish, the difference was even more dramatic: Our solution correctly handled characters that competitors consistently misrecognized.

Figure 2. Real-world documents OCR benchmark - chrF scores by language

Figure 2. Results of our real-world documents OCR benchmark test - chrF scores by language.

Most importantly, we evaluated end-to-end performance using LLM-based question answering on financial documents, such as annual reports and other statements. When we fed OCR results from different systems into an LLM and asked it to answer questions about the documents, Snowflake's OCR results produced more accurate answers: ANLS metric of 0.974 vs. 0.969 for AWS.

Figure 3. End-to-end financial documents RAG benchmark.

Beyond accuracy: Engineering for scale

Document processing at enterprise scale presents unique challenges. Our engineering team built a robust, scalable system that handles these demands with ease.

Our autoscaling architecture dynamically adjusts resources based on workload, facilitating consistent performance even during peak usage. We implemented page-by-page processing to reduce memory utilization for large documents, allowing the system to handle documents with 500+ pages efficiently. And we built support for multiple file formats (PDF, DOCX, PPTX) directly into the system, eliminating the need for format conversion.

Seamless integration: OCR as a SQL function

Perhaps the most elegant aspect of our solution is how seamlessly it integrates into existing Snowflake workflows. Using our OCR capabilities is as simple as calling a SQL function:

-- Extract text from a PDF
SELECT SNOWFLAKE.CORTEX.PARSE_DOCUMENT(@documents, 'financial_report.pdf', {'mode': 'OCR'});

-- Extract text with layout preservation
SELECT SNOWFLAKE.CORTEX.PARSE_DOCUMENT(@documents, 'quarterly_results.pdf', {'mode': 'LAYOUT'});

No external services to configure. No complex integrations to maintain. No data leaving Snowflake’s security perimeter. Just powerful OCR capabilities available wherever you need them.

The future of document understanding

The general availability of parse_document() OCR mode represents a significant milestone in our document understanding journey, but it's just the beginning. Our engineering team continues to push the boundaries of what's possible, exploring advanced layout understanding, table extraction improvements and integration with multimodal AI models. While the parse_document() LAYOUT mode is still in public preview, we keep working to deliver accurate and reliable results on documents with complex layouts or a challenging reading order.

What sets Snowflake's approach apart is our commitment to solving the entire document understanding problem within the Snowflake environment. While competitors offer point solutions that address pieces of the puzzle, we're building a comprehensive platform that handles the complete workflow — from document storage to text extraction to semantic understanding — all within your secure AI data cloud.

Experience the difference

By combining digital-born extraction, custom-trained OCR models, and intelligent processing pipelines, we've created a best-in-class text extraction solution that delivers industry-leading accuracy and performance across diverse document types.

Most importantly, this functionality is delivered as a simple SQL function within your existing Snowflake environment without depending on external services or complex integrations. Experience the power of Snowflake's OCR capabilities today and unlock the full potential of your document repositories.