Accelerate End-to-End RAG Development in Snowflake with New SQL Functions for Document Preprocessing

Illustration showing icons representing a document being split up and processed for analysis.

As organizations increasingly seek to enhance decision-making and drive operational efficiencies by making knowledge in documents accessible via conversational applications, a RAG-based application framework has quickly become the most efficient and scalable approach. As RAG-based application development continues to grow, the solutions to process and manage the documents that power these applications need to evolve with scalability and efficiency in mind. Until now, document preparation (e.g. extract and chunk) for RAG relied on developing and deploying functions using Python libraries which can become hard to manage and scale.

To accelerate generative AI app development, we are now offering SQL functions to make PDFs and other documents AI-ready. Following the announcement of the general availability of Cortex Search, we are excited to announce two new document preprocessing functions:

PARSE_DOCUMENT for layout-aware document text extraction (Public Preview)

and

SPLIT_TEXT_RECURSIVE_CHARACTER for text chunking (Private Preview).

These functions streamline the preparation of documents, such as PDFs, making them AI-ready. AI-ready data is key to delivering value via a RAG application. Once the documents are AI-ready, they can be fed into a RAG engine, which improves the overall quality of the AI application.

Imagine that you want to provide a sales team with a conversational app that uses a large language model (LLM) to answer questions about your company’s product portfolio. Since a pre-trained LLM alone will lack deep expertise in your company’s products, the answers generated are likely to be incorrect and of no value. To provide accurate answers, developers can use a RAG-based architecture, where the LLM retrieves relevant internal knowledge from documents, wikis or FAQs before generating a response. However, for these documents to enhance RAG quality, content must be extracted, split into smaller blocks of content (chunks) such as paragraphs or document sections, and embedded as vectors for semantic retrieval. Once the pre-processing is complete, the RAG engine can be initiated.

In other words, your RAG is as good as your search capabilities, search is as good as the data chunks that it indexes, and having high-quality text extraction is foundational to all of this.

Deliver the most relevant results

Cortex Search is a fully managed service that includes integrated embedding generation and vector management, making it a critical component of enterprise-grade RAG systems. As a hybrid search solution that combines exact keyword matching with semantic understanding, it enhances retrieval precision, capturing relevant information even when queries are phrased differently.

This hybrid approach enables RAG systems to deliver more accurate and contextually relevant responses, whether the query is narrowly focused on specific terms or explores more abstract concepts. For example, a hybrid query like “headphones SKU: ABC123” will prioritize results with an exact match on “ABC123” while also returning related results about headphones, electronics, music and more. This means each query can give semantically similar results as well as precise matches for specific terms like product SKUs or company IDs.

This capability is particularly valuable when documents are prepared using layout-aware text extraction and chunking, helping ensure that the content is optimally structured for retrieval. By simplifying document preprocessing through short SQL functions, data engineers can efficiently prepare PDFs and other documents for gen AI without the need to write complex, lengthy Python functions. This streamlined process significantly reduces the time and effort required to make documents AI-ready.

Document preprocessing is foundational for building successful RAG applications, with PARSE_DOCUMENT and SPLIT_TEXT_RECURSIVE_CHARACTER serving as important steps in this process. These new functions significantly reduce the complexity and time required for document preprocessing. This makes it faster and simpler to get documents ready for use in RAG chatbots, helping organizations quickly build and improve their AI-powered solutions all within Snowflake.

Diagram showing how to build and serve RAG applications in Snowflake.

Figure 1: Build and serve RAG applications in Snowflake. Document preprocessing is foundational to building a successful RAG application. Here, PARSE_DOCUMENT and SPLIT_TEXT are the first steps in the process, upper left.

In this blog post, we will show how Snowflake’s integrated functionality simplifies building and deploying RAG-based applications.

Preparing documents for a RAG system

The responses of an LLM in a RAG app are only as good as the data available to it, which is why proper data preparation is fundamental to building a high-performing RAG system. The process starts with selecting the right data sources, such as internal documents, external databases or industry-specific content. The goal is to ensure that the data is relevant, up-to-date and reliable.

Step 1: Extract text and layout

Once the necessary data is gathered, the first step involves cleaning and preprocessing. In the vast majority of enterprise use cases, optimizing search systems requires extracting text from PDF documents in combination with advanced document structure analysis. To make this easy to do, we built PARSE_DOCUMENT for layout-aware extraction that makes it possible to parse documents based on paragraph boundaries, images and tables. Organizations can get started quickly by pointing the PARSE_DOCUMENT SQL function to process PDF documents available in a cloud storage service accessible via an External Stage (e.g., Amazon S3) without copying the original file into Snowflake.

Since not all documents are the same — some PDFs may contain large volumes of text split in paragraphs while others feature complex layouts with tables — tailored parsing strategies can help ensure that the extracted text is useful for the RAG engine. The PARSE_DOCUMENT SQL function supports two modes — OCR mode, which ignores layout, and LAYOUT mode, which maintains document structure, including tables. OCR mode is typically best suited for flat documents or text files that lack a defined layout structure, such as tables, images, and sections. For documents with rich layout structure, such as technical manuals, business reports, and legal documents, LAYOUT mode is recommended to improve retrieval accuracy. This SQL function automatically scales processing across machines to optimize throughput, allowing developers to seamlessly scale their apps to process millions of documents simultaneously.

At launch, the function supports text extraction for languages and characters used in English, Spanish, French, Portuguese, German, Swedish and Norwegian. For the latest details and document limitations, refer to the Cortex Parse Document documentation.

Diagram showing how Cortex text extraction SQL function supports optional layout mode

Figure 2: Cortex text extraction SQL function supports optional layout mode.

Step 2: Chunking or splitting

After extracting relevant text and layout from documents, it must be split into smaller, manageable chunks. This is crucial for indexing and retrieval. The two most common techniques are rule-based splitting (e.g., by sentence, paragraph or section) and semantic splitting (e.g., based on topic or context), both of which seek to preserve semantically similar phrases in the document.

To streamline rule-based splitting, developers can now use the new, user-friendly SPLIT_TEXT_RECURSIVE_CHARACTER SQL function. Proper chunking is key to maintaining context and relevance during information retrieval. The size of each chunk directly impacts how well the system retrieves data. Chunks that are too small may lack context, while those that are too large may dilute relevance. Striking the right balance is essential. With the SQL function, developers can quickly test and evaluate multiple chunking strategies with the easy-to-adapt chunk size and overlap variables. To access this function during private preview, reach out to your sales team. In the meantime, developers can continue to run custom Python libraries like LangChain as Snowpark Python UDFs as demonstrated in the quickstart.

In some cases, semantic chunking — grouping text based on meaningful, semantically complete chunks — can be more effective than character-based splitting. For example, when extracting text from a document with tables, semantic chunking helps ensure that the content of an entire table remains within a single chunk. If you extract structural elements of a document using PARSE_DOCUMENT Layout mode, which outputs markdown, it can yield higher consistency between chunks and better retrieval accuracy. This strategy enables targeted summaries per section in documents. You can use your preferred embedding model, such as Arctic Embed (the embed model used in Cortex Search, downloadable from HuggingFace). An example of running semantic splitting as part of your data preparation pipeline is available in this "Build a RAG-based LLM assistant" quickstart.

Step 3: Vectorize (embed) and index

Once the text has been split or chunked, Cortex Search handles the rest. Each document chunk is converted into a vector representation (embedding) using the Snowflake Arctic Embed model. These embeddings capture the semantic meaning of the text and enable effective similarity matching. The embeddings are then indexed for quick retrieval based on query similarity. The service automatically indexes and embeds your data in an incremental fashion, processing only changed rows from the underlying data source.

All of the operational complexity of building a search service is abstracted into a single SQL statement for service creation. This eliminates the burden of creating and managing multiple processes for ingestion, embedding and serving — freeing up time to focus on developing cutting-edge AI applications.

Start building smarter RAG chatbots for enterprises today

Mastering data preparation is a prerequisite for optimizing a RAG system. That’s why we are excited to introduce PARSE_DOCUMENT to simplify document text and layout extraction, and the upcoming SPLIT_TEXT_RECURSIVE_CHARACTER function for efficient text chunking. Together with Cortex Search, already trusted by organizations, these functions empower customers to streamline their end-to-end RAG workflows, easily within Snowflake. Our solution can help enterprises unlock the full potential of their data.

Start building smarter, high-performing RAG applications, all within Snowflake. To see how easy it is to build a complete RAG chatbot with Cortex Search, check out our RAG quickstart.