Gen AI

Long-Context Isn't All You Need: Impact of Retrieval and Chunking on Finance RAG

As very long-context large language models (LLMs) have arrived on the scene, some have claimed that retrieval is no longer needed for a high-quality AI answer generation pipeline — rather, simply fit all documents into a large context window and let the LLM pick out the relevant pieces. Through a case study on financial document analysis, we disprove this theory and show that retrieval and chunking strategies are still large determinants of AI answer generation quality, even more important than the quality of the generating model itself.

Retrieval-augmented generation (RAG) systems face unique challenges when processing long, complex documents — particularly financial filings such as SEC 10-K and 10-Q forms. For example, consider these queries:

  • “What are the key factors contributing to the $2.1 trillion increase in assets under management (AUM) at BNY as of the third quarter of 2024?”

  • “Which filings mention regulatory changes that impact capital requirements?”

In these scenarios, accurately pinpointing relevant information is critical. Overlooking key details can lead to costly missteps. Financial teams need a RAG system that doesn't just retrieve information but retrieves relevant information with precision from thousands of PDFs, each spanning hundreds of pages.

A typical RAG pipeline, in a nutshell, involves the following steps:

  1. Parsing documents: Extracting text via file parsers or OCR models for a high-fidelity retrieval across varied document formats.

  2. Chunking: Segmenting text into meaningful units that balance retrieval effectiveness with contextual integrity.

  3. Retrieval: Searching and returning the most relevant chunks using advanced techniques such as vector search.

  4. Generation: Synthesizing retrieved chunks into a coherent, factually consistent response.

Figure 1. A standard RAG pipeline over a document corpus.
Figure 1. A standard RAG pipeline over a document corpus.

Each of these components can be tuned individually with the overall goal of improving the quality of the output text generated by a chatbot. Our experiments on a hand-curated data set of SEC filings focus on the chunking, retrieval and generation components of the pipeline. Our results are powerful and somewhat counterintuitive — even with the advent of long-context models, chunking and retrieval strategies are far more impactful on output quality than the raw computational power of the generative language model. 

Specifically, based on our high-quality, hand-curated data set of SEC filings, we discovered that:

  • Global document context is critical: Appending LLM‑generated global document context (for example, company name or filing date) boosts response accuracy, as seen in Figure 2. This simple strategy outperforms LLM‑generated chunk-specific context.

  • Optimal chunking matters: Using moderate chunk sizes (~1,800 characters) and retrieving more chunks improves accuracy, while using overly large chunks (for example, 14,400 characters) can dilute relevance and drop performance by ~10%-20%.

  • Optimized retrieval narrows the gap: A robust retrieval pipeline with moderate 1,800‑character chunks and top‑50 retrieval significantly elevates performance — narrowing the quality gap between generation models (for example, bringing Llama 3.3 70B quality close to that of Claude 3.5 Sonnet).

  • Markdown-aware chunking has benefits: Without document context, markdown‑aware chunking (using section headers) can boost accuracy by 5%-10% over fixed splits; however, its advantage lessens when global context is added.

Snowflake Cortex Search — and the broader Snowflake Cortex AI ecosystem — is engineered to tackle these challenges head-on, offering a flexible, production-ready solution that extracts precisely what you need from even the most extensive financial filings. 

Figure 2. Different strategies improve RAG accuracy. Using more chunks as contexts, adding contexts to chunks and employing structure-aware chunking all push performance further. All settings (except “No RAG”) use Arctic-Embed 2.0 M for retrieval.
Figure 2. Different strategies improve RAG accuracy. Using more chunks as contexts, adding contexts to chunks and employing structure-aware chunking all push performance further. All settings (except “No RAG”) use Arctic-Embed 2.0 M for retrieval.

For all these plots, the y-axis is LLM-judged accuracy.

Chunking strategies explored

Our SEC filings experiments focused on a select set of chunking techniques that significantly boosted retrieval accuracy — without overcomplicating the pipeline. We concentrated on three "vanilla" approaches and enhanced them with LLM-based metadata injection.

“Vanilla” chunking strategies

  • Recursive chunking: This strategy uses a hierarchical approach using predefined delimiters. It begins by splitting text at larger, semantically meaningful boundaries (for example, double newlines for paragraphs) and then recurses into finer splits (for example, single newlines or spaces) as needed. This method is available as a user-defined function (UDF) in Snowflake and LangChain. In our experiments, we convert parsed documents from markdown to plaintext before applying recursive split text chunking.

  • Semantic chunking: This strategy finds natural breakpoints in the provided text by embedding each sentence and calculating the cosine distance between embeddings of each consecutive sentence. If the distance between two sentence embedding exceeds a specified threshold, that point is considered a good place to split. The breakpoint type (for example, percentile or standard deviation) and threshold can be specified in the input. The LangChain implementation can be found here. For these experiments, we use the plaintext version of the parsed documents and we use the percentile breakpoint type.

  • Markdown-header-based chunking: This approach splits text based on markdown headers, preserving structural cues that help reconstruct document context (here is a LangChain example). For example, headers such as “Risk Factors” or “MD&A” serve as valuable context during retrieval. The markdown headers can be used as context for the chunk. We plan to offer this chunking strategy soon as a UDF in snowflake.

LLM-enhanced context-based chunking

  • LLM-based document-level metadata: Here, a language model extracts or generates a concise summary of the entire document — covering key details such as company name, filing date, form type, financial metrics and section highlights. This global metadata is then prepended to every chunk, providing a consistent context that improves retrieval precision. See the appendix for more details and a simple SQL to add this context to all chunks.

  • LLM-based chunk-level metadata: This strategy improves retrieval accuracy by constructing chunk-specific contextual explanations prior to indexing (as exemplified here) and prepending them to the chunk. This approach ensures that each chunk not only carries its immediate content but also benefits from additional context derived from the broader document. See appendix for more details.

Comparison of chunking strategies

This table shows the pros and cons of each of the abovementioned chunking strategies.

Strategy

Pros

Cons

Recursive Chunking

- Preserves natural language boundaries

- Adapts flexibly to document structure

- May produce very small or uneven chunks

- Each chunk has no global context

Semantic Chunking

- Determines natural splitting points based on semantic similarity based on sentence embeddings

- May produce uneven chunks

- Embedding the whole corpus can be slow and costly

Markdown-Header-Based Chunking

- Aligns with the author’s logical organization

- Maintains coherent context within sections

- Can miss content spanning multiple headers

- Employs inconsistent header usage across filings

LLM-Based Document-Level Metadata

- Embeds global context with every chunk

- Improves retrieval precision

- Reduces redundancy

- Requires extra computation

- Has more complex implementation

LLM-Based Chunk-Level Metadata

- Provides tailored, context-rich summaries for each chunk

- Enhances disambiguation in complex filings


- Has significant computational overhead

- Introduces risk of LLM “hallucinations” and increased storage needs

Results

Figure 3: Different strategies improve RAG accuracy. Using more chunks as contexts, adding contexts to chunks and employing structure-aware chunking all push performance further. All settings (except “No RAG”) use Arctic-Embed 2.0 M for retrieval.
Figure 3. Different strategies improve RAG accuracy. Using more chunks as contexts, adding contexts to chunks and employing structure-aware chunking all push performance further. All settings (except “No RAG”) use Arctic-Embed 2.0 M for retrieval.

Document context outperforms chunk-level context

One of the most notable findings is that appending document-level context to each chunk provides consistent gains across both generation models (Claude 3.5 Sonnet and Llama 70B). For instance, with 1,800‑character chunks, injecting global metadata (such as company name, filing date and form type) boosts QA accuracy from around 50%-60% to the 72%-75% range.

By contrast, heavily augmenting each chunk with a unique LLM-generated summary (i.e., chunk-level context), though recommended by some sources such as Anthropic's "contextual retrieval" approach, is not quite as effective. This method resulted in 5.8- and 0.4-point losses on Llama 3 and Claude Sonnet, respectively, despite requiring much more computation. Simply including global document context — once per chunk — proved more robust and efficient.

Choice of retrieval and chunking can matter more than the generative model

A striking outcome in our experiments is how powerful retrieval and chunking can sometimes outweigh the strength of the generative model itself. While Claude 3.5 Sonnet does outperform Llama 70B in nearly every setting, the gap narrows significantly when we introduce higher‑quality retrieval pipelines (for example, top‑50 ranked chunks) and well‑structured chunking strategies (for example, 1,800‑character splits with appended doc metadata).

  • No RAG baseline: Both models perform poorly on the query sets without RAG, scoring merely 5%-10% accuracy.

  • High‑quality retrieval and chunking: Llama 70B’s accuracy can climb from 40-50% to the 70%+ range — nearly matching Claude 3.5 Sonnet in some configurations.

This highlights a crucial insight for finance RAG:

  • Even a strong generative model can yield subpar results if the retrieval stage is weak (for example, with large or unfocused chunks or with few relevant chunks retrieved).

  • Conversely, a carefully tuned retrieval pipeline (with moderate chunk size, a robust chunking strategy and appended doc metadata) can substantially boost a weaker model’s performance, in some cases closing most of the gap to a more powerful model.

In other words, the interplay of retrieval and chunking can be as important — if not more — than the raw capacity of the LLM, particularly when dealing with large, specialized documents such as SEC filings.

Figure 4. Effect of “more contexts” — through more chunks or larger chunks — without and with document contexts. All settings use Arctic-Embed 2.0 M for retrieval.
Figure 4. Effect of “more contexts” — through more chunks or larger chunks — without and with document contexts. All settings use Arctic-Embed 2.0 M for retrieval.

Adding large chunks is worse than adding more granular chunks

Many have argued recently that in the era of powerful long context models, we can simply throw entire documents or pages at the generative model and let it take care of extracting relevant information. However, our experiments show that bigger chunks (for example, ~14,400 characters) can degrade retrieval performance by bundling too much text, making it harder for vector searches to pinpoint the most relevant sections. Even Claude 3.5 Sonnet, with a 200,000-token window, sometimes suffers from “context confusion” if large chunks contain too many irrelevant details. On close examination of additional errors, we found that the extra noise added by larger chunks confuses the generation model: 

  • In one case, the question was about “driver compensation,” but the model got confused and included details of “driver commission.”

  • In another case, it picked up information from the wrong year (2023 vs. 2024).

By contrast, a moderate chunk size (1,800 characters) is generally more effective, especially when retrieving multiple chunks (for example, top‑50). However, “more is not always better”: Returning too many chunks or using overly large chunk sizes can saturate the context window and muddy the model’s focus. In practice, balancing chunk size and retrieval depth is key to maximizing QA accuracy.

Figure 5. Three "vanilla" chunking approaches and how LLM-based metadata injection enhances them. All settings use Arctic-Embed 2.0 M for retrieval and use top-10 retrieved chunks for RAG.
Figure 5. Three "vanilla" chunking approaches and how LLM-based metadata injection enhances them. All settings use Arctic-Embed 2.0 M for retrieval and use top-10 retrieved chunks for RAG.

Of all the “vanilla” chunking strategies, markdown-aware chunking performs the best

When document-level metadata is not appended, markdown-aware chunking (i.e., splitting on section headers or structured boundaries) tends to outperform naive fixed‑size splits as well as semantic chunking by 5-10 percentage points. This is because it preserves natural thematic breaks — important for financial documents that have standard sections such as “Management Discussion and Analysis” or “Risk Factors.”

However, once document-level contexts are appended, the differences between markdown and plaintext chunking strategies shrink. Having a global “this is from XYZ Corp 10-Q” label on every chunk helps the model keep track of the bigger picture, reducing the need for chunk‑level structural awareness. Hence, teams looking to avoid the overhead of generating doc contexts might consider markdown‑aware chunking. Those who can afford to generate or store doc metadata often find the chunking approach becomes less critical once global context is injected.

 


Appendix

Details of metrics and data overview

Metrics

To rigorously assess RAG performance, we compare the final generated outputs against a gold standard using two key metrics:

  • Average Normalized Levenshtein Similarity (ANLS): A soft-matching metric that quantifies similarity between the generated answer and the gold standard using Levenshtein distance.

  • LLM-based quality scores: An approach where a second LLM (or the same LLM in a separate pass) is used to judge the correctness and completeness of the generated response, with a rubric-like prompt. If the generated response is deemed to cover the golden answer for the query, then it is considered accurate; otherwise, if the generated response states "no answer" or fails to cover the golden answer, it is considered inaccurate.

In a future post, we will correlate metrics at document level, chunk level and final generation level. Next, we detail our data set's structure and annotation scheme, designed to support rigorous evaluation.

A closer look at our data

Document corpus

We compiled five years of SEC 10-K and 10-Q filings for the top 1,000 Fortune companies, amounting to approximately 23,000 PDFs. These files were parsed with Snowflake Cortex Parse_Document in ‘LAYOUT’ mode to yield markdown text. Subsequent chunking (using SPLIT_TEXT_RECURSIVE_CHARACTER with an 1,800-token size and 300-token overlap) produced roughly 3.2 million chunks.

Annotations 

We worked with an annotation agency to collect very high-quality annotations for around 500 queries. This is by far the best finance RAG data set in both the extensiveness of our annotations and the realistic and difficult nature of the queries involved (the closest other alternative is Financebench, but the information-seeking queries in that dataset are highly artificial often specifying which table or section the query should be answered from).

Designed to capture every nuance of financial filings, each annotated record includes:

  • Category: The type of query (e.g., contextual, fact-based)

  • Query: The actual question (e.g., “What are the key factors contributing to the $2.1 trillion increase in AUM at BNY?”)

  • Answer: A synthesized answer based on the aggregated evidence

  • Document identifiers (Doc Ids): Unique IDs linking to specific SEC filings

  • Span texts: The specific segments of text (or “spans”) where relevant information is located

  • Evidence metadata: For each evidence snippet, we record the page number, a detailed excerpt of the evidence text and a qualitative rating indicating the confidence or relevance

  • Types: The type of evidence (e.g., text or table)

For instance, one annotated example in our data set might include the following elements:

Category: Contextual  

Query: What are the key factors contributing to the $2.1 trillion increase in assets under management (AUM) at BNY as of the third quarter of 2024?  

Answer: The AUM of $2.1 trillion reflects an 18% YoY growth driven by higher market values and the favorable impact of a weaker U.S. dollar.  

Doc Id 1: Bank of New York Mellon Corp_10-Q_2024-11-01  

Span 1: "higher market values and the favorable impact of a weaker U.S. dollar"  

Evidence Page Number 1: 5  

Evidence Text 1: "AUM of $2.1trn up 18% YoY, primarily reflecting higher market values and the favorable impact of a weaker U.S. dollar"  

Rating 1: 2  

Type: Text  

... [further evidence fields follow] ...

This comprehensive annotation scheme enables evaluation at multiple levels:

  • Generation level: Assessing the overall accuracy and coherence of the generated answer.

  • Document level: Ensuring the right document (by Doc Id) is retrieved.

  • Chunk level: Verifying that the specific relevant snippets of text are accurately captured, despite challenges such as split spans or redundant information.

Details of adding LLM-based context 

Adding document-level context to each document

  1. Metadata generation: An LLM is used to extract or generate a concise summary of the document. For SEC filings, the generated metadata might include1:

    • Company name: "Bank of New York Mellon Corp"

    • Filing date: "2024-11-01"

    • Form type: "10-Q" or "10-K"

    • Key financial metrics: Brief notes such as "18% YoY AUM growth driven by higher market values and favorable currency trends"

    • Section highlights: Identifiers for sections such as "Risk Factors" or "Management Discussion and Analysis"

  2. Context injection into chunks: The generated metadata is then prepended to each chunk produced via a vanilla strategy, ensuring that every chunk carries both its immediate content and a snapshot of the document’s broader context.

-- STEP 1: GENERATE METADATA WITH LLM BASED ON PROVIDED KEYS

CREATE OR REPLACE TABLE MY_DOC_METADATA AS (

    SELECT

        DOC_ID,

        TEXT,

        SNOWFLAKE.CORTEX.COMPLETE(

        'llama3.3-70b',

        'I am going to provide a document which will be indexed by a retrieval system containing many similar documents. I want you to provide key information associated with this document that can help differentiate this document in the index. Follow these instructions:

    1. Do not dwell on low level details. Only provide key high level information that a human might be expected to provide when searching for this doc.

    2. Do not use any formatting, just provide keys and values using a colon to separate key and value. Have each key and value be on a new line.

    3. Only extract at most the following information. If you are not confident with pulling any one of these keys, then do not include that key:\n'

    ||

    ARRAY_TO_STRING(

        ARRAY_CONSTRUCT(<INSERT KEYS AS LIST OF STRINGS HERE>),

        '\t\t* ')

    ||

    '\n\nDoc starts here:\n' || SUBSTR(TEXT, 0, 4000) || '\nDoc ends here\n\n') METADATA,

    FROM

        MY_DOC_TABLE

)

;

 

-- STEP 2: GENERATE CHUNKS AND PREPEND CONTEXT TO CHUNKb

CREATE OR REPLACE TABLE MY_CONTEXTUALIZED_CHUNKS AS (

    WITH SPLIT_TEXT_CHUNKS AS (

        SELECT

            DOC_ID,

            C.VALUE AS CHUNK,

        FROM

           MY_DOC_METADATA,

           LATERAL FLATTEN( input => SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER (

              TEXT,

              'none',

              1800, -- SET CHUNK SIZE

              300 -- SET CHUNK OVERLAP

           )) C

    )

    SELECT

        M.DOC_ID,

        CONCAT(M.METADATA, '\n\n', C.CHUNK) AS CONTEXTUALIZED_CHUNK,

    FROM

        SPLIT_TEXT_CHUNKS C

    JOIN

        MY_DOC_METADATA M ON C.DOC_ID = M.DOC_ID

)

;

Adding a per-chunk context

This method consists of two key steps:

  1. Context generation: A language model creates a concise explanation for each chunk, situating it within the broader document context. For instance, for SEC filings, this metadata may include:

    • Document reference: Identifies the broader document or section it belongs to (e.g., "This chunk is from the ‘Management Discussion and Analysis’ section of XYZ Corp’s Q3 2023 10-Q filing").

    • Historical context: Adds reference points from the document (e.g., "In Q2 2023, operating expenses were reported at $150 million").

    • Key details: Extracts crucial information that might not be immediately present in the chunk itself.

  2. Context injection into chunks: The generated contextual metadata is prepended to each chunk before embedding and indexing, ensuring that the retrieval model understands how the chunk relates to the entire document.

Illustrative example for SEC filings

Consider a chunk from an SEC filing that originally states: "The company reported a 12% increase in operating expenses compared to the prior quarter."

Using contextual retrieval, this chunk would be transformed into: "This chunk is from the ‘Management Discussion and Analysis’ section of XYZ Corp’s Q3 2023 10-Q filing. In Q2 2023, operating expenses were reported at $150 million. The company reported a 12% increase in operating expenses compared to the prior quarter."


1 In some cases, metadata might be readily available even without using LLMs.

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Start your 30-DayFree Trial

Try Snowflake free for 30 days and experience the AI Data Cloud that helps eliminate the complexity, cost and constraints inherent with other solutions.