Gen AI

Accuracy at What Cost? Benchmarking Agentic Reasoning with MADQA

The best AI agents can now match human accuracy in document intelligence tasks, such as retrieving and reasoning over information in complex, multimodal documents.

But they work 5 times as hard to solve the problem. And both humans and machines hit a performance ceiling, leaving many tasks unsolved.

Figure 1: The MADQA benchmark requires the agent to iteratively search, gather and reason over visual and textual evidence in a large document collection. We measure not only the final answer and citation accuracy, but also computational efficiency.
Figure 1: The MADQA benchmark requires the agent to iteratively search, gather and reason over visual and textual evidence in a large document collection. We measure not only the final answer and citation accuracy, but also computational efficiency.

We introduce MADQA (multimodal agentic document question answering), a benchmark created in collaboration with academic and industrial partners, including the University of Oxford, UNC-Chapel Hill, CVC, and Hugging Face. Unlike traditional benchmarks, MADQA evaluates full search trajectories, allowing us to measure how systems navigate documents rather than just what they output.

When we applied MADQA to leading architectural paradigms, four findings stood out:

  • Even the best agents "brute force" their way to answers, but lack the ability to invest effort rationally.
  • Humans solved about 50% of questions on their first query, compared to roughly 12% for the best-performing agent (Gemini 3 Pro). To reach the same ~82% final accuracy, agents required up to 9 rounds of search, whereas humans typically succeeded in just 1-2 rounds, meaning the agents had to work roughly 5x as hard to reach parity.
  • A simple agent with a state-of-the-art LLM achieved 82.2% accuracy, outperforming a strong static RAG baseline by 4.6%. This result highlights a preference for agents when accuracy is the only factor and indicates a nearly 20% gap that neither humans nor LLMs can close.
  • An unconstrained recursive language model (RLM) processed 270 million tokens at a cost of roughly $850, yet failed to surpass the cheaper constrained agent built on the same base model.

The takeaway is clear: Enterprise AI systems must be evaluated not only on whether they get the right answer, but on how efficiently and strategically they arrive at it.

Below, we outline the design of MADQA and the results of our architectural standoff. For a deeper dive into the methodology and findings, read the full paper.

Designed for disruption: What makes MADQA unique

MADQA is designed to measure the enterprise readiness of agentic document intelligence systems.

Maximizing discriminative power. We moved beyond random sampling by applying Classical Test Theory to curate the data set. We selected questions that reliably distinguish between strong and weak models to ensure the leaderboard reflects genuine capability differences rather than noise or luck.

Benchmarking search trajectories, not just answers. Unlike standard QA benchmarks that only check the final output, MADQA captures the whole "search trajectory.” We logged every search query, page view and time stamp for human annotators, enabling the first direct comparison of how agents navigate documents versus how humans do it.

Confronting the "visual reality" of enterprise data. Most of the MADQA questions require understanding visual structures — such as correlating rows in a financial table, interpreting a checkbox on a form or reading a stamped date. Systems that rely solely on text extraction (OCR) cannot solve these tasks.

example 1

Enforcing strict grounding. Every question is human-authored and strictly grounded in the document set, spanning types from financial reports to technical diagrams. We explicitly designed the task to be "closed-world," meaning agents cannot rely on external training data to guess answers; they must find the evidence in the provided files.

We applied this rigorous framework to evaluate the leading architectural paradigms, revealing critical trade-offs between raw performance and computational cost.

Architectural standoff: RAG vs. agents vs. RLMs

A critical question for enterprise document intelligence is the selection of system architecture. We rigorously evaluated three distinct paradigms to understand the trade-offs between raw performance and computational cost.

example 2

Static RAG (“The Baseline”). Standard retrieval-augmented generation (RAG) pipelines, such as Google's Gemini File Search, serve as a strong baseline. They are fast and cost-effective but brittle. While Gemini 3 Pro File Search achieved a respectable 78.6% accuracy, it lacks the iterative planning required for complex, multihop queries. RAG systems often suffer from "last-mile" navigation failures — they find the proper document but fail to pinpoint the specific page or table containing the evidence.

Agentic systems (“The Winner”). Our results show that constrained agents (systems equipped with search tools and an iterative loop) significantly outperform static RAG. The Gemini 3 Pro Agent achieved the state-of-the-art accuracy of 82.2%. Why do they win? Agents can decompose problems. When an initial search fails, effective agents reformulate their queries and navigate through the document rather than giving up.

Recursive language models (“The Cautionary Tale”). We also evaluated RLMs, which allow the LLM to decompose context without the constraints of a fixed tool. While theoretically flexible, RLMs proved to be an efficiency catastrophe in practice. Without the constraints of a search tool, RLMs fall into an "illusion of infinite budget." For example, the Claude Sonnet 4.5 RLM processed more than 270 million input tokens at a cost of $850, yet failed to match the accuracy of the much cheaper agent relying on the same model. Unconstrained reasoning often leads to inefficient information processing without resulting in performance gains.

"Cold start" problem

While agents achieve high accuracy, they are less efficient learners than humans equipped with the same search tools. We analyzed the "Accuracy @ N Steps" to see how quickly different systems converge on the correct answer.

Humans demonstrate strong zero-shot strategic calibration, achieving ~50% accuracy on their very first query. Agents suffer from a severe "cold start." Gemini 3 Pro starts at only ~12% accuracy, relying on a steep, compute-intensive recovery to eventually reach parity.

example 3

To rigorously quantify this trade-off between eventual success and the computational cost required to achieve it, we need a new metric.

Accuracy-to-effort trade-off

In agentic workflows, final accuracy is a deceptive metric. It tells you if the agent solved the problem, but not at what cost. Did the agent solve the query in one elegant search, or did it flail through 20 expensive steps only to get lucky?

To rigorously measure this accuracy-to-effort trade-off, we adapted the Kuiper Statistic (derived from the Cumulative Difference method). It doesn't just reward the right answer; it penalizes the 'flail' — the moment an agent begins burning tokens on a lost cause. A high Kuiper score exposes a lack of strategic introspection, where the system grinds through complex queries it will ultimately never solve.

How does it work? We sort all test questions by the "effort" required (e.g., number of agent searches) and analyze the performance curve. In a well-calibrated system, the easiest questions (lowest effort) should have the highest accuracy. As effort increases, accuracy often drops. A steep drop indicates the agent is burning compute on lost causes — stubbornly grinding through complex queries that it ultimately fails to solve. The Kuiper Statistic quantifies the magnitude of this misalignment.

example 4

Low score (e.g., humans ~14.6). Highly calibrated. Humans invest effort rationally — solving easy tasks quickly and only spending their "effort budget" on problems they are likely to solve.

High score (e.g., best agents ~22.9). Poorly calibrated. Agents currently lack strategic introspection. They persist in "stochastic search" loops, burning tokens on problems where more search does not yield better answers.

Employing this framework shows that while the best agents can match human accuracy, they are significantly less efficient. They effectively "brute force" their way to the answer, whereas humans navigate strategically.

Conclusion

Scaling context windows gives models a bigger flashlight, but it doesn't teach them to read the map. The future belongs to agents that can navigate with purpose rather than luck, recognizing when they are lost and halting the moment the answer is found.

The future of AI does not belong to the model with the biggest context window, but to the one that can read the map. We are releasing MADQA to force a shift from brute-force retrieval to calibrated, cost-effective intelligence.


Resources

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Where Data Does More

  • 30-day free trial
  • No credit card required
  • Cancel anytime