Enterprise evaluation needs to be hybrid
One of the highest-value applications of AI today is transforming complex business workflows. Yet, while enterprise AI has become the industry's most critical frontier, realistic benchmarks for it remain surprisingly underexplored. Specifically, current evaluations are missing the "hybrid" piece: the reality that real-world business intelligence rarely lives in a single silo. To solve actual problems — for example, a business user may ask which customers had a revenue drop in the internal data warehouse, and what's the reason behind it — agents must combine complex structured data (like SQL databases) with ambiguous unstructured text (like web search or documents) to form a complete answer.

Currently, existing benchmarks focus almost entirely on a single modality. Text-to-SQL benchmarks such as Spider and BIRD drive progress in complex database querying, while browsing benchmarks like BrowseComp and GAIA measure open-ended web retrieval. However, the connection between these two worlds is largely unstudied.
To bridge this gap, we introduce HybridDeepResearch, a benchmark designed specifically to test hybrid SQL and text deep research. It asks a simple question: Can an agent reliably move evidence between the rigidity of SQL and the ambiguity of Search without dropping constraints? Answering this requires designing tasks that expose whether an agent can coordinate both modalities end-to-end to form a single, verifiable answer.
Where agents break: The anatomy of a handoff failure
To understand why this end-to-end coordination is so difficult, we need to recognize a two-fold problem: First, current evaluations rarely capture the intersection between SQL and text; second, because that intersection is not directly measured, current agent systems are not consistently designed to preserve constraints across it. HybridDeepResearch evaluates three task families as shown in Figure 2 to make this failure mode visible — the handoff. Each modality may look correct in isolation, but the workflow breaks when evidence must be transferred, cross-checked and validated across systems.

Here are three examples illustrating how agents typically fail to handle supply chain analyst tasks across three core patterns:
Task 1 — The salience trap (SQL-to-search failure): "Find our primary supplier of lithium-ion batteries, and check external news to see if their main manufacturing hub is currently facing a port strike."
- The correct path: The agent queries the database and identifies A-company as the primary supplier. It then searches the web specifically for A-company and port strikes.
- The common failure: The agent correctly finds A-company in the database. But when it goes to the web, it searches generally for "lithium-ion battery port strikes." It sees a massive, highly ranked news article about B-company's supply chain being halted. The agent answers: "B-company is affected by strikes."
Task 2 — Dropping the anchor (Search-to-SQL failure): "Identify the battery manufacturer currently striking in Rotterdam, and calculate our total Q3 spend with them."
- The correct path: The agent searches the news and identifies A-company as the striking company. It then writes a SQL query to calculate Q3 spend specifically where vendor_name = 'A-company'.
- The common failure: The agent successfully identifies A-company from the news. But when it turns to the database, it fails to translate that text discovery into a strict SQL WHERE clause. Instead, it writes a generic query that returns the Q3 spend for all battery manufacturers, or it just checks if A-company exists in the database without calculating the spend.
Task 3 — Skipping the intersection (parallel fusion failure): "Which of our active, Tier-1 suppliers are currently affected by the Rotterdam port strike?"
- The correct path: The agent queries the database to get a list of Tier-1 suppliers. Independently, it searches the web to get a list of companies affected by the Rotterdam strike. The final answer is the intersection of those two lists.
- The common failure: The agent searches the web and finds an article listing three companies affected by the strike (A-company, B-company and C-company). It immediately returns all three to the user as the final answer.
These examples illustrate the core issue: agents optimized for isolated tasks fail at the handoff because they aren't designed to enforce cross-tool agreement. The need is clear, but building such a benchmark is nontrivial, as each task must relentlessly force that agreement while still remaining objectively verifiable.
How we built it: From design philosophy to execution
To capture the architectural gaps at these seams, HybridDeepResearch introduces two core design shifts:
- From isolated tasks to cross-system orchestration: Traditional benchmarks test whether a foundational model can write SQL or browse the web in a vacuum. HybridDeepResearch evaluates the seams between these tools. We test the surrounding agent architecture to see if it can successfully transfer evidence, enforce cross-modal validation and prevent web salience from overriding database constraints.
- From flat scoreboards to diagnostic targets: Benchmarks like SWE-bench and 𝜏-bench have proven that evaluation should help builders debug their systems. Because every task in HybridDeepResearch requires a verifiable intersection of modalities, a failure tells you exactly where the orchestration broke — revealing instantly whether the agent dropped a schema anchor, fell for a salience trap or skipped a cross-check.
To translate these design goals into a reality, we needed a custom construction pipeline. Forcing an agent to find agreement between modalities while remaining objectively verifiable is a notoriously difficult balancing act. The vast majority of candidate tasks fail at least one of these strict checks. Some collapse into database-only tasks. Others are too easily bypassed using text alone or internal model memory. Even seemingly perfect tasks often unravel due to ambiguous entity matches, unstable web evidence, accidental answer leakage or database outputs that are simply too broad to serve as a clean pivot. Finding the true hybrid lock requires meticulously filtering out the noise.
HybridDeepResearch scales this process through a construction pipeline that starts from database-grounded seed entities and builds paired SQL and text constraints around them. The database provides the source of truth, the text side provides external evidence, and the task is retained only if both sides combine into a unique, objectively gradable answer.

- Select a database-grounded seed: We start from entities in structured tables and use schema context to determine what each entity means inside the database, preventing ambiguous names or overloaded values from becoming unreliable benchmark targets.
- Build nontrivial SQL constraints: Using LiveSQLBench-lite, we create database-side constraints that require joins, aggregations, filters, nested logic, rankings or grouped statistics. The SQL side produces either a precise pivot entity or a compact candidate set: useful enough to guide the task, but not sufficient to solve it alone.
- Add text evidence (simulating enterprise data): To simulate unstructured enterprise context — like news, internal reports or regulatory filings — in a reproducible, open source way, we attach evidence in the form of public knowledge from public sources available online. We filter weak associations, unsupported links and clues that leak the answer too directly, so the task requires true evidence gathering and entity resolution rather than simple keyword matching.
- Validate the hybrid lock: We retain only tasks where both sides are necessary. A valid task cannot be solved from model memory, SQL alone or text alone, and must have a unique, stable answer. After automated filtering, especially difficult cases that no evaluated agent architecture can solve are sent to human reviewers to distinguish genuinely hard examples from flawed, ambiguous or unsolvable ones. The retained tasks have a hybrid lock: SQL and text each narrow the space, but the final answer emerges only when the agent combines them, cross-checks the constraints and validates the final output correctly.
With this benchmark in place, we can evaluate whether agent systems actually preserve that hybrid lock in practice.
Benchmark results: Orchestration matters
A useful benchmark should not be saturated on launch day. To establish a baseline and validate that HybridDeepResearch remains challenging, we evaluated it using three representative agentic systems: ArcticSwarm (our multi-agent system for long-horizon search and hybrid reasoning), smolagents (a lightweight code-agent framework) and MiroFlow (an open source research-agent framework). These frameworks were selected as representative agentic research systems for baseline evaluation.
We evaluated all frameworks using Claude Sonnet 4.6 and reported mean@3 and pass@3 accuracy. Because each task has a unique answer and a strict evidence chain, we do not assign partial credit. A run is correct only if the agent resolves the full chain and returns the right final answer. Benchmark methodology, evaluated task sets, scoring criteria and runtime configurations are documented internally and available upon request.

As shown in Figure 4, ArcticSwarm achieved the highest scores among the evaluated frameworks in this benchmark configuration. The figure also reveals a clear difficulty hierarchy across the three task families, highlighting the gap between simple execution and orchestration:
- Parallel fusion (most tractable): This pattern offers the strongest foothold. Because the rigid database shortlist acts as an immediate, structured filter on the fuzzy web results, it creates a built-in safety net that prevents single-tool errors from derailing the entire run.
- SQL-to-search (intermediate difficulty): Having a clean database anchor provides a stable, highly specific guide before the agent enters noisy web browsing. However, agents still struggle with the Salience Trap, easily drifting from the anchor when highly ranked but irrelevant search results grab their attention.
- Search-to-SQL (most challenging): This remains the hardest pattern by a wide margin. Starting in a highly unconstrained web search space and trying to distill ambiguous narratives into a rigid, syntactically perfect SQL query introduces severe formatting and grounding friction. Without strong orchestration, agents fail to translate their text discoveries into clean database filters, resulting in flat-out query failures.
These performance gaps reveal why system design is the deciding factor. ArcticSwarm's multi-agent architecture — which powers Snowflake's CoWork Deep Research feature — was designed to help reduce these handoff failures through gated execution and constraint preservation. By employing a gated execution pattern, it keeps database exploration and web search initially insulated from one another.
Toward enterprise-ready hybrid reasoning
Enterprise agents will not be judged only by whether they can write a SQL query or fetch a webpage. They will be judged by whether they can move between governed data and external evidence while preserving the constraints that make an answer correct.
At Snowflake, we designed HybridDeepResearch not merely as a passive scoreboard, but as an active engine driving the co-evolution of our enterprise agent architectures. In developing collaborative agent systems like ArcticSwarm, the benchmark failures directly inform system design. This continuous feedback loop is the core idea: Evaluation should not trail system building; it should actively shape it.
Authors
Snowflake AI Research: Yite Wang, Xiaodong Yu, Boyi Liu, Yuxiong He, Zhewei Yao
Academic Collaborators: Ruofan Wu (University of Houston), Peiran Xu (University of California, Los Angeles), Xiaolong Li (The University of Hong Kong), Fan Shu (University of Houston), Soyoung Yoon (Seoul National University)

