Gen AI

Building Reliable Data Science Agents with DARE-Bench and PRISM-DS

Large language models are rapidly evolving into autonomous agents capable of executing complex, multistep workflows. One of the most important frontiers in this evolution is the data science agent — an AI system that can autonomously ingest data, engineer features, train models and produce reproducible predictions through code execution.

The two missing ingredients for autonomous data science

Despite rapid advancements in AI, building reliable data science agents remains a massive challenge. Real-world data science is fundamentally different from answering trivia or writing isolated code snippets. It requires end-to-end execution, absolute reproducibility, strict metric awareness and unyielding instruction fidelity.

Under these rigorous conditions, even the strongest frontier models remain brittle. They frequently ignore explicit constraints, process data steps in the wrong order or omit critical function arguments, like random seeds. These issues matter especially in enterprise environments, where data science workflows must be reproducible, governed and closely tied to real production data systems. We cannot rely on systems that succeed through lucky guesses or hallucinated code.

Why do these agents continue to struggle? Our research indicates that the field is currently blocked by two major missing ingredients:

  • The infrastructure gap: Current evaluation and training setups are insufficient. Most existing benchmarks are too small, measure only final-answer accuracy or rely on subjective LLM-as-a-judge grading. Training a robust agent — especially through reinforcement learning — requires deterministic, programmatic ground truth. Noisy reward signals simply cannot support stable training.
  • The agent design gap: Many current agents still rely on generic reasoning patterns. When faced with complex data sets, they draw from broad parametric memory or unconstrained web searches rather than employing task-aware workflow designs that align specific modeling choices with the actual data at hand.

To build autonomous data science agents that actually work in production, we need to solve both. DARE-Bench is our answer to the infrastructure problem, and PRISM-DS is our answer to the agent design problem.

DARE-Bench: A verifiable infrastructure for training and evaluation

To address the lack of reliable training and evaluation infrastructure, our team at Snowflake AI Research introduces DARE-Bench (Data Agent Reinforcement Evaluation Benchmark).

DARE-Bench provides 6,300 deterministic, Kaggle-derived data science tasks with fully programmatic ground truth. It is designed to serve a dual role: both as a standardized evaluation harness and as a large-scale training resource. This focus on executable, deterministic workflows aligns perfectly with how Snowflake thinks about enterprise AI systems operating reliably over real data pipelines.

To rigorously probe distinct agent capabilities, we designed specialized task variants covering classification, regression and time-series forecasting:

  • Process vs. outcome (for classification and regression)
    • Instruction following (IF): These tasks simulate a scenario where an agent must strictly execute a senior data scientist's detailed design. We evaluate process fidelity by verifying the agent's final predictions against a simulated reference output obtained from ground-truth code.
    • ML modeling (MM): These tasks reflect an outcome-driven scenario where the LLM has full freedom to explore and model the data. Performance is evaluated directly against the data set's original ground truth using reproducible metrics, such as macro-F1 and clipped R^2.
  • Classical vs. multivariate (for time-series forecasting): For time-series tasks, the distinction is more nuanced to reflect real-world forecasting challenges.
  • Exogenous features (XF): In this variant, we retain all exogenous features from the original data set alongside the time stamp and entity identification columns. This tests the agent's ability to leverage rich, multivariate signals for prediction.
  • Canonical forecasting (CF): This setup mimics classical forecasting. While exogenous features remain available for training, the test set is strictly constrained to only the timestamp and entity columns. This forces the agent to rely heavily on extracting historical and temporal patterns.
Figure 1: DARE-Bench defines each task by providing a natural-language question and structured files (metadata and train/test splits). An LLM agent executes code within a sandbox to generate predictions, which are compared against ground truth for automatic and reproducible evaluation.
Figure 1: DARE-Bench defines each task by providing a natural-language question and structured files (metadata and train/test splits). An LLM agent executes code within a sandbox to generate predictions, which are compared against ground truth for automat

Unleashing model performance with verifiable rewards

When we evaluate leading models on DARE-Bench using a standardized sandboxed environment, the infrastructure reveals a stark reality: Complex data science is still a major hurdle. While Anthropic's Claude-Sonnet-3.7 leads the pack in modeling tasks, and GPT-5 leads in instruction following, smaller open source models (like the Qwen3 family) struggle significantly out of the box.

Model class-IF class-MM reg-IF reg-MM time-XF time-CF
GPT-4o 32.88 40.45 20.28 40.60 35.54 4.77
GPT-4.1 55.82 57.83 52.17 58.62 40.78 6.60
GPT-5 69.81 43.40 57.24 56.29 36.83 10.13
GPT-o4-mini 67.56 57.89 53.62 57.60 42.29 9.67
Claude-Sonnet-3.7 61.48 61.03 46.37 63.20 49.88 13.70
Claude-Sonnet-4 16.21 18.27 15.21 11.33 4.80 0.01
Qwen3-32B 17.11 30.71 15.21 35.86 26.96 0.00
Qwen3-4B 3.60 5.23 0.72 3.29 6.97 0.00

Table 1: Main evaluation results on DARE-Bench test set (data as of Sept. 19, 2025).

However, because DARE-Bench provides verifiable, programmatic ground truth rather than subjective human or LLM judgment, it unlocks massive training potential for supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR).

  • SFT via rejection sampling: We generated high-quality supervision signals using strict rejection sampling across multiple trajectories. By experimenting with distinct filtering strategies (Fastest-valid, All-valid, Best-valid and Duo-valid), SFT alone improved Qwen3-32B's accuracy by 1.83x.
  • RLVR: To push the limits of smaller models like Qwen3-4B, we applied reinforcement learning using GRPO. Instead of subjective preference, we engineered objective programmatic signals (e.g., rewarding 1.1 for an exact match in IF tasks and dynamic scaling based on actual metric scores in MM tasks).

Using these verifiable rewards on DARE-Bench delivered an 8x improvement for the Qwen3-4B model (jumping from a score of 4.39 to 37.40) while reducing code errors by 48%. This proves that having the right training infrastructure, not just scaling up parameters, determines whether autonomous data science agents succeed.

Table 2: Fine-tuning and reinforcement learning improve performance over baselines. Superscripts denote absolute gains compared to the baseline of the same model (data as of September 19, 2025).
Table 2: Fine-tuning and reinforcement learning improve performance over baselines. Superscripts denote absolute gains compared to the baseline of the same model (data as of September 19, 2025).

PRISM-DS: Grounding agent design in expert workflows

With the infrastructure established, we must address how the agents themselves operate. To solve the second missing ingredient, we introduce PRISM-DS, an advanced agent framework designed to overcome the "generic design" limitations of current systems.

PRISM-DS introduces a "task-profile-to-skill loop." Instead of guessing approaches based on surface-level similarities, the system first acts as a profiler to extract structured task characteristics — such as data modality, target metrics and specific evidenced challenges, like extreme target skewness, high cardinality or systematic missing data.

These characteristics act as precise queries for an agentic retriever to navigate a highly curated knowledge base of 4,421 stage-specific skills, distilled from 1,504 medal-winning Kaggle notebooks. This helps ensure the agent retrieves targeted, challenge-aware techniques for initial code generation, and then repeats this retrieval process during block-wise refinement to iteratively improve specific pipeline stages.

Figure 2: Overall workflow of PRISM-DS.
Figure 2: Overall workflow of PRISM-DS.

When evaluated on the DARE-Bench modeling subset, PRISM-DS demonstrated the immense power of this structured agent design. Using GPT-5 as the backbone, PRISM-DS achieved a score of 48.38, significantly outperforming leading generic agent baselines, such as MLE-STAR (44.24), ML-Master (42.23) and DS-Agent (41.17).

Table 3: Main results on DARE-Bench modeling tasks (data as of March 31, 2025).
Table 3: Main results on DARE-Bench modeling tasks (data as of March 31, 2025).

These results, alongside the clear performance gaps shown in the benchmarks, underscore that success in real-world data science workflows depends on more than just scaling laws. It requires systems capable of rigorous task diagnosis and the targeted application of expert skills — capabilities that DARE-Bench is uniquely equipped to measure and validate.

Combining infrastructure and design: Pushing the frontier of enterprise AI

The results from our research paint a clear picture: You cannot solve data science automation with just larger models. Progress requires both verifiable environments and better workflow structures.

DARE-Bench removes the infrastructure bottleneck by providing deterministic, large-scale training and evaluation. PRISM-DS removes the agent design bottleneck by grounding model reasoning in structured, expert-level skills. For Snowflake, this points toward a future where data-centric agents can be trained and evaluated with the same rigor and reliability expected of production data systems.

Conclusions

Our results show that reliable data science agents require both rigorous training and evaluation signals, and agent designs grounded in expert workflows. DARE-Bench provides that foundation through verifiable tasks, while PRISM-DS shows how structured task profiling and skill retrieval can improve agent performance on real modeling workflows.

Together, they move us closer to reliable, reproducible and instruction-faithful data science agents for enterprise AI systems. Looking ahead, we see opportunities to extend this work to more complex enterprise settings, including multitable database workflows, unstructured data processing and advanced SQL-based data engineering tasks.

We are releasing DARE-Bench together with its associated training resources:

Additionally, we plan to make available more information regarding the PRISM-DS framework and the associated benchmark results in the near future. Stay tuned to the Snowflake Engineering Blog for more updates on our research into agentic reasoning.

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Where Data Does More