Blog/Gen AI/CoCoEvolve: What If a Coding Agent Could Optimize Your AI Systems?
JUN 03, 2026/11 min readGen AI

CoCoEvolve: What If a Coding Agent Could Optimize Your AI Systems?

Tuning AI systems by hand doesn't scale. Whether you're improving a data agent, a dbt project or an LLM pipeline that classifies, filters and summarizes rows (such as Snowflake's AI Functions), the loop looks the same: Run an eval → diagnose failures → edit → rerun → repeat.

While teams are exploring evolutionary optimization techniques (e.g. AlphaEvolve and OpenEvolve) to address this problem, we spent the past few months asking this research question: what if we used a coding agent instead of an LLM to power these techniques?

Our research resulted in CoCoEvolve, an optimization harness that uses evolutionary optimization techniques along with CoCo, Snowflake's AI coding agent, to automatically propose, test and keep changes that improve your AI systems. In our evaluations, coding-agent powered optimization outperformed the tested LLM-only approaches.

Here is what we found:

  • A stock Cortex Agent reached near the top of the DABStep leaderboard. In our internal evaluation using the DABStep Hard benchmark configuration as of June 2026, CoCoEvolve improved the evaluated Cortex Agent configuration from 22% to 89.9%, with no human-in-the-loop tuning1. LLM-only approaches like OpenEvolve did not achieve significant gains due to the high level of interactivity (e.g. running SQL for verification, selectively exploring massive artifacts) required for this task, a capability that coding agents like CoCo possess in spades.
  • dbt task pass rates jumped roughly 8.5%. The same harness optimized a data engineering pipeline with no task-specific engineering.
  • Cortex AI Functions accuracy improved by roughly 41.3%. CoCoEvolve improved AI Function accuracy from 49.4% to 90.7% on the PII redaction benchmark. Similar improvements are found in other use cases like sentiment analysis and spam detection, providing a strong signal for optimization strategy generalizability.

In this post, we break down why hand-tuning AI systems hits a ceiling, explain how CoCoEvolve works, show benchmark results across three AI applications and detail what coding agents bring that LLMs alone cannot.

Tuning AI systems manually doesn't scale

Every team building on top of LLMs has hit the same ceiling: the system works on demos but experiences issues on real workloads, and every fix may risk regressing something else. This manual edit-eval loop has three structural problems:

  1. Sequential: One person makes one edit at a time, so progress is linear regardless of how large the evaluation surface is.
  2. No memory: Each edit overwrites the last, so you lose track of what was working before, rediscover the same tradeoffs repeatedly, and have no way to combine strategies that each solve different parts of the problem.
  3. Locally biased: You fix whatever failure is most visible, not necessarily where improvement would have the most impact.

We built CoCoEvolve to address all three gaps.

How does CoCoEvolve work?

CoCoEvolve is an evolutionary optimization framework, validated through our internal research prototype, which wraps around an AI artifact and repeatedly makes it better. The premise is simple:

  1. Start with a candidate "program" (could be code, a prompt, a config, an agent blueprint).
  2. Propose a change to it (a "mutation").
  3. Evaluate whether the change helped.
  4. Keep what works, discard what doesn't, repeat.

Existing frameworks in this family (such as OpenEvolve, AlphaEvolve, GEPA and ADRS) use an LLM as the component that proposes each change (the "mutation operator"): Feed in the parent program, ask for a better version. But an LLM alone in that role has three critical gaps:

  1. No interactivity: It proposes text diffs but cannot interact with a live artifact to verify a change actually works.
  2. Shallow domain knowledge: It knows the environment the way it knows anything from training data: surface-level, sometimes stale and without built-in validation.
  3. No self-validation: Every unvalidated mutation burns an expensive eval cycle, and without targeted hypotheses the search wastes budget on low-value changes.

CoCoEvolve replaces the LLM with Snowflake CoCo, Snowflake's AI coding agent, which addresses all three:

  1. Interact with live objects: CoCoEvolve's mutations are not text diffs, but are sequences of executed, verified actions against a real Snowflake account: inspecting an agent's configuration, querying tables, creating stored procedures and confirming the change works before returning it to the harness.
  2. Self-test before consuming an expensive eval: By pre-testing candidates against the target question before delivery, the agent identifies and eliminates obvious regressions early. In our internal testing on the agent optimization example, this approach alone reduced the need for full evaluations by 34%. Additionally, since failing candidates are often only slightly off-target, CoCoEvolve can gracefully diagnose and recover these iterations, preventing outright failure.
  3. Apply verified, environment-specific skills: CoCo ships with packaged workflows that encode expert knowledge (e.g., how to author a semantic view, modify agents, inspect query history). Each skill includes its own validation step, so every mutation entering the population has already been tested.

In practice, this means an engineer can point CoCoEvolve at an AI artifact and an eval set, start it and walk away. CoCoEvolve also tracks performance per question rather than reducing everything to a single accuracy score. This catches regressions automatically and focuses effort where it will actually move the needle.

Benchmarks by AI artifact

We tested CoCoEvolve against three benchmarks spanning different artifact types to validate that the approach generalizes.

Optimizing Cortex Agents on DABStep

DABStep is a 450-question public benchmark for data agents, built by Adyen and Hugging Face on a payments/fintech domain. When the benchmark was published, the best frontier models barely cleared 20% on the hard set: o3-mini at 16%, DeepSeek R1 at 13%, Claude 3.5 Sonnet at 12%. Several reasoning models scored 0% out of the box without a custom prompt. Questions range from straightforward aggregations to multistep reasoning across a 138k-row transaction ledger, 1,000 fee-rule structures and a domain manual, where the human reference solution runs 220 lines of code across four sequential reasoning steps.

We took a stock Cortex Agent using claude-sonnet-4-6 generated by Snowflake CoCo (empty instructions, default semantic view, default tools) and pointed CoCoEvolve at it. The "program" being evolved is the agent itself. Each iteration:

  1. The graph picks a set of high-value target questions.
  2. An ensemble of CoCo instances, equipped with skills, mutates a parent agent to produce a child.
  3. The child is gated against the targeted questions and evaluated against a regression and generalization check.
  4. Survivors enter the archive and feed the next generation.
Approach Score on DABStep Hard
Stock Cortex Agent (baseline) 22.0%
Snowflake CoCo only 35.2%
LLM-based OpenEvolve 45.5%
CoCoEvolve 89.9%

Using CoCoEvolve optimization, the evaluated Cortex Agent configuration achieved one of the top reported DABStep Hard benchmark results at the time of testing.

Other leaderboard approaches are bespoke, multimodel pipelines designed by hand for this specific benchmark, often with custom code interpreters running alongside the LLM.

CoCoEvolve uses no benchmark-specific engineering: The agent does everything through its own tool calls, and the harness is universal enough to optimize any artifact with an eval set.

CoCoEvolve built entirely new capabilities

While in our testing, the LLM-based OpenEvolve approach only discovered surface-level instruction tuning, CoCoEvolve utilizes CoCo to perform structural mutations. Consequently, the agent does not merely possess a refined prompt, but has evolved entirely new, functional tools and capabilities.

  • New verified queries that encode domain-specific reasoning paths.
  • New stored procedures that pre-encode common reasoning patterns (e.g., a MERCHANT_PERIOD_FEES UDF that lets the agent ask one question instead of doing arithmetic across 26 tool calls).
  • New Dynamic Tables that pre-compute frequently-joined columns, eliminating flaky on-the-fly joins.
  • Refined orchestration and response guidelines that encode lessons from observed failure modes as tight, situational guardrails.

We never told the system to make any of these changes. They are the kind of fixes an expert in modifying Cortex Agents would propose after a week of failure analysis. CoCoEvolve produced them in hours, and the harness logs a per-question audit trail showing exactly how it arrived at each one.

Optimizing data pipelines on dbt-bench

Here the "program" being evolved is a dbt project: a directed acyclic graph of SQL transformations, tests and dependencies.

We ran CoCoEvolve against dbt-bench, an internal benchmark that measures whether an automated system can author and maintain a dbt project end-to-end (models, tests, refs, lineage). CocoEvolve evolves the meta-prompt that generates task-specific knowledge graphs (KGs). It iteratively mutates the prompts' extraction guidance, structural constraints, and quality bars, so the KGs it produces give the coding agent better grounding on schemas, column names, join keys, and business rules.

  Score on dbt-bench tasks
Baseline 26/82 (31.7%)
With CoCoEvolve KGs 33/82 (40.2%)

On the standard dbt-bench tasks, our initial experiments indicated an 8.5% improvement in passing tasks over the baseline without knowledge graphs.

Qualitatively, the prompt specializes as it evolves:

  • It replaces abstract extraction guidance with a structured pre-flight checklist.
  • It adds explicit schema-routing verification after discovering that some failures come from models landing in the wrong schema.
  • It learns to suppress generic infrastructure boilerplate that diluted attention from task-specific guidance driving accuracy.

This is the same harness that optimized a Cortex Agent, with no dbt-specific modifications, applied to a graph of SQL transformations with interdependent tests and lineage. We take that as evidence that CoCoEvolve works across artifact types, rather than being limited to a single benchmark.

Optimizing AI Functions on Personally Identifiable Information (PII) Redaction

Snowflake Cortex AI Functions (AI_FILTER, AI_AGG, AI_CLASSIFY etc) are a third natural target. The artifact being evolved is the function body which includes prompts, model choices, and SQL scripts for preprocessing and postprocessing.

AI Functions face the same optimization challenge as agents and dbt projects: their small parameterization (prompts, model choices, and SQL scripts) must be tested across a large evaluation surface, such as a real production use case with massive datasets. This makes manual tuning often directionless and difficult to scale. However, CoCoEvolve's combined approach—using targeted mutation and stratified evaluation—significantly boosts the efficiency of function body candidate search, finally making optimization feasible at scale.

When applied to a PII redaction dataset by ai4privacy, CoCoEvolve took a basic AI function using claude-haiku-4-5 scoring 49.4% and boosted it to 90.7% by developing a two-step extraction → substitution pipeline. This significant performance leap stems in part from the CoCo mutator's feedback loop, which enables it to run AI Functions and perform real-time error correction during the mutation process.

Approach Accuracy on PII data set
AI Function (baseline) 49.4%
AI Function optimized CoCoEvolve 90.7%

In other common AI Function use cases, such as sentiment analysis, spam detection, insurance routing, and content moderation, CoCoEvolve consistently discovers higher-quality function implementations. This provides strong evidence of CoCoEvolve's generalizability for optimizing AI Functions.

The math behind how CoCoEvolve decides what to optimize next

The core loop (propose a change, evaluate it, keep what works) runs continuously, but each coding-agent session takes minutes and a full eval pass can take hours. When the benchmark has hundreds of failing cases, deciding where to spend that time matters as much as the mutation itself.

CoCoEvolve addresses this challenge with two mechanisms:

1. Per-question fitness, propagated across a similarity manifold. Instead of compressing performance into a single scalar, every question in the benchmark maintains a pass probability. We pre-compute a similarity matrix over the questions and propagate observed pass/fail signals from tested questions to untested ones. The fitness function is a graph-weighted lower bound on this surface.

2. Information-gain target selection with conditional decay. Each iteration picks the question whose solution would yield the largest expected lift in fitness, modulated by a decay factor that suppresses questions that have repeatedly resisted the search. The decay resets the moment any candidate in the population cracks the question, at which point that candidate's configuration is injected into the next mutation as a "donor." This is what lets a single solution propagate quickly through the entire population.

The combined effect is that CoCoEvolve spends its coding-agent budget on the questions where evidence suggests it can actually make progress, rather than re-attacking impossible frontiers or re-solving easy ones.

What one iteration looks like in practice:

  1. The harness selects a failing question and picks a parent candidate that passes semantically similar questions.
  2. CoCo receives both, proposes a change and tests it against the target question.
  3. If it passes, the harness runs a regression check on previously-passing questions.
  4. Survivors enter the population, and multiple iterations run in parallel.

Concluding thoughts

Our research on CoCoEvolve introduces a new technical idea: powering evolutionary optimization techniques with a coding agent rather than an LLM can lead to substantial gains in optimization results. Our empirical results on three different case studies of AI Systems – improving Cortex Agent performance on the DABStep benchmark, dbt pipelines on an internal data engineering benchmark, and an AI Function for PII redaction on a public dataset – provide support for the thesis that this a promising approach for optimizing production AI systems. Stay tuned for more research results in this space.

1 Although this would rank second on the public DABstep leaderboard as of May 2026, we acknowledge it remains an unvalidated result despite our efforts to contact the DABStep team.

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Where Data Does More