Self-Improving Agents with CoCo
Overview
Building AI agents is just the beginning — understanding how well they perform and systematically improving them is what separates prototypes from production systems. In this guide, you'll build a marketing analytics agent, deploy it to production, stress-test it with hard queries, then use Snowflake CoCo to mine failures from logs, evaluate with Agent GPA, and optimize the agent's instructions.
By the end, you'll have an agent with measurably better performance — and a repeatable workflow for continuous improvement.
| Step | What You'll Do |
|---|---|
| Setup | Deploy a production agent with 5 tools |
| Stress Test | Run hard queries in Snowflake CoWork to generate failure traces |
| Install Snowflake CoCo | Install the CLI while traces propagate (~10 min) |
| Evaluate | Mine logs, curate an eval dataset, run Agent GPA baseline |
| Optimize | Analyze failures, generate improved instructions, validate with a second eval |
Architecture
┌──────────────────────────────────────────────────────┐ │ MARKETING CAMPAIGNS AGENT │ │ │ │ Tool 1: query_performance_metrics (Cortex Analyst) │ │ Tool 2: search_campaign_content (Cortex Search) │ │ Tool 3: generate_campaign_report (Stored Proc) │ │ Tool 4: web_search (Web Search) │ │ Tool 5: data_to_chart (Visualization) │ └──────────────────────────────────────────────────────┘ │ ▲ ▼ │ ┌───────────────┐ ┌──────────────┐ ┌─────────────┐ │ Evaluate │───▶│ Analyze │──▶│ Optimize │ │ (Agent GPA) │ │ (failures) │ │ (AI-driven)│ └───────────────┘ └──────────────┘ └─────────────┘
What You'll Learn
- How to build a Cortex Agent with multiple tool types (Cortex Analyst, Cortex Search, stored procedures, web search, data-to-chart)
- How to use Snowflake CoWork to interact with your agent and generate observability traces
- How to use Snowflake CoCo to mine agent logs and curate evaluation datasets
- How to run Agent GPA evaluations with built-in metrics
- How to analyze failure patterns and generate improved orchestration instructions
- How to validate improvements by comparing evaluation scores across agent iterations
What You'll Build
A complete agent optimization workflow:
- A marketing campaigns agent with 5 tools
- An evaluation dataset curated from real agent interaction logs
- An optimized agent with improved orchestration instructions
- Before/after evaluation results demonstrating measurable improvement
What You'll Need
- A Snowflake account with ACCOUNTADMIN access
- Cross-region inference enabled (required for evaluation LLM judge models)
- ~5 minutes for setup script to complete
Prerequisites
- Basic familiarity with Snowflake SQL and Cortex Agents
Run Setup
Download the setup.sql file from the repository.
Open a Snowflake worksheet in Snowsight and run the entire setup.sql file. This creates:
- Database
SELF_IMPROVING_AGENT_DBwith schemaAGENTS - 4 data tables (25 campaigns, ~1578 performance records, content, feedback)
- Semantic view, Cortex Search service, report generation procedure
- The production agent
MARKETING_CAMPAIGNS_AGENT
Verify setup succeeded — the final statement should print a success banner.
Stress-Test the Agent in Snowflake CoWork
Open the agent in Snowflake CoWork:
- In Snowsight select AI & ML > Agents
- Select MARKETING_CAMPAIGNS_AGENT
- Query your newly created agent in the Agent Admin View or Click Preview in Snowflake CoWork to get the full front end experience!
- Or optionally - Go to ai.snowflake.com
Your goal is to generate a mix of successful and failing traces by asking progressively harder questions. Copy-paste these one at a time:
Note: If you'd prefer a quicker less interactive route you can run -
agent_requests.sql. As the script executes (~3-5 minutes) visit the AI & ML > Agents > MARKETING_CAMPAIGNS_AGENT > Monitoring tab to inspect traces of all of the calls made to your agent!
Simple queries
What is the total spend across all campaigns?
Generate a report for our holiday gift guide
What content format generated the most revenue per dollar spent?
Compare the A/B test performance for our email vs social media campaigns
Multi-tool queries
Which campaign had the highest ROI and what did customers say about it? Generate a report for that campaign too.
Which audience segment responded best to our promotions and what was their average spend?
Complex synthesis queries
Which campaigns had the best A/B test lift but the worst customer sentiment?
Show me campaigns where customer sentiment was negative but ROI was still positive — what made them work financially?
Note: It may take a few minutes for agent interaction traces to appear in the observability logs. If you just finished running the queries above, now is a good time to install Snowflake CoCo (next step) while the traces propagate.
Install Snowflake CoCo
Snowflake CoCo is an AI-powered CLI that you'll use to mine agent logs, run evaluations, analyze failures, and generate improved agent instructions.
Linux / macOS / WSL:
curl -LsS https://ai.snowflake.com/static/cc-scripts/install.sh | sh
Windows (PowerShell):
irm https://ai.snowflake.com/static/cc-scripts/install.ps1 | iex
After installation, run cortex to launch the setup wizard — it will guide you through connecting to your Snowflake account.
For detailed setup instructions, see the Snowflake CoCo CLI docs.
Curate an Eval Dataset from Logs
Open Snowflake CoCo and enter /bypass to enable bypass mode, then enter the following prompt:
Use the cortex agent dataset-curation skill to pull all available production traces for SELF_IMPROVING_AGENT_DB.AGENTS.MARKETING_CAMPAIGNS_AGENT and curate an evaluation dataset. Ground truth should always include sections for key figures, curated suggestions and sources referenced and be specific enough to accurately evaluate agent quality. Expected tool invocations are not needed. Store the evalset in SELF_IMPROVING_AGENT_DB.AGENTS and register it as a new evaluation dataset.
Snowflake CoCo will:
- Query the observability traces from the previous step
- Help you select and annotate queries with ground truth
- Create an eval table and register it via
SYSTEM$CREATE_EVALUATION_DATASET
Run Baseline Evaluation
Run Agent GPA on your curated dataset. Enter this prompt in Snowflake CoCo:
Run an evaluation for SELF_IMPROVING_AGENT_DB.AGENTS.MARKETING_CAMPAIGNS_AGENT against the registered dataset. Compute Answer Correctness, Logical Consistency, Execution Efficiency, Plan Quality and Plan Adherance as metrics. All metrics should use a 0-1 scale where 1 is optimal beavior. Once the eval completes, show me the evaluation results. Break down scores by metric and identify which queries scored lowest. What are the common failure patterns?
Common failure patterns you'll see:
- Wrong tool selection for multi-tool queries
- Redundant tool calls
- Incomplete summaries missing key data
Improve the Agent
Based on the failure analysis, generate improved orchestration and response instructions for SELF_IMPROVING_AGENT_DB.AGENTS.MARKETING_CAMPAIGNS_AGENT that fix the identified issues. The instructions should tell the agent what format to respond in, when to use multiple tools and in what order, and encourage efficient tool calling. Only make updates to the instructions - do not make any updates to the tool configuration or other areas of the agent spec. Apply the changes.
Snowflake CoCo will:
- Draft improved orchestration and response instructions with explicit tool routing rules and response guidelines
- Apply via
ALTER AGENT ... SET SPECIFICATION = ...
What changes: Only the instructions.orchestration and instructions.response field. Tools, tool_resources, and orchestration model stay the same. Improved instructions is the only lever.
Validate Agent Improvement
Run the evaluation of SELF_IMPROVING_AGENT_DB.AGENTS.MARKETING_CAMPAIGNS_AGENT against the same dataset again. Compare the results against the baseline — show me a side-by-side comparison of scores by metric and highlight what improved.
What to look for:
- Overall score improvement: The optimized agent should score higher across both metrics
- No regressions: The optimized agent should still handle simple queries just as well as the baseline
Conclusion and Resources
Congratulations! You've built a self-improving AI agent workflow — deploying a production agent, stress-testing it, mining failures from logs, evaluating with Agent GPA, and validating that improved instructions lead to measurably better performance.
What You Learned
- How to build a multi-tool Cortex Agent with Cortex Analyst, Cortex Search, stored procedures, web search, and data-to-chart capabilities
- How to generate observability traces by stress-testing your agent in Snowflake CoWork
- How to use Snowflake CoCo to mine agent logs and curate evaluation datasets
- How to run Agent GPA evaluations with built-in metrics
- How to analyze failure patterns and generate improved orchestration instructions
- How to validate improvements by comparing baseline vs optimized evaluation scores
- That better instructions — not more tools — can be the key lever for agent improvement
Key Concepts
| Concept | Description |
|---|---|
| Agent GPA | Evaluation framework with built-in metrics for answer correctness and logical consistency |
| Orchestration Instructions | Natural language instructions telling the agent how to route queries and coordinate tools — the key lever for improvement |
| Eval Dataset | Frozen snapshot of queries + ground truth used to score agent performance |
| Snowflake CoCo | AI-powered CLI that mines agent logs, runs evaluations, identifies failures, and generates improved agent instructions |
Related Resources
Cleanup
USE ROLE ACCOUNTADMIN; DROP DATABASE IF EXISTS SELF_IMPROVING_AGENT_DB; DROP ROLE IF EXISTS SELF_IMPROVING_AGENT_ROLE;
This content is provided as is, and is not maintained on an ongoing basis. It may be out of date with current Snowflake instances