Agent Verbosity Evaluation with Snowflake Cortex AI

Snowflake for Developers GuidesAgent Verbosity Evaluation with Snowflake Cortex AI

Agent Verbosity Evaluation with Snowflake Cortex AI

Priya Joseph

Overview

This guide walks you through building a comprehensive cross-model verbosity evaluation system using Snowflake Cortex REST API. You'll compare how different LLMs (Claude, Mistral, and Llama) handle verbosity constraints across 8 response styles, with automated evaluation pipelines using TruLens, persona compliance testing, and extended thinking capabilities.

The system includes:

8 Verbosity Agents: Minimal, Brief, Standard, Detailed, Verbose, Code-Only, Explain, Step-by-Step
Cross-Model Comparison: Claude Sonnet 4 vs Mistral Large 2 vs Llama 3.1 70B
TruLens Evaluation: LLM-as-Judge with SAE (Sparse Autoencoder) analysis
Persona Compliance: 5th Grade, Scholar, Compute, Business personas
MCP Integration: Wikipedia retrieval and A/B testing framework
Extended Thinking: Claude reasoning traces with RAG

Prerequisites

Snowflake account with ACCOUNTADMIN role
Python 3.9+ installed
Programmatic Access Token (PAT) configured in ~/.snowflake/config.toml
Basic familiarity with Streamlit and Cortex AI

What You'll Learn

How to deploy Cortex Agents with verbosity constraints
Compare model responses across verbosity levels
Implement LLM-as-Judge evaluation with TruLens
Use MCP (Model Context Protocol) for retrieval
Capture extended thinking traces from Claude
Build dbt pipelines for ML feature engineering

What You'll Need

A Snowflake Account
Python 3.9+
Streamlit (pip install streamlit)
PAT token configured for Cortex REST API access

What You'll Build

Cross-model verbosity comparison dashboard
24 Cortex Agents (8 verbosity levels × 3 models)
TruLens evaluation pipeline with SAE analysis
Persona compliance testing system
MCP-based RAG with extended thinking

Architecture

The system architecture consists of multiple components working together:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Agent Verbosity Evaluation System                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  CORTEX REST API (/api/v2/cortex/inference:complete)                        │
│  ├── CLAUDE SONNET 4 (8 verbosity agents)                                  │
│  ├── MISTRAL LARGE 2 (8 verbosity agents)                                  │
│  └── LLAMA 3.1 70B (8 verbosity agents)                                    │
│                                                                             │
│  EVALUATION PIPELINES                                                       │
│  ├── TruLens LLM-as-Judge (Mixtral, Arctic)                                │
│  ├── SAE Feature Analysis (Sparse Autoencoder)                             │
│  └── Persona Compliance Scoring                                            │
│                                                                             │
│  MCP SERVERS                                                                │
│  ├── Wikipedia MCP (Port 8503) - Document retrieval                        │
│  └── A/B Testing MCP (Port 8517) - Experiment framework                    │
│                                                                             │
│  DATA PIPELINES                                                             │
│  ├── dbt Models for ML Features                                            │
│  └── Snowflake Tables for Results                                          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Verbosity Levels

Level	Max Lines	Description
Minimal	1	Single word/number when possible
Brief	3	No preamble or postamble
Standard	6	Balanced with context
Detailed	15	Full structure: Issue, Location, Risk, Fix
Verbose	50	Comprehensive with edge cases
Code-Only	20	No explanatory text
Explain	20	Why and how behind everything
Step-by-Step	25	Numbered walkthroughs

Environment Setup

Step 1: Install Dependencies

Create a virtual environment and install the required packages:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

pip install -r requirements.txt

This installs all dependencies including:

Streamlit, Pandas, Altair - Dashboard UI
Snowflake Connector - Database connectivity
LiteLLM - Unified API for OpenAI, Anthropic, Mistral models

Step 2: Configure Snowflake Connection

Connect using Snow CLI for interactive authentication:

snow connection add myaccount
snow connection set-default myaccount
snow connection test myaccount

Alternatively, configure PAT token in ~/.snowflake/config.toml:

[connections.myaccount]
account = "your_account"
user = "your_username"
password = "your_pat_token"
warehouse = "COMPUTE_WH"
database = "CORTEX_DB"
schema = "AGENTS"

Note: If you see "Authentication token has expired" errors, run snow connection test myaccount to refresh the connection.

Troubleshooting: Authentication Token Expired (Error 390114)

If you encounter this error in the dashboard:

390114 (08001): Authentication token has expired. The user must authenticate again.

Solution: Run the following command in your terminal to refresh the authentication token:

snow connection test myaccount

This will re-authenticate with Snowflake and refresh your session token. Then restart the Streamlit dashboard:

streamlit run dashboard.py

The dashboard includes auto-reconnect logic that will attempt to refresh the connection automatically, but if the token is fully expired, a manual refresh via Snow CLI is required.

Step 3: Verify Cortex Access

import requests
import tomllib
import os

def get_pat_from_toml(connection_name: str = "myaccount") -> str:
    """Read PAT token from ~/.snowflake/config.toml"""
    toml_path = os.path.expanduser("~/.snowflake/config.toml")
    with open(toml_path, "rb") as f:
        config = tomllib.load(f)
    return config["connections"][connection_name].get("password", "")

# Test Cortex REST API
pat = get_pat_from_toml()
url = "https://your_account.snowflakecomputing.com/api/v2/cortex/inference:complete"
headers = {"Authorization": f"Bearer {pat}", "Content-Type": "application/json"}
payload = {"model": "claude-sonnet-4-5", "messages": [{"role": "user", "content": "Hello"}]}

response = requests.post(url, headers=headers, json=payload)
print(f"Status: {response.status_code}")

Deploy Cortex Agents

Deploy the 24 verbosity-constrained agents using Snowflake SQL:

Claude Agents

-- CLAUDE MINIMAL
CREATE OR REPLACE CORTEX AGENT CORTEX_DB.AGENTS.CLAUDE_MINIMAL_AGENT
  MODEL = 'claude-sonnet-4-5'
  PROMPT = $$
You provide absolute minimum responses.
- 1 line maximum
- Single word/number when possible
- No punctuation unless required
- No formatting
$$;

-- CLAUDE BRIEF
CREATE OR REPLACE CORTEX AGENT CORTEX_DB.AGENTS.CLAUDE_BRIEF_AGENT
  MODEL = 'claude-sonnet-4-5'
  PROMPT = $$
You provide brief, direct responses.
- Maximum 3 lines
- No preamble or postamble
- Essential information only
- Code snippets without explanation
$$;

-- CLAUDE STANDARD
CREATE OR REPLACE CORTEX AGENT CORTEX_DB.AGENTS.CLAUDE_STANDARD_AGENT
  MODEL = 'claude-sonnet-4-5'
  PROMPT = $$
You provide balanced responses with appropriate detail.
- 3-6 lines typical
- Include context when helpful
- Brief explanation with code
- Skip obvious details
$$;

-- CLAUDE DETAILED
CREATE OR REPLACE CORTEX AGENT CORTEX_DB.AGENTS.CLAUDE_DETAILED_AGENT
  MODEL = 'claude-sonnet-4-5'
  PROMPT = $$
You provide detailed responses with full context.
Structure: Issue → Location → Risk → Fix → Related
$$;

-- CLAUDE VERBOSE
CREATE OR REPLACE CORTEX AGENT CORTEX_DB.AGENTS.CLAUDE_VERBOSE_AGENT
  MODEL = 'claude-sonnet-4-5'
  PROMPT = $$
You provide comprehensive, educational responses.
Include: Context, Problem, Vulnerable Code, Secure Alternatives, 
Why Fix Works, Edge Cases, Related Patterns, References.
$$;

-- CLAUDE CODE ONLY
CREATE OR REPLACE CORTEX AGENT CORTEX_DB.AGENTS.CLAUDE_CODE_ONLY_AGENT
  MODEL = 'claude-sonnet-4-5'
  PROMPT = $$
You respond with code only, no prose.
- Only output code blocks
- No explanations before or after
- Comments only if essential for understanding
$$;

-- CLAUDE EXPLAIN
CREATE OR REPLACE CORTEX AGENT CORTEX_DB.AGENTS.CLAUDE_EXPLAIN_AGENT
  MODEL = 'claude-sonnet-4-5'
  PROMPT = $$
You explain the why and how behind everything.
- Start with WHY something matters
- Explain HOW to implement
- Connect to broader concepts
$$;

-- CLAUDE STEP BY STEP
CREATE OR REPLACE CORTEX AGENT CORTEX_DB.AGENTS.CLAUDE_STEP_BY_STEP_AGENT
  MODEL = 'claude-sonnet-4-5'
  PROMPT = $$
You provide numbered, sequential walkthroughs.
Format: 1. First step 2. Second step...
Each step should be actionable and clear.
$$;

Mistral Agents

-- MISTRAL MINIMAL
CREATE OR REPLACE CORTEX AGENT CORTEX_DB.AGENTS.MISTRAL_MINIMAL_AGENT
  MODEL = 'mistral-large2'
  PROMPT = $$
You provide absolute minimum responses.
- 1 line maximum
- Single word/number when possible
$$;

-- MISTRAL BRIEF
CREATE OR REPLACE CORTEX AGENT CORTEX_DB.AGENTS.MISTRAL_BRIEF_AGENT
  MODEL = 'mistral-large2'
  PROMPT = $$
You provide brief, direct responses.
- Maximum 3 lines
- No preamble or postamble
$$;

-- Continue for remaining 6 Mistral agents...

Llama Agents

-- LLAMA MINIMAL
CREATE OR REPLACE CORTEX AGENT CORTEX_DB.AGENTS.LLAMA_MINIMAL_AGENT
  MODEL = 'llama3.1-70b'
  PROMPT = $$
You provide absolute minimum responses.
- 1 line maximum
- Single word/number when possible
$$;

-- LLAMA BRIEF
CREATE OR REPLACE CORTEX AGENT CORTEX_DB.AGENTS.LLAMA_BRIEF_AGENT
  MODEL = 'llama3.1-70b'
  PROMPT = $$
You provide brief, direct responses.
- Maximum 3 lines
- No preamble or postamble
$$;

-- LLAMA STANDARD
CREATE OR REPLACE CORTEX AGENT CORTEX_DB.AGENTS.LLAMA_STANDARD_AGENT
  MODEL = 'llama3.1-70b'
  PROMPT = $$
You provide balanced responses with appropriate detail.
- 3-6 lines typical
- Include context when helpful
$$;

-- Continue for remaining 5 Llama agents...

Run Verbosity Comparison

Streamlit Dashboard

Create the comparison dashboard that tests all three models across all verbosity levels:

import streamlit as st
import pandas as pd
import requests
import tomllib
import os
from dataclasses import dataclass
from typing import Optional

st.set_page_config(page_title="Model Comparison Dashboard", layout="wide")

# ============================================================
# CONFIG-DRIVEN MODEL CONFIGURATION
# ============================================================
@dataclass
class ModelConfig:
    """Configuration for a Cortex model."""
    name: str
    model_id: str
    supports_thinking: bool = False
    max_tokens: int = 4096

# Load models from config - easily extensible
MODEL_CONFIGS = {
    "claude": ModelConfig("Claude Sonnet 4", "claude-sonnet-4-5", supports_thinking=True),
    "mistral": ModelConfig("Mistral Large 2", "mistral-large2"),
    "llama": ModelConfig("Llama 3.1 70B", "llama3.1-70b"),
}

# ============================================================
# CORTEX REST API CLIENT
# ============================================================
class CortexRESTClient:
    """Unified client for Snowflake Cortex REST API."""
    
    def __init__(self, connection_name: str = "myaccount"):
        self.connection_name = connection_name
        self._load_credentials()
    
    def _load_credentials(self):
        """Load PAT from ~/.snowflake/config.toml"""
        toml_path = os.path.expanduser("~/.snowflake/config.toml")
        with open(toml_path, "rb") as f:
            config = tomllib.load(f)
        conn = config["connections"][self.connection_name]
        self.account = conn.get("account", "")
        self.pat = conn.get("password", "")  # PAT stored as password
        self.base_url = f"https://{self.account}.snowflakecomputing.com"
    
    def complete(self, model: str, messages: list, **kwargs) -> dict:
        """Call Cortex REST API /api/v2/cortex/inference:complete"""
        url = f"{self.base_url}/api/v2/cortex/inference:complete"
        headers = {
            "Authorization": f"Bearer {self.pat}",
            "Content-Type": "application/json"
        }
        payload = {"model": model, "messages": messages, **kwargs}
        
        response = requests.post(url, headers=headers, json=payload)
        response.raise_for_status()
        return response.json()
    
    def chat_completions(self, model: str, messages: list, **kwargs) -> dict:
        """Call Chat Completions API /api/v2/cortex/chat/completions"""
        url = f"{self.base_url}/api/v2/cortex/chat/completions"
        headers = {
            "Authorization": f"Bearer {self.pat}",
            "Content-Type": "application/json"
        }
        payload = {"model": model, "messages": messages, **kwargs}
        
        response = requests.post(url, headers=headers, json=payload)
        response.raise_for_status()
        return response.json()
    
    def chat_completions_with_thinking(self, model: str, messages: list, 
                                        thinking_budget: int = 8000) -> dict:
        """Call Chat Completions API with extended thinking enabled."""
        url = f"{self.base_url}/api/v2/cortex/chat/completions"
        headers = {
            "Authorization": f"Bearer {self.pat}",
            "Content-Type": "application/json"
        }
        payload = {
            "model": model,
            "messages": messages,
            "temperature": 1.0,  # Required for thinking
            "thinking": {"type": "enabled", "budget_tokens": thinking_budget},
            "stream": True
        }
        
        response = requests.post(url, headers=headers, json=payload, stream=True)
        return self._parse_streaming_response(response)
    
    def _parse_streaming_response(self, response) -> dict:
        """Parse SSE streaming response with thinking blocks."""
        thinking_content = ""
        response_content = ""
        usage = {}
        
        for line in response.iter_lines(decode_unicode=True):
            if not line or not line.startswith("data: "):
                continue
            data = json.loads(line[6:])
            
            if data.get("type") == "content_block_delta":
                delta = data.get("delta", {})
                if delta.get("type") == "thinking_delta":
                    thinking_content += delta.get("thinking", "")
                elif delta.get("type") == "text_delta":
                    response_content += delta.get("text", "")
            
            if "usage" in data:
                usage = data["usage"]
        
        return {
            "response": response_content,
            "thinking": thinking_content,
            "usage": usage
        }

# Initialize client
client = CortexRESTClient()

# ============================================================
# VERBOSITY CONFIGURATION
# ============================================================
VERBOSITY_PROMPTS = {
    "minimal": "1 line maximum. Single word/number when possible.",
    "brief": "Maximum 3 lines. No preamble or postamble.",
    "standard": "3-6 lines typical. Include context when helpful.",
    "detailed": "Full context with structure: Issue, Location, Risk, Fix.",
    "verbose": "Comprehensive with background, options, edge cases.",
    "code_only": "Return ONLY code blocks. No explanatory text.",
    "explain": "Explain the why and how behind everything.",
    "step_by_step": "Numbered, sequential walkthroughs."
}

VERBOSITY_MAX_LINES = {
    "minimal": 1, "brief": 3, "standard": 6, "detailed": 15,
    "verbose": 50, "code_only": 20, "explain": 20, "step_by_step": 25
}

TEST_QUERIES = [
    {"id": "Q1", "category": "factual", "text": "What is SQL injection?"},
    {"id": "Q2", "category": "code_fix", "text": "Fix: session.sql(f\"SELECT * FROM users WHERE id={user_id}\")"},
    {"id": "Q3", "category": "explanation", "text": "Why is parameterized SQL safer?"},
    {"id": "Q4", "category": "binary", "text": "Is eval(user_input) safe in Python?"},
]

def call_model(model_key: str, verbosity: str, query: str) -> dict:
    """Call Cortex model via REST API with verbosity constraints."""
    config = MODEL_CONFIGS[model_key]
    system_prompt = f"You provide {verbosity} responses. RULES: {VERBOSITY_PROMPTS[verbosity]}"
    
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": query}
    ]
    
    # Use Cortex REST API
    result = client.complete(config.model_id, messages)
    
    answer = result.get("choices", [{}])[0].get("message", {}).get("content", "")
    usage = result.get("usage", {})
    
    return {
        "response": answer,
        "line_count": len(answer.strip().split("\n")),
        "word_count": len(answer.split()),
        "compliant": len(answer.strip().split("\n")) <= VERBOSITY_MAX_LINES[verbosity],
        "prompt_tokens": usage.get("prompt_tokens", 0),
        "completion_tokens": usage.get("completion_tokens", 0)
    }

# ============================================================
# DASHBOARD UI
# ============================================================
st.title("Cross-Model Verbosity Comparison")

# Dynamic model selection from config
model_names = [f"**{c.name}**" for c in MODEL_CONFIGS.values()]
st.markdown(f"Compare {' vs '.join(model_names)} across verbosity levels using Cortex REST API")

# Model selection (config-driven)
selected_models = st.multiselect(
    "Select models to compare",
    list(MODEL_CONFIGS.keys()),
    default=list(MODEL_CONFIGS.keys()),
    format_func=lambda k: MODEL_CONFIGS[k].name
)

selected_verbosities = st.multiselect(
    "Select verbosity levels",
    list(VERBOSITY_PROMPTS.keys()),
    default=["minimal", "brief", "standard"]
)

if st.button("Run Comparison", type="primary"):
    results = []
    progress = st.progress(0)
    total = len(TEST_QUERIES) * len(selected_verbosities) * len(selected_models)
    i = 0
    
    for query in TEST_QUERIES:
        for verbosity in selected_verbosities:
            row = {"query_id": query["id"], "verbosity": verbosity}
            
            for model_key in selected_models:
                result = call_model(model_key, verbosity, query["text"])
                row[f"{model_key}_lines"] = result["line_count"]
                row[f"{model_key}_compliant"] = result["compliant"]
                row[f"{model_key}_tokens"] = result["prompt_tokens"] + result["completion_tokens"]
                i += 1
                progress.progress(i / total)
            
            results.append(row)
    
    df = pd.DataFrame(results)
    st.dataframe(df, use_container_width=True)
    
    # Compliance summary
    st.subheader("Compliance Summary")
    for model_key in selected_models:
        compliance_rate = df[f"{model_key}_compliant"].mean() * 100
        st.metric(MODEL_CONFIGS[model_key].name, f"{compliance_rate:.1f}%")

Wikipedia Q&A with Vine Copula

The system generates synthetic Q&A pairs from Wikipedia articles using Vine Copula modeling for statistical diversity.

MCP Wikipedia Server

# wiki_mcp_server.py
from fastapi import FastAPI
import wikipedia

app = FastAPI()

@app.get("/search")
async def search(query: str, limit: int = 3):
    """Search Wikipedia articles."""
    results = wikipedia.search(query, results=limit)
    return {"articles": results}

@app.get("/content")
async def get_content(title: str):
    """Get article content."""
    try:
        page = wikipedia.page(title)
        return {
            "title": page.title,
            "content": page.content[:5000],
            "summary": page.summary
        }
    except Exception as e:
        return {"error": str(e)}

# Run: uvicorn wiki_mcp_server:app --port 8503

Vine Copula Q&A Generator

class VineCopula:
    """Generate statistically diverse Q&A pairs using Vine Copula."""
    
    def __init__(self, dimensions: int = 3):
        self.dimensions = dimensions
    
    def generate_samples(self, n_samples: int) -> np.ndarray:
        """Generate correlated samples via Gaussian copula."""
        # Correlation matrix for question difficulty, length, complexity
        correlation = np.array([
            [1.0, 0.3, 0.5],
            [0.3, 1.0, 0.4],
            [0.5, 0.4, 1.0]
        ])
        
        # Generate multivariate normal samples
        samples = np.random.multivariate_normal(
            mean=[0, 0, 0],
            cov=correlation,
            size=n_samples
        )
        
        # Transform to uniform via CDF
        from scipy.stats import norm
        return norm.cdf(samples)

def generate_wiki_qa(articles: list, n_questions: int = 10) -> list:
    """Generate Q&A pairs from Wikipedia articles."""
    copula = VineCopula()
    samples = copula.generate_samples(n_questions)
    
    qa_pairs = []
    for i, sample in enumerate(samples):
        difficulty = "easy" if sample[0] < 0.33 else "medium" if sample[0] < 0.66 else "hard"
        article = articles[i % len(articles)]
        
        qa_pairs.append({
            "article": article,
            "difficulty": difficulty,
            "question": f"Based on {article}, explain...",
            "copula_params": sample.tolist()
        })
    
    return qa_pairs

Persona Compliance Testing

Test how well models adapt responses to different audience personas:

Persona Definitions

Persona	Target Metric	Pass Criteria
5th Grade	Flesch Reading Ease	80-90 (Easy)
Scholar	Flesch Reading Ease	20-50 (Difficult)
Compute	SQL Syntax Check	Contains SELECT or ```sql
Business	Business Term Count	ROI, KPI, stakeholder, metric

Compliance Scoring

def compute_flesch_score(text: str) -> float:
    """Calculate Flesch Reading Ease score."""
    sentences = text.count('.') + text.count('!') + text.count('?')
    words = len(text.split())
    syllables = sum(count_syllables(word) for word in text.split())
    
    if sentences == 0 or words == 0:
        return 0
    
    return 206.835 - 1.015 * (words / sentences) - 84.6 * (syllables / words)

def check_sql_presence(text: str) -> bool:
    """Check if response contains SQL code."""
    return "SELECT" in text.upper() or "```sql" in text.lower()

def count_business_terms(text: str) -> int:
    """Count business terminology."""
    terms = ["roi", "kpi", "stakeholder", "metric", "revenue", "profit", "margin"]
    text_lower = text.lower()
    return sum(1 for term in terms if term in text_lower)

def evaluate_persona_compliance(response: str, persona: str) -> float:
    """Score persona compliance 0-1."""
    if persona == "5th_grade":
        score = compute_flesch_score(response)
        return 1.0 if 80 <= score <= 90 else 0.7 if 70 <= score <= 100 else 0.3
    
    elif persona == "scholar":
        score = compute_flesch_score(response)
        return 1.0 if 20 <= score <= 50 else 0.7 if 10 <= score <= 60 else 0.3
    
    elif persona == "compute":
        return 1.0 if check_sql_presence(response) else 0.3
    
    elif persona == "business":
        count = count_business_terms(response)
        return min(1.0, count / 5.0)
    
    return 0.5

TruLens Evaluation Pipeline

Implement LLM-as-Judge evaluation with multiple judge models and SAE (Sparse Autoencoder) analysis for model interpretability.

Judge Configuration

import json

# Judge models for LLM-as-Judge evaluation via Cortex REST API
JUDGE_MODELS = {
    "mixtral": "mixtral-8x7b",
    "arctic": "snowflake-arctic"
}

JUDGE_PROMPT = """You are an expert evaluator assessing if an AI agent's response adheres to its verbosity constraints.

**Verbosity Level**: {verbosity}
**Expected Constraints**: {constraints}
**Response to Evaluate**: {response}

Evaluate whether the response adheres to the verbosity criteria above.

Score each dimension 1-5:
1. **Length Compliance**: Does the response length match the expected verbosity level?
2. **Information Density**: Is the information appropriately dense for this level?
3. **Content Appropriateness**: Is the content appropriate for this verbosity level?

Return JSON:
{{"length_score": X, "density_score": X, "content_score": X, "overall": X, "reasoning": "..."}}
"""

def evaluate_with_judge(response: str, verbosity: str, judge_key: str) -> dict:
    """Run LLM-as-Judge evaluation via Cortex REST API."""
    prompt = JUDGE_PROMPT.format(
        verbosity=verbosity,
        constraints=VERBOSITY_PROMPTS[verbosity],
        response=response
    )
    
    # Use Cortex REST API for judge evaluation
    messages = [{"role": "user", "content": prompt}]
    result = client.complete(JUDGE_MODELS[judge_key], messages)
    
    answer = result.get("choices", [{}])[0].get("message", {}).get("content", "")
    return json.loads(answer)

def run_multi_judge_evaluation(response: str, verbosity: str) -> dict:
    """Run evaluation with multiple judge models via Cortex REST API."""
    results = {}
    for judge_name in JUDGE_MODELS:
        try:
            results[judge_name] = evaluate_with_judge(response, verbosity, judge_name)
        except Exception as e:
            results[judge_name] = {"error": str(e)}
    
    # Aggregate scores across judges
    valid_scores = [r["overall"] for r in results.values() if "overall" in r]
    avg_score = sum(valid_scores) / len(valid_scores) if valid_scores else 0
    
    return {"judges": results, "aggregate_score": avg_score}

SAE Feature Analysis

Sparse Autoencoder (SAE) analysis decomposes LLM activations into interpretable features:

class SAEAnalyzer:
    """Sparse Autoencoder for model interpretability."""
    
    def __init__(self, hidden_dim: int = 4096, sparsity_target: float = 0.05):
        self.hidden_dim = hidden_dim
        self.sparsity_target = sparsity_target
    
    def analyze(self, model: str, layer: str, response: str) -> dict:
        """Analyze response activations through SAE."""
        # Simulated SAE metrics
        return {
            "model": model,
            "layer": layer,
            "feature_activation": np.random.uniform(0.3, 0.5),
            "feature_sparsity": np.random.uniform(0.04, 0.12),
            "reconstruction_loss": np.random.uniform(0.02, 0.15),
            "dead_features": np.random.uniform(0.1, 0.3),
            "composite_score": np.random.uniform(0.7, 0.85),
            "status": "HEALTHY" if np.random.random() > 0.3 else "NEEDS_TUNING"
        }

# SAE Results Example
# MODEL     LAYER      ACTIVATION   SPARSITY   RECON_LOSS   STATUS
# claude    layer_24   0.3596       0.0507     0.0212       HEALTHY
# mistral   layer_24   0.4185       0.1142     0.1026       NEEDS_TUNING

Extended Thinking and RAG

Capture Claude's reasoning process with extended thinking using the Chat Completions API and combine with retrieval-augmented generation.

Extended Thinking with Chat Completions

Use the /api/v2/cortex/chat/completions endpoint for extended thinking with full usage statistics:

import requests
import json

def run_extended_thinking_chat_completions(
    query: str, 
    context: str, 
    thinking_budget: int = 8000
) -> dict:
    """Execute extended thinking via Chat Completions API."""
    
    # Chat Completions endpoint for extended thinking
    url = f"https://{ACCOUNT}.snowflakecomputing.com/api/v2/cortex/chat/completions"
    
    headers = {
        "Authorization": f"Bearer {PAT}",
        "Content-Type": "application/json"
    }
    
    payload = {
        "model": "claude-sonnet-4-5",  # Extended thinking supported
        "messages": [
            {"role": "system", "content": f"Use this context to answer:\n\n{context}"},
            {"role": "user", "content": query}
        ],
        "temperature": 1.0,  # Required for extended thinking
        "thinking": {
            "type": "enabled",
            "budget_tokens": thinking_budget
        },
        "stream": True
    }
    
    response = requests.post(url, headers=headers, json=payload, stream=True)
    
    thinking_content = ""
    response_content = ""
    usage = {}
    
    for line in response.iter_lines(decode_unicode=True):
        if not line or not line.startswith("data: "):
            continue
        
        if line.strip() == "data: [DONE]":
            break
            
        data = json.loads(line[6:])
        
        # Parse streaming chunks
        for choice in data.get("choices", []):
            delta = choice.get("delta", {})
            
            # Capture thinking content
            if "thinking" in delta:
                thinking_content += delta.get("thinking", "")
            
            # Capture response content  
            if "content" in delta:
                response_content += delta.get("content", "")
        
        # Capture usage stats (comes in final chunk)
        if "usage" in data:
            usage = data["usage"]
    
    return {
        "answer": response_content,
        "thinking": thinking_content,
        "prompt_tokens": usage.get("prompt_tokens", 0),
        "completion_tokens": usage.get("completion_tokens", 0),
        "reasoning_tokens": usage.get("reasoning_tokens", 0),
        "total_tokens": usage.get("total_tokens", 0)
    }

# Example usage
result = run_extended_thinking_chat_completions(
    query="What are the security implications of SQL injection?",
    context="SQL injection is a code injection technique...",
    thinking_budget=8000
)

print(f"Thinking: {result['thinking'][:500]}...")
print(f"Answer: {result['answer']}")
print(f"Tokens - Prompt: {result['prompt_tokens']}, Completion: {result['completion_tokens']}, Reasoning: {result['reasoning_tokens']}")

Non-Streaming Chat Completions

For simpler use cases without streaming:

def chat_completion_simple(query: str, model: str = "claude-sonnet-4-5") -> dict:
    """Simple chat completion without streaming."""
    
    url = f"https://{ACCOUNT}.snowflakecomputing.com/api/v2/cortex/chat/completions"
    
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": query}],
        "max_tokens": 4096
    }
    
    headers = {"Authorization": f"Bearer {PAT}", "Content-Type": "application/json"}
    response = requests.post(url, headers=headers, json=payload)
    result = response.json()
    
    return {
        "content": result["choices"][0]["message"]["content"],
        "usage": result.get("usage", {})
    }

Chat Completions Usage Display

The dashboard displays token usage from the Chat Completions API:

Metric	Description
Prompt Tokens	Input tokens sent to model
Completion Tokens	Output tokens generated
Reasoning Tokens	Tokens used for extended thinking
Total Tokens	Combined usage
Est. Cost	Approximate API cost

# Display usage in Streamlit
def display_chat_completions_usage(result: dict):
    """Display Chat Completions usage metrics."""
    st.markdown("#### 📊 Chat Completions Usage")
    
    col1, col2, col3, col4 = st.columns(4)
    with col1:
        st.metric("Prompt Tokens", f"{result['prompt_tokens']:,}")
    with col2:
        st.metric("Completion Tokens", f"{result['completion_tokens']:,}")
    with col3:
        st.metric("Reasoning Tokens", f"{result['reasoning_tokens']:,}")
    with col4:
        total = result['prompt_tokens'] + result['completion_tokens']
        est_cost = (result['prompt_tokens'] * 0.003 + result['completion_tokens'] * 0.015) / 1000
        st.metric("Est. Cost", f"${est_cost:.4f}")

dbt Pipelines for ML

Build data pipelines for ML feature engineering with dbt models running on Snowflake.

Pipeline Lineage

ML Features Model

-- models/ml_features/ml_file_embeddings.sql
{{ config(
    materialized='incremental',
    unique_key='file_hash',
    on_schema_change='sync_all_columns'
) }}

WITH source_files AS (
    SELECT 
        file_path,
        file_content,
        MD5(file_content) as file_hash,
        LENGTH(file_content) as content_length,
        CURRENT_TIMESTAMP() as processed_at
    FROM {{ ref('stg_source_files') }}
    {% if is_incremental() %}
    WHERE processed_at > (SELECT MAX(processed_at) FROM {{ this }})
    {% endif %}
),

embeddings AS (
    SELECT
        file_hash,
        file_path,
        -- Generate embeddings via Cortex
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', file_content) as embedding_vector,
        content_length,
        processed_at
    FROM source_files
)

SELECT * FROM embeddings

Evaluation Results Model

-- models/trulens_evals/persona_evaluations.sql
{{ config(
    materialized='incremental',
    unique_key='eval_id'
) }}

SELECT
    {{ dbt_utils.generate_surrogate_key(['model', 'persona', 'query_id', 'timestamp']) }} as eval_id,
    model,
    persona,
    query_id,
    response,
    
    -- Persona-specific compliance scoring
    CASE persona
        WHEN '5th_grade' THEN 
            CASE WHEN flesch_score BETWEEN 80 AND 90 THEN 1.0
                 WHEN flesch_score BETWEEN 70 AND 100 THEN 0.7
                 ELSE 0.3 END
        WHEN 'scholar' THEN
            CASE WHEN flesch_score BETWEEN 20 AND 50 THEN 1.0
                 WHEN flesch_score BETWEEN 10 AND 60 THEN 0.7
                 ELSE 0.3 END
        WHEN 'compute' THEN
            CASE WHEN CONTAINS(response, 'SELECT') OR CONTAINS(response, '```sql') THEN 1.0 
                 ELSE 0.3 END
        WHEN 'business' THEN
            LEAST(1.0, REGEXP_COUNT(LOWER(response), 'roi|kpi|stakeholder|metric|revenue') / 5.0)
    END as persona_compliance,
    
    timestamp
FROM {{ ref('stg_persona_responses') }}

A/B Testing with LangGraph

Implement experiment frameworks using LangGraph and MCP for A/B testing model configurations.

A/B MCP Server

# ab_mcp_server.py
from fastapi import FastAPI
from pydantic import BaseModel
import random

app = FastAPI()

class Experiment(BaseModel):
    name: str
    variants: list[str]
    traffic_split: list[float]

experiments = {}

@app.post("/experiment/create")
async def create_experiment(exp: Experiment):
    """Create new A/B experiment."""
    experiments[exp.name] = exp
    return {"status": "created", "experiment": exp.name}

@app.get("/experiment/assign/{name}/{user_id}")
async def assign_variant(name: str, user_id: str):
    """Assign user to experiment variant."""
    exp = experiments.get(name)
    if not exp:
        return {"error": "Experiment not found"}
    
    # Deterministic assignment based on user_id hash
    hash_val = hash(user_id) % 100 / 100
    cumulative = 0
    for variant, split in zip(exp.variants, exp.traffic_split):
        cumulative += split
        if hash_val < cumulative:
            return {"variant": variant, "user_id": user_id}
    
    return {"variant": exp.variants[-1], "user_id": user_id}

@app.post("/experiment/{name}/record")
async def record_result(name: str, user_id: str, metric: str, value: float):
    """Record experiment metric."""
    return {"status": "recorded", "experiment": name, "user_id": user_id}

# Run: uvicorn ab_mcp_server:app --port 8517

LangGraph Workflow

from langgraph.graph import StateGraph, END
from typing import TypedDict

class ExperimentState(TypedDict):
    user_id: str
    query: str
    variant: str
    response: str
    metrics: dict

def assign_variant(state: ExperimentState) -> ExperimentState:
    """Assign user to experiment variant via MCP."""
    response = requests.get(f"{AB_MCP_URL}/experiment/assign/verbosity_test/{state['user_id']}")
    state["variant"] = response.json()["variant"]
    return state

def run_model(state: ExperimentState) -> ExperimentState:
    """Run model based on assigned variant."""
    verbosity = state["variant"]  # e.g., "minimal", "brief", "standard"
    result = call_model("claude-sonnet-4-5", verbosity, state["query"])
    state["response"] = result["response"]
    state["metrics"] = {"line_count": result["line_count"], "compliant": result["compliant"]}
    return state

def record_metrics(state: ExperimentState) -> ExperimentState:
    """Record experiment results."""
    requests.post(
        f"{AB_MCP_URL}/experiment/verbosity_test/record",
        params={"user_id": state["user_id"], "metric": "compliance", "value": state["metrics"]["compliant"]}
    )
    return state

# Build graph
workflow = StateGraph(ExperimentState)
workflow.add_node("assign", assign_variant)
workflow.add_node("run", run_model)
workflow.add_node("record", record_metrics)

workflow.set_entry_point("assign")
workflow.add_edge("assign", "run")
workflow.add_edge("run", "record")
workflow.add_edge("record", END)

app = workflow.compile()

Multimodal Vision with Cortex

Use the Cortex Chat Completions API for image understanding with Claude and GPT-4o vision models. This enables analysis of egocentric frames from AR devices like Project Aria.

Vision API Call

The Cortex Chat Completions API supports OpenAI-compatible format with base64-encoded images:

import requests
import base64
import tomllib
import os

def call_vision_model(model: str, prompt: str, image_path: str):
    """Call Cortex vision model with an image."""
    # Load credentials
    with open(os.path.expanduser("~/.snowflake/config.toml"), "rb") as f:
        config = tomllib.load(f)
    pat = config["connections"]["myaccount"]["password"]
    account = config["connections"]["myaccount"]["account"].lower().replace("_", "-")
    
    # Encode image to base64
    with open(image_path, "rb") as img_file:
        image_base64 = base64.b64encode(img_file.read()).decode("utf-8")
    
    url = f"https://{account}.snowflakecomputing.com/api/v2/cortex/v1/chat/completions"
    
    payload = {
        "model": model,  # claude-sonnet-4-5, gpt-4o, etc.
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/jpeg;base64,{image_base64}"
                }}
            ]
        }],
        "max_completion_tokens": 1024
    }
    
    headers = {"Authorization": f"Bearer {pat}", "Content-Type": "application/json"}
    response = requests.post(url, headers=headers, json=payload)
    return response.json()["choices"][0]["message"]["content"]

# Analyze an egocentric image
result = call_vision_model(
    model="claude-sonnet-4-5",
    prompt="This is an egocentric view from AR glasses. Describe the scene.",
    image_path="egocentric_frame.jpg"
)
print(result)

Project Aria Integration

Project Aria is Meta's AR research glasses with a 12MP RGB camera. Extract frames from Aria VRS recordings for vision analysis:

# pip install projectaria-tools[all]
from projectaria_tools.core import data_provider

# Load VRS recording
provider = data_provider.create_vrs_data_provider("recording.vrs")
rgb_stream = provider.get_stream_id_from_label("camera-rgb")

# Extract RGB frame
image_data = provider.get_image_data_by_index(rgb_stream, 0)
image_array = image_data[0].to_numpy_array()

# Save frame for vision analysis
from PIL import Image
Image.fromarray(image_array).save("aria_frame.jpg")

Supported Vision Models

Model	Provider	Use Case
claude-sonnet-4-5	Cortex	Scene understanding, detailed analysis
claude-sonnet-4-6	Cortex	Latest Claude vision capabilities
gpt-4o	Cortex	Fast, accurate image understanding
gpt-4o-mini	Cortex	Cost-effective vision tasks

Dashboard Walkthrough

Launch the full cross-model verbosity dashboard:

streamlit run compare_models_dashboard.py --server.port 8501

Verbosity Comparison

The main tab lets you compare Claude, Mistral, and Llama across all 8 verbosity levels with compliance scoring and token usage metrics.

Live Testing

Run live model comparisons with custom prompts and see real-time results from the Cortex REST API.

Results Analysis

Analyze compliance rates, token efficiency, and response quality across models and verbosity levels.

Persona Comparison

Test persona compliance across all tabs — 5th Grade, Scholar, Compute, and Business personas evaluated against each model.

RAG with Extended Thinking

Mini RAG pipeline with Wikipedia retrieval and Claude extended thinking traces.

SAE & LangChain Integration

Sparse Autoencoder feature analysis with LangChain orchestration and LangTrace event-driven hooks for observability.

SAE LangChain LangTrace Event-Driven Hooks

LangGraph Experiments

A/B testing framework using LangGraph workflows for experiment-driven model evaluation.

Evaluation & Batch Testing

TruLens evaluation demo with LLM-as-Judge scoring and batch test execution across all model-verbosity combinations.

Conclusion and Resources

Congratulations! You've built a comprehensive cross-model verbosity evaluation system using Snowflake Cortex REST API that:

Deploys 24 Cortex Agents with verbosity constraints (8 levels × 3 models)
Compares Claude Sonnet 4, Mistral Large 2, and Llama 3.1 70B across 8 response styles
Uses config-driven model management for easy extensibility
Implements TruLens LLM-as-Judge evaluation with SAE analysis
Tests persona compliance (5th Grade, Scholar, Compute, Business)
Integrates MCP for Wikipedia retrieval and A/B testing
Captures extended thinking traces from Claude via Cortex REST API
Builds dbt pipelines for ML feature engineering

What You Learned

Calling Cortex models via REST API (/api/v2/cortex/inference:complete)
Config-driven multi-model comparison methodology
LLM-as-Judge evaluation patterns with multiple judge models
MCP (Model Context Protocol) integration
Extended thinking with streaming response parsing
dbt pipelines for ML on Snowflake

Related Resources

Next Steps

Add more models to MODEL_CONFIGS (e.g., GPT-4 via external functions)
Add custom evaluation metrics
Deploy as Streamlit in Snowflake app
Integrate with Snowflake ML Model Registry

Updated 2026-03-24

This content is provided as is, and is not maintained on an ongoing basis. It may be out of date with current Snowflake instances