AI & ML

Unlocking Trust in Enterprise AI with AI Observability and Evaluations in Snowflake Cortex AI

The widespread integration of large language models (LLMs) and generative AI in mission-critical business processes has created a need for robust AI observability to address the inherent “black box” and nondeterministic nature of these systems and applications. 

The true opportunity for the teams leveraging platforms such as Snowflake Cortex AI lies in transforming generative AI prototypes into dependable, efficient and trustworthy production-ready applications. 

The process of selecting the right LLM and refining prompts requires: 

  1. Constant experimentation and evaluation for more accurate responses 

  2. Systematic testing for and mitigating a variety of failure modes

  3. Simultaneous monitoring and optimization for crucial operational metrics such as response latency and token usage. 

Without an integrated solution to continuously evaluate, debug and track these factors directly within their AI data environment, organizations cannot confidently deploy generative AI solutions that are both effective and efficient.

 

What is AI observability?

AI observability enables developers to monitor, analyze and visualize the internal states, inputs and outputs of generative AI applications, increasing accuracy, trust, efficiency and regulatory compliance in real-world environments. AI observability spans all stages of application development, including development, testing and production, and anchors on three key pillars:

  1. Tracing: As developers build and customize their applications, tracing enables them to visualize the inputs, outputs and intermediate states of the application. This provides granular information of each component within the application to enable better debugging and explainability of the application behavior.

  2. Evaluations: After the initial version of the application is ready, developers conduct systematic evaluations to assess their application's performance to proactively improve response accuracy. This allows them to test and compare different models and prompts and finalize the configuration for product deployments.

  3. Monitoring: Once the application is deployed in production, developers need to constantly monitor the performance of their application to ensure operational reliability and avoid performance drift. Continuous monitoring also enables them to fine-tune the application by eliminating failure points and accommodating data drift.

     

AI Observability in Snowflake Cortex

Snowflake supports a comprehensive set of AI Observability capabilities that enables developers to effectively evaluate and monitor their generative AI apps. AI Observability can be enabled across custom-designed gen AI apps as well as Snowflake native generative AI services.

ai observability 1

AI Observability on custom gen AI apps

AI Observability for custom gen AI apps is now generally available, offering AI engineers and developers the capability to effortlessly evaluate and trace their generative AI applications. Through AI Observability, users can measure the performance of their AI applications by conducting systematic evaluations and can iterate on application configurations to enhance performance. Furthermore, it allows for the logging of application traces to facilitate debugging. This functionality enhances the trust and transparency of generative AI applications and agents, enabling comprehensive benchmarking and performance measurement before application deployment.

  • End-to-end evaluation: AI Observability can evaluate the performance of agents and apps, using techniques such as LLM-as-a-judge. It can report metrics such as relevance, groundedness and harmfulness, giving customers the ability to quickly iterate and refine the agent for improved performance. 

  • Comparison: Users can compare evaluation runs side by side and assess the quality and accuracy of responses across different LLM configurations to identify the best configuration for production deployments.

  • Comprehensive tracing: Customers can enable logging for every step of agent executions across input prompts, tool use and final response generation using OpenTelemetry traces. This allows easy debugging and refinement for accuracy, latency and cost. 

 

AI observability across Cortex AI services

ai observability 2

Snowflake Intelligence and Cortex Agents

Snowflake Intelligence provides AI-generated insights using natural language that users can trust by offering verifiable explainability and transparency. This new agentic experience, accessible via a dedicated portal, allows all users to securely converse with their data, derive meaningful insights from their trusted enterprise data, and initiate actions from a unified, intuitive interface. 

With native observability, Snowflake Intelligence users can easily see the "why" behind every answer generated by the agent, tracking whether the data came from verified sources or curated queries, and tracing the lineage. Data administrators can soon gain visibility into the questions being asked and the relevance scores of the answers, allowing for continuous improvement and fine-tuning with centralized control.

Additionally, for the agents built using Cortex Agents, engineers will soon gain the ability to effortlessly evaluate, trace and monitor their agents with native observability capabilities. 

Agent observability will allow developers to trace agent interactions in real time, enabling them to have better visibility into agent planning, tool selection, execution and response generation steps. Developers will be able to log and monitor every interaction on the agent to systematically debug, improve and iterate the agent’s performance.

This native observability accelerates the development cycle, enhancing the trustworthiness and transparency of generative AI applications and agents before deployment. 

 

Cortex Search

In an AI agent or application performing retrieval-augmented generation (RAG), the quality of the final output is fundamentally dependent on the precision of the initial retrieval. 

To measure and continuously improve retrieval quality, Cortex Search now provides a native suite of evaluation and tuning tools. Now, users have access to a dedicated Evaluation UI for Cortex Search that allows them to:

  • Create high-quality evaluation sets 

  • Run experiments

  • Automatically tune search parameters to optimize performance for their specific business use case

This UI leverages LLMs to speed up the search evaluation process, including for query generation and relevance judgment. 

Using the Evaluation UI, users can quickly run and compare experiments to measure retrieval quality against human- and LLM-labeled data sets, ensuring that downstream users' search and chat apps receive the most relevant context to their queries.

 

Cortex Analyst

Cortex Analyst translates natural language prompts into precise SQL queries, enabling users to extract critical insights from complex data sets. 

To help ensure continuous improvement and accuracy, administrators and engineers have access to historical logs of all past interactions. By analyzing these logs, engineers can make informed adjustments to the underlying semantic model, refining its ability to generate highly accurate responses.

To quantitatively measure performance, Cortex Analyst has open sourced a Streamlit tool that uses an "LLM-as-a-judge." This involves comparing the model's responses against a golden set of ideal request-response pairs, which calculates an aggregate percentage of correctness and provides a benchmark for the model's accuracy.

 

Document AI

Observability in Document AI is achieved through Attention Spans for explainability and Confidence Scores for reliability. 

Attention Spans provide a direct method for validating the output extracted from documents. This feature enhances explainability by using a secondary LLM to present the specific evidence from the source text that supports each result. This is particularly useful during preproduction stages, such as inference and training, as it allows for the continuous validation of output quality to confirm it meets expectations.

ai observability 3

Additionally, the system generates built-in Confidence Scores for every extracted value. These scores are calculated via an algorithm that aggregates the individual probabilities of each word token in the answer. While a high confidence score does not guarantee a correct answer, it significantly increases the likelihood of accuracy. This acts as a powerful tool for responsible AI, enabling workflows to automatically filter or flag answers with low scores. Constant monitoring of these scores ensures the timely detection and remedy of any model performance deterioration over time.

 

Build trust in your AI with Observability in Snowflake Cortex AI

The journey from generative AI prototypes to dependable, production-ready applications hinges on trust and transparency. Snowflake Cortex AI delivers the essential tool kit of AI observability and evaluations to make this transition seamless, enabling developers to move beyond the "black box" nature of AI systems.

By integrating observability into their AI development lifecycle, developers can continuously validate, debug and refine their work, ensuring that AI solutions are not only effective and efficient but also fully explainable and reliable. 

Ultimately, Snowflake Cortex AI empowers you to build generative AI applications that are not only powerful but also transparent and worthy of enterprise trust.

 


Additional resources:

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Where Data Does More

  • 30-day free trial
  • No credit card required
  • Cancel anytime