Getting Started with Governing your Lakehouse for AI use-cases
This guide walks you through setting up a governed data lakehouse with Apache Iceberg tables in Snowflake, implementing data governance policies, establishing data quality monitoring, and creating an AI agent for natural language data exploration.
Overview
What You'll Build
In this guide, you will build an end-to-end governed data lakehouse for Amazon Product Reviews Analytics.
We will cover key aspects of building a production-ready, AI-enabled data lakehouse that combines open table formats with enterprise governance. You'll create a complete solution that:
- Connects to external Iceberg data stored in Amazon S3
- Implements fine-grained access controls with row-level security and data masking
- Monitors data quality with automated metric functions
- Enables natural language queries through a Cortex AI Agent

What You'll Learn
In this guide, you will learn:
- How to configure Apache Iceberg tables with external cloud storage
- How to implement enterprise-grade data governance on open table formats
- Row Access Policies for role-based row filtering
- Dynamic Data Masking for sensitive column protection
- Object Tagging for data classification and compliance
- How to monitor data quality with Data Metric Functions (DMFs)
- Creating custom quality metrics
- Scheduling automated quality checks
- How to build AI-powered data access with Cortex Agents
- Semantic model design for text-to-SQL
- Agent creation and configuration
What is Apache Iceberg?
Apache Iceberg is an open table format designed for huge analytic datasets. It provides:
- ACID transactions - Reliable operations with atomicity, consistency, isolation, and durability
- Schema evolution - Add, rename, or remove columns without rewriting data
- Hidden partitioning - Automatic partition management without user-visible partition columns
- Time travel - Query data at any point in time using snapshots
What are Iceberg Tables in Snowflake?
Iceberg tables in Snowflake combine the performance and query semantics of native Snowflake tables with external cloud storage that you manage. They are ideal for:
- Existing data lakes you cannot (or choose not to) migrate
- Multi-engine architectures where multiple compute engines access the same data
- Cost optimization by leveraging your own cloud storage
Iceberg Tables - Key Concepts
External Volume:
- Account-level object connecting Snowflake to your cloud storage
- Stores IAM credentials for secure access
- Supports S3, GCS, and Azure Blob Storage
Catalog Integration:
- Defines how Snowflake interacts with Iceberg metadata
- Options: Snowflake-managed, AWS Glue, or object store catalogs
Metadata Management:
- Iceberg uses manifest files to track data files
- Snowflake reads metadata to understand table structure
- Enables features like schema evolution and time travel
What is Snowflake Data Governance?
Snowflake provides a comprehensive suite of governance features that work seamlessly with Iceberg tables:
Row Access Policies
Row-level security that filters rows at query time based on user context:
- ✅ Dynamic filtering based on
CURRENT_ROLE(),CURRENT_USER(), or custom logic - ✅ No data duplication - single table serves multiple audiences
- ✅ Transparent to queries - users write normal SQL
Dynamic Data Masking
Column-level security that transforms sensitive data at query time:
- ✅ Role-based visibility - privileged users see real data
- ✅ Multiple masking patterns - full mask, partial mask, tokenization
- ✅ Applied dynamically - original data unchanged
Object Tagging
Classification system for organizing and governing data:
- ✅ Hierarchical tags - apply to databases, schemas, tables, columns
- ✅ Tag-based policies - link masking policies to tags
- ✅ Compliance tracking - identify PII, sensitivity levels
What are Data Metric Functions?
Data Metric Functions (DMFs) measure the state and integrity of your data. They enable:
Automated Quality Monitoring:
- Schedule quality checks to run automatically
- Results stored in dedicated event tables
- Alert on quality threshold violations
Built-in and Custom Metrics:
- ✅ System DMFs - NULL count, duplicate count, uniqueness, freshness
- ✅ Custom DMFs - Define your own quality rules with SQL
Serverless Execution:
- No warehouse management required
- Snowflake handles compute automatically
- Pay only for scheduled executions
What are Cortex Agents?
Cortex Agents are AI-powered assistants that orchestrate across multiple tools to answer questions and perform tasks.
Semantic Model:
- YAML definition of your data's business meaning
- Dimensions, measures, and time dimensions
- Synonyms for natural language understanding
- Verified queries for accuracy
Prerequisites
Snowflake Requirements
| Requirement | Details |
|---|---|
| Snowflake Account | Snowflake Free Trial Account |
| Edition | Enterprise Edition or higher (required for governance features) |
| Role | ACCOUNTADMIN or equivalent privileges |
| Cloud Region | AWS US-WEST-2 (Snowflake instance should be in the same region as your AWS instance) |
AWS Requirements (for custom external volumes)
| Requirement | Details |
|---|---|
| S3 Bucket | Bucket containing Iceberg data files |
| IAM Permissions | Ability to create IAM policies and roles |
Skills & Knowledge
- Basic SQL proficiency
- Familiarity with cloud storage (AWS S3 in particular for this demo)
Download SQL files
Download the SQL, JSON and YAML files from this git repository into your local machine.
File Reference
| File | Purpose |
|---|---|
iam_policy.json | AWS IAM policy for S3 bucket access |
1-setup.sql | Database, warehouse, external volume, and Iceberg table creation |
2-governance.sql | Role creation, row access policies, masking policies, and tagging |
3-data-quality.sql | Custom data metric functions for quality monitoring |
4-create-agent.sql | Cortex Agent with Cortex Analyst for natural language queries |
product_reviews_semantic_model.yaml | Semantic model defining business context for AI |
Setup: External Iceberg Tables
AWS IAM Configuration
Before creating an external volume, you need to configure AWS IAM to grant Snowflake access to your S3 bucket.
Step 1: Create an IAM Policy
Create a policy that grants Snowflake the necessary S3 permissions. The policy in iam_policy.json provides:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "AllowIcebergObjectAccess", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:GetObjectVersion", "s3:DeleteObject", "s3:DeleteObjectVersion" ], "Resource": "arn:aws:s3:::<bucket-name-and-prefix>/*" }, { "Sid": "AllowIcebergBucketAccess", "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetBucketLocation" ], "Resource": "arn:aws:s3:::<bucket-name>", "Condition": { "StringLike": { "s3:prefix": [ "<prefix>/*" ] } } } ] }
Key Permissions Explained:
| Permission | Purpose |
|---|---|
s3:PutObject | Write data files to S3 |
s3:GetObject | Read data and metadata files |
s3:GetObjectVersion | Access versioned objects (required for Time Travel) |
s3:DeleteObject | Remove data files during compaction |
s3:ListBucket | List bucket contents |
s3:GetBucketLocation | Verify bucket region |
Step 2: Create an IAM Role
- Create an AWS IAM role with a trust policy for Snowflake
- Attach the IAM policy created above
- Record the Role ARN for use in the external volume configuration
Step 3: Update Trust Policy
After creating the external volume, retrieve the Snowflake IAM user ARN using DESC EXTERNAL VOLUME and update the IAM role's trust policy:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "<snowflake_user_arn>" }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": "<your_external_id>" } } } ] }
Snowflake Setup Script (1-setup.sql)
USE ROLE ACCOUNTADMIN; CREATE DATABASE LAKEHOUSE_VHOL; CREATE WAREHOUSE VHOL_XS WAREHOUSE_SIZE = XSMALL; USE LAKEHOUSE_VHOL.PUBLIC; USE WAREHOUSE VHOL_XS;
External Volume
An external volume is an account-level Snowflake object that connects Snowflake to your external cloud storage for Iceberg tables. It stores the IAM entity that Snowflake uses to securely access your storage.
CREATE OR REPLACE EXTERNAL VOLUME amazon_reviews_iceberg_volume STORAGE_LOCATIONS = ( ( NAME = 'amazon-reviews-s3' STORAGE_PROVIDER = 'S3' STORAGE_BASE_URL = 's3://amazon-product-reviews-dataset/curated/product_reviews/product_reviews_db/product_reviews_100k/' STORAGE_AWS_ROLE_ARN = '<>' STORAGE_AWS_EXTERNAL_ID = 'ext_id' ) ) ALLOW_WRITES = TRUE;
Parameters:
| Parameter | Description |
|---|---|
STORAGE_PROVIDER | Cloud provider (S3, GCS, AZURE) |
STORAGE_BASE_URL | Base URL for your storage location |
STORAGE_AWS_ROLE_ARN | ARN of the IAM role with S3 access |
STORAGE_AWS_EXTERNAL_ID | External ID for cross-account access security |
ALLOW_WRITES | Enable write operations (required for Snowflake-managed tables) |
Verify the external volume configuration:
DESC EXTERNAL VOLUME amazon_reviews_iceberg_volume; SELECT SYSTEM$VERIFY_EXTERNAL_VOLUME('amazon_reviews_iceberg_volume');
Catalog Integration
A catalog integration tells Snowflake how to interact with Iceberg metadata. For externally managed tables, use OBJECT_STORE:
CREATE OR REPLACE CATALOG INTEGRATION amazon_reviews_catalog_int CATALOG_SOURCE = OBJECT_STORE TABLE_FORMAT = ICEBERG ENABLED = TRUE;
Create the Iceberg Table
Create an Iceberg table pointing to existing metadata:
CREATE OR REPLACE ICEBERG TABLE product_reviews_iceberg EXTERNAL_VOLUME = 'amazon_reviews_iceberg_volume' CATALOG = 'amazon_reviews_catalog_int' METADATA_FILE_PATH = 'metadata/00000-00a5b7be-48a2-4966-918f-99af062f8e85.metadata.json';
Verify Your Data
SELECT * FROM product_reviews_iceberg LIMIT 10; SELECT COUNT(*) AS total_reviews, ROUND(AVG(OVERALL), 2) AS avg_rating, COUNT(DISTINCT ASIN) AS unique_products, COUNT(DISTINCT REVIEWERID) AS unique_reviewers FROM product_reviews_iceberg;
Governance: Access Control, Tagging & Masking
Overview
Snowflake provides comprehensive governance features for Iceberg tables, enabling you to:
- Control row-level access based on user roles
- Mask sensitive data dynamically
- Classify and tag data for compliance
Role-Based Access Control (2-governance.sql)
Create Custom Roles
USE ROLE ACCOUNTADMIN; CREATE ROLE IF NOT EXISTS DATA_ADMIN; CREATE ROLE IF NOT EXISTS ANALYST; SET my_user = CURRENT_USER(); GRANT ROLE DATA_ADMIN TO USER IDENTIFIER($my_user); GRANT ROLE ANALYST TO USER IDENTIFIER($my_user);
Grant Permissions
GRANT USAGE ON DATABASE LAKEHOUSE_VHOL TO ROLE DATA_ADMIN; GRANT USAGE ON DATABASE LAKEHOUSE_VHOL TO ROLE ANALYST; GRANT USAGE ON SCHEMA LAKEHOUSE_VHOL.PUBLIC TO ROLE DATA_ADMIN; GRANT USAGE ON SCHEMA LAKEHOUSE_VHOL.PUBLIC TO ROLE ANALYST; GRANT SELECT ON ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg TO ROLE DATA_ADMIN; GRANT SELECT ON ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg TO ROLE ANALYST; GRANT USAGE ON WAREHOUSE VHOL_XS TO ROLE DATA_ADMIN; GRANT USAGE ON WAREHOUSE VHOL_XS TO ROLE ANALYST;
Row Access Policies
Row Access Policies enable row-level security by determining which rows are visible based on the user's role and context. The policy is evaluated at query runtime, creating a dynamic secure view.
Create a Time-Based Row Access Policy
This policy restricts the ANALYST role to only see reviews from 2010 onwards, while DATA_ADMIN and ACCOUNTADMIN can see all data:
CREATE OR REPLACE ROW ACCESS POLICY LAKEHOUSE_VHOL.PUBLIC.time_based_review_policy AS (UNIXREVIEWTIME NUMBER) RETURNS BOOLEAN -> CASE WHEN CURRENT_ROLE() IN ('ACCOUNTADMIN', 'DATA_ADMIN') THEN TRUE WHEN CURRENT_ROLE() = 'ANALYST' AND TO_TIMESTAMP(UNIXREVIEWTIME) >= '2010-01-01'::TIMESTAMP THEN TRUE ELSE FALSE END;
Apply the Policy
ALTER ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg ADD ROW ACCESS POLICY LAKEHOUSE_VHOL.PUBLIC.time_based_review_policy ON (UNIXREVIEWTIME);
Verify Role-Based Access
-- DATA_ADMIN: should see all 100,000 rows (1997-2014) USE ROLE DATA_ADMIN; SELECT 'DATA_ADMIN' AS role, COUNT(*) AS visible_rows, MIN(TO_TIMESTAMP(UNIXREVIEWTIME))::DATE AS earliest, MAX(TO_TIMESTAMP(UNIXREVIEWTIME))::DATE AS latest FROM LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg; -- ANALYST: should see fewer rows (2010-2014 only) USE ROLE ANALYST; SELECT 'ANALYST' AS role, COUNT(*) AS visible_rows, MIN(TO_TIMESTAMP(UNIXREVIEWTIME))::DATE AS earliest, MAX(TO_TIMESTAMP(UNIXREVIEWTIME))::DATE AS latest FROM LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg;
Dynamic Data Masking
Dynamic Data Masking is a column-level security feature that masks data at query time based on the user's role. The original data remains unchanged; masking is applied dynamically.
Create a Masking Policy
USE ROLE ACCOUNTADMIN; CREATE OR REPLACE MASKING POLICY LAKEHOUSE_VHOL.PUBLIC.mask_reviewer_name AS (val STRING) RETURNS STRING -> CASE WHEN CURRENT_ROLE() IN ('ACCOUNTADMIN', 'DATA_ADMIN') THEN val ELSE '***MASKED***' END;
Apply the Masking Policy
ALTER ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg MODIFY COLUMN REVIEWERNAME SET MASKING POLICY LAKEHOUSE_VHOL.PUBLIC.mask_reviewer_name;
Verify Masking Behavior
-- DATA_ADMIN sees real names USE ROLE DATA_ADMIN; SELECT REVIEWERID, REVIEWERNAME, SUMMARY FROM LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg LIMIT 5; -- ANALYST sees masked names USE ROLE ANALYST; SELECT REVIEWERID, REVIEWERNAME, SUMMARY FROM LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg LIMIT 5;
Object Tagging
Tags enable data classification and can be used to organize data governance across your organization. Tags can be applied to tables, columns, and other database objects.
Create Classification Tags
USE ROLE ACCOUNTADMIN; CREATE OR REPLACE TAG LAKEHOUSE_VHOL.PUBLIC.pii_type ALLOWED_VALUES 'name', 'email', 'id', 'none'; CREATE OR REPLACE TAG LAKEHOUSE_VHOL.PUBLIC.data_sensitivity ALLOWED_VALUES 'high', 'medium', 'low';
Apply Tags to Columns
ALTER ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg MODIFY COLUMN REVIEWERNAME SET TAG LAKEHOUSE_VHOL.PUBLIC.pii_type = 'name'; ALTER ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg MODIFY COLUMN REVIEWERID SET TAG LAKEHOUSE_VHOL.PUBLIC.pii_type = 'id'; ALTER ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg MODIFY COLUMN REVIEWTEXT SET TAG LAKEHOUSE_VHOL.PUBLIC.data_sensitivity = 'medium';
Apply Tags to the Table
ALTER ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg SET TAG LAKEHOUSE_VHOL.PUBLIC.data_sensitivity = 'medium';
View All Tags
SELECT * FROM TABLE( INFORMATION_SCHEMA.TAG_REFERENCES_ALL_COLUMNS( 'LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg', 'TABLE' ) );
Data Quality: Monitoring with Data Metric Functions
Overview
Data Metric Functions (DMFs) measure the state and integrity of your data. Snowflake provides built-in system DMFs and supports custom DMFs for specialized quality measurements. DMFs can be scheduled to run automatically and results are stored in a dedicated event table.
Benefits of DMFs:
- Automated measurement - Schedule quality checks without manual intervention
- Compliance facilitation - Document data quality for regulatory requirements
- SLA monitoring - Track freshness and accuracy metrics
- Consistent evaluation - Apply the same metrics across tables
Create Custom DMFs (3-data-quality.sql)
Empty Review Count DMF
Checks for NULL or empty review text:
USE ROLE ACCOUNTADMIN; CREATE OR REPLACE DATA METRIC FUNCTION LAKEHOUSE_VHOL.PUBLIC.empty_review_count( ARG_T TABLE(ARG_C STRING) ) RETURNS NUMBER AS 'SELECT COUNT(*) FROM ARG_T WHERE ARG_C IS NULL OR TRIM(ARG_C) = \'\'';
Invalid Rating Count DMF
Validates ratings are within the expected 1-5 range:
CREATE OR REPLACE DATA METRIC FUNCTION LAKEHOUSE_VHOL.PUBLIC.invalid_rating_count( ARG_T TABLE(ARG_C FLOAT) ) RETURNS NUMBER AS 'SELECT COUNT(*) FROM ARG_T WHERE ARG_C < 1 OR ARG_C > 5 OR ARG_C IS NULL';
Schedule DMF Execution
ALTER ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg SET DATA_METRIC_SCHEDULE = '5 MINUTE';
Run DMFs Directly
-- Check for empty reviews SELECT LAKEHOUSE_VHOL.PUBLIC.empty_review_count( SELECT REVIEWTEXT FROM LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg ) AS empty_review_count; -- Check for invalid ratings SELECT LAKEHOUSE_VHOL.PUBLIC.invalid_rating_count( SELECT OVERALL FROM LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg ) AS invalid_rating_count;
DMF Execution Model
| Feature | Description |
|---|---|
| Serverless compute | DMFs use Snowflake-managed compute resources |
| Event logging | Results are logged to a dedicated event table |
| Billing | Charged under "Data Quality Monitoring" category |
| Scheduling | All DMFs on a table follow the same schedule |
AI Agent: Natural Language Data Access
Overview
Snowflake Cortex Agents are AI-powered assistants that orchestrate across multiple tools to answer questions and perform tasks. In this lab, we create an agent that uses Cortex Analyst for text-to-SQL capabilities, allowing users to query product reviews using natural language.
Semantic Model
The semantic model (product_reviews_semantic_model.yaml) defines the business meaning of your data, enabling the AI to generate accurate SQL queries.
Key Components:
Dimensions - Categorical attributes for grouping and filtering:
dimensions: - name: asin synonyms: - product_id - amazon_product_id description: Amazon Standard Identification Number - unique product identifier expr: ASIN data_type: VARCHAR
Time Dimensions - Date/time fields for temporal analysis:
time_dimensions: - name: review_timestamp synonyms: - review_date - timestamp description: Unix timestamp when the review was posted expr: TO_TIMESTAMP(UNIXREVIEWTIME) data_type: TIMESTAMP
Measures - Quantitative metrics for aggregation:
measures: - name: overall_rating synonyms: - rating - stars - score description: Overall rating given by the reviewer (typically 1-5 scale) expr: OVERALL data_type: FLOAT default_aggregation: avg
Verified Queries - Pre-validated query patterns:
verified_queries: - name: average_rating question: What is the average overall rating? sql: | SELECT AVG(OVERALL) as average_rating FROM LAKEHOUSE_VHOL.PUBLIC.PRODUCT_REVIEWS_ICEBERG
Create the Agent (4-create-agent.sql)
Upload the Semantic Model
USE LAKEHOUSE_VHOL.PUBLIC; USE WAREHOUSE VHOL_XS; USE ROLE ACCOUNTADMIN; CREATE STAGE IF NOT EXISTS LAKEHOUSE_VHOL.PUBLIC.SEMANTIC_MODELS; -- Manually upload product_reviews_semantic_model.yaml to the stage
Setup Snowflake Intelligence Database
For agents to be discoverable in Snowflake Intelligence, they must be created in the SNOWFLAKE_INTELLIGENCE.AGENTS schema:
CREATE DATABASE IF NOT EXISTS SNOWFLAKE_INTELLIGENCE; GRANT USAGE ON DATABASE SNOWFLAKE_INTELLIGENCE TO ROLE PUBLIC; CREATE SCHEMA IF NOT EXISTS SNOWFLAKE_INTELLIGENCE.AGENTS; GRANT USAGE ON SCHEMA SNOWFLAKE_INTELLIGENCE.AGENTS TO ROLE PUBLIC; GRANT CREATE AGENT ON SCHEMA SNOWFLAKE_INTELLIGENCE.AGENTS TO ROLE ACCOUNTADMIN;
Create the Agent
CREATE OR REPLACE AGENT SNOWFLAKE_INTELLIGENCE.AGENTS.AMAZON_PRODUCT_REVIEWS_AGENT COMMENT = 'Agent for querying Amazon product reviews data' PROFILE = '{"display_name": "Amazon Product Reviews Assistant"}' FROM SPECIFICATION $$ { "models": { "orchestration": "claude-4-sonnet" }, "instructions": { "orchestration": "Use the product reviews tool to answer questions about reviews, ratings, and customer feedback.", "response": "Be concise and data-driven. When showing ratings, include counts for context." }, "tools": [ { "tool_spec": { "type": "cortex_analyst_text_to_sql", "name": "product_reviews", "description": "Query product reviews data including ratings, review text, summaries, and reviewer information" } } ], "tool_resources": { "product_reviews": { "semantic_model_file": "@LAKEHOUSE_VHOL.PUBLIC.SEMANTIC_MODELS/product_reviews_semantic_model.yaml", "execution_environment": { "type": "warehouse", "warehouse": "VHOL_XS" } } } } $$;
Agent Specification Components
| Component | Description |
|---|---|
models.orchestration | LLM for planning and tool selection (e.g., claude-4-sonnet, llama3.3-70B) |
instructions.orchestration | Guidance for tool selection and query planning |
instructions.response | Tone and formatting for responses |
tools | Array of tool definitions (Cortex Analyst, Cortex Search, custom tools) |
tool_resources | Configuration for each tool (semantic model, warehouse, etc.) |
Grant Access
GRANT USAGE ON AGENT SNOWFLAKE_INTELLIGENCE.AGENTS.AMAZON_PRODUCT_REVIEWS_AGENT TO ROLE PUBLIC;
Using the Agent
After creation, the agent is accessible via:
- Snowflake Intelligence UI (AI & ML > Snowflake Intelligence)
- REST API for programmatic access
- SQL using agent functions
Example questions the agent can answer:
- "What is the average rating for all products?"
- "How many reviews are there for each star rating?"
- "Show me the most recent reviews"
- "What products have the highest ratings?"
Conclusion and Resources
What did you build?
In this guide, you built an end-to-end governed data lakehouse for Amazon Product Reviews Analytics.
You have covered key aspects of building a production-ready, AI-enabled data lakehouse that combines open table formats with enterprise governance. You created a complete solution that:
- Connects to external Iceberg data stored in Amazon S3
- Implements fine-grained access controls with row-level security and data masking
- Monitors data quality with automated metric functions
- Enables natural language queries through a Cortex AI Agent
Additional Resources
This content is provided as is, and is not maintained on an ongoing basis. It may be out of date with current Snowflake instances