Summit 26 from June 1-4 in San Francisco

Lead your organization in the era of agents and enterprise intelligence.

Snowflake for DevelopersGuidesInteroperable Lakehouse with AWS Glue, Snowflake Iceberg Tables, and Snowpark Connect for Apache Spark
Quickstart

Interoperable Lakehouse with AWS Glue, Snowflake Iceberg Tables, and Snowpark Connect for Apache Spark

Apache Iceberg
Parag Jain, Vino Duraisamy

Overview

Apache Iceberg is an open table format for huge analytical datasets that enables high performance analytics on open data formats with ACID compliance. With Apache Iceberg's broad adoption across major data platforms, it is designed for vendor-neutral interoperability. You can write data once and query it from any compatible engine—no proprietary formats, no data migration required.

This guide demonstrates how to build a modern, interoperable lakehouse architecture that seamlessly integrates AWS Glue, Apache Iceberg, and Snowflake to enable data processing workloads across multiple compute engines while maintaining a single source of truth.

The solution showcases how organizations can leverage best-of-breed tools without vendor lock-in, allowing Spark code written for AWS Glue to run unmodified on Snowflake using Snowpark Connect.

Lakehouse Architecture

Your organization's data lives everywhere -- across tools, platforms, and cloud providers. If you use both AWS and Snowflake, you can build a connected and interoperable architecture that focuses on getting value from your data without costly and complex data movement.

For this guide, we will use a Sales Analytics use case where 1M+ transaction records are processed. We'll load sales and customer data into Iceberg tables via AWS Glue, then demonstrate how the exact same Spark code can run on both AWS Glue and Snowflake to generate identical Top 10 product sales reports.

Prerequisites

  • Familiarity with Snowflake, basic SQL, Snowsight UI, and Snowflake Objects
  • Familiarity with AWS Services (S3, Glue, CloudFormation, IAM) and the Management Console
  • Basic knowledge of Python and PySpark

What You'll Learn

  • How to create Iceberg tables in the AWS Glue Data Catalog using Glue ETL jobs
  • How to configure a Snowflake Catalog-Linked Database with the Glue Data Catalog via Iceberg REST Catalog APIs
  • How to use Snowpark Connect to run identical Spark code on Snowflake
  • How to achieve true code portability between AWS Glue and Snowflake with zero data movement

What You'll Need

  • A Snowflake Account with ACCOUNTADMIN access (Enterprise Edition or higher recommended)
  • An AWS Account with administrative access
  • AWS CLI installed and configured (optional, for command-line deployment)
  • Snowflake Notebook feature enabled (requires Snowflake Runtime 2.0+)
  • The accompanying GitHub repository cloned or downloaded to your local machine

What You'll Build

  • An AWS Glue Database with Iceberg tables stored on S3
  • CloudFormation-deployed infrastructure including IAM roles and Glue ETL jobs
  • Snowflake External Volume and Catalog Integration with AWS Glue
  • Snowflake Catalog-Linked Database for automatic table discovery
  • Sales analytics pipeline demonstrating 95%+ code reuse between platforms

Setup

This guide has an accompanying GitHub repository that contains all the files, scripts, data, and resources you need to complete this quickstart:

Repository Contents:

glue_cld_snowflake/
├── cloudformation/
│   └── iceberg_glue_stack_cloudformation_template.yaml  # AWS infrastructure template
├── data/
│   ├── customers.json                                    # Customer data
│   └── sales_data_1m.json.gz                            # 1M sales records (compressed)
├── scripts/
│   ├── iceberg_cld_demo_sales_table_script_1.py         # Load sales data
│   ├── iceberg_cld_demo_customer_table_script_2.py      # Load customer data
│   ├── iceberg_cld_demo_sales_report_table_script_3.py  # Generate sales report
│   └── iceberg_cld_demo_snow_report_table_script_4.py   # Create Snowflake report table
├── snowflake_notebook/
│   └── 01_GLUE_LH_SNOWPARK_CONNECT_DEMO.ipynb           # Snowflake notebook
├── diagrams/                                             # Architecture diagrams
└── trusted_policy/
    └── IAM_trust_policy.json                             # IAM trust policy template

Clone or download this repository before starting the quickstart to have all necessary files readily available.

Architecture and Data Flow

This solution implements an open lakehouse architecture that separates storage from compute and uses open standards (Apache Iceberg) to enable seamless interoperability.

Key Components

1. AWS Glue with Apache Iceberg

  • Infrastructure as Code via CloudFormation template
  • Glue Database for Iceberg table metadata
  • 4 pre-configured Glue ETL jobs (Glue 4.0 with Spark 3.5)
  • IAM roles with appropriate S3 and Glue permissions

2. Snowflake Catalog-Linked Database

  • Automatically syncs with AWS Glue Data Catalog
  • Discovers Iceberg tables without manual configuration
  • Supports both read and write operations to the remote catalog
  • Sync interval configurable (default 60 seconds)

3. Snowpark Connect

  • Apache Spark DataFrame API implementation for Snowflake
  • Enables existing Spark applications to run unchanged
  • Provides Spark-compatible API with Snowflake's optimized query engine

Data Flow Diagram

┌─────────────────────────────────────────────────────────────────┐
│                     S3 Data Bucket                              │
│  • raw_data/sales_data_1m.json.gz (1M records)                 │
│  • raw_data/customers.json                                      │
│  • iceberg-warehouse/ (Iceberg table storage)                   │
└────────────┬──────────────────────────────────┬─────────────────┘
             │                                  │
             ▼                                  ▼
┌─────────────────────────┐        ┌──────────────────────────────┐
│   AWS Glue ETL Jobs     │        │  AWS Glue Data Catalog       │
│  (Spark on AWS)         │◄──────►│  (Iceberg REST Catalog)      │
│                         │        │  • Metadata for all tables   │
│  Script 1: Load Sales   │        │  • Schema definitions        │
│  Script 2: Load Customers│       │  • Partition info            │
│  Script 3: Sales Report │        └──────────┬───────────────────┘
│  Script 4: Snow Report  │                   │
└─────────────────────────┘                   │ Iceberg REST API
                                              │ (Catalog Integration)
                                              ▼
                                   ┌──────────────────────────────┐
                                   │ Snowflake Catalog-Linked DB  │
                                   │  (Snowpark Connect)          │
                                   │                              │
                                   │  • Auto-discovers tables     │
                                   │  • Notebook with Spark code  │
                                   │  • Generates same report     │
                                   │  • Zero data movement        │
                                   └──────────────────────────────┘

Upload Data into S3

In this step, we will create an S3 bucket and upload the data files and Glue scripts needed for the demo.

Create S3 Bucket

  • Navigate to AWS S3 Console and create a new bucket
  • Choose a unique name (e.g., my-iceberg-lakehouse-demo-<your-account-id>)
  • Select the same region as your Snowflake account (to minimize egress charges)
  • Keep default settings for encryption and versioning
# Option: Using AWS CLI
export DATA_BUCKET=my-iceberg-lakehouse-demo-<your-account-id>
export AWS_REGION=us-west-2  # Match with your Snowflake region

aws s3 mb s3://$DATA_BUCKET --region $AWS_REGION

Create Folder Structure

Create the required folders in your bucket:

  • raw_data/ - for source data files
  • scripts/ - for Glue ETL scripts
  • iceberg-warehouse/ - for Iceberg table storage
# Create folder structure
aws s3api put-object --bucket $DATA_BUCKET --key raw_data/
aws s3api put-object --bucket $DATA_BUCKET --key scripts/
aws s3api put-object --bucket $DATA_BUCKET --key iceberg-warehouse/

Upload Data Files

Upload the data files from the data/ folder in the repository:

# Navigate to the cloned repository directory
# (https://github.com/sfc-gh-pjain/glue_cld_snowflake)
cd glue_cld_snowflake

# Upload data files from the cloned repository
# (https://github.com/sfc-gh-pjain/glue_cld_snowflake/tree/main/data)
aws s3 cp data/customers.json s3://$DATA_BUCKET/raw_data/customers.json
aws s3 cp data/sales_data_1m.json.gz s3://$DATA_BUCKET/raw_data/sales_data_1m.json.gz

# Verify upload
aws s3 ls s3://$DATA_BUCKET/raw_data/
Raw Data Folder

Upload Glue Scripts

Upload all Glue ETL scripts from the scripts/ folder in the repository:

# Upload all Glue scripts from the cloned repository
# (https://github.com/sfc-gh-pjain/glue_cld_snowflake/tree/main/scripts)
aws s3 sync scripts/ s3://$DATA_BUCKET/scripts/ --exclude "*" --include "*.py"

# Verify scripts upload
aws s3 ls s3://$DATA_BUCKET/scripts/
Scripts Folder

Expected Output:

iceberg_cld_demo_customer_table_script_2.py
iceberg_cld_demo_sales_table_script_1.py
iceberg_cld_demo_sales_report_table_script_3.py
iceberg_cld_demo_snow_report_table_script_4.py

Create Snowflake External Volume

In this step, we will create the Snowflake database and External Volume to enable S3 access.

Run Initial Setup Cells

Run the first cells of the Snowflake notebook to create the database and external volume.

Cell 0: Create Database and Schema

CREATE OR REPLACE DATABASE ICEBERG_LAKE;
CREATE OR REPLACE SCHEMA ICEBERG_LAKE.DEMO;
USE SCHEMA ICEBERG_LAKE.DEMO;

Cell 2: Create External Volume

Update the following parameters in Cell 2 before running:

PlaceholderReplace With
<Your dataBucket name>Your S3 bucket name from Step 1
<AWS ROLE ARN>Leave as placeholder (update after CloudFormation)
<Any secret word you want>Choose a secret external ID (e.g., my_iceberg_ext_id_2024)
CREATE OR REPLACE EXTERNAL VOLUME extvol_iceberg_demo
STORAGE_LOCATIONS =
      (
         (
            NAME = 'Iceberg-Table-Demo'
            STORAGE_PROVIDER = 'S3'
            STORAGE_BASE_URL = 's3://<Your dataBucket name>/'
            STORAGE_AWS_ROLE_ARN = '<AWS ROLE ARN>'
            STORAGE_AWS_EXTERNAL_ID = '<Any secret word you want>'
         )
      );

Important: Save the External ID you choose - you'll need it for CloudFormation deployment.

Get Snowflake Credentials

Run Cell 3 to retrieve the Snowflake-generated IAM User ARN and External ID:

DESC EXTERNAL VOLUME extvol_iceberg_demo;

SELECT b.KEY, b.VALUE 
FROM 
    TABLE(RESULT_SCAN(LAST_QUERY_ID())) a, 
    TABLE(FLATTEN(INPUT => PARSE_JSON(a."property_value"))) b 
WHERE 
    a."parent_property" = 'STORAGE_LOCATIONS'
    AND a."property" = 'STORAGE_LOCATION_1'
    AND (b.KEY='STORAGE_AWS_EXTERNAL_ID' OR b.KEY ='STORAGE_AWS_IAM_USER_ARN');

Output:

KEY                          | VALUE
-----------------------------+----------------------------------------
STORAGE_AWS_IAM_USER_ARN     | arn:aws:iam::123456789012:user/abc...
STORAGE_AWS_EXTERNAL_ID      | XYZ12345_SFCRole=1_AbCdEfGh...

IMPORTANT: Copy and save these values!

  • STORAGE_AWS_IAM_USER_ARN → Use as SnowflakeUserArn in CloudFormation
  • STORAGE_AWS_EXTERNAL_ID → Use as SnowflakeExternalId in CloudFormation

Deploy AWS CloudFormation Stack

In this step, we will deploy the CloudFormation template that creates all AWS infrastructure including the Glue database, IAM role, and ETL jobs.

Prepare CloudFormation Parameters

Gather these parameters before deploying:

Parameter NameDescriptionExample Value
GlueDatabaseNameName for Glue databasesnowcldtest
GlueRoleNameName for IAM rolesnowcldtest
DataBucketNameS3 bucket from Step 1snowcldtestbucket
S3ScriptLocationPath to any scripts3://snowcldtestbucket/scripts/iceberg_cld_demo_sales_table_script_1.py
SnowflakeUserArnFrom Step 4arn:aws:iam::123456789012:user/abc...
SnowflakeExternalIdFrom Step 4XYZ12345_SFCRole=1_...

Deploy via AWS Console

CloudFormation Template

Enter Stack Parameters

  • Stack name: iceberg-glue-snowflake-demo
  • Fill in all parameters from the table above
  • Click Next

Configure Stack Options

  • Tags (optional): Add tags for cost tracking
  • Permissions: Leave as default
  • Stack failure options: Select Delete newly created resources during rollback
  • Click Next

Review and Deploy

  • Review all parameters
  • Check the box: "I acknowledge that AWS CloudFormation might create IAM resources with custom names"
  • Click Submit

Monitor Stack Creation

  • Wait for stack status to change to CREATE_COMPLETE (typically 2-5 minutes)
  • Monitor the Events tab for progress
  • If any errors occur, check the Events tab for details

Deploy via AWS CLI (Alternative)

# Set variables
export GLUE_DB_NAME=iceberg_demo_db
export GLUE_ROLE_NAME=GlueIcebergRole
export DATA_BUCKET=my-iceberg-lakehouse-demo-123456789012
export SNOWFLAKE_USER_ARN="arn:aws:iam::123456789012:user/abc..."
export SNOWFLAKE_EXT_ID="XYZ12345_SFCRole=1_..."

# Deploy stack using CloudFormation template from the cloned repository
# (https://github.com/sfc-gh-pjain/glue_cld_snowflake/tree/main/cloudformation)
aws cloudformation create-stack \
  --stack-name iceberg-glue-snowflake-demo \
  --template-body file://cloudformation/iceberg_glue_stack_cloudformation_template.yaml \
  --capabilities CAPABILITY_NAMED_IAM \
  --on-failure DELETE \
  --parameters \
      ParameterKey=GlueDatabaseName,ParameterValue=$GLUE_DB_NAME \
      ParameterKey=GlueRoleName,ParameterValue=$GLUE_ROLE_NAME \
      ParameterKey=DataBucketName,ParameterValue=$DATA_BUCKET \
      ParameterKey=S3ScriptLocation,ParameterValue=s3://$DATA_BUCKET/scripts/iceberg_cld_demo_sales_table_script_1.py \
      ParameterKey=SnowflakeUserArn,ParameterValue="$SNOWFLAKE_USER_ARN" \
      ParameterKey=SnowflakeExternalId,ParameterValue="$SNOWFLAKE_EXT_ID"

# Check stack status
aws cloudformation describe-stacks \
  --stack-name iceberg-glue-snowflake-demo \
  --query 'Stacks[0].StackStatus'

After CloudFormation completes, retrieve the IAM Role ARN and update the Snowflake notebook.

Get the Glue IAM Role ARN

Using AWS Console:

  • Go to CloudFormation → Stacks → iceberg-glue-snowflake-demoOutputs tab
  • Look for "GlueRole" key → Copy the role name

Using AWS CLI:

aws cloudformation describe-stacks \
  --stack-name iceberg-glue-snowflake-demo \
  --query 'Stacks[0].Outputs[?OutputKey==`GlueRole`].OutputValue' \
  --output text

Construct Full Role ARN

arn:aws:iam::<YOUR-AWS-ACCOUNT-ID>:role/<GlueRoleName>

Example:

arn:aws:iam::123456789012:role/GlueIcebergRole

Update Snowflake Notebook Parameters

Go back to your Snowflake notebook and update the AWS Role ARN in Cell 2:

CREATE OR REPLACE EXTERNAL VOLUME extvol_iceberg_demo
STORAGE_LOCATIONS = (
    (
        NAME = 'Iceberg-Table-Demo'
        STORAGE_PROVIDER = 'S3'
        STORAGE_BASE_URL = 's3://<Your dataBucket name>/'
        STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<account-id>:role/<role-name>'  -- UPDATE THIS
        STORAGE_AWS_EXTERNAL_ID = '<your-external-id>'
    )
);

Re-run Cell 2 to recreate the external volume with the correct IAM role.

Verify AWS Glue Resources

Before running the ETL jobs, verify that CloudFormation created all required resources.

Check Glue Database

# Using AWS CLI
aws glue get-database --name $GLUE_DB_NAME

# Using AWS Console
# Navigate to: AWS Glue → Data Catalog → Databases
# Verify your database exists

Check Glue ETL Jobs

Navigate to AWS Glue ConsoleETL Jobs

You should see 4 Glue jobs created:

Job NamePurpose
iceberg-sales-table-<db-name>Load 1M sales records to Iceberg
iceberg-customer-table-<db-name>Load customer data to Iceberg
iceberg-sales-report-<db-name>Generate Top 10 products report (Spark)
iceberg-snow-report-<db-name>Create placeholder table for Snowflake
Glue ETL Jobs

Verify Job Configuration

Each job should have:

  • Script location: s3://<your-bucket>/scripts/iceberg_cld_demo_*.py
  • IAM role: The role created by CloudFormation
  • Glue version: 4.0
  • Worker type: G.1X
  • Number of workers: 2

Run Glue ETL Jobs

In this step, we will run the Glue ETL jobs to load data into Iceberg tables.

Run Customer Table Load Job

Using AWS Console:

  • Go to AWS GlueETL Jobs
  • Select iceberg-customer-table-<db-name>
  • Click Run button
  • Monitor the Runs tab for status

Using AWS CLI:

aws glue start-job-run --job-name iceberg-customer-table-$GLUE_DB_NAME

Expected Duration: 2-3 minutes

Run Sales Table Load Job

Using AWS Console:

  • Select iceberg-sales-table-<db-name>
  • Click Run
  • Monitor progress (will take longer due to 1M records)

Using AWS CLI:

aws glue start-job-run --job-name iceberg-sales-table-$GLUE_DB_NAME

Expected Duration: 3-5 minutes (processing 1M gzipped records)

Wait for Both Jobs to Complete

Both jobs must show SUCCEEDED status before proceeding.

Verify Iceberg Tables in Glue Catalog

# List tables in the Glue database
aws glue get-tables --database-name $GLUE_DB_NAME \
  --query 'TableList[*].[Name,StorageDescriptor.Location]' \
  --output table

Expected Output:

-----------------------------------------------------------
|                       GetTables                         |
+--------------------+-----------------------------------+
|  customers_info    |  s3://my-bucket/iceberg-warehouse/...|
|  raw_sales_data    |  s3://my-bucket/iceberg-warehouse/...|
-----------------------------------------------------------

Generate Sales Report

In this step, we run the sales report job that performs the same analytics that we'll later replicate in Snowflake.

Run Sales Report Job

This job reads from the raw_sales_data Iceberg table and generates the Top 10 products report.

Using AWS Console:

  • Select iceberg-sales-report-<db-name> job
  • Click Run
  • Monitor the run (check logs for detailed output)

Using AWS CLI:

aws glue start-job-run --job-name iceberg-sales-report-$GLUE_DB_NAME

Expected Duration: 3-5 minutes

What the Report Job Does

The sales report script performs these operations:

  1. Reads from raw_sales_data Iceberg table
  2. Flattens the nested purchase structure
  3. Aggregates by product_id calculating:
    • Total sales amount
    • Total quantity sold
    • Unique customers
    • Transaction counts
    • Average transaction metrics
  4. Generates Top 10 products by sales amount
  5. Saves to Iceberg table: top_10_products_report_spark

Run Snowflake Report Table Creation Job

Create the placeholder table for the Snowflake-generated report:

aws glue start-job-run --job-name iceberg-snow-report-$GLUE_DB_NAME

Expected Duration: 1-2 minutes

Verify All Report Tables

aws glue get-tables --database-name $GLUE_DB_NAME \
  --query 'TableList[?starts_with(Name, `top_10`)].Name'

Expected Output:

[
    "top_10_products_report_spark",
    "top_10_products_report_snow"
]

Create Snowflake Catalog Integration

Now we'll create the Snowflake Catalog Integration to connect to the AWS Glue Data Catalog.

Update Catalog Integration Parameters

In Cell 4 of the Snowflake notebook, update these parameters:

PlaceholderWhere to Find
<Glue database Name>Your CloudFormation parameter
<AWS REGION>Your AWS region (e.g., us-west-2)
<AWS ACCOUNT NUMBER>Your AWS account ID
<AWS ROLE ARN>From CloudFormation outputs
<External ID>Optional — see note below

Note on SIGV4_EXTERNAL_ID: This is the external ID used to secure the AWS IAM trust relationship for the catalog integration — it is not the same as STORAGE_AWS_EXTERNAL_ID from your external volume. You can either provide your own string value here (which you then add to your IAM role's trust policy), or omit SIGV4_EXTERNAL_ID entirely and let Snowflake auto-generate one. After creating the integration, run DESC CATALOG INTEGRATION glue_rest_cat_int_demo and copy the GLUE_AWS_EXTERNAL_ID value — that is what you add to the IAM trust policy condition.

CREATE OR REPLACE CATALOG INTEGRATION glue_rest_cat_int_demo
  CATALOG_SOURCE = ICEBERG_REST
  TABLE_FORMAT = ICEBERG
  CATALOG_NAMESPACE = '<Glue database Name>'
  REST_CONFIG = (
    CATALOG_URI = 'https://glue.<AWS REGION>.amazonaws.com/iceberg'
    CATALOG_API_TYPE = AWS_GLUE
    CATALOG_NAME = '<AWS ACCOUNT NUMBER>'
  )
  REST_AUTHENTICATION = (
    TYPE = SIGV4
    SIGV4_IAM_ROLE = '<AWS ROLE ARN>'
    SIGV4_SIGNING_REGION = '<AWS REGION>'
    SIGV4_EXTERNAL_ID = '<External ID>'
  )
  ENABLED = TRUE;

Run Cell 4 to create the catalog integration.

Create Catalog-Linked Database

Update Cell 6 with your Glue database name:

CREATE DATABASE IF NOT EXISTS glue_lake_int_db
  LINKED_CATALOG = (
    CATALOG = 'glue_rest_cat_int_demo',
    ALLOWED_NAMESPACES = ('<Glue database name>'),
    NAMESPACE_MODE = FLATTEN_NESTED_NAMESPACE,
    NAMESPACE_FLATTEN_DELIMITER = '-',
    SYNC_INTERVAL_SECONDS = 60
  )
  EXTERNAL_VOLUME = 'extvol_iceberg_demo';

Run Cell 6 to create the catalog-linked database.

Verify Catalog Link Status

Run Cell 7 to check the sync status:

SELECT SYSTEM$CATALOG_LINK_STATUS('glue_lake_int_db');

Expected output should show successful sync status:

{"failureDetails":[],"executionState":"RUNNING","lastLinkAttemptStartTime":"2025-01-27T16:32:42.426Z"}

Verify Tables Are Discovered

Run Cell 8 to see the automatically discovered tables:

USE DATABASE glue_lake_int_db;
USE SCHEMA "<glue database name>";
SHOW TABLES;

You should see all 4 Iceberg tables that were created by Glue:

  • customers_info
  • raw_sales_data
  • top_10_products_report_spark
  • top_10_products_report_snow

Run Apache Spark code in Snowflake

This is where the magic happens! We'll run the exact same Spark code in Snowflake that we ran in AWS Glue.

Install Snowpark Connect for Apache Spark

First, create a Snowflake notebook and configure the required packages for Snowpark Connect.

Access Snowflake Notebooks:

  • Log into your Snowflake account
  • Navigate to ProjectsNotebooks in the left sidebar
  • Click + Notebook button to create a new notebook

Import the Notebook:

Configure Notebook Runtime:

  • Click on Notebook settings (gear icon)
  • Select Snowflake Runtime: Choose 2.0 or later
  • Select Warehouse: Choose any warehouse for compute (e.g., COMPUTE_WH)
    • Recommended: Small or Medium warehouse for this demo
  • Click Apply

Add Snowpark Connect Package:

  • Click on the Packages tab in the notebook
  • Search for snowpark-connect
  • Select the latest version of Snowpark Connect
  • Click Add to install the package
Snowflake Notebook Packages

Note: The notebook will initialize with Snowpark Connect capabilities after adding this package.

Initialize Snowpark Connect

Run Cell 13 to initialize Snowpark Connect:

from snowflake import snowpark_connect
import traceback

spark = snowpark_connect.server.init_spark_session()
spark = snowpark_connect.get_session()

# Use lowercase table identifiers (Glue default)
spark.conf.set("snowpark.connect.sql.identifiers.auto-uppercase", "none")

# Catalog configuration
catalog = "GLUE_LAKE_INT_DB"
database = "<Glue database Name>"  # UPDATE THIS
source_sales_table = "raw_sales_data"  
report_table = "top_10_products_report_snow"

Read Sales Data (Same as Glue Script)

sales_iceberg_df = spark.table(f"{catalog}.{database}.{source_sales_table}")
record_count = sales_iceberg_df.count()
print(f"Sales data loaded successfully from Iceberg table. Total records: {record_count:,}")
sales_iceberg_df.show(3, truncate=False)

Flatten and Clean Data (Same as Glue Script)

from pyspark.sql.functions import col

flattened_sales = sales_iceberg_df.select(
    col("customer_id").alias("customer_id"),
    col("customer_name").alias("customer_name"),
    col("purchases.prodid").alias("product_id"),
    col("purchases.purchase_amount").alias("purchase_amount"),
    col("purchases.quantity").alias("quantity"),
    col("purchases.purchase_date").alias("purchase_date")
)

clean_sales = flattened_sales.filter(
    col("product_id").isNotNull() & 
    (col("product_id") > 0) & 
    col("purchase_amount").isNotNull() & 
    (col("purchase_amount") > 0) &
    col("quantity").isNotNull() & 
    (col("quantity") > 0)
)

Aggregate by Product (Same as Glue Script)

from pyspark.sql.functions import sum as spark_sum, count, avg, max as spark_max, min as spark_min, countDistinct, desc

product_sales_summary = clean_sales.groupBy("product_id").agg(
    spark_sum("purchase_amount").alias("total_sales_amount"),
    spark_sum("quantity").alias("total_quantity_sold"),
    count("*").alias("total_transactions"),
    avg("purchase_amount").alias("avg_transaction_amount"),
    spark_max("purchase_amount").alias("max_transaction_amount"),
    spark_min("purchase_amount").alias("min_transaction_amount"),
    avg("quantity").alias("avg_quantity_per_transaction"),
    spark_max("quantity").alias("max_quantity_per_transaction"),
    countDistinct("customer_id").alias("unique_customers"),
    spark_min("purchase_date").alias("first_sale_date"),
    spark_max("purchase_date").alias("last_sale_date")
)

Generate Top 10 Report (Same as Glue Script)

top_10_products = product_sales_summary.orderBy(desc("total_sales_amount")).limit(10)
print("\n=== TOP 10 HIGHEST SELLING PRODUCTS BY SALES AMOUNT ===")
top_10_products.show(10, truncate=False)

Save Results to Iceberg Table

from pyspark.sql.functions import current_date, lit
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

sales_window = Window.partitionBy(lit(1)).orderBy(desc("total_sales_amount"))

detailed_report = product_sales_summary.withColumn(
    "sales_amount_rank", row_number().over(sales_window)
).filter(col("sales_amount_rank") <= 10).orderBy("sales_amount_rank")

final_report = detailed_report.withColumn("report_date", current_date()) \
                             .withColumn("source_table", lit(source_sales_table)) \
                             .withColumn("job_name", lit("snowflake_notebook"))

# Truncate and insert
spark.sql(f"TRUNCATE TABLE {catalog}.{database}.{report_table}")
target_df = spark.read.table(f"{catalog}.{database}.{report_table}")
final_report.select(*target_df.columns).write.insertInto(f"{catalog}.{database}.{report_table}")

Verify Code Interoperability

Both AWS Glue and Snowflake have now generated the same Top 10 products report using identical Spark code.

Compare Results

Query both reports to verify they produce identical results:

# Get Glue-generated report
print("=== GLUE-GENERATED REPORT ===")
spark.sql(f"SELECT * FROM {catalog}.{database}.top_10_products_report_spark ORDER BY sales_amount_rank").show(10, truncate=False)

# Get Snowflake-generated report
print("=== SNOWFLAKE-GENERATED REPORT ===")
spark.sql(f"SELECT * FROM {catalog}.{database}.{report_table} ORDER BY sales_amount_rank").show(10, truncate=False)

Verify in AWS Glue Console

Go to AWS Glue ConsoleData CatalogTables

Check the top_10_products_report_snow table:

  • It should now have data (populated by Snowflake)
  • Metadata automatically synced back to Glue catalog

Key Observations

Same Code: 95%+ of the Spark code is identical between platforms
Same Results: Both reports show identical Top 10 products
Same Data: Both platforms query the same Iceberg files in S3
Bi-directional: Both platforms can create and modify tables
Zero Data Movement: No ETL between platforms required

Troubleshooting

CloudFormation Stack Creation Failed

Error: IAM permissions denied

Solution:

  • Ensure your AWS user has permissions to create IAM roles
  • Add iam:CreateRole, iam:PutRolePolicy, iam:AttachRolePolicy permissions
  • Or use an admin user for initial deployment

Error: S3 bucket not found

Solution:

  • Verify the bucket exists: aws s3 ls s3://$DATA_BUCKET
  • Ensure bucket name in parameters matches exactly
  • Check bucket is in the same region as CloudFormation stack

Glue Job Fails

Error: "Access Denied" to S3 bucket

Solution:

# Check IAM role permissions
aws iam get-role-policy --role-name $GLUE_ROLE_NAME --policy-name GlueIcebergS3Access

Error: "No JSON object could be decoded"

Solution:

  • Verify data files uploaded correctly to S3
  • Check file paths match: s3://<bucket>/raw_data/customers.json

Tables Not Appearing in Snowflake

Error: SHOW TABLES returns empty in catalog-linked database

Solution:

  1. Check catalog link status:
SELECT SYSTEM$CATALOG_LINK_STATUS('glue_lake_int_db');
  1. Verify ALLOWED_NAMESPACES includes your Glue database

  2. Verify tables exist in Glue:

aws glue get-tables --database-name $GLUE_DB_NAME
  1. If tables still don't appear, drop and recreate catalog-linked database

Snowpark Connect Not Working

Error: "Module 'snowpark_connect' not found"

Solution:

  • Ensure Snowpark Connect package is added in Packages tab
  • Restart notebook kernel
  • Check Snowflake Runtime is 2.0 or later

External ID Mismatch

Error: "External ID mismatch"

Solution:

  • The external ID in Catalog Integration must match:
    • The External ID from DESC EXTERNAL VOLUME output
    • The SnowflakeExternalId parameter in CloudFormation
  • All values must be identical

Cleanup

Follow these steps to remove all deployed resources.

Delete Snowflake Resources

-- Drop catalog-linked database
DROP DATABASE IF EXISTS glue_lake_int_db;

-- Drop catalog integration
DROP CATALOG INTEGRATION IF EXISTS glue_rest_cat_int_demo;

-- Drop external volume
DROP EXTERNAL VOLUME IF EXISTS extvol_iceberg_demo;

-- Drop demo database
DROP DATABASE IF EXISTS ICEBERG_LAKE;

Delete AWS CloudFormation Stack

# Delete stack (will remove all Glue jobs, database, IAM role)
aws cloudformation delete-stack --stack-name iceberg-glue-snowflake-demo

# Wait for deletion to complete
aws cloudformation wait stack-delete-complete --stack-name iceberg-glue-snowflake-demo

# Verify deletion
aws cloudformation describe-stacks --stack-name iceberg-glue-snowflake-demo
# Should return error: "Stack does not exist"

Delete S3 Data

# Delete all Iceberg data
aws s3 rm s3://$DATA_BUCKET/iceberg-warehouse/ --recursive

# Delete scripts and raw data
aws s3 rm s3://$DATA_BUCKET/scripts/ --recursive
aws s3 rm s3://$DATA_BUCKET/raw_data/ --recursive

# Optional: Delete the bucket itself
aws s3 rb s3://$DATA_BUCKET --force

Verify Cleanup

# Check Glue database is gone
aws glue get-database --name $GLUE_DB_NAME
# Should return error

# Check S3 bucket
aws s3 ls s3://$DATA_BUCKET
# Should be empty or not exist

Continue Building with Cortex Code

Cortex Code is Snowflake's AI coding agent — built into the Snowflake CLI. It can guide you through setting up, verifying, and troubleshooting your Glue catalog integration interactively, without switching between AWS console tabs and documentation.

Set up or troubleshoot your Glue catalog integration

Use these prompts to get started. Cortex Code will ask for your AWS account ID, region, IAM role, and access delegation preference — then generate and execute the SQL, retrieve the Snowflake IAM user ARN, and walk you through the trust policy update.

Create a new catalog integration:

Help me create a catalog integration for AWS Glue Iceberg REST

Verify an existing integration is working:

Verify my Glue catalog integration and list the namespaces and tables it can see

Troubleshoot a broken integration:

My Glue catalog integration is failing — help me diagnose and fix it

Set up a Catalog-Linked Database after the integration is ready:

Create a catalog-linked database from my Glue catalog integration

Run Spark code against your Glue-managed Iceberg tables using Snowpark Connect:

Help me set up Snowpark Connect for Apache Spark to read and write Iceberg tables managed by AWS Glue

Conclusion and Resources

You've successfully built an interoperable lakehouse architecture that demonstrates true code portability between AWS Glue and Snowflake using Apache Iceberg.

What You Learned

  • How to create Iceberg tables using AWS Glue ETL jobs
  • How to configure Snowflake External Volume and Catalog Integration with AWS Glue
  • How Snowflake Catalog-Linked Databases automatically discover and sync Iceberg tables
  • How Snowpark Connect enables identical Spark code to run on both platforms
  • How to achieve zero data movement with bi-directional read/write capabilities

Related Resources

Updated 2026-05-05

This content is provided as is, and is not maintained on an ongoing basis. It may be out of date with current Snowflake instances