Getting Started with Multimodal Analysis on Snowflake Cortex AI
Overview
In this quickstart, you'll learn how to build an end-to-end application for multimodal analysis using AI models through Snowflake Cortex AI. This application uses AI_COMPLETE with models like Claude 4 Sonnet and Pixtral-large to extract insights, detect emotions, and generate descriptions from images, plus uses AI_TRANSCRIBE to transcribe audio with speaker identification - all within the Snowflake ecosystem.
Note: AI_COMPLETE multimodal capability and AI_TRANSCRIBE are currently in Public Preview.
What You'll Learn
- Setting up a Snowflake environment for multimodal processing
- Creating storage structures for image and audio data
- Using AI_COMPLETE to analyze images with AI models
- Implementing audio transcription with AI_TRANSCRIBE
What You'll Build
A multimodal analysis system that enables users to:
- Upload and store images and audio files in Snowflake
- Extract detailed insights from images using AI models
- Identify scenes, objects, text, and emotions in images
- Transcribe audio with speaker identification and precise timestamps
- Generate custom descriptions based on specific prompts
- Process media files individually
- Combine image analysis with audio transcription for comprehensive content understanding
Prerequisites
- Snowflake account in a supported region for AI functions
- Account must have these features enabled:
Setup Environment
To set up your Snowflake environment for multimodal analysis:
- Download the setup.sql file
- Open a new worksheet in Snowflake
- Paste the contents of setup.sql or upload and run the file
- The script will create:
- A new database and schema for your project
- Image and audio storage stages
Upload Media Files
After running the setup script:
- Download the data.zip and unzip for sample images and audio files
- For images:
- Navigate to Data > Databases > MULTIMODAL_ANALYSIS > MEDIA > IMAGES > Stages
- Click "Upload Files" button in top right
- Select your image files
- For audio files:
- Navigate to Data > Databases > MULTIMODAL_ANALYSIS > MEDIA > AUDIO > Stages
- Click "Upload Files" button in top right
- Select your audio files
- Verify upload success:
-- Check uploaded images LS @MULTIMODAL_ANALYSIS.MEDIA.IMAGES; -- Check uploaded audio files LS @MULTIMODAL_ANALYSIS.MEDIA.AUDIO;
You should see your uploaded files listed with their sizes.
TIP: For best results, ensure your media files are: Images:
- Clear and well-lit
- In common formats (PNG, JPEG, WEBP)
- No larger than 20MB
- Have sufficient resolution for details you want to analyze
Audio:
- Clear audio quality
- Supported formats: FLAC, MP3, OGG, WAV, WEBM
- Maximum 2 hours for text transcription, 1 hour for speaker segmentation
- Maximum file size: 700 MiB
Create Snowflake Notebook
Let's create a notebook to explore multimodal analysis techniques:
- Navigate to Projects > Notebooks in Snowflake
- Click "+ Notebook" button in the top right
- To import the existing notebook:
- Click the dropdown arrow next to "+ Notebook"
- Select "Import .ipynb" from the dropdown menu
- Upload the multimodal_analysis_notebook.ipynb file
- In the Create Notebook popup:
- Select your MULTIMODAL_ANALYSIS database and MEDIA schema
- Choose an appropriate warehouse
- Click "Create" to finish the import
The notebook includes:
- Setup code for connecting to your Snowflake environment
- Functions for analyzing images with different AI models
- Audio transcription with various modes (text, word-level, speaker identification)
- Example analysis with various prompt types
- Comparison between Claude 4 Sonnet and Pixtral-large models for analyzing images
Conclusion And Resources
Congratulations! You've successfully built an end-to-end multimodal analysis system using AI models via Snowflake Cortex. This solution allows you to extract valuable insights from both images and audio content, perform transcription with speaker identification, detect emotions, analyze scenes, and generate rich descriptions - all within the Snowflake environment using AI_COMPLETE and AI_TRANSCRIBE functions.
The combination of visual and audio analysis capabilities opens up powerful possibilities for content understanding, customer experience analysis, compliance monitoring, and automated content processing workflows.
What You Learned
- How to set up Snowflake for multimodal content storage and processing
- How to use AI_COMPLETE with AI models like Claude 4 Sonnet and Pixtral-large for comprehensive image analysis
- How to implement audio transcription with AI_TRANSCRIBE including speaker identification and timestamps
- How to create custom prompts for specialized analysis tasks
- How to implement batch processing for multiple media files
- How to combine image and audio analysis for enhanced content understanding
Related Resources
This content is provided as is, and is not maintained on an ongoing basis. It may be out of date with current Snowflake instances