Build Your First Machine Learning Project
Overview
This project demonstrates how to build a complete machine learning project for classifying bear species using physical characteristics and image analysis. The project is structured as a 5-part learning journey that covers the entire ML lifecycle from data ingestion to model deployment.
This guide provides a high-level overview of the 5-part series, which is available as a set of Snowflake Notebooks and Python files.
What You'll Learn
- Data Engineering: Working with Snowpark and Modin for large-scale data processing.
- Data Analysis: Exploratory data analysis with statistical summaries and visualizations.
- Machine Learning: Training and comparing multiple ML algorithms (Logistic Regression, Random Forest, SVM) using
scikit-learn. - Model Optimization: Hyperparameter tuning and experiment tracking using Snowflake ML
ExperimentTracking. - AI Integration: Using Snowflake Cortex AI for image analysis and feature extraction.
- Data Visualization: Creating interactive charts with Altair and Streamlit.
- ML Operations: Registering models in Snowflake Model Registry and deploying as services.
What You'll Build
You will build a comprehensive end-to-end machine learning project for bear species classification, broken into five parts:
- Data Operations: Ingest data, perform data wrangling, and use Cortex AI for feature extraction.
- Exploratory Data Analysis (EDA): Explore data, summary statistics, and data visualization.
- Machine Learning (ML): Prepare data and features to build models using different ML algorithms.
- Experiment Tracking: Initiate experiment tracking when building and trying out different hyperparameters.
- Data App: Build a shareable Streamlit application for model deployment.
Prerequisites
- Snowflake Account: Access to a Snowflake account. Sign up for a 30-day free trial account
- Snowflake Notebooks: Access to Snowflake's notebook environment.
- Database Permissions: Ability to create tables, stages, and models.
- Compute Resources: CPU compute pool access for model training.
- External Access: Policy allowing access to external data sources (GitHub).
- Python Knowledge: Basic understanding of Python programming.
Setup
What You'll Need
To follow this tutorial, you will use the files from the following GitHub repositories:
- Project Files: The full set of Snowflake Notebooks (
.ipynb) and the Streamlit app (.py) are available at: - Dataset: The bear dataset (CSV and images) is available at:
Notebook Configuration
Before running the notebooks, ensure you configure your Snowflake Notebook environment:
- Compute Pool: In "Notebook settings," under "Run on container," select a CPU compute pool.
- External Access: In the "External access" tab, select a policy that allows the notebook to access external data from GitHub.
- Packages: Add the following packages via the "Packages" dropdown:
modin,scikit-learn, andsnowflake-ml-python.
The Bear Data Set
About the Data Set
The bear dataset is composed of two main parts:
- Tabular Data (
bear_raw_data.csv): The first portion of the data contains physical measurements for each bear as expressed by the columns as follows:
| Column | Description |
|---|---|
id | A unique identifier for each bear instance. |
species | The species of the bear. |
body_mass_kg | The body mass of the bear, measured in kilograms (kg). |
shoulder_hump_height_cm | The height of the bear's shoulder hump, measured in centimeters (cm). |
claw_length_cm | The length of the bear's claws, measured in centimeters (cm). |
snout_length_cm | The length of the bear's snout, measured in centimeters (cm). |
forearm_circumference_cm | The circumference of the bear's forearm, measured in centimeters (cm). |
ear_length_cm | The length of the bear's ears, measured in centimeters (cm). |
- Image Data (
images/folder): The second portion is a collection of images, where each image corresponds to a unique ID from the tabular data (e.g.ABB_01,EUR_01,GRZ_01andKDK_01). This allows for visual analysis and the extraction of image-based features.




Identifying Bear Species
These physical measurements are key differentiators between bear species. For example, the presence of a prominent shoulder hump (shoulder_hump_height_cm) and a "concave" or "dished" facial profile are characteristic of brown bears (Grizzly, Kodiak, Eurasian).
In contrast, the absence of a shoulder hump and a straight "Roman nose" facial profile are key features of the American Black Bear. Additionally, features related to paw structure, like the shorter, more curved claws (claw_length_cm) of an American Black Bear, are adapted for climbing, distinguishing them from the long, straighter claws of brown bears, which are built for digging.
These and other defining features can also be deduced from machine learning efforts through exploratory data analysis (EDA) and feature analysis.
Part 1: Data Operations and Feature Extraction
This part of the project focuses on data ingestion, wrangling, and feature extraction using Snowflake's native capabilities.
High-Level Steps:
- Load Tabular Data: The notebook starts by loading the
bear_raw_data.csvfile from GitHub directly into amodin.pandasDataFrame. - Create Stage: You will create a Snowflake stage (
@bear) to store the corresponding bear images. - Cortex AI Feature Extraction: The core of this notebook is using the
AI_COMPLETESQL function (from Snowflake Cortex AI) to perform image analysis. You will write Python code to iteratively loop through all images in the stage and extract three new features: Fur Color, Facial Profile, and Paw Pad Texture. - Combine and Save: Finally, you will merge the original tabular data with these new AI-extracted features and save the complete, enriched dataset to a new Snowflake table named
BEAR.
Get the Code:
Part 2: Exploratory Data Analysis (EDA)
With a clean dataset in Snowflake, the next step is to explore it. This notebook demonstrates how to perform rich, interactive EDA directly within a Snowflake Notebook.
High-Level Steps:
- Load Data: Connect to your Snowflake account and load the
BEARtable (created in Part 1) usingpd.read_snowflake. - Interactive Visualizations: This notebook heavily features
StreamlitandAltairto build interactive charts inside the notebook cell output. - Analyze Features: You will build several plots to understand the data:
- Feature Distributions: An interactive histogram to see the distribution of numeric features (e.g.,
Body_Mass_kg) across the different bear species. - Correlation Heatmap: A heatmap to quickly visualize the correlation between all numeric features.
- Feature Relationships: Interactive scatter plots to explore the relationship between any two selected features (e.g.,
Claw_Length_cmvs.Snout_Length_cm). - Categorical Analysis: Bar and pie charts to see the distribution of categorical features, including the target variable
species.
- Feature Distributions: An interactive histogram to see the distribution of numeric features (e.g.,
Get the Code:
Part 3: Machine Learning Model Training
This notebook walks you through the classical machine learning workflow: data preparation, model training, and performance comparison.
High-Level Steps:
- Data Preparation: Load the
BEARdataset and split it into features (X) and the target variable (y). - Data Splitting: Use
scikit-learnto split the data into 80% training and 20% testing sets. - Feature Scaling: Apply preprocessing to the data. Numerical features are scaled using
StandardScaler, and categorical features (like 'Fur Color') are transformed usingOneHotEncoder. - Train Models: Train three different classification algorithms on the prepared data:
- Logistic Regression
- Random Forest
- Support Vector Machine (SVM)
- Benchmark Models: Compare the models by examining their performance metrics (Accuracy, MCC) on both the training and testing sets to assess for overfitting.
- Interpretability: Analyze the feature importance from the Random Forest model to understand which features contribute most to the predictions.
Get the Code:
Part 4: Experiment Tracking and Model Registry
This notebook introduces the MLOps capabilities of Snowflake, demonstrating how to track experiments and version models.
High-Level Steps:
- Setup Experiment: After preparing the data (as in Part 3), you will initialize the
ExperimentTrackingclient from thesnowflake-ml-pythonlibrary to create a new experiment named "Bear_Classification_Experiment". - Log Baseline Model: Train a single baseline Random Forest model and log its hyperparameters and performance metrics to the experiment tracker.
- Hyperparameter Tuning: Perform a grid search by training multiple Random Forest models with different combinations of hyperparameters (
n_estimators,max_depth, etc.). Each run, with its parameters and resulting metrics (accuracy, precision, recall), is logged as a separate run within the experiment. - Analyze Results: You will analyze the logged results in the experiment to identify the best-performing model configuration.
- Register Model: After training the final model with the best parameters, you will register it in the Snowflake Model Registry, giving it a name and version. This makes the model available for deployment and inference.
Get the Code:
Part 5: Building a Streamlit Data App
The final step is to create a user-facing application that uses the model you just trained and registered.
High-Level Steps:
- Streamlit Application: This file (
CHANINN_BEAR_PREDICTION_APP.py) is a standalone Streamlit application, not a notebook. It's designed to be run as a Streamlit in Snowflake (SiS) app. - Load Model: The application loads the
BEAR_SPECIES_CLASSIFIERmodel directly from the Snowflake Model Registry. - User Interface: It builds a simple UI with sliders and dropdowns that allow a user to input the physical measurements (body mass, claw length, etc.) and categorical features (fur color, etc.) for a bear.
- Make Predictions: When the user submits the features, the app sends this data to the registered model and displays the returned prediction for the bear species.
Get the Code:
Conclusion And Resources
Congratulations! You've successfully built a complete end-to-end machine learning pipeline in Snowflake. You've ingested and wrangled data using Modin, performed image analysis with Cortex AI, explored data with Streamlit, trained multiple models using scikit-learn, and tracked your experiments using Snowflake ML.
What You Learned
- How to ingest and process data using the Snowpark pandas API (Modin).
- How to use Snowflake Cortex AI (
AI_COMPLETE) for powerful feature extraction from images. - How to build interactive Exploratory Data Analysis (EDA) apps directly in a Snowflake Notebook using Streamlit and Altair.
- How to pre-process data and train multiple
scikit-learnmodels (Logistic Regression, Random Forest, SVM) for classification. - How to use Snowflake ML Experiment Tracking to log, compare, and manage hyperparameter tuning experiments.
- How to register the best-performing model into the Snowflake Model Registry.
Related Resources
- Snowpark Documentation: https://docs.snowflake.com/en/developer-guide/snowpark/python/index.html
- Snowflake ML Documentation: https://docs.snowflake.com/en/developer-guide/snowflake-ml/overview
- scikit-learn User Guide: https://scikit-learn.org/stable/user_guide.html
- Altair Documentation: https://altair-viz.github.io/user_guide/api.html
- Streamlit Documentation: https://docs.streamlit.io/
- YouTube Playlist on Snowflake Notebooks: https://www.youtube.com/watch?v=YB1B6vcMaGE&list=PLavJpcg8cl1Efw8x_fBKmfA2AMwjUaeBI
This content is provided as is, and is not maintained on an ongoing basis. It may be out of date with current Snowflake instances