Data Engineering Pipeline with pandas on Snowflake
Overview
Through this quickstart, you will learn how to use Snowflake Notebooks, pandas on Snowflake, Git integration, and Serverless Tasks to build an end-to-end data engineering pipeline.
You will create a customer profile using the Snowflake Sample TPC-H dataset, save it into a Snowflake table and to create a serverless task to schedule the feature engineering.

What You’ll Learn
- how to build a Python data pipeline with Snowflake Notebooks
- how to aggregate and transform data to create new features with pandas on Snowflake
- how to version control and collaborate with Git integration
- how to create a Serverless Task to schedule the feature engineering
What is pandas on Snowflake?
pandas on Snowflake lets you run your pandas code in a distributed manner scalably and securely directly on your data in Snowflake. Just by changing the import statement and a few lines of code, you can get the same pandas-native experience you know and love with the scalability and security benefits of Snowflake. With pandas on Snowflake this API, you can work with much larger datasets and avoid the time and expense of porting your pandas pipelines to other big data frameworks or provisioning large and expensive machines. It runs workloads natively in Snowflake through transpilation to SQL, enabling it to take advantage of parallelization and the data governance and security benefits of Snowflake.
Why use pandas on Snowflake?
pandas is the go-to data processing library for millions worldwide, including countless Snowflake users. However, pandas was never built to handle data at the scale organizations are operating today. Running pandas code requires transferring and loading all of the data into a single in-memory process. It becomes unwieldy on moderate-to-large data sets and breaks down completely on data sets that grow beyond what a single node can handle. With pandas on Snowflake, you can run the same pandas code, but with all the pandas processing pushed down to run in a distributed fashion in Snowflake. Your data never leaves Snowflake, and your pandas workflows can process much more efficiently using the Snowflake elastic engine. This brings the power of Snowflake to pandas developers everywhere.
pandas on Snowflake is delivered through the Snowpark pandas API as part of the Snowpark Python library, which enables scalable data processing of Python code within the Snowflake platform. You will learn how to use Snowpark pandas as part of this quickstart. This quickstart will focus on getting started with Snowpark pandas API, and enable you to perform common pandas operations on large volumes of data while leveraging the power of Snowflake.
Learn more about Snowpark Pandas API.
Prerequisites
- Git installed
Clone or download the git repo
- A Snowflake account. If you do not have a Snowflake account, you can register for a free trial account.
- A Snowflake account login with a role that has the ability to create database, schema and tables. If not, you will need to register for a free trial or use a different role.
What You’ll Build
A Customer profile table using the Snowpark Pandas API and a serverless task that will run the feature engineering pipeline on a schedule.
Setup The Snowflake Environment
Overview
This section covers cloning of the GitHub repository and creating the needed Snowflake objects (i.e role, warehouse, database, schema, etc..)
Clone The Git Repository
The very first step is to clone the GitHub repository. This repository contains all the code you will need to successfully complete this QuickStart Guide.
Using HTTPS:
git clone https://github.com/Snowflake-Labs/sfguide-data-engineering-pipelines-with-pandas-on-snowflake.git
OR, using SSH:
git clone [email protected]:Snowflake-Labs/sfguide-data-engineering-pipelines-with-pandas-on-snowflake.git
You can also use the Git integration feature of Snowflake Notebooks, in order to do that you need to fork the GitHub repository to be allowed to commit changes. For instructions how to set up Git integration for your Snowflake account see here and for using it with Snowflake Notebooks see this page. You can refer to this video for a walkthrough on how you can set up Git integration and use it with Snowflake Notebooks.
Setup Snowflake
During this step you will verify that the Snowflake Sample TPC-H dataset is available in your account, and if not add the share.
Verify That The Snowflake Sample TPC-H Dataset Is Available
Create Database, Schema And Warehouse To Be Used
USE ROLE ACCOUNTADMIN; CREATE DATABASE SNOW_PANDAS_DE_QS; CREATE SCHEMA SNOW_PANDAS_DE_QS.NOTEBOOKS; CREATE SCHEMA SNOW_PANDAS_DE_QS.DATA; CREATE WAREHOUSE SNOW_PANDAS_DE_QS_WH;
Create Snowflake Notebook
Navigate To Snowflake Notebooks
-
Navigate to the Notebooks section by clicking Projects and then Notebooks

-
Click on the *down arrow next to + Notebook

-
If you have set up git integration choose Create from repository. If not, then choose import .ipynb file.

Import .ipynb File
- Navigate to where you have cloned the GitHub repository and select Customer Profile Creation Pipeline.ipynb and click Open

- Give the notebook a name, then select SNOW_PANDAS_DE_QS and NOTEBOOKS for Notebook location, SNOW_PANDAS_DE_QS_WH for Notebook warehouse and click Create

Create From Repository
If you have forked the GitHub repository and create a Git integration to it in Snowflake you can open the notebook directly from the repository. See here for instructions on how to set up Git integration.
- In the Create Notebook from Repository dialog, click on Select .ipynb file

- In the Select file from Git Repository dialog, click on the repository integration you are using and select the notebook file and click Select File, if you do not see the file press Fetch to refresh with the latest changes from the repository
- Back to the Create Notebook from Repository dialog, give the notebook a name, then select SNOW_PANDAS_DE_QS, NOTEBOOKS for Notebook location and SNOW_PANDAS_DE_QS_WH for Notebook warehouse and click Create
Add Required Python Libraries
Before you run the notebook you need to add the following Python libraries:
- modin
- snowflake
- matplotlib
- seaborn
- In the Notebook click on Packages
- Search for modin and select modin in the list

- Do the same for snowflake, matplotlib and seaborn. When done you should have the same packages as the list below (the versions might differ)

Data Engineering with pandas on Snowflake
During this step you will learn how to use pandas on Snowflake to:
- Create datframe from a Snowflake table
- Aggregate data to create new features
- Join dataframes
- Save the result into a Snowflake table
- Create a serverless task to schedule the feature engineering pipeline (Currently in Private Preview)
Follow along and run each of the cells in the Notebook.
Within this Notebook, we will use Snowpark Pandas API to create DataFrames, join them, create new features and create a serverless task to schedule the feature engineering pipeline.
Conclusion And Resources
Congratulations, you have successfully completed this quickstart! Through this quickstart, we were able to showcase how you can use Snowflake Notebooks, pandas on Snowflake, Git integration, and Serverless Tasks to build an end-to-end data engineering pipeline.
What You Learned
- how to build a Python data pipeline with Snowflake Notebooks
- how to aggregate and transform data to create new features with pandas on Snowflake
- how to version control and collaborate with Git integration
- how to create a Serverless Task to schedule the feature engineering
Related Resources
This content is provided as is, and is not maintained on an ongoing basis. It may be out of date with current Snowflake instances