BUILD: The Dev Conference for AI & Apps (Nov. 12-14)

Hear the latest product announcements and push the limits of what can be done in the AI Data Cloud.

Product and Technology

Run pandas on 1TB+ Enterprise Data Directly In Snowflake

Snowflake Snowpark icon with photo of two people using a computer, on a blue and black background.

As one of the most widely used libraries in the Python ecosystem, pandas helps developers analyze, load and transform data across data science, data engineering and machine learning. The flexibility and ease of use of the pandas API have driven rapid growth in popularity, with pandas being used by one in every five developers, according to the StackOverflow 2024 Developer Survey.

But pandas was initially designed as an in-memory data structure, which limits its ability to operate on large data sets. That often means that developers can only work with the amount of data that can fit on their machines. These scale challenges lead to slow development velocity and present roadblocks for data teams that need to operate on data in large volumes. As a result, data teams have had to rely on rewriting pandas code to other frameworks to operate on larger-scale data — until now. 

Today, we are excited to announce the general availability of pandas on Snowflake, which brings the best of the Snowflake AI Data Cloud to Python developers by enabling scalable, distributed pandas operations within Snowflake.

Bar chart showing the performance results of running pandas on Snowflake, resulting in up to 30x faster performance.

Our benchmark studies have shown that pandas on Snowflake scale to more than a terabyte of data, for data sets where the standard pandas library runs out of memory on even less than 100GB. On average across representative workloads, we find that pandas on Snowflake perform around 6x faster on 1GB scale and around 30x faster on 10GB scale than vanilla pandas in memory. 

Minimal tuning or rewriting required to use

With the introduction of pandas on Snowflake, users can work with their familiar pandas API and semantics. This feature enables developers to run pandas directly on their data in Snowflake, while queries are translated to SQL to run natively in Snowflake. 

pandas on Snowflake is part of the Snowpark Python library, which enables scalable data processing of Python code within the Snowflake platform. By simply changing a few lines of import statement, developers get the same pandas experience they know and love with the scalability and security benefits of Snowflake. As a result, migrations to Snowflake are easy, and data teams avoid the time and expense of rewriting their pandas pipelines to other big data frameworks or provisioning expensive high-memory machines. 

Secure access within Snowflake removes sensitive data risks on local machines

The in-memory design of pandas has created problems for organizations — notably the security and governance concerns that result from pulling enterprise data to laptops to process with pandas. As part of the Snowpark Python library, compute is pushed down to Snowflake directly within Snowflake’s secure, governed perimeter. 

Built on the Modin open source project

At Snowflake, we are committed to meeting developers where they are by integrating open source tools and standards with the powerful capabilities of the Snowflake AI Data Cloud. pandas on Snowflake is built on the Modin open source project. Modin is a distributed pandas library that joined the family of open source projects at Snowflake through an acquisition in October 2023. Modin is used by hundreds of thousands of data scientists and developers to seamlessly scale their pandas workflows. Snowflake actively contributes to and supports both the open source project and its vibrant community.

A technology stack diagram of the Snowflake Python Developer Ecosystem, including ingestion, transformation, delivery processes on the dev experience and devops elements.

pandas on Snowflake is an integral part of Snowflake’s Python developer ecosystem, which also includes Snowpark Python, Snowflake Python API, Streamlit in Snowflake and Snowflake Notebooks. These latest product innovations bring the power of the Snowflake AI Data Cloud to Python developers and empower data teams to efficiently scale enterprise data pipelines and applications.

To learn more, visit Snowflake Documentation, or try this quickstart in Snowflake Notebooks to get started.  

 

Data Pipeline icon of a pipe with streaming arrows on a blue background with code images

The Essential Guide to Data Engineering

Learn how you can build a modern data engineering practice and create efficient data pipelines for your organization.
Authors
Share Article

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Start your 30-DayFree Trial

Try Snowflake free for 30 days and experience the AI Data Cloud that helps eliminate the complexity, cost and constraints inherent with other solutions.