Beyond toPandas(): Streaming PySpark Data to DuckDB via Apache Arrow

This blog post is part of an on-going series showcasing our team's latest contributions to open source projects. You can explore more of our work on Open Source at Snowflake.

If you’ve ever tried to take a massive PySpark DataFrame and move it into another Python tool to do some quick ad-hoc analysis, you're familiar with the challenge. You probably tried to use .toPandas() and watched your driver’s memory spike, waited while everything materialized, and maybe even saw an out-of-memory exception (OOM). The worst part about running into an OOM is that you basically have three options if you want to get past it: (1) use a driver machine with more memory, (2) sample your results, or (3) write your data to Parquet/CSV.

Scale and interoperability should not be mutually exclusive, so open source Snowflake developer Devin Petersohn added support for the PyCapsule Interface (via the Arrow C Stream protocol) to Apache Spark™.

The problem: The `toPandas` bottleneck

Historically, moving data out of Spark often meant fully materializing it. When you call toPandas(), Spark collects every single row from every executor, ships it to the driver, and constructs a monolithic pandas DataFrame in memory.

This works for small data, but it breaks the fundamental promise of distributed computing: It forces you to be constrained by the RAM of a single machine.

from pyspark.sql import SparkSession
import duckdb
import pyarrow as pa

spark = SparkSession.builder.getOrCreate()
spark_df = spark.range(1_000_000_000)

# Old way: Wait for 1 billion rows to materialize on the driver
pandas_df = spark_df.toPandas()
# Alternatively: Write the data to parquet

Because of driver memory limitations, most of the time you would instead write an intermediate Parquet file (or CSV if you prefer a human-readable format). Regardless of whether you use <code>toPandas</code> or go through Parquet, the interoperability is broken; Spark must completely finish computation before we can move on to using some other tool.

The solution: Arrow PyCapsule

With commit ecf179c, we have added support for the Arrow PyCapsule protocol. While this might sound super technical, the impact is simple: PySpark can now expose data as a stream of Arrow batches that other libraries can consume directly.

This means you can hand a PySpark DataFrame to DuckDB or Polars, and they can pull data incrementally, one Arrow batch at a time. We no longer have to use <code>.toPandas</code> or <code>.toArrow</code> to materialize the full data set in memory in order to get it into the Python ecosystem.

Here is what this looks like in practice:

from pyspark.sql import SparkSession
import duckdb
import pyarrow as pa

spark = SparkSession.builder.getOrCreate()
spark_df = spark.range(1_000_000_000)

# Arrow RecordBatchReader created from the Spark DataFrame.
# This operation is lazy, computation is not triggered
# until the stream is consumed.
stream = pa.RecordBatchReader.from_stream(spark_df)

# DuckDB consumes the stream batches as Spark produces them.
result = duckdb.sql("SELECT avg(id) FROM stream").fetchall()
print(result)

From isolated workflows to connected ecosystems

By implementing this protocol, we are effectively "dropping the walls" around PySpark. You can use Spark for the heavy lifting — the distributed ETL and massive joins — and then seamlessly hand off that data to DuckDB for a lightning-fast analytical query, or to Polars for some final data wrangling.

Check out the full Apache Spark pull request and look for the change in the upcoming Spark 4.2 release!