Product and Technology

Announcing Snowpark Connect for Apache Spark™ in Public Preview. Your Spark Client, Now Powered by Snowflake.

In version 3.4, the Apache Spark™ community introduced Spark Connect. Its decoupled client-server architecture separates the user's code from the Spark cluster where the work is done. This new architecture now makes it possible to run your Spark code in a Snowflake warehouse eliminating the need to provision and maintain Spark clusters.

We’re excited to announce the public preview of Snowpark Connect for Spark. With Snowpark Connect, customers can take advantage of the powerful Snowflake vectorized engine for their Spark code while avoiding the complexity of maintaining or tuning separate Spark environments — including managing dependencies, version compatibility and upgrades. You can now run all modern Spark DataFrame, Spark SQL and user-defined function (UDF) code with Snowflake.

Using Snowflake elastic compute runtime with virtual warehouses, Snowpark Connect for Spark delivers the best of both worlds: the power of Snowflake's engine and the familiarity of Spark code, all while lowering costs and accelerating development. Organizations will no longer need dedicated Spark clusters. Write or onboard your compatible Spark SQL, DataFrame, and UDF to run directly on the Snowflake platform. Snowflake handles all the performance tuning and scaling automatically, freeing your developers from the operational overhead of managing Spark. Furthermore, by bringing data processing into Snowflake, you establish a single, robust governance framework upstream, which helps ensure data consistency and security across the entire lifecycle without redundant effort.

snow park connect
Figure 1: Snowpark Connect, the latest addition to Snowpark, expands the developer experience by enabling Spark code to run on Snowflake without having to migrate to Snowpark DataFrames. For new pipelines or to take advantage of platform-specific integrations such as Snowflake SQL, AI or pandas, Snowpark continues to offer a suite of easy-to-use tools for developers.

For customers using Snowpark Client to author data pipelines in Python, Java and Scala languages, they are seeing, on average:

  • 5.6x faster performance over managed Spark

  • 41% cost savings over managed Spark

With the launch of Snowpark Connect for Spark, you can get the same benefits of Snowpark execution without the need to convert your code to use Snowpark Client or familiarize yourself with Snowpark Client APIs, if you’re already familiar with Spark.

snowpark connect
Figure 2: Connect your Spark PySpark client (from VSCode, Jupyter Notebooks, Apache Airflow™, Snowflake Notebook and Spark Submit) to run Spark jobs on the Snowflake platform.

“VideoAmp has a long history of leveraging both Spark and Snowflake. We've migrated a large portion of our workloads to Snowpark directly, but Snowpark Connect takes us one step further in achieving code interoperability. Having Snowflake meet us where our code already resides is nothing but a clear win and the early results we've seen are extremely promising. The best part is that we didn't have to sacrifice critical engineering time to migrate workloads with Snowpark Connect — they just worked.”

John Adams
SVP of Architecture at VideoAmp

Built on Spark Connect

The release of Spark Connect, which decouples the Spark client and server, was designed to make it easier to use Spark from any application. Whereas before Spark Connect your application and the main Spark driver had to run together, now they can be separate. Your application, whether it's a Python script or a data notebook, simply sends the unresolved logical plan to a remote Spark cluster. This improves Spark connectivity to different tools and allows it to fit better into modern app development. 

Figure 3: Snowpark was originally built on the same premise of client-server separation as Spark Connect. In this architectural view, you can see how paired with Spark Connect, we’re able to transparently bring the ease of use, performance benefits, and reliability of the Snowflake platform to Spark workloads.
Figure 3: Snowpark was originally built on the same premise of client-server separation as Spark Connect. In this architectural view, you can see how paired with Spark Connect, we’re able to transparently bring the ease of use, performance benefits and reliability of the Snowflake platform to Spark workloads.

Snowpark was originally built on this same premise of client-server separation. Now paired with Spark Connect, we’re able to bring the ease of use, performance benefits and the reliability of the Snowflake platform to Spark workloads effortlessly. Snowpark Connect enables you to run your Spark code in a Snowflake warehouse which does all the heavy lifting eliminating the need to provision and maintain Spark clusters. Snowpark Connect currently supports Spark 3.5.x versions, enabling compatibility with the features and improvements in those versions.

 

Bringing Spark code to Snowflake data

Until now, many organizations using Snowflake have chosen to use the Spark Connector to process Snowflake data with Spark code, but this introduced data movement, resulting in additional costs, latency and governance complexity. While moving to Snowpark improved performance, scaled governance, and saved money, it still often meant rewriting code, slowing down development. With Snowpark Connect, organizations have a fresh opportunity to revisit these workloads and do the data processing directly in Snowflake without code conversion while removing data movement and latency.

 

Working with an open data lakehouse 

Snowpark Connect for Spark works with Apache Iceberg™ tables, including externally managed Iceberg tables and catalog-linked databases too. With this, you can now leverage the power, performance, ease of use and governance of the Snowflake platform without having to move your data or rewrite your Spark code. 

 

How to get started

It's simple to try if your data is in or accessible to Snowflake. You can use the Spark Connect client environment where your Spark DataFrame currently runs and point it to Snowflake like this:

import os
import pyspark
from pyspark.sql import Row, SparkSession
from pyspark.sql.functions import col, to_timestamp, current_timestamp
from snowflake import snowpark_connect

# Start the Spark Connect session.
os.environ["SPARK_CONNECT_MODE_ENABLED"] = "1"
snowpark_connect.start_session()
spark = snowpark_connect.get_session()

# Display data from tables in Snowflake.
orders = spark.read.table("SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS")
orders.show()
customers = spark.read.table("SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER")
customers.show()

# Top 10 Most frequent buyers and their order counts.
frequent_buyers = orders.join(customers, orders.o_custkey == customers.c_custkey, "inner") \
    .groupBy(customers.c_name) \
    .count() \
    .orderBy("count", ascending=False) \
    .limit(10)

frequent_buyers.show()

# Read from the managed Iceberg table that you have created using tutorial.
iceberg_table = "iceberg_tutorial_db.PUBLIC.customer_iceberg"
df = spark.sql(f"SELECT * FROM {iceberg_table}")
df.show()

You can now run Spark DataFrame, SQL and UDF code on Snowflake via Snowflake Notebooks, Jupyter notebooks, Snowflake stored procedures, VSCode, Airflow or Snowpark Submit, enabling seamless integration across different storage in Snowflake, Iceberg (in Snowflake or externally managed) and cloud storage options.

Considerations and limitations

Snowpark Connect currently works with Spark 3.5.x versions only. This includes support for Spark DataFrame APIs and Spark SQL. There are, however, some distinctions to note regarding API coverage. For example, RDD, Spark ML, MLlib, Streaming and Delta API(s) are not currently part of Snowpark Connect's supported features. Additionally, for the supported APIs, there can be some semantic differences to consider as specified on the Snowpark Connect documentation. Snowpark Connect is currently available for Python environments only, and Java/Scala support is under development.

Join today’s Data Engineering Connect event for a special segment featuring Snowpark Connect for Spark. Do you think this might be a good solution for your organization? Talk with your account team or reach out to find your team. Mark your calendars and register for our September 10 webinar where we’ll review the feature in more detail. 

 


1 Based on customer production use cases and proof-of-concept exercises comparing the speed and cost for Snowpark versus managed Apache Spark services between November 2022 and May 2025. All findings summarize actual customer outcomes with real data and do not represent fabricated data sets used for benchmarks.

Ebook

Secrets of Apache Spark to Snowflake Migration Success

Why leading orgs are migrating from Apache Spark-based solutions for greater cost efficiency, simplicity and reliability
Share Article

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Where Data Does More

  • 30-day free trial
  • No credit card required
  • Cancel anytime