Data Engineering

Build Better Data Pipelines: Constructing and Orchestrating with SQL and Python in Snowflake

Data transformations are the engine room of modern data operations — powering innovations in AI, analytics and applications. As the core building blocks of any effective data strategy, these transformations are crucial for constructing robust and scalable data pipelines. Today, we're excited to announce the latest product advancements in Snowflake to build and orchestrate data pipelines.

In today’s fast-paced AI era, pipelines are the bedrock of downstream data success. This puts data engineers in a critical position. Yet many find themselves constantly juggling competing priorities: 

  • Configuring and managing compute resources and infrastructure

  • Debugging across disparate stacks

  • Tracking and reacting to upstream data changes

  • Ensuring development agility and security

  • Navigating complexities associated with growing volumes of data — especially unstructured data

Addressing these points of friction is historically where Snowflake shines. For years, Snowflake has been laser-focused on reducing these complexities, designing a platform that streamlines organizational workflows and empowers data teams to concentrate on what truly matters: driving innovation. Moving deeper into the raw data layer to shepherd data from its source to its destination as curated data sets, we are empowering data engineers to stop getting bogged down by operational overhead to instead become drivers of innovation.

To do this, we’re excited to announce new and improved features that simplify complex workflows across the entire data engineering landscape — from SQL workflows that support collaboration to more complex pipelines in Python.

Figure 1: Snowflake supports building data pipelines with both SQL and Python transformations, as well as flexible orchestration options to streamline the data lifecycle and support a wide range of use cases and data engineering personas.
Figure 1: Snowflake supports building data pipelines with both SQL and Python transformations, as well as flexible orchestration options to streamline the data lifecycle and support a wide range of use cases and data engineering personas.

Accessible data pipelines in SQL

For many organizations, SQL pipelines offer the most accessible entry into data transformation, empowering a wider range of team members, such as data analysts, and thereby easing the burden on data engineers. The modular nature of these pipelines, which can be built by users of varying SQL skills, allows for scalable and reliable execution of hundreds of workflows. This democratized approach helps ensure a strong and adaptable foundation.

Introducing dbt Projects on Snowflake (in public preview soon)

Data teams all over the world love dbt because it brings software engineering best practices and efficiency to SQL and Snowpark data transformation workflows within their data warehouses. By integrating dbt directly into Snowflake's automation and managed services, data engineers can now focus on building, deploying and monitoring these pipelines rather than managing infrastructure or stitching together observability across multiple systems.

Snowflake users can now seamlessly create, upload, edit and run dbt Projects natively in Snowflake (public preview) within a new Workspaces interface. This native integration streamlines development and accelerates the delivery of transformed data.

Dynamic Tables updates

Dynamic Tables provides a declarative processing framework for batch and streaming pipelines. This approach simplifies pipeline configuration, offering automatic orchestration and continuous, incremental data processing. Users gain comprehensive visibility through DAG visualization, receive real-time alerts and benefit from integrated data quality features, leading to more efficient and reliable data pipeline management. Updates include:

  • Apache Iceberg support (now generally available): Dynamic Tables now includes expanded functionality with support for open table formats, including Apache Iceberg. Users can build batch and stream processing pipelines on Apache Iceberg™ tables (using Snowflake or an external catalog) with declarative definitions, automatic orchestration and incremental processing. The resulting data can be queried by any Iceberg engine. 

  • Lower latency (private preview): Create real-time pipelines with end-to-end latency (from ingestion to transformation) of ~15 seconds. 

  • Performance enhancements (generally available): Use improved incremental refreshes of OUTER JOINs, QUALIFY RANK () = 1, window functions and clustered tables, along with new incremental optimizations for CURRENT_TIMESTAMP and IS_ROLE_IN_SESSION.

  • Define completeness (generally available): New SQL extensions, IMMUTABLE WHERE and INSERT ONLY, offer more control over data completeness, allowing users to prevent updates or deletions, restrict data modifications based on conditions and backfill data from existing pipelines for migrations.

Enterprise-grade Python development

Snowpark enables enterprise-grade Python development for building and scaling data pipelines directly in Snowflake. Using familiar Python syntax and pandas DataFrames, complex transformations execute seamlessly using our elastic engine, eliminating data movement for efficient large-scale data processing. Snowpark handles growing data volumes and processing demands without infrastructure overhead, offering a powerful and scalable Python solution.

pandas on Snowflake updates

pandas on Snowflake integrates the flexibility of pandas with Snowflake's scalability, simplifying the development of robust Python data pipelines. Users can now:

  • Integrate with various data sources, including accessing and saving to Snowflake tables, views, Iceberg tables, Dynamic Tables and common file formats (CSV, Parquet, Excel, XML)

  • Develop pandas pipelines that scale from initial prototypes to full production deployments without code changes

  • Utilize familiar pandas syntax to leverage Snowflake's analytical capabilities for flexible data transformation, including Snowflake Cortex AI LLM functions for developing AI-powered workflows

Using pandas on Snowflake, developers can build end-to-end Python data pipelines by reading from an Iceberg table, transform data with pandas and save the resulting pipeline as a dynamic Iceberg table.

To support pandas pipelines across all data scales, we are introducing pandas on Snowflake with Hybrid Execution (private preview). This innovative new capability intelligently determines the optimal backend for running your pandas queries, either by pushing down to Snowflake for large data sets or in-memory with standard pandas, to support rapid interactive testing and development. 

Figure 2: Hybrid execution for pandas on Snowflake intelligently determines whether to run queries by pushing down to Snowflake or locally in-memory with vanilla pandas.
Figure 2: Hybrid execution for pandas on Snowflake intelligently determines whether to run queries by pushing down to Snowflake or locally in-memory with vanilla pandas.

Snowpark updates

Snowpark speeds up data development by enabling data transformation with Python and other languages within Snowflake. This extensibility is tightly integrated with the security and scalability of the Snowflake platform, allowing developers to use familiar tools without data movement or separate infrastructure management.

With Snowpark execution, customers have seen an average 5.6x faster performance and 41% cost savings over traditional Spark. [1]

Snowpark now offers enhanced capabilities for bringing code to data securely and efficiently across languages, with expanded support across data integration, package management and secure connectivity. Updates include:

  • Data integration: With Python DB-API Support (private preview), developers can now use Snowpark to pull data from external relational databases directly into Snowflake. Python XML RowTag Reader (private preview) allows loading large, nested XML files using a simple rowTag option. Users can ingest only the relevant parts of an XML document and receive structured tabular output for downstream processing. 

  • Package management: With Artifact Repository (generally available), our flexible package support simplifies package management to easily download and install packages from PyPI within Snowpark user-defined functions (UDFs) and Stored Procedures. For those with custom packages, you can now upload packages with native code and import as part of your UDFs or Stored Procedures. 

  • File writes from Python UD(T)F (now generally available): The introduction of this feature expands Snowpark’s overall capabilities for data engineering use cases, especially where parallel writes of custom files are required with UDFs. Such examples include writing custom files (for example, model files; unstructured files like PDFs and images; or semi-structured files such as JSON) from function to stages and transforming files as part of data pipelines on the stage. Now you can transform row-oriented avro to JSON files and split large files into smaller files to be used as part of downstream applications.

We have made it easier to access external data sources and endpoints from Snowpark with capabilities, such as support for wildcard in network rules, support for Allow All to access any endpoint in network rules, and integration with AWS IAM to simplify connectivity to AWS resources. Additionally, External Access Outbound Private Connectivity is now available in additional regions, including AWS Gov (generally available), Azure Gov (generally available) and Google Cloud Platform (private preview).

Automating pipelines

Automated orchestration is embedded into transformation workflows with features such as Dynamic Tables, with additional native support using Snowflake Tasks to provide a reliable and scalable framework for consistent execution without the operational overhead.

Tasks and serverless tasks updates

Snowflake Tasks and serverless tasks shine for orchestration because they allow you to define complex workflows as a series of dependent SQL statements or Python code executed directly within Snowflake, eliminating the need for external orchestration tools. This tight integration simplifies management and leverages Snowflake's robust compute resources for reliable and cost-effective automation. Over the past year, we’ve been making continuous improvements to these native orchestration capabilities, including: 

  • Task Graph enhancements: Define richer workflows to model data pipelines with new views and notifications. You can now send notifications to cloud messaging services upon the successful completion of a Task Graph (which can trigger downstream action) and view the graph representation of task execution dependencies with metadata information for tasks.

  • Triggered tasks: Immediately run tasks when new data arrives in source tables with event-based processing for SQL and Snowpark. You can now also create a task without needing to specify a schedule or virtual warehouse. Additionally, you can automatically run tasks when data arrives from a data share or in directory tables (in addition to previous support for tables, views, Dynamic Tables and Iceberg). 

  • Low-latency task scheduler: Reliably orchestrate data pipelines with 10-second schedules to frequently process data.

  • Optimization and governance controls: Control for cost and performance optimizations on serverless tasks. 

  • Edit tasks in Snowsight: Edit existing tasks from the action menu to alter schedule, compute, parameters or comment. 

  • Python/JVM automation: Automate UDFs (Python/JVM) and Stored Procedures with serverless tasks.

A more comprehensive pipeline experience with Snowflake

Snowflake continues to evolve as the central engine for modern data operations, providing a comprehensive suite of tools to build and orchestrate data pipelines with ease and efficiency. From the accessibility of SQL and the power of dbt to the flexibility of Python through Snowpark and pandas, these latest advancements empower data engineers to overcome operational complexities and focus on driving innovation. By bringing code closer to data, streamlining workflows and enhancing performance across diverse use cases and skill sets, Snowflake is committed to enabling data teams to unlock the full potential of their data in today's fast-paced, AI-driven landscape.

If you’d like to learn more about these features and more, join us at Data Engineering Connect on July 29, 2025. 

 


Forward Looking Statements:

This article contains forward-looking statements, including about our future product offerings, and are not commitments to deliver any product offerings. Actual results and offerings may differ and are subject to known and unknown risk and uncertainties. See our latest 10-Q for more information.

 

1 Based on customer production use cases and proof-of-concept exercises comparing the speed and cost for Snowpark versus managed Spark services between November 2022 and May 2025. All findings summarize actual customer outcomes with real data and do not represent fabricated data sets used for benchmarks.

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Where Data Does More

  • 30-day free trial
  • No credit card required
  • Cancel anytime