Skip to content
  • AT SNOWFLAKE
  • Industry solutions
  • Partner & Customer Value
  • Product & Technology
  • Strategy & Insights
Languages
  • Deutsch
  • 한국어
  • Español
  • Français
  • English
  • Italiano
  • 日本語
  • Português
  • Deutsch
  • 한국어
  • Español
  • Français
  • English
  • Italiano
  • 日本語
  • Português
  • AT SNOWFLAKE
  • Industry solutions
  • Partner & Customer Value
  • Product & Technology
  • Strategy & Insights
  • Deutsch
  • 한국어
  • Español
  • Français
  • English
  • Italiano
  • 日本語
  • Português
  • VISÃO GERAL
    • Por que Snowflake
    • Histórias de clientes
    • Parceiros
    • Serviços
  • SETORES
    • Publicidade, meios de comunicação e entretenimento
    • Serviços financeiros
    • Saúde e ciências da vida
    • Indústria
    • Setor público
    • Varejo e bens de consumo
    • Tecnologia
  • For Departments
    • Marketing
    • IT
  • SAIBA MAIS
    • Biblioteca de recursos
    • Desenvolvedores
    • Quickstarts
    • Documentação
    • Laboratórios práticos
    • Treinamento
    • Guias
    • Glossário
  • Connect
    • Blog
    • Community
    • Events
    • Webinars
    • Podcast
    • Support
    • Trending
  • VISÃO GERAL
    • Sobre a Snowflake
    • Investidores
    • Liderança e direção
    • Carreiras
    • Newsroom
    • Speakers Bureau
    • ESG
    • Empreendimentos de Snowflake
Author
Michael Gregory Michael Gregory
Share
Subscribe
jul 22, 2022

Snowpark for Python: Bringing Efficiency and Governance to Polyglot ML Pipelines

  • Nenhuma categoria
Snowpark for Python: Bringing Efficiency and Governance to Polyglot ML Pipelines

PLEASE NOTE: This post was originally published in May. It has been updated in July to reflect currently available features and functionality.

Machine learning (ML), more than any other workflow, has imposed the most stress on modern data architectures. Its success is often contingent on the collaboration of polyglot data teams stitching together SQL- and Python-based pipelines to execute the many steps that take place from data ingestion to ML model inference.  

This polyglot nature of data teams is one of the largest impediments to an organization’s ability to operationalize the machine learning workflow and create sustainable return on investment from its data.

The polyglot impediment

For over a decade data professionals have been touting, building, and striving for the utopia of data democratization: a future state where anyone, regardless of role or skills, can leverage the power of data in their daily work. 

Yet, as more people from diverse backgrounds join the conversation, it is unrealistic to expect them all to speak with the data using the same programming language. Over time, different languages have emerged to meet the needs of different communities. While SQL has long been the mainstay of large-scale data transformation and management, other languages like Python have emerged with added flexibility in functional constructs for greater expressiveness as well as extensibility. Today, there are a massive number of Python frameworks to simplify everything from application development to quantitative analysis and ML. 

Specific to ML, many of the challenges of machine learning operations (MLOps) stem directly from this polyglot impediment. Often, the most effective tool for any particular task in a complex training or inference pipeline may be written in SQL or Python. The multitude of frameworks (e.g., TensorFlow, Pytorch, etc.), along with the specialized compute infrastructure to support them, exacerbate this complexity even further. MLOps and DevOps teams are left with the unenviable job of building and maintaining efficient, scalable pipelines across multiple platforms supporting different languages and frameworks. 

The multi-platform approach

Different platforms have emerged to support these different languages as a way to overcome the polyglot impediment. For example, data platforms have traditionally been the domain of data engineers and analysts, but because these platforms don’t always meet the needs of data scientists, who sometimes require different languages and frameworks, some data scientists opt to build their own separate platforms. On top of that, ML engineers often build their own MLOps platforms to support things such as monitoring, orchestration, and version controls. 

Platforms used to develop and deliver ML-powered applications

To bridge multiple processing steps across these platforms, frameworks such as Apache Airflow and dbt have emerged to simplify this orchestration by acting as an integration hub. But there still remains the issue of these unique platforms adding technical debt and risk as data is copied and moved between them, and until now these platforms have not evolved to overcome this polyglot impediment at the data layer. 

While technical teams struggle to maintain a data infrastructure that is fragile and overly complex language- and workload-specific, CIOs and CDOs are constantly dealing with rising costs and security risks from duplicate pipelines and the massive amounts of data moving across these platforms.  

The polyglot platform approach

As the polyglot nature of the data world is not likely to change (think of the emergence of newer languages such as Julia), and data teams continue to run into the challenges and risks associated with data movement across multi-platform architectures, it becomes increasingly apparent that multi-language platforms will play a vital role. Rather than moving data across various single-language platforms, multi-language platforms can support the processing needs of multiple teams and languages, reducing the need to move data outside of its governed boundaries.

To streamline architectures, enhance collaboration between different teams, and provide consistent governance across all data, the world needs more polyglot platforms with seamless good integration with best-of-breed orchestration frameworks. 

Snowpark: The polyglot answer for modern data teams

Snowflake introduced Snowpark as an extensibility framework to create a polyglot platform that bridges the gaps between data engineers, data scientists, ML engineers, application developers, and the MLOps and DevOps teams that support them.

First with support for the most popular languages such as Java, Scala, and JavaScript, Snowpark makes it possible to simplify architectures while reducing costs and governance risks associated with data duplication. Snowpark allows users to talk to their data in the language of their choice while leveraging the performance, scalability, simplicity, security, and governance they have come to expect from Snowflake. Best of all, Snowpark was designed to make it easy to integrate custom functions written in other languages as part of a SQL query or processing step.

And now, Snowpark for Python (in public preview) takes it to a whole new level by embracing a massive community of developers, data engineers, data scientists, and ML engineers. Unsurprisingly, Snowpark for Python also exposes much-needed surface area for integration with orchestration frameworks, and the Snowflake partnership with Anaconda makes it possible to tap into a huge ecosystem of frameworks including TensorFlow, Pytorch, Keras, and many more.

Apache Airflow: An orchestration framework for a multilingual workflow

Simultaneously, Astronomer and the Airflow Community continue to add great support for Python, including the Taskflow API in Airflow 2.0. Taskflow provides a comfortable pythonic interface for data teams, while empowering good software development practices. In conjunction with Snowpark, Taskflow makes it possible to easily define not only complex data transformations in Python but also integrate non-SQL tasks such as ML into a DAG—with no data movement.  

Snowpark and Airflow allow data teams to execute entire pipelines from ELT, feature engineering, experimentation, model training, inference, monitoring, and even powerful visual applications in Streamlit, all without moving and copying data. This allows complex, scalable pipelines to be managed with confidence as they eliminate complexity, reduce governance risk, and support openness with best-of-breed frameworks.

Empowering polyglot teams

Thanks to the democratization of data, today’s teams require platforms that support many different languages and frameworks. Snowpark empowers these polyglot teams with one platform supporting open integrations with the world’s leading orchestration frameworks, enabling operational simplicity while reinforcing good data governance practices. 

Snowpark is already in general availability for Java/Scala and currently in public preview for Python. To build your own ML workflow using Snowpark, Anaconda, and Apache Airflow, check out this step-by-step code guide.

Share

Related Content

  • Nenhuma categoria
jul 08, 2021 4 min read

Snowpark Use Cases for Data Science

This year's Snowflake Summit featured a slew of exciting announcements that furthered Snowflake's vision of…

Find Out More
Read More
  • Na empresa
    • Notícias da Snowflake
set 29, 2021

Snowflake Partners with and Invests in Anaconda to Bring Enterprise-Grade Open-Source Python Innovation to the Data Cloud

Today we are delighted to share that Snowflake and Anaconda are partnering to provide the…

Delve into the details
Read More
  • Na empresa
nov 16, 2021

Snowday: Snowpark Offers Expanded Capabilities Including Python, Multi-Cloud Availability, and More

Since its launch earlier this year, Snowflake’s Snowpark developer framework has helped data scientists, data…

Delve into the details
Read More

Big Data Management

Learn about Big Data management, which is the key to unlocking the analytical potential of any organization's information. Examine what defines Big Data.

Explore
Read More

Cloud Services

Cloud services such as cloud data platforms have become cost-efficient, high performance calling cards for any business that leverages big data.

Full Details
Read More

Deloitte and Snowflake Deliver a Foundation for a Modern Data Strategy

Deloitte's AI, data and analytics modernization experience paired with Snowflake’s single, integrated data platform empowers...

Have a look
Read More

Using MLOps to Quickly Deploy and Sustain Machine Learning Pipelines

Machine learning operations (MLOps) is a process used to streamline and standardize the entire ML lifecycle for rapid...

Find Out How
Read More
Snowflake Inc.
  • Plataforma
    • Data Cloud
    • Arquitetura
    • preços
    • Snowflake Marketplace
    • Segurança e Confiança
  • Soluções
    • Serviços financeiros
    • Publicidade, meios de comunicação e entretenimento
    • Varejo e bens de consumo
    • Saúde e ciências da vida
    • Analítica de Marketing
  • Recursos
    • Biblioteca de recursos
    • Webinars
    • Documentação
    • Comunidade
    • Compras
    • Legal
  • Explore
    • Notícias
    • Blog
    • Tendência
    • Guias
    • Desenvolvedores
  • About
    • Sobre a Snowflake
    • Investidores
    • Liderança e direção
    • Empreendimentos de Snowflake
    • Carreiras
    • contato

Thanks for signing up!

  • Privacy Notice
  • Site Terms
  • Cookie Settings
  • Do Not Share My Personal Information

© 2023 Snowflake Inc. All Rights Reserved |  If you’d rather not receive future emails from Snowflake, unsubscribe here or customize your communication preferences