Snowflake Data Pipelines

Author: Dinesh Kulkarni

Snowflake News, Snowflake Technology

Businesses work with massive amounts of data today, and in order to analyze all of that data they need a single view into the entire data set. The challenge is that data resides in multiple systems and services, yet it needs to be combined in ways that make sense for deep analysis. Data flow itself can be especially unreliable because there are many points during the transport from one system to another where corruption can happen or bottlenecks (ultimately resulting in latency) can occur. As the breadth and scope of the role data plays increases, the problems only get magnified in scale and impact.

That is why data pipelines are critical. They eliminate many manual steps from the process, enabling a smooth, automated flow of data from one step to another. Data pipelines are important for real-time analytics to help organizations make faster, data-driven decisions. They’re particularly important for organizations that:

  • Rely on real-time data analysis
  • Store data in the cloud
  • House data in multiple sources

To further augment Snowflake’s focus on data pipelines, we released a public preview of the Auto-Ingest, Streams and Tasks, and Snowflake Connector for Kafka features to provide customers continuous, automated, and cost-effective services to load data efficiently and without any manual effort.

New Enhancements to Snowflake Data Pipelines:

  • Auto-Ingest
    AWS and Azure provide notification mechanisms to notify users whenever an object is created. Auto-Ingest is using these mechanisms and layering them over the ingest service so the ingest service can automatically detect and retrieve files created under a stage and ingest them into their appropriate tables. This is important because it reduces latency for queries by ingesting and transforming data as it arrives.
  • Streams and Tasks
    The Streams and Tasks feature is fundamental to building end-to-end data pipelines and orchestration in Snowflake. While customers can use Snowpipe or their ELT provider of choice, that approach is limited to just loading data into Snowflake. Streams and Tasks aims to provide a task scheduling mechanism so customers no longer have to resort to external jobs for their most common scheduling needs for Snowflake SQL jobs. The feature also enables customers to connect their staging tables and downstream target tables with regularly processed logic that picks up new data from the staging table and transforms it into the shape required for the target table.
  • Snowflake Connector for Kafka
    Apache Kafka is a platform for building pipelines to handle continuous streams of records, and this connector makes it fast and easy to reliably publish these records to your Snowflake instance for storage and analysis.

The Snowflake Connector for Kafka is available via the Maven Central Repository. After you install the connector to a Kafka Connect cluster, instances of the connector can be instantiated via a simple JSON configuration or via the Confluent Control Center. After you configure the connector for a set of topics, it creates and manages stages, pipes, and files on the user’s behalf to reliably ingest messages into Snowflake tables.

There is no additional charge for the use of the Snowflake Connector for Kafka, which is freely available under an Apache 2.0 license. The connector makes use of tables, stages, files, and pipes, which are all charged at normal rates.

When we made this set of features available to a select few customers for private preview, those customers saw tremendous benefit. Leaders at Blackboard, Inc. said, Snowpipe and Streams and Tasks enabled us to build an ingestion platform for most of our data pipelines hydrating our data lake. These pipelines serve over a thousand clients/sites with hundreds of tables per site and growing, resulting in a significant reduction of our infrastructure management and costs, and a streamlined architecture with less complexity and handoff points.

How can you get started?

If you have files regularly created in a blob store such as Amazon S3 or Microsoft Azure Blob Store, you can create a Snowpipe with Auto-Ingest option and specify the appropriate prefix for files you want Snowpipe to ingest. Once the pipe is created, you can configure corresponding blob creation notifications for Amazon SQS or for Microsoft Azure Event Grid to go to the pipe. Once you connect the notifications, the pipe will start automatically ingesting newly created files according to the specified prefix.

Table Streams can be used independently of Snowpipe. You can create a stream any time on any table and start consuming it using tasks or any other scheduled activity. A stream can be used just like a view – in DML and query statements.

Tasks can also be used independently of Snowpipe and Table Streams. You can specify a schedule of one minute or longer, resume the task and it will start running on schedule. Streams and tasks can also be used together for continuous data pipelines that run periodically based on changes in a table.

Read more about Snowflake Data Pipelines.