Public preview of the data lake export feature is now available.
Snowflake announced a private preview of data lake export at the Snowflake virtual summit in June 2020. Data lake export is one of the key features of the data lake workload in the Snowflake Data Cloud. The feature makes Snowflake data accessible to the external data lake, and it enables customers to take advantage of Snowflake’s reliable and performant processing capabilities.
This feature is often used by customers who use Snowflake to augment an existing data lake, for example, parquet files on cloud storage services like Amazon S3, Azure data lake storage, or Google cloud storage.
A common data lake architecture is shown in Figure 1. Raw data from various sources lands into the raw zone, after which it goes through the ETL process and it is converted into parquet files, which in turn are used to build tables and views for data consumers in the modeled zone. Consumers such as data scientists and data analysts consume data from the modeled zone.
Customers love Snowflake’s simplicity and ease of use. They love the ability to spin up and spin down virtual warehouses instantly and the elastic processing power that can efficiently process data at a low cost.
The data lake export feature is available via an enhancement to the existing COPY INTO <location> command in Snowflake.
An example of the data lake export job is shown below.
copy into @my_data_lake/processed_parquet/rnfromrn(select * from my_processed_table)rnpartition by ('year=' || year_col || '/date=' || date_col)rnheader = truernfile_format = (type = parquet)rnmax_file_size = 512000000;rn
The launch of the data lake export feature includes a few new functionalities. The first one is the partition by keyword. In a data lake architecture, hive style partitioning is commonly used to organize data on the data lake storage. When the partition by keyword is provided, Snowflake will export the parquet files in organized folders defined by the provided partition expression.
The second functionality is the max_file_size. Small files on data lakes lead to extremely inefficient queries. For faster performance, it is important to compact the data sets into highly compressed large parquet files. When max_file_size is used in a data lake export job, Snowflake will attempt to create parquet files as close to the provided file size as possible.
Lastly, Snowflake will include job query IDs in the file names. This will help customers easily identify all the files that were created using a particular job, thus improving data consistency on the data lake.
A lot of customers who participated in the private preview have implemented end-to-end pipelines like the one shown in Figure 2. This pipeline converts raw JSON or CSV data to parquet files, using Snowflake for performant processing with features like external tables, table streams, tasks, and data lake export.
Figure 2: End-to-end pipeline
For more information about this pipeline, see: Demo: How to use Snowflake as a transformation engine.
Try this functionality today.