Data pipelines are the lifeblood of modern analytics, which are a key enabler for making faster, data-driven decisions. Snowflake is in the business of enabling organizations to be data-driven to capture competitive advantage, and data ingestion is a key piece of the puzzle.
Over the last several years, we have seen the continued adoption of semi-structured data, which has flexible schemas that contain n-level hierarchies of nested information of different attributes. One of Snowflake’s highly compelling features is its native support for semi-structured data, either through copying the file data into Snowflake relational tables or by querying the file data in place using Snowflake external tables. In both options, a schema of the data sets that are included in the files must eventually be provided, either through Schema-on-Write or Schema-on-Read. We sought to improve Snowflake’s Schema-on-Write capabilities such that data from various sources can be brought into Snowflake quicker in the final schematized form that reporting tools or legacy applications require. Schema-on-Write should be completed without the need for external tools to resolve column definitions or repetitive manual schema resolution. Today I am happy to announce the public preview of the Schema Detection feature with support for Parquet, Avro, and ORC semi-structured file formats.
With schema detection, Snowflake will automatically determine and return the schema of your staged files for review. You can also have Snowflake generate a DDL so that you can easily review it and use it to create a table, external table, or view. And last but not least, you can have Snowflake automatically create a table for you with the detected schema. This is all made possible by three new functions that we added for the Schema Detection feature’s public preview.
- INFER_SCHEMA retrieves and returns the schema from a set of staged files.
- GENERATE_COLUMN_DESCRIPTION returns the list of columns necessary to create a table, external table, or view. Here you can modify the columns or data types before you complete the creation of the object.
- CREATE TABLE … USING TEMPLATE expands upon Snowflake’s CREATE TABLE functionality to automatically create the structured table using the detected schema from the staged files with no additional input.
Figure 1: Sample Parquet file schema
Figure 2: Detected schema converted to Snowflake table
While Snowflake supports both Schema-on-Read and Schema-on-Write, the public preview of the Schema Detection feature improves Snowflake’s Schema-on-Write capabilities and can greatly decrease the amount of effort at the beginning of data ingestion. Its functionality provides benefits such as:
- Easing the pain for handling the schemas of large multicolumn files manually
- Getting more insight about the files natively before starting the loading process
- Ingesting new data from multiple sources faster into the schematized tables necessary for visualization or legacy applications
Try Schema Detection Today
Schema detection is just the beginning of our investments to improve the Schema-on-Write experience; support for source file schema changes and other data formats such as CSV and JSON data is planned to be included, respectively, in our schema evolution and inference capabilities. You can find more details in this demo session, which was presented at Snowflake Summit 2021.
Give the Schema Detection feature a try today to streamline your data loading process, and let us know your feedback about the feature or about our roadmap. To get started with the feature, see our documentation. Generally, previews are not supported for production use, but they are a great way to get ready for production use when the General Availability release is available.
This post contains express and implied forwarding-looking statements, including statements regarding (i) Snowflake’s business strategy, (ii) Snowflake’s products, services, and technology offerings, including those that are under development, (iii) market growth, trends, and competitive considerations, and (iv) the integration, interoperability, and availability of Snowflake’s products with and on third-party platforms. These forward-looking statements are subject to a number of risks, uncertainties and assumptions, including those described under the heading “Risk Factors” and elsewhere in the Quarterly Report on Form 10-Q for the fiscal quarter ended April 30, 2021 that Snowflake has filed with the Securities and Exchange Commission. In light of these risks, uncertainties, and assumptions, actual results could differ materially and adversely from those anticipated or implied in the forward-looking statements. As a result, you should not rely on any forwarding-looking statements as predictions of future events.
© 2021 Snowflake Inc. All rights reserved. Snowflake, the Snowflake logo, and all other Snowflake product, feature and service names mentioned herein are registered trademarks or trademarks of Snowflake Inc. in the United States and other countries. All other brand names or logos mentioned or used herein are for identification purposes only and may be the trademarks of their respective holder(s). Snowflake may not be associated with, or be sponsored or endorsed by, any such holder(s).