Unstructured Data Now Generally Available in Snowflake, Processing with Snowpark in Public Preview

We’re excited to announce the general availability of the unstructured data management functionality in Snowflake. We launched public preview of this functionality in September 2021, and since then we have seen adoption by customers across industries for a variety of use cases. These use cases include storing and securing call center recordings, securely sharing PDF documents in Snowflake Data Marketplace, storing medical images and extracting data from them, and many more.

The partner ecosystem for unstructured data continues to grow. We have ML partners like Clarifai, Impira, Labelbox, Semantic Health, and Veritone that can help customers derive valuable insights from their unstructured data. We also have partners like Hammerspace that can provide additional data management capabilities.

Snowpark for unstructured data now in public preview

For customers who want a single, centralized repository to store and manage multiple types of data, not only can they store and govern unstructured data in Snowflake, they can also process that data externally with External Functions or natively with Snowpark and Java, now in public preview.

Snowpark is Snowflake’s new developer framework that natively supports Scala, Java, and Python (in preview) with full control over libraries, eliminating the need for separate processing engines. By enabling teams to collaborate on the same data in a single platform, customers can streamline their architecture and enable a wide variety of new use cases.

“We have a project to apply text analytics on vast amounts of email data with attachments. We were storing email body separately and attachments as binary in the database, which came with challenges. Attachments can exceed column storage limitations and the original email needs to be stored on disk to be accessed again. Snowflake’s support for unstructured data allows us to have all of the data and processing in one place and build rich datasets for machine learning in various use cases. We can store email files in their original format in a Snowflake-managed stage and process them using Snowflake’s engine with Java UDFs.”

— Eranga, VP Data Science in a leading software company

Processing unstructured data using Snowpark

Using Streams and Tasks, and directory tables, customers can build continuous data pipelines to process unstructured data. The actual processing of files can be done on Snowflake compute using Java functions and Snowpark.

Data engineers, data scientists, and developers can create Java user-defined functions to:

Extract text from documents.
Process emails, extract metadata, and extract attachments.
Process medical images to extract patient information stored in them.

Imagine a healthcare provider has doctor notes stored in PDF or image format, and they need to extract fields from these files into structured tables. Within Snowflake, this healthcare provider can create a Java function to extract data from PDFs or image files, which can be called in SQL queries or pipelines for continuous processing with Snowflake’s engine.

Here’s an example of what such an architecture would look like:

Get started

Try out these new capabilities by following along with this quickstart, which walks you through step by step how to store, govern, process, and share unstructured data with Snowflake. You can also explore the product documentation here. We’re always looking for ways to improve, so if you have feedback about Snowpark, share it with us in our dedicated discussion forum.

Author

Saurin Shah

Just For You

Data Lake

Snowflake Launches Unstructured Data Support in Public Preview

Saurin Shah|Scott Teal

SEP 20, 2021|8 min read

Product and Technology

Snowpark Is Now Generally Available

Isaac Kunen

JAN 27, 2022|3 min read

Product and Technology

Operationalizing Data Pipelines With Snowpark Stored Procedures, Now in Preview

Isaac Kunen

FEB 18, 2022|3 min read

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Today we are happy to announce the public preview of Snowpipe Streaming as the latest addition to our Snowflake ingestion offerings. Read more.

Introducing Cortex AISQL: Reimagining SQL into AI Query Language for Multimodal Data

Cortex AISQL (public preview) transforms Snowflake SQL into an AI query language so users can build AI pipelines using familiar commands across multimodal data.

Analyze Your Query Performance Like Never Before with Programmatic Access to Query Profile

The Query Profile has been available in Snowsight, and we are excited to announce the Public Preview of programmatic access to Query Profile. Learn more here.