We’re excited to announce the general availability of the unstructured data management functionality in Snowflake. We launched public preview of this functionality in September 2021, and since then we have seen adoption by customers across industries for a variety of use cases. These use cases include storing and securing call center recordings, securely sharing PDF documents in Snowflake Data Marketplace, storing medical images and extracting data from them, and many more.
The partner ecosystem for unstructured data continues to grow. We have ML partners like Clarifai, Impira, Labelbox, Semantic Health, and Veritone that can help customers derive valuable insights from their unstructured data. We also have partners like Hammerspace that can provide additional data management capabilities.
Snowpark for unstructured data now in public preview
For customers who want a single, centralized repository to store and manage multiple types of data, not only can they store and govern unstructured data in Snowflake, they can also process that data externally with External Functions or natively with Snowpark and Java, now in public preview.
Snowpark is Snowflake’s new developer framework that natively supports Scala, Java, and Python (in preview) with full control over libraries, eliminating the need for separate processing engines. By enabling teams to collaborate on the same data in a single platform, customers can streamline their architecture and enable a wide variety of new use cases.
“We have a project to apply text analytics on vast amounts of email data with attachments. We were storing email body separately and attachments as binary in the database, which came with challenges. Attachments can exceed column storage limitations and the original email needs to be stored on disk to be accessed again. Snowflake’s support for unstructured data allows us to have all of the data and processing in one place and build rich datasets for machine learning in various use cases. We can store email files in their original format in a Snowflake-managed stage and process them using Snowflake’s engine with Java UDFs.”
— Eranga, VP Data Science in a leading software company
Processing unstructured data using Snowpark
Using Streams and Tasks, and directory tables, customers can build continuous data pipelines to process unstructured data. The actual processing of files can be done on Snowflake compute using Java functions and Snowpark.
Data engineers, data scientists, and developers can create Java user-defined functions to:
- Extract text from documents.
- Process emails, extract metadata, and extract attachments.
- Process medical images to extract patient information stored in them.
Imagine a healthcare provider has doctor notes stored in PDF or image format, and they need to extract fields from these files into structured tables. Within Snowflake, this healthcare provider can create a Java function to extract data from PDFs or image files, which can be called in SQL queries or pipelines for continuous processing with Snowflake’s engine.
Here’s an example of what such an architecture would look like:
Try out these new capabilities by following along with this quickstart, which walks you through step by step how to store, govern, process, and share unstructured data with Snowflake. You can also explore the product documentation here. We’re always looking for ways to improve, so if you have feedback about Snowpark, share it with us in our dedicated discussion forum.