Mobilize your data with the Snowflake Data Cloud and its new enhancements for data pipelines, governance, and performance.
If you missed November’s Data Cloud Summit 2020, you missed hearing about many new products, capabilities, and features currently in preview and development that help organizations mobilize their data, easily obtaining data from almost anywhere in almost any format; gaining lightning-fast performance; and exploring, analyzing, and sharing data insights via a range of analytics tools.
First there’s the Snowflake Data Cloud itself, a network of thousands of organizations mobilizing data as data consumers, data providers, and data service providers. Built on Snowflake’s platform, the Data Cloud is a single location where organizations can unify their siloed data.
“In the Data Cloud, silos are eliminated, and our vision is to bring the world’s data within reach of every organization,” Snowflake Senior Vice President of Product, Christian Kleinerman, said in his overview presentation. “The Data Cloud enables seamless collaboration, control, and access to data via Snowflake Data Marketplace,” which now includes more than 100 data providers.
“A critical aspect of the Data Cloud is that we envision organizations collaborating, not just in terms of data but also in terms of data-powered applications and services,” said Kleinerman. “Think of instances where a provider doesn’t want to open access to an entire data set but wants to make available business logic that can access and leverage that data set. That is what we call data services. And we want Snowflake to be the platform of choice for developing, discovering, and consuming such building blocks.”
To that end, several of the Data Cloud Summit 2020 announcements focused on features that extend the reach and impact of the Data Cloud, including extensible pipelines, advanced data governance, and Snowflake platform performance and capabilities.
EXTENSIBLE DATA PIPELINES
Available in public preview, the Snowflake External Functions feature enables teams to leverage services or business logic outside of Snowflake to interact with data within Snowflake by bringing data to where the computation occurs. Snowflake announced this feature for calling regional endpoints via AWS gateway in June. It is now also in public preview supporting Azure’s API management and will soon support the Google API gateway and AWS private endpoints.
A second extensibility mechanism does the converse: It brings the computation to Snowflake to run closer to the data by enabling the creation of functions and procedures. We’ve been doing this in SQL, but today we are happy to announce a new feature called Snowpark, which enables coders to build pipelines, create functions and procedures in Java, Scala, or Python.
Snowpark is a family of libraries that enable developers to write code directly against Snowflake in a way that is deeply integrated into common programming languages, using familiar concepts such as data frames. Snowpark is designed to leverage the Snowflake engine and optimize its performance, reliability, and scalability with near-zero maintenance.
“Think of the power of a declarative SQL statement available through a well-known API in Scala, Java, or Python,” said Kleinerman. “All these are applied against data governed in your core data platform. We believe Snowpark will be transformative for data programmability.”
After organizations have their data in Snowflake, he explained, they can simplify the flow even further by using Snowflake’s Streams and Tasks capabilities to call an external function to transcribe those files. Snowflake also plans to introduce a serverless execution model for Tasks, he said, by which Snowflake can automatically size and manage resources for its customers. After implementing this model, organizations can use the same serverless task to execute sentiment scoring and surface the sentiment score, either via Snowsight or through any tool they use for sharing insights throughout their organization.
Snowpark will be available in private preview in a future release.
Snowflake is working on a set of product capabilities to simplify data collaboration while complying with privacy regulations. Earlier this year, Snowflake acquired a company called CryptoNumerics to accelerate its efforts on this front, including the identification and anonymization of sensitive data. Although that work is not yet ready to be announced, Snowflake revealed two important new data governance capabilities at Data Cloud 2020 Summit: object tagging and row access policies.
Snowflake’s new object tagging feature helps users better know and organize their data by allowing them to attach user-defined metadata to a variety of objects, including tables, views, and columns. Think of the ability to annotate warehouses with cost-center information for tracking, or annotate tables and columns with sensitivity classifications that enable organizations to track sensitive data for compliance reasons. Flexible admin models allow for either centralized governance or decentralized tag assignment controlled by privileges. Object tagging should be available in private preview early next year.
Row Access Policies
Another key aspect of data governance in Snowflake is a framework in which organizations specify data policies to be enforced by Snowflake. For example, Snowflake announced Dynamic Data Masking earlier this year, and it is now available in public preview. Dynamic Data Masking allows organizations to mask sensitive information, such as PII column data, at query time. Depending on the masking policy conditions, the SQL execution context, and role hierarchy, Snowflake query results will show the plain-text value, a partially masked value, or a fully masked value.
In addition, at Data Cloud Summit 2020, Snowflake announced new row-access policies that complement Dynamic Data Masking. The new row-access policies allow users to define various rules for accessing data in the Data Cloud. Similar to Snowflake masking policies, row-access policies in Snowflake will be integrated seamlessly across all of Snowflake. Whether accessing data stored in external tables or semi-structured JSON data, building data pipelines via streams, or leveraging Snowflake’s data-sharing functionality, organizations will be able to implement complex row-access policies for diverse use cases and workloads within Snowflake—and instantly apply these new policies consistently to all of their Snowflake accounts, sharing governance across regions and clouds. These new row-level security capabilities should be available in private preview early next year.
Search Optimization Service
Snowflake announced a search optimization service earlier this year. This service can dramatically accelerate lookup queries on any column, particularly those not used as clustering columns. Currently in public preview, the search optimization can be enabled on a table-by-table basis. Initially Snowflake’s search optimization service supported equality comparisons only, but at November’s Data Cloud Summit 2020, Snowflake announced expanded support for searches, including pattern matching within strings. This expanded support is currently being validated by a few customers in private preview before being made broadly available.
Query Acceleration Service
Snowflake also announced a new query acceleration service at Data Cloud Summit 2020 that automatically identifies and scales out parts of a query that could benefit from additional resources and parallelization. This service, planned for private preview in a future release, enables organizations to realize dramatic improvements in performance—improvements that will be especially effective for data science and other scan-intensive workloads. And, importantly, it will be easy to use. Organizations simply define a maximum amount of additional resources that a warehouse can use for acceleration, and the service decides when it would be beneficial to use those resources. Given enough resources, a query over a massive data set can see significant performance improvement. When Snowflake used the service, it saw a common query execute 15 times faster, without changing the warehouse size.
SUPPORT FOR UNSTRUCTURED DATA
Now available in private preview, Snowflake took advantage of Data Cloud Summit 2020 to announce it is adding support for unstructured data, enabling customers to store all their data, in all its forms, on the same Snowflake platform. With as much as 90 percent of data defined as unstructured, including images, text files, social media content, audio files, and call center transcripts, think of the new insights your organization might gain—and share—by leveraging the power of SQL to analyze those unstructured data sets.
IN CASE YOU MISSED IT
Snowsight, Snowflake’s next-generation web user interface designed to support data analyst activities, already features many ease-of-use improvements for analysts, data engineers, and business users. In September, Snowflake introduced two additional Snowsight features in preview mode: the Current Role and Warehouse and Current Database drop-down menus.
Previously, Snowsight users selected their current role, warehouse, and database from a single drop-down menu in the query editor. With the September release, users select their session context from two separate drop-down menus: Current Role and Warehouse, and Current Database. The Current Role and Warehouse menu is available from the upper-right corner of Snowsight, making it more visible for users. This menu also includes a new Suspend/Resume and Resize Warehouses menu that displays details about status, size, scaling possibilities, and more for the selected warehouse. Users can resume or suspend the warehouse or change its size from within the new menu.
Want more? View Christian Kleinerman’s Snowflake Data Cloud Summit video here.
Watch Snowflake Data Cloud Summit breakouts on demand here.