Unlock the Value of Your Sensitive Data with Differential Privacy, Now Generally Available
The Snowflake AI Data Cloud has democratized data for thousands of customers, removing data silos and powering data sharing and collaboration use cases. Many customers have been able to unlock enormous value from their data with Snowflake, including safely collaborating on sensitive data using Snowflake Data Clean Rooms and Data Governance features. However, some highly sensitive data has remained off-limits due to regulatory requirements and privacy concerns — until now.
To address these challenges and truly democratize even highly sensitive data, we are excited to announce the general availability of differential privacy policies in Snowflake. These are built on a technology called differential privacy, which is regarded in academic literature as the gold standard for private data analytics. Differential privacy brings mathematical rigor to privacy protection, enabling customers to leverage previously inaccessible data. With the data unlocked, Snowflake customers can use it to power collaboration use cases like cross-organization, cross-geography data sharing, and even create new revenue streams with data monetization.
What is differential privacy?
Differential privacy is a privacy-enhancing technology that helps minimize the risk of sensitive information leakage by protecting the identity of individual entities in a data set, like people, organizations, location, etc. It has been implemented in many high-profile use cases with sensitive data, including the 2020 U.S. Census Data Release and Apple’s user data collection, and is highlighted in the 2023 AI Executive Order.
With differential privacy, data consumers can run analytical queries on the full dataset, but they cannot see the row-level data nor can they reverse engineer sensitive information. It complements data privacy methods that protect data at rest, in motion and in use.
To do this, Snowflake differential privacy policies dynamically add noise to query results. The amount of noise added depends on how sensitive the query is, as determined by the mathematical techniques of differential privacy. For example, if the query is calculating broad aggregates, the amount of noise will be relatively small, potentially negligible. If the query concerns a small group or even one individual, the noise will be large enough to obscure their identities and protect your most sensitive data against privacy attacks.
Typically, implementations of differentially private systems require significant investment and expertise; open source differential privacy libraries are not only not differentially private end-to-end, but they also may not implement features that make differential privacy useful for real-world use cases. Snowflake differential privacy policies, however, are ready to use out of the box, without these downsides.
Differential privacy unlocks rather than devalues data
Differential privacy is a significant improvement compared to existing approaches, which do much more to devalue data. To illustrate this, let’s examine a use case from the healthcare industry that uses a dataset of patient visits to healthcare providers. In this example, the data provider needs to protect the patient identities to comply with privacy regulations.
For the basic version of this use case, let’s assume we have a dataset where each row represents one visit between a patient and a healthcare provider. Without differential privacy, the data provider would typically mask the fields that could be used to identify the patient, like the date of the visit. Fields like these would be masked to a coarser level of granularity, such as removing the month and day and leaving just the year. While this approach may appear to make sense from a privacy point of view, it drastically reduces the value of the data. For example, data consumers can no longer ask questions like, “On average, how long is drug X taken for condition Y?”
With differential privacy, the data provider does not need to mask or remove any fields, enabling data consumers to ask these kinds of detailed questions and receive useful answers.
The difference between existing approaches and differential privacy becomes even more stark when we look at a more realistic version of this use case. Often in datasets like patient-doctor visits, the condition and drug prescribed is not cleanly available as a column. Instead, the data has an unstructured text field that contains the doctor’s notes from the visit.
It is difficult to safely redact sensitive data from fields like this, and in many cases the redaction will devalue the data. For example, the notes field often contains identifying information like the patient’s name because it includes a copy-and-paste of all the fields from the provider’s intake system.
In many cases, approaches to redacting this information also redacts personal health information like the condition diagnosed or the drug prescribed, which — while sensitive on a per-visit or per-patient basis — also contains the data’s main analytical value. Because of the devaluation, these types of use cases are often not possible with today’s commonly used privacy techniques.
Differential privacy unlocks these use cases and enables data consumers to query against unstructured text fields in aggregate. Analysts and researchers can now ask and get answers to questions like, “Based on the doctor’s notes for drug Y, what percentage of patients are experiencing side effects?”
Use cases for differential privacy span all industries
While the previous example involves a use case in healthcare, we have seen the same story over and over again in all industries. Here is a small sampling of use cases:
Advertising, media and entertainment: Protect individuals in event-level data on ad outcomes for ad targeting, optimization and measurement
Financial services: Consumer banking data monetization; capital markets data provisioning including Central Risk Book; prime brokerage; asset management
Healthcare and life sciences: Unlock datasets like electronic medical records, social determinants of health, genomic data, medical and pharmacy claims, and clinical trials for applications in research, data monetization and drug development
Manufacturing: Product telemetry on equipment for predictive maintenance
Public sector: Cross-agency/cross-office data sharing; publishing statistics outside of classification zones or to the public; broaden access to PII and PHI for research
Retail and consumer goods: Cross-organization Customer 360, cross-outlet consumer behavior data monetization, and supply chain benchmarking
Technology: External data sharing for research and policy makers; product telemetry analytics
Telecom: User device movement data monetization
While some of these use cases are possible today, they are not reaching their full potential because information is redacted from data in an attempt to protect privacy. With differential privacy, these use cases are fully unlocked, and the data remains protected from privacy leaks and targeted privacy attacks.
Realize the full value of your sensitive data
Differential privacy policies are now available to all Snowflake customers with Enterprise edition accounts or higher. To get started and learn more about differential privacy in Snowflake, check out this demo video and read the Snowflake documentation. And don’t miss our on-demand webinar on privacy-enhancing technologies that can help you balance the trade-offs between data privacy and utility.