Secure Joins: How to Join Data Between Companies While Respecting PII

Author: Justin Langseth | Contributing Authors: Daniel Freundel

How to Use Snowflake, Snowflake Technology

In our previous blog post, “The Power of Secure UDFs for Protecting Shared Data,” we introduced Snowflake’s Secure Shared User-Defined Functions (Secure UDFs) feature and demonstrated how a data consumer can run a secure, shared function to ask specific questions about a provider’s data. In this post, we’ll consider various ways to join data from different parties, while preventing the sharing of PII data (personally identifiable information).

Consider this question, which companies often ask each other: “How many customers do we have in common?” This question comes up frequently between buyers and sellers of advertising, between healthcare payers and providers, and even between members of the Data Sharing team at Snowflake as we try to match data supply and demand to spark the emerging Information Economy. It is a challenging question to answer without one party exposing its entire customer list to the other party.

Semi-Secure Joins Risk PII Probing

Without a secure join, one company, which we will call the Provider, could share a secure UDF that allows it to check whether a specific customer exists in its database and then return a 1 if there is a match or a 0 if there is not.

-- PROVIDER ACCOUNT

-- setup db, schema, share

use role accountadmin;
create database if not exists SJ_DEMO;
create schema if not exists SJ_DEMO.PUBLIC;
create share if not exists SJ_DEMO_SHARE;
grant usage on database SJ_DEMO to share SJ_DEMO_SHARE;
grant usage on schema SJ_DEMO.PUBLIC to share SJ_DEMO_SHARE;
alter share SJ_DEMO_SHARE add accounts=JTEST3;
use database SJ_DEMO;
use schema PUBLIC;

--create provider customer file as a 20% sample of the TPCDS dataset's customer table


CREATE DATABASE if not exists "SNOWFLAKE_SAMPLE_DATA" FROM SHARE 
SFC_SAMPLES."SAMPLE_DATA";
GRANT IMPORTED PRIVILEGES ON DATABASE "SNOWFLAKE_SAMPLE_DATA" TO ROLE "PUBLIC";
create or replace table provider_customers as
(select * from snowflake_sample_data.TPCDS_SF10TCL.customer tablesample 
bernoulli (20.1));

create or replace secure function secure_direct_lookup (input_email_address 
string)
 returns numeric(10)
 as 'select count(*) as count_custs from
  provider_customers c
  where c.c_email_address = input_email_address';

grant usage on function secure_direct_lookup(string) to share SJ_DEMO_SHARE;

Then the other company, which we will call the Consumer, could connect to the provider’s share, create its own customer table as a random sample from the TPC-DS benchmark data set, and count common customers using the provider’s shared secure_direct_lookup function.

-- CONSUMER ACCOUNT

--setup

use role accountadmin;
create or replace database SJ_TEST from share jtest2.SJ_DEMO_SHARE;
grant imported privileges on database SJ_TEST to role PUBLIC;
create or replace database SJ_TEST_JOIN;
use database SJ_TEST_JOIN;
use schema PUBLIC;

--create consumer database as 4.5% sample of TPCDS customer table

CREATE DATABASE if not exists "SNOWFLAKE_SAMPLE_DATA" FROM SHARE 
SFC_SAMPLES."SAMPLE_DATA";
GRANT IMPORTED PRIVILEGES ON DATABASE "SNOWFLAKE_SAMPLE_DATA" TO ROLE 
"PUBLIC"; 

create or replace table consumer_customers as
(select * from snowflake_sample_data.TPCDS_SF10TCL.customer tablesample 
bernoulli (4.5));
-- calculate the number of overlapping customers

select sum(sj_test.public.secure_direct_lookup(c_email_address)) 
overlapping_customers, count(*) total_customers, ((overlapping_customers / 
total_customers ) * 100 ) overlap_pct;
from consumer_customers;


Because the secure function can securely access the provider’s data without exposing it to the consumer, and the provider can’t see who the consumer looks up, this is relatively secure. The main risk is that the consumer cannnot know with certainty if any person the consumer knows about is a customer of the provider. A better approach would be to provide the overlapping customer count without providing the consumer with the ability to gain visibility into the provider’s individual customers.  

Introducing Secure Joins

Here is how to implement the better approach using a secure join. First, the provider defines a secure UDF called get_secure_key and shares it with the consumer, in addition to sharing a SQL secure UDF to perform the count of overlapping customers.  

create or replace secure function
get_secure_join_key(customer_key_string string)
returns string as
$$
sha2(customer_key_string || sha2(current_account() || 'provider secret 
salt'))
$$;

create or replace secure function secure_lookup(secure_join_key string, consumer_salt string)
returns numeric(10)
as 
$$
select count(*) as count_custs 
  from provider_customers c
  where secure_join_key = sha2(nvl(C_EMAIL_ADDRESS,to_char(random())) || 
consumer_salt || sha2(current_account() || 'provider secret salt'))
$$;
grant usage on function get_secure_join_key(string) to share SJ_DEMO_SHARE;
grant usage on function secure_lookup(string, string) to share 
SJ_DEMO_SHARE;

Doing that allows the provider to define a secure UDF that uses a native Snowflake SHA-256 one-way hash to convert the consumer’s customer PII into a meaningless string. Because this function runs inside the consumer’s Snowflake compute nodes, the provider never knows that it was run or what data was passed into it. In this example, the provider also includes the consumer’s Snowflake account ID as part of the hash, using the CURRENT_ACCOUNT() function. This way, the hashes cannot be used to measure data overlap between any databases except this provider’s database and this consumer’s database. The consumer can also add its own “salt,” which makes the data in the reverse share (discussed below) meaningless to the provider, because the provider doesn’t know the consumer’s salt value.

The consumer can then use this hash approach for one-row-at-a-time or bulk matching. Although doing that doesn’t add any additional security benefit, it might appear more secure to people who do not understand that the provider has no visibility into how its secure UDFs are called by consumers, nor can the provider see what data a consumer runs through the provider. Later in this post, we will introduce the concept of a reverse share. With a reverse share, the provider can see the values that the consumer wants to match, in which case the hashing becomes critical.

The SQL script below shows how the consumer can set up the share from the provider and perform a count of the consumer’s customers that overlap with those of the provider.

-- CONSUMER ACCOUNT

-- create customer key string function

create or replace function
get_customer_key_string(c_email_address string, consumer_salt string)
returns string as
$$
nvl(C_EMAIL_ADDRESS,to_char(random()))|| consumer_salt
$$;

--make table of secure join keys (this is needed in some cases UDF’s can not 
call other UDFs within them).

create or replace temporary table joinkeys as 
select 
sj_test.public.get_secure_join_key(get_customer_key_string(C_EMAIL_ADDRESS,'
consumersalt')) secure_join_key
from consumer_customers c;

--run secure join overlap count

select sum(sj_test.public.secure_lookup(secure_join_key,'consumersalt')) 
overlapping_customers, 
count(*) total_customers, ((overlapping_customers / total_customers ) * 100 
) overlap_pct
from joinkeys;


--your numbers may vary if you run this due to the random selection function used to create the consumer and provider customer tables 

The Reverse Share

The above approach works well, but still allows the Consumer to know with certainty whether any customer known to the Consumer is or is not also present in the Provider’s database.  In some cases, the provider may not want to allow such specific certainty on an individual customer level, but instead to provide only overlapping customer counts on a collection of potential customers.

To help achieve this, the consumer can create a secure view of its customer data after running the data values through the hash function, and share only the salted hash values with the provider by using a reverse share.

In the following reverse share example, note that the consumer has added some additional hash values that don’t represent anyone at all, which helps prevent the provider from being able to determine anything about the size of the consumer’s customer base by inspecting the contents of the reverse share. This technique is called noise injection and doesn’t harm the analytics.

-- CONSUMER ACCOUNT 

--CREATE REVERSE SHARE
CREATE SHARE SJ_DEMO_REVERSE_SHARE;
GRANT USAGE ON DATABASE SJ_TEST_JOIN TO SHARE SJ_DEMO_REVERSE_SHARE;
GRANT USAGE ON SCHEMA SJ_TEST_JOIN.PUBLIC TO SHARE SJ_DEMO_REVERSE_SHARE;
ALTER SHARE SJ_DEMO_REVERSE_SHARE ADD ACCOUNTS = demo28;   

--CREATE NOISE INJECTION TO OBFUSCATE DATA SET SIZE
CREATE OR REPLACE temporary table noiseinjection as
SELECT 
sj_test.public.get_secure_join_key(get_customer_key_string(randstr
(20, random()),'consumersalt')) as secure_join_key FROM 
table(generator(rowCount => 10000));

 
CREATE OR REPLACE TABLE  consumer_customers_hashed as select 
sj_test.public.get_secure_join_key(get_customer_key_string(C_EMAIL
_ADDRESS,'consumersalt')) secure_join_key
from consumer_customers c
UNION
SELECT * FROM noiseinjection; 

GRANT SELECT ON TABLE CONSUMER_CUSTOMERS_HASHED TO SHARE 
SJ_DEMO_REVERSE_SHARE;

Now that the consumer has shared the hashed customer IDs with the provider, the consumer can call another SQL secure UDF that performs the secure join, providing its salt value, which returns the distinct count of joint customers.

--PROVIDER ACCOUNT

CREATE DATABASE SJ_DEMO_REVERSE_SHARE FROM SHARE 
DEMO28_TF.SJ_DEMO_REVERSE_SHARE;

USE DATABASE SJ_DEMO;
CREATE OR REPLACE TABLE consumer_customers_hashed as SELECT * FROM 
SJ_DEMO_REVERSE_SHARE.PUBLIC.consumer_customers_hashed;
create or replace secure function 
secure_lookup_shared(consumer_salt string)
returns numeric(10)
as
$$
with joinkeys as (select secure_join_key from 
consumer_customers_hashed)
select count(*) as count_custs
  from provider_customers c
  join joinkeys
  on joinkeys.secure_join_key = 
sha2(nvl(C_EMAIL_ADDRESS,to_char(random())) || consumer_salt || 
sha2(current_account() || 'provider secret salt'))
$$;;

GRANT usage on function secure_lookup_shared (string) to share 
SJ_DEMO_SHARE;

--CONSUMER ACCOUNT

select sj_test.public.secure_lookup_shared('consumersalt') 
overlapping_customers,
count(*) total_customers, ((overlapping_customers / 
total_customers ) * 100 ) overlap_pct
from consumer_customers_hashed;

Because the secure UDF runs on the consumer’s compute resources, the provider does not know the query was run, nor does the provider see the results. Likewise, the consumer is able to find out only how many overlapping customers it and the provider have; the consumer cannot determine anything else about the provider’s non-overlapping customer base, including its size.

Fuzzy Secure Joins

A potential risk the provider faces is the consumer intentionally probing the provider’s data set with successive requests and altering only one row of each request, in order to determine if a specific customer is present in the provider’s data set. If this is a concern, the provider can add additional logic to its counting secure UDF, as follows:

--PROVIDER ACCOUNT 

create or replace secure function 
secure_lookup_shared_fuzzy(consumer_salt string)
returns numeric(10)
 as
 $$
 with joinkeys as (select distinct(secure_join_key) from 
consumer_customers_hashed)
 select CASE WHEN (round((count(*) + random() % 10), -1)) > 10 then (round((count(*) + random()% 10), -1))
            ELSE 0
       END as count_custs
     from provider_customers c
     join joinkeys
     on joinkeys.secure_join_key = 
sha2(nvl(C_EMAIL_ADDRESS,to_char(random())) || consumer_salt || 
sha2(current_account() || 'provider secret salt'))
$$;

grant usage on function secure_lookup_shared_fuzzy(string) to 
share SJ_DEMO_SHARE;

--CONSUMER ACCOUNT 
select sj_test.public.secure_lookup_shared_fuzzy('consumersalt') 
overlapping_customers,
count(*) total_customers, ((overlapping_customers / 
total_customers ) * 100 ) overlap_pct
from consumer_customers_hashed;

The provider’s function now does five things to protect against raw-row probing:

  1. Performs a distinct row reduction on the reverse share input set and enforces a minimum number of distinct rows on the reverse share inputs
  2. Uses a random fuzz-factor that jitters the returned result by +/- 10
  3. Rounds off the distinct overlapping result to the nearest 10
  4. Enforces a minimum number of overlap “hits” for the function to return a non-zero result

These actions inject a fair amount of provider-side noise into the result, making row-probing much more difficult, although not eliminating all risk. And although the resulting numbers are fuzzed and rounded a bit, they still provide the consumer a good estimation of the relative magnitude of the overlap of the customer bases.

Real-World Secure Joins

In this example we use an exact match on the email address.  In the real world, however, joining is not usually as easy since people frequently have multiple email addresses or physical addresses. Additionally, company names often have different naming conventions and multiple companies might have similar names. The good news is that companies like LiveRamp and Full Contact provide services that combine with Secure Data Sharing to provide for broader and fuzzier matching, as well as enrichment of data about joined entities.

This Is Just the Beginning

To help convey the basic concepts and spark your imagination, we crafted these example secure joins to be simple yet powerful. We’d love to hear what you come up with, and we encourage you to continue the conversation with us in the Snowflake Lodge. We hope these tools will turbo-charge your data sharing and monetization aspirations as we all “think big” to build the Information Economy.

Find the complete code set shown above on GitHub.

Additional Links