By gathering information and creating written notes that were filed into patient charts, yesterday’s physicians practiced medicine in a silo. Although this basic form of a “single source of truth” was suitable for the patient and the doctor, it stifled any possibility of collaboration with other healthcare practitioners. Yet, the collaborative model of practitioners is highly effective. 

Combining the expertise of multiple minds seems like common sense, right? Unfortunately, bringing together massive data sets into one single source of truth is beyond the current capabilities of many organizations. Yesterday’s on-premises data warehouses are limited in their ability to scale beyond a few data sets.

Cloud computing as a data access equalizer 

The arrival of cloud computing brings elasticity and concurrency at scale. It also opens the door for a new level of collaboration. Some of these capabilities have already shown great promise. For example, organizations such as IQVIA and PRA combine electronic health records (EHRs) and medical claims data to uncover significant insights about things such as the effectiveness of new drugs as they come to market.  

But what about large genomic data files, which are stored in formats such as VCF, FASTA, BAM, and SAM. Is it possible and performant to bring this type of data into a relational database? With Snowflake, the cloud-built data warehouse, it is. If you don’t want to use SQL to query and model the data, that’s not a problem. Snowflake allows you to run simple SQL queries against only the data you need, then push it directly into your Jupyter notebook. Likewise, you can run simple SQL queries through Spark and push the data directly into your data science platform. Then, you can use tools such as R and Python on subsets of data to create predictive models. 

What’s the value in that, though? Couldn’t you just take the genomic data straight from a collection of VCF files? You could, but once you build a predictive model and obtain results, how do you get the results back to the point of care?  And, for subsequent queries you have to go back to the VCF files and build your next dataset. And don’t forget, having genomic assets buried in a stack of VCF files prevents on-demand, interactive data exploration and analysis that is so critical for research and assessment applications. 

Secure data sharing

This is where the value of a modern, cloud-built data warehouse comes in. By keeping genomic data about patients or populations live and queryable, and pushing analytical results back into the data warehouse, valuable data is democratized and available to business analysts using data visualization tools, as well as to other data science teams. You no longer have to put the data into Excel files and share them over email to distribute the data.

In addition, instead of performing a simple query against ingested VCF files, imagine a query joining data from your EHRs, some medical claims, some social determinants of health, and your genomic data, then pulling the result into your data science environment for analysis. You could also share the results with a partner organization without using Excel. Snowflake allows you to securely share data without copying it; SFTP is not required. Read more about Snowflake and HIPAA here.

For genomic data, how do modern data warehouses handle all the annotation data that goes with the phenotype and genotype data? Snowflake allows you to store both structured and semi-structured data in the same database while taking advantage of modern, compressed columnar storage. The ability to co-locate structured phenotype data and semi-structured annotation data (formatted as JSON, for example), provides many possibilities.

Having all of those data sources together in one place, curated by collaboration among practitioners, holds the promise of bringing Big Data to the next level. The question is, will you be ready?