BUILD: The Dev Conference for AI & Apps (Nov. 12-14)

Hear the latest product announcements and push the limits of what can be done in the AI Data Cloud.

Industry Solutions

Leveraging Snowflake to Enable Genomic Analytics at Scale

Leveraging Snowflake to Enable Genomic Analytics at Scale

Genomic data, which is the DNA data of organisms, is essential to life sciences companies. For population studies, anonymized data sets can link long-term health histories with treatment patterns and genomic variations, making it possible to analyze effective approaches for subpopulations. In clinical trials and drug discovery, pharmaceutical research that combines patient health data, drug effectiveness, and genomic variations can improve outcomes and speed time to market. For individual care, a doctor who knows a patient’s genetic profile can make more informed decisions, advancing personalized medicine.

But life sciences companies face several challenges with querying and analyzing genomic data at scale. Dedicated bioinformatics toolkits often complicate the ability to link genomic data to patient records or annotation data. Population-wide studies often require curating hundreds of thousands of genomes, each one represented by about 500 megabytes of compressed whole-genome sequences. And analytic workflows involve periods of intense computation followed by relatively low utilization. Life sciences organizations are continually sharing data—with collaborators, clinical partners, and pharmaceutical industry data services. But legacy systems and data silos prevent easy and secure data sharing.

Snowflake can help life sciences companies query and analyze data easily, efficiently, and securely. With Snowflake as a powerful analytic relational database, there’s no limit to joining genomes, patient history, variant research, and annotation—even the characteristics of the sequencing process. Snowflake’s ability to scale compute resources easily and dynamically without limits, but only when needed, combines performance with cost-effectiveness. Snowflake’s Secure Data Sharing allows authorized organizations to exchange directly queryable data without copying it, and with the level of governance and masking expected by the industry. Snowflake also allows users to easily subset data for iterative analysis and bring compute power directly to the data.

Cloud computing environments offer a rich collection of solutions for deploying sequencing pipelines—software that takes raw output from next generation sequencing platforms and performs alignment, variant calling, and related data-curation tasks. Examples include the Illumina DRAGEN pipeline on AWS and Azure, and Broad Institute’s GATK on AWS, Azure and GCP. Alignment of sequence data with a reference genome and variant-calling algorithms are key elements of primary and secondary genomic data analysis. The next step—tertiary analysis—involves analyzing large and dynamic collections of this preprocessed data, frequently packaged and distributed as compressed VCF files. Snowflake is positioned to accelerate and scale this type of tertiary analysis in the cloud environment of your choice.  

Here we walk through an end-to-end example of ingesting and querying a real-world genomic data set in Snowflake. You can run this example in your own Snowflake account (or a trial account) in less than 15 minutes.

Getting ready to work with public genomic data sets in Snowflake

As an example of using Snowflake in genomic applications, we can work with publicly available variant and annotation data already hosted by AWS and available in the Open Data Registry. We will use the DRAGEN output from the publicly available 1000-Genomes project sequence collection—curated by Illumina—for 3,202 individuals. And we complement this with annotation data from ClinVar, sourced from NCBI and curated by AWS

In our Snowflake environment, we will work with an Extra Small (XS) warehouse (cluster) to process a sample subset of sequences, but illustrate how to easily scale up to handle the entire collection of genomes in the 1000-Genome data set.

To work with the VCF data, we first need to define an ingestion and parsing function in Snowflake to apply to the raw data files. The function uses Java streaming methods to handle the rows and specialized column formatting defined by the VCF specification—converting the zipped VCF files into an easy-to-query structured and semi-structured data representation inside Snowflake.   

To create the VCF Ingestion function, please see the appendix below and copy and execute the 3 CREATE OR REPLACE FUNCTION statements provided there. You can run these directly by pasting the code into the Snowflake UI, within the database you plan to use.

Ingesting the VCF files

Now let’s ingest the VCF data representing DRAGEN pipeline variant calls from the 3,202 member population of the 1000-genomes project.

First, we define the s3 data location for the Open Data Repository collection as a Snowflake Stage—a location defining source data files. We point to the collection of DRAGEN sequencing files based on the GRCh38 Reference Genome:

-- Data location from 1000-genomes files
-- Create stage including Directory Table
create or replace stage dragen_all
   directory = (enable = true)
   url = 's3://1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based'
   file_format = (type = CSV compression = AUTO)
;

Defining this stage using the directory directive above lets us query the names of available files that meet some criteria. For example, to work with a random set of 8 genomes represented as “hard filtered” variants, we can get a file listing with a query like this:

-- Sample query from Directory Table showing a pattern selection of 8 Dragan files
SELECT * FROM DIRECTORY( @dragen_all )
where relative_path rlike '/HG0011[0-7].*.hard-filtered.vcf.gz'
;

Each of these VCF files hold approx 5M rows. To take a quick view of the contents of the VCF file, we can retrieve the first 200 data rows using a query like this, referencing the stage and path, and calling the Java function we defined earlier to retrieve and parse the VCF:

-- Quick Peek (200 rows) at a single VCF file:
select * from table(ingest_vcf(BUILD_SCOPED_FILE_URL(
    @dragen_all, '/HG00116/HG00116.hard-filtered.vcf.gz'), 200));

You will see a structured result containing the well-defined columns Chrom, Pos, Ref, etc, including the specific SampleID. A variable set of characteristics of the genome call at that location are in the semi-structured fields INFO and SAMPLEVALS. Desired values from these semi-structured fields can be easily selected and manipulated as illustrated in the ingestion query below.

First we create a table to store the genomic variant data, and represent variants using 1-row per variant per sample.   

create or replace table GENOTYPES_BY_SAMPLE (
   CHROM       varchar,
   POS         integer,
   ID          varchar,
   REF         varchar,
   ALT         array  ,
   QUAL        integer,
   FILTER      varchar,
   INFO        variant,
   SAMPLE_ID   varchar,
   VALS        variant,
   ALLELE1     varchar,
   ALLELE2     varchar,
   FILENAME    varchar
);

And finally we populate the table with a query that uses the VCF parsing function we defined earlier, applied to each desired file via a JOIN to the (possibly filtered) directory listing:

insert into GENOTYPES_BY_SAMPLE (
   CHROM       ,
   POS         ,
   ID          ,
   REF         ,
   ALT         ,
   QUAL        ,
   FILTER      ,
   INFO        ,
   SAMPLE_ID   ,
   VALS        ,
   ALLELE1     ,
   ALLELE2     ,
   FILENAME    )
with file_list as (
   SELECT file_url, relative_path FROM DIRECTORY( @dragen_all )
   --where relative_path rlike '.*.hard-filtered.vcf.gz'  //select all 3202 genomes
   where relative_path rlike '/HG0011[0-7].*.hard-filtered.vcf.gz'  //selection of 8 genomes
   --where relative_path rlike '/HG00[0-2][0-9][0-9].*.hard-filtered.vcf.gz'  //selection of 122 genomes
 order by random()  //temporary hint to maximize parallelism  
)
select
   replace(CHROM, 'chr','')         , 
   POS           ,
   ID            ,
   REF           ,
   split(ALT,','),
   QUAL          ,
   FILTER        ,
   INFO          , 
   SAMPLEID      ,
   SAMPLEVALS    ,
   allele1(ref, split(ALT,','), SAMPLEVALS:GT::varchar) as ALLELE1,
   allele2(ref, split(ALT,','), SAMPLEVALS:GT::varchar) as ALLELE2,
   split_part(relative_path, '/',3)
from file_list,
    table(ingest_vcf(BUILD_SCOPED_FILE_URL(@dragen_all, relative_path), 0)) vcf
;

A few example timings we achieved for this ingestion in Snowflake are:

– 8 min to consume 8 genomes on an XS warehouse

– 8 min to consume 122 genomes on an XL warehouse

– 50 min to consume 3,202 genomes on a 3XL warehouse 

(The commented lines in the file_list section above are alternative filter conditions to apply when ingesting all 3,202 genomes, or a subset of 122 genomes, instead of this starter sample of 8 genomes.)

A few example queries of data properties

We can retrieve some interesting initial characteristics from the data. For example, the following query summarizes the distribution of sampling depth (DP) values for variant calls per sample—showing 5th, 10th, 50th, 90th, and 95th percentile values of this important quality metric:

-- Percentile distributions of Sampling Depths
select distinct sample_id,
  percentile_cont(.05) within group (order by vals:DP::number) over (partition by(sample_id)),
  percentile_cont(.1) within group (order by vals:DP::number) over (partition by(sample_id)),
  percentile_cont(.5) within group (order by vals:DP::number) over (partition by(sample_id)),
  percentile_cont(.9) within group (order by vals:DP::number) over (partition by(sample_id)),
  percentile_cont(.95) within group (order by vals:DP::number) over (partition by(sample_id))
from genotypes_by_sample
;

The DP is one of the numeric elements in the VALS column of the genotypes table, originating in a semi-structured VCF data element, and can be accessed in a Snowflake query using the expression VALS:DP :: number in a Select or Filter condition as shown above. All other variable elements in these semi-structured columns can be queried in a similar way.

Note that our ingestion script has also explicitly decoded the values of ALLELE1 and ALLELE2 for each variant present in the samples. That is because these single sample VCF files do not contain comprehensive information about all possible ALT values that may be present at a location across a population, so a genotype call (GT value) of, say, 1/0 for one sample may have a different meaning from 1/0 at that same location in another sample. Working directly with explicit allele values is simpler for handling data that has not yet been normalized across a population.

In another example below, we characterize the volume of variant calls observed in each chromosome by whether they are homozygous or heterozygous variants, and we use the ALLELE1 and ALLELE2 values to determine it:

-- Total Count of variant calls by type, per chrom
select chrom,
   sum(case when allele1 = ref and allele2 = ref then 1 else 0 end) as homozref,
   sum(case when allele1 != allele2 then 1 else 0 end) as heterozalt,
   sum(case when allele1 != ref and allele2 = allele1 then 1 else 0 end) as homozalt
from Genotypes_by_sample
where try_to_number(chrom) is not null
group by 1 order by try_to_number(chrom);

Since the files we ingested contained only variants (non-reference) called per sample, homozygous reference counts are all 0.

Incorporating annotation and phenotype data sources

Queries are even more interesting when we can join genotype data to annotations and phenotype information. Let’s bring some examples of those into Snowflake now. A current copy of ClinVar data is maintained by AWS in Parquet format and we can quickly ingest it. We first define a stage pointing to the s3 location, then ingest the Parquet into a structured table:

create or replace stage clinvar
   url = 's3://aws-roda-hcls-datalake/clinvar_summary_variants'
   file_format = (type = PARQUET)
;


create or replace table CLINVAR (
 chrom                  varchar(2),
 pos                    number    ,
 ref                    varchar   ,
 alt                    varchar   ,
 ALLELEID               number    ,
 chromosomeaccession    varchar   ,        
 CLNSIG                 varchar   ,
 clinsigsimple          number    ,  
 cytogenetic            varchar   ,
 geneid                 number    ,
 genesymbol             varchar   ,
 guidelines             varchar   ,
 hgnc_id                varchar   ,
 lastevaluated          varchar   ,  
 name                   varchar   ,   
 numbersubmitters       number    ,     
 origin                 varchar   ,
 originsimple           varchar   , 
 otherids               varchar   ,
 CLNDISB                array     ,
 CLNDN                  array     ,
 rcvaccession           array     , 
 CLNREVSTAT             varchar   ,
 RS                     number    ,
 startpos               number    ,
 stoppos                number    ,
 submittercategories    number    ,        
 testedingtr            varchar   ,
 type                   varchar   ,
 variationid            number    ,
 full_annotation        variant  
);


insert into CLINVAR
select
   $1:chromosome                 ::varchar(2)  chrom,
   $1:positionvcf                ::number      pos,
   $1:referenceallelevcf         ::varchar     ref,
   $1:alternateallelevcf         ::varchar     alt,
   $1:alleleid                   ::number      ALLELEID,
   $1:chromosomeaccession        ::varchar     chromosomeaccession,
   $1:clinicalsignificance       ::varchar     CLNSIG,
   $1:clinsigsimple              ::number      clinsigsimple,
   $1:cytogenetic                ::varchar     cytogenetic,
   $1:geneid                     ::number      geneid,
   $1:genesymbol                 ::varchar     genesymbol,
   $1:guidelines                 ::varchar     guidelines,
   $1:hgnc_id                    ::varchar     hgnc_id,
   $1:lastevaluated              ::varchar     lastevaluated,
   $1:name                       ::varchar     name,   
   $1:numbersubmitters           ::number      numbersubmitters,
   $1:origin                     ::varchar     origin,
   $1:originsimple               ::varchar     originsimple,
   $1:otherids                   ::varchar     otherids,
   split($1:phenotypeids::varchar,  '|')       CLNDISB,
   split($1:phenotypelist::varchar, '|')       CLNDN,
   split($1:rcvaccession::varchar,  '|')       rcvaccession,
   $1:reviewstatus               ::varchar     CLNREVSTAT,
   $1:rsid_dbsnp                 ::number      RS,
   $1:start                      ::number      startpos,
   $1:stop                       ::number      stoppos,
   $1:submittercategories        ::number      submittercategories,
   $1:testedingtr                ::varchar     testedingtr,
   $1:type                       ::varchar     type,
   $1:variationid                ::number      variationid,
   $1                            ::variant     full_annotation
FROM '@clinvar/variant_summary/'
where $1:assembly = 'GRCh38'
order by chrom, pos
;

Note that we ingest data associated with the GRCh38 reference genome, to match with the data in the 1000 Genomes data set that we ingested earlier.

For phenotype data, we only have access to a small “panel” table describing the geographic origin and gender of each sample, and any family linkages among samples. Still it is useful for illustrating the joining of genomes to properties of the samples.

create or replace stage population
   url = 's3://1000genomes/1000G_2504_high_coverage/additional_698_related/'
   file_format = (type = CSV compression = AUTO field_delimiter=' ' skip_header=1)
   ;


create or replace table panel (
  Sample_ID        varchar,
  Family_ID        varchar,
  Father_ID        varchar,
  Mother_ID        varchar,
  Gender           varchar,
  Population       varchar,
  Superpopulation  varchar
);


insert into panel
select $2, $1, $3, $4, case when $5 = 1 then 'male' else 'female' end, $6, $7
from '@population/20130606_g1k_3202_samples_ped_population.txt'
;

Putting it all together

Now we are ready to quickly and easily extract variants of interest based on a combination of population characteristics and genome location. The example below collects a slice representing female samples from Britain, for variants on a narrow range of Chromosome 10:

-- Example Query:  Britain Female Samples for subset of chrom 10
select g.sample_id, chrom, pos, ref, alt, vals:GT::varchar gt,
   allele1,
   allele2
from genotypes_by_sample g
join panel p on p.sample_id = g.sample_id
where chrom = '10' and pos between 100000 and 500000
   and population = 'GBR'
   and gender = 'female'
limit 100;

The example highlights the ability to easily join the genomic  data, keyed on Sample_ID, to phenotype data (or more complex data such as EMR or lab sources) that are associated with the same ID. In Snowflake, these IDs and selected characteristics can also be masked based on user roles accessing the data—effectively hiding IDs from query results to protect privacy, but still readily used as join conditions to allow collecting population statistics. By joining to ClinVar, we can explore data at genome locations that are annotated as being associated with a specific clinical condition or disease of interest. Here we look at all locations associated with hereditary colon cancer. Note that we do not need to explicitly specify chromosomes or positions—that filtering is applied implicitly by the join to ClinVar:

-- Example Query -- locations associated with hereditary colon cancer
select   g.chrom, g.pos, g.ref, allele1, allele2, count (sample_id)
from genotypes_by_sample g
join clinvar c
   on c.chrom = g.chrom and c.pos = g.pos and c.ref = g.ref
where
   array_contains('Hereditary nonpolyposis colorectal neoplasms'::variant, CLNDN)
group by 1,2,3,4,5
order by 1,2,5
;

When working with ClinVar’s CLNDN column, because the column may contain multiple values, we use Snowflake’s ARRAY_CONTAINS function in the filter condition to specify the desired indicator. Next, we can refine the query by revealing only those variant values where ALLELE1 or ALLEL2 of a variant call matches the annotation. We also return the label of the gene involved at those positions:

-- Example Query -- specific variants associated with hereditary colon cancer
select   g.chrom, g.pos, g.ref, c.alt clinvar_alt, genesymbol, allele1, allele2, count (sample_id)
from genotypes_by_sample g
join clinvar c
   on c.chrom = g.chrom and c.pos = g.pos and c.ref = g.ref
   and ((Allele1= c.alt) or (Allele2= c.alt))
where
   array_contains('Hereditary nonpolyposis colorectal neoplasms'::variant, CLNDN)
group by 1,2,3,4,5,6,7
order by 1,2,5
;

Other ways to unlock value of genomic data

These examples illustrate the power and flexibility of Snowflake to ingest, combine, and analyze genomic data originating in the industry-standard VCF format, coupled with patient and annotation data. Snowflake includes many features and capabilities that make it easy to expand these use cases, including:

  • Secure, role-governed sharing of selected data subsets with collaborators without copying
  • Incrementally ingesting new data, sequence-by-sequence, to accumulate population-wide data collections
  • Transforming raw genomic data sets into “normalized” collections 
  • Handling GVCF data sources and deriving reference calls from high-quality range reads
  • Indexing / clustering techniques to enable fast searching by sample_ID or by location in the genome
  • Generating dataframes as inputs to training or scoring machine learning models, based on combined variant and phenotype data

I hope these illustrations help you explore the potential of combining cloud-based life science pipelines and data sources with Snowflake’s analytics at scale. For more information on how Snowflake can help your life sciences organization unlock the value of genomic data, visit Snowflake’s Healthcare & Life Sciences website.

APPENDIX – Sample Functions for VCF File Data Ingestion:

-- Copyright (c) 2022 Snowflake Inc.  All Rights Reserved


-- UDTF to ingest gzipped vcf file.  If num_lines > 0, only first N data lines are processed
create or replace function ingest_vcf(vcfFile string, num_lines integer)
returns table (chrom string, pos integer, id string, ref string, alt string, qual string,
                   filter string, info variant, sampleVals variant, sampleId string)
language java
handler='App'
target_path='@~/ingest_vcf.jar'
as
$$
import java.io.*;
import java.util.*;
import java.util.zip.GZIPInputStream;
import java.util.stream.Stream;


class OutputRow {
   public String chrom; 
   public int    pos;      
   public String id; 
   public String ref; 
   public String alt;  
   public String qual;  
   public String filter;
   public String info;
   public String sampleVals;
   public String sampleId;


   public OutputRow(String chrom, int pos, String id, String ref, String alt, String qual,
                     String filter, String info, String sampleVals, String sampleId) {


       this.chrom      = chrom; 
       this.pos        = pos;   
       this.id         = id;  
       this.ref        = ref;   
       this.alt        = alt;  
       this.qual       = qual;  
       this.filter     = filter;
       this.info       = info;
       this.sampleVals = sampleVals;
       this.sampleId   = sampleId;


   }
}


class DictEntry {
   public String number;
   public String type;
   public String description; 
  
   public DictEntry(String number, String type, String description) {
       this.number = number;
       this.type = type;
       this.description = description;
   }
}


public class App {


   public static Class getOutputClass() {
       return OutputRow.class;
   }


   // Return quoted values unless numeric, to populate JSON values
   // Also replace "dots" with empty strings
   public static String delimitedValue(String rawValue, String type) {
       String theValue = (rawValue.equals(".")) ? "" : rawValue;
       return ((type.equals("Integer") || type.equals("Float")) ? theValue : "\"" + theValue + "\"");
   }


   // Assemble a Json Output object element from a VCF element
   public static String createJsonElement(String theKey, String unparsedValue, Map<String, DictEntry> dict) {


       // Lookup metadata in the dictionary
       String theNumber = dict.get(theKey).number;
       String theType = dict.get(theKey).type;


       String theValue = (theNumber.equals("0") ? "" : unparsedValue);


       if (theNumber.equals("0")) { // Boolean
           return("\"" + theKey + "\": true");
       }
       else if (theNumber.equals("1")) { // Singleton
           return("\"" + theKey + "\": " + delimitedValue(theValue, theType));
       }
       else { // Array of values
           List<String> vals = new ArrayList<String>();
               for (String e : theValue.split(",")) {
                   vals.add(delimitedValue(e, theType));
               }
           return("\"" + theKey + "\": [" + String.join(",", vals) + "]");
       }
   }


   // Parse content of Info fields into JSON
   public static String createInfoJson(Map<String, DictEntry> dict, String infoText) {
       if (infoText.equals(".")) {
           return null;
       }
       List<String> elements  = new ArrayList<String>();
       String[] infoArray = infoText.split(";");
       for (String item : infoArray) {
          String theKey = item.split("=")[0].trim();
          String unparsedValue = (item.contains("=") ? item.split("=")[1].trim() : "");
          elements.add(createJsonElement(theKey, unparsedValue, dict));
       }
       return ("{" + String.join(",", elements) + "}");
   }


   // Parse genomic values based on Format field into Json
   public static String createFormatJson(Map<String, DictEntry> dict, String formatString, String formatVals) {
       List<String> elements = new ArrayList<String>();
       String[] formatArray = formatString.split(":");  // available fields
       String[] valsArray = formatVals.split(":"); // the values -- may be truncated relative to format entries
       for (int i = 0 ; i < valsArray.length; i++ ) {
           String theKey = formatArray[i];
           String unparsedValue = valsArray[i];
           // add an entry to the JSON so long as the field is not empty (!== ".")
           if (! unparsedValue.equals(".")) {
               elements.add(createJsonElement(theKey, unparsedValue, dict));                   
           }  
       }                  
       return ("{" + String.join(",", elements) + "}");
   }


   // Parse a line from header INFO or FORMAT entries and populate the respective dictionaries
   public static void addDictionaryElement(Map<String, DictEntry> dict, String descriptor) {
       String infoData = descriptor.split("=<")[1];
       dict.put(infoData.split("ID=")[1].split(",")[0],
               new DictEntry(
                   infoData.split("Number=")[1].split(",")[0],
                   infoData.split("Type=")[1].split(",")[0],
                   infoData.split("Description=")[1].split(">$")[0]
               ));       
   }


   public static Stream<OutputRow> process(InputStream vcfFile, long num_lines) throws IOException {


       Map<String, DictEntry> infos = new HashMap<>();
       Map<String, DictEntry> formats = new HashMap<>();


       try {
           GZIPInputStream gzStream = new GZIPInputStream(vcfFile);
           Reader vcf = new InputStreamReader(gzStream, "US-ASCII");
           BufferedReader vcfBuf = new BufferedReader(vcf);
           String thisLine = null;
           while ((thisLine = vcfBuf.readLine()) != null) {
               if (! thisLine.startsWith("##")) {
                   break;
               }
               else if (thisLine.startsWith("##INFO")) {
                   addDictionaryElement(infos, thisLine);
               }
               else if (thisLine.startsWith("##FORMAT")) {
                   addDictionaryElement(formats, thisLine);
               }               
           }
          
           if (! thisLine.startsWith("#CHROM")) {
               throw new IOException("Unknown VCF Layout, expected '#CHROM' header line");
           }


           final String[] columns = thisLine.split("#")[1].split("\t");


           // Map the position of all "Well Known Columns"
           Map<String,Integer> columnIndex = new HashMap<>(10);
           String knownCols = "CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT";
           for (int i = 0; i < columns.length; i++) {
               columnIndex.put(columns[i],i);
               if (knownCols.indexOf(columns[i]) == -1) {
                   break;
               }
           }


           // 0-based index of first sample positon for single or multisample VCF
           // Add some validation...
           final int firstSampleCol = (columns.length > 9) ? 9 :
           (columns.length == 9 && columnIndex.get("FORMAT") == null) ? 8 : -1;


           // Handle the data lines
          
           return ((num_lines == 0) ? vcfBuf.lines() : vcfBuf.lines().limit(num_lines)).flatMap( dataLine -> {
               String[] vals = dataLine.split("\t");
               // Get the sample-independent data
               String chrom  =  vals[columnIndex.get("CHROM")];
               int pos       =  Integer.parseInt(vals[columnIndex.get("POS")]);
               String id     =  vals[columnIndex.get("ID")];
               String ref    =  vals[columnIndex.get("REF")];
               String alt    =  vals[columnIndex.get("ALT")];     
               String qual   =  vals[columnIndex.get("QUAL")];  
               String filter =  vals[columnIndex.get("FILTER")];  
               String info   =  createInfoJson(infos, vals[columnIndex.get("INFO")]);
               return java.util.stream.IntStream.range((firstSampleCol >= 0 ? firstSampleCol : columns.length - 1), columns.length)
                   .mapToObj( i -> {
                       String format =  (columnIndex.get("FORMAT") == null) ? null : vals[columnIndex.get("FORMAT")];
                       String sampleVals = (format == null) ?      // case where there is no FORMAT column,
                                               null :              // e.g. no populated per-sample column of data in the VCF
                                           (firstSampleCol >= 0) ? // case where there is at least one populated sample column
                                           createFormatJson(formats, format, vals[i]) :
                                               null;               // case where no sample column is defined at all (annotation VCF)
                       String sampleId   = (firstSampleCol >= 0) ? columns[i] : "";


                       return new OutputRow(chrom, pos, id, ref, alt, qual, filter, info, sampleVals, sampleId);     
                   });
           });
       }
       finally{}
   }
}
$$
;


-- UDFs to help parse out allele values and reduce query complexity!
CREATE OR REPLACE SECURE FUNCTION allele1 (ref varchar, alt array, gt varchar)
RETURNS varchar as
$$
case when contains(gt, '|') then
   case when strtok(gt, '|', 1) = '0' then ref else alt[strtok(gt, '|', 1)::integer-1] end
else
   case when strtok(gt, '/', 1) = '0' then ref else alt[strtok(gt, '/', 1)::integer-1] end
end
$$
;


CREATE OR REPLACE SECURE FUNCTION allele2 (ref varchar, alt array, gt varchar)
RETURNS varchar as
$$
case when contains(gt, '|') then
   case when strtok(gt, '|', 2) = '0' then ref else alt[strtok(gt, '|', 2)::integer-1] end
else
   case when strtok(gt, '/', 2) = '0' then ref else alt[strtok(gt, '/', 2)::integer-1] end
end
$$
;
Share Article

Start your 30-DayFree Trial

Try Snowflake free for 30 days and experience the AI Data Cloud that helps eliminate the complexity, cost and constraints inherent with other solutions.