Since I have joined Snowflake, I have been asked multiple times what data warehouse modeling approach does Snowflake support best. Well, the cool thing is that we support multiple data modeling approaches equally.
Turns out we have a few customers who have existing data warehouses built using a particular approach known as the Data Vault modeling approach and they have decided to move into Snowflake (yeah!).
So the conversation often goes like this:
Customer: “Can you do Data Vault on Snowflake?”
Me: “Yes you can! Why do you ask?”
Customer: “Well, your name is “snowflake” so we thought that might mean you only support snowflake-type schemas.”
Me: “Well, yes I can see your confusion in that case, but the name has nothing to do with data warehouse design really. In fact, we support any type of relational design, including Data Vault.”
What is Data Vault Modeling?
For those of you have not yet heard of the Data Vault System of Business Intelligence or simply Data Vault (DV) modeling, it provides (among other things) a method and approach to modeling your enterprise data warehouse (EDW) that is agile, flexible, and scalable.
The formal definition as written by the inventor Dan Linstedt:
The DV is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business.
It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent, and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses.
The main point here is that DV was developed specifically to address agility, flexibility, and scalability issues found in the other main stream data modeling approaches used in the data warehousing space. It was built to be a granular, non-volatile, auditable, historical repository of enterprise data.
At its core is a repeatable modeling technique that consists of just three main types of tables:
- Hubs = Unique list of Business Keys
- Links = Unique List of Associations / Transactions
- Satellites = Descriptive Data for Hubs and Links (Type 2 with history)
Hubs make it business driven and allow for semantic integration across systems.
Links give you the flexibility to absorb structural and business rule changes without re-engineering (and therefore without reloading any data).
Satellites give you the adaptability to record history at any interval you want plus unquestionable auditability and traceability to your source systems.
Snowflake Features to use in a Data Vault
As we have said in the past, Snowflake is an ANSI SQL RDBMS with pay-as-you-go pricing. We support tables and views like all the relational solutions on the market today. Since, from a data modeling perspective, DV is specific way and pattern for designing tables for your data warehouse, there are no issues implementing one in Snowflake.
In fact, with our combination of MPP compute clusters, optimized columnar storage format, and our patent-pending Adaptive Data Warehouse technology, I think you will get better results with your DV loads and queries with less effort than you get today on your legacy data warehouse solutions. Remember that with Snowflake you don’t need to pre-plan partitioning or distribution keys, or build indexes to get great performance. That is all handled as part of our Dynamic Query Optimization feature that uses our secure cloud-based metadata store and sophisticated feedback loop to monitor and tune your queries based on data access patterns and resource availability among other things.
Because our customers are getting such great query performance (some up to 100x improvement), I think Snowflake will be a great place to try virtualizing your information mart (i.e., reporting) layer that you expose to your BI tools.
Data Vault 2.0
For those of you interested in implementing the DV 2.0 specification, Snowflake can handle that as well. We have a built in MD5 hash function so you can implement MD5-based keys and do your change data capture using the DV 2.0 HASH_DIFF concept.
Not only does Snowflake support DV 2.0 use of hash functions, you can also take advantage of Snowflake’s Multi-table Insert (MTI) when loading your main DV tables. With this feature you can load multiple tables in parallel from a single source table.
For example you can take the data from a stage table and load it into a Hub and Satellite with one statement. Assuming your source table is in a schema called STAGE and your vault tables in a schema called DV, the first pass at loading a Hub and associated Satellite might look like this:
INSERT ALL INTO DV.HUB_COUNTRY (HUB_COUNTRY_KEY, COUNTRY_ABBRV, HUB_LOAD_DTS, HUB_REC_SRC) VALUES (stg.HASH_KEY, stg.COUNTRY_ABBRV, stg.LOAD_DTS, stg.REC_SRC) INTO DV.SAT_COUNTRIES (HUB_COUNTRY_KEY, SAT_LOAD_DTS, HASH_DIFF, SAT_REC_SRC, COUNTRY_NAME) VALUES (stg.HASH_KEY, stg.LOAD_DTS, stg.HASH_DIFF, stg.REC_SRC, stg.COUNTRY_NAME) SELECT HASH_KEY, COUNTRY_ABBRV, COUNTRY_NAME, HASH_DIFF, CURRENT_TIMESTAMP AS LOAD_DTS, STAGE_REC_SRC AS REC_SRC from STAGE.COUNTRY stg;
Likewise you can use an MTI to load Links and their associated Satellites.
If you are interested in learning more about DV, there is a website dedicated to DV, and a few books on the subject that you might want to peruse:
- Introduction to Agile Data Engineering by Kent Graziano (me)
- Intro to DV – free white paper on my personal blog
- Free DV Intro Videos – from the inventor Dan Linstedt
- Building a Scalable Data Warehouse with DV 2.0 by Dan Linstedt
- Dan Linstedt’s Blog
So hopefully this answers the basic questions about doing DV on Snowflake (yes you can). If you want to start using Snowflake, feel free to take advantage of our On-Demand offer with $400 in free usage. In the future, as our customers roll out their solutions, I hope we will be able to give you some real-world case studies on how they implemented their DVs on the Snowflake Elastic Data Warehouse.