Skip to content
  • AT SNOWFLAKE
  • Industry solutions
  • Partner & Customer Value
  • Product & Technology
  • Strategy & Insights
Languages
  • Deutsch
  • Français
  • Português
  • Español
  • English
  • Italiano
  • 日本語
  • 한국어
  • Deutsch
  • Français
  • Português
  • Español
  • English
  • Italiano
  • 日本語
  • 한국어
  • AT SNOWFLAKE
  • Industry solutions
  • Partner & Customer Value
  • Product & Technology
  • Strategy & Insights
  • Deutsch
  • Français
  • Português
  • Español
  • English
  • Italiano
  • 日本語
  • 한국어
  • 개요
    • Why Snowflake
    • 고객 사례
    • 파트너 네트워크
    • 서비스
  • 데이터 클라우드
    • 데이터 클라우드
    • 플랫폼 개요
    • SNOWFLAKE 데이터 마켓플레이스
    • Powered by Snowflake
    • 라이브 데모
  • WORKLOADS
    • 협업
    • 데이터 사이언스&머신러닝
    • 사이버 보안
    • 애플리케이션
    • 데이터 웨어하우스
    • 데이터 레이크
    • 데이터 엔지니어링
    • 유니스토어
  • PRICING
    • Pricing Options
  • 산업별 솔루션
    • 광고, 미디어 및 엔터테인먼트
    • 금융 서비스
    • 의료 및 생명 과학
    • 제조
    • 공공 부문
    • 소매 / CPG
    • 테크놀로지
  • 리소스
    • 리소스
    • Documentation
    • 핸즈온 랩
    • 트레이닝
  • CONNECT
    • Snowflake 블로그
    • 커뮤니티
    • 이벤트
    • 웨비나
    • 팟캐스트
  • 개요
    • 회사 소개
    • 투자정보
    • 리더십 및 이사회
    • 채용
Author
Kent Graziano Kent Graziano
Contributing Author
Keith Hoyle Keith Hoyle
Share
Subscribe
2020년 02월 03일 7 min read

Tips for Optimizing the Data Vault Architecture on Snowflake

  • 제품 및 기술
    • 데이터 엔지니어링
Tips for Optimizing the Data Vault Architecture on Snowflake

Data Vault is an architectural approach that includes a specific data model design pattern and methodology developed specifically to support a modern, agile approach to building an enterprise data warehouse and analytics repository. If you are not familiar with Data Vault (DV) and want to learn a bit more, check out this introductory post first. 

Snowflake Cloud Data Platform was built to be design pattern agnostic. That means you can use it with equal efficiency 3NF models, dimensional (star) schemas, DV, or any hybrid you might have.

Snowflake supports DV designs and handles several DV design variations very well with excellent performance. This series of blog posts will present some tips and recommendations that have evolved over the last few years for implementing a DV-style warehouse in Snowflake.

Modeling Considerations

The world of DV has evolved in the last decade. Some folks, who started years ago may be doing what we refer to now as DV 1.0, while others will be using the more current DV 2.0 patterns. Both of these work well in Snowflake and, in fact, we have customers doing both of these today.

So let’s take a quick look at what these types of models look like.

Sequential keys (DV 1.0 style)

Using sequence generated, integer values for primary keys (PKs) is a characteristic of the modeling style when DV was first released in the early 2000s (since superseded by DV 2.0). For customers who have current DVs built this way, these will easily port to Snowflake and, in fact, will usually provide the fastest query join performance because the joins are on integer columns (which are generally the fastest in all RDBMSs). 

This style of DV works best for customers who give more priority to querying directly against the DV and have more tolerance for data loading performance. Using this style of DV means you will have dependencies between Hubs and Satellites (Sats) and then Hubs and Links during load operations, but join performance, to extract the data, will be optimal because all the join keys are integers.

The figure below depicts a typical DV 1.0 model.

Here is a depiction of the typical load pattern for a DV 1.0 model (more discussion on this later):

 

 

 

Natural (business) keys (logical DV style):

Using natural or “business” keys (BKs) for the PKs was the design approach Dan Linstedt originally taught for building a logical DV model. Unfortunately most legacy databases could not adequately support joins using this approach, so surrogate keys were introduced to the physical model to improve query performance. With Snowflake, this is no longer a concern.

This approach works best for customers who place a higher priority on being able to load their DV objects in parallel. Query performance for this approach is still excellent on Snowflake, but it may be slightly slower than joins using integers (depending on how many attributes make up the BK). With Snowflake’s advanced metadata-driven optimization engine, you can still achieve excellent query performance using BKs because Snowflake prunes on character-based column values as well. Based on some internal testing, if the BK is made of only one or two attributes, the join performance may be equivalent to that of joining on integers.

The diagram below is a model using multicolumn BKs:

The advantage of this style of DV is that all the objects can be loaded in parallel because the primary keys for the Hubs do not have to be calculated during the load (as they do with the DV 1.0 style) but are simply mapped from the stage tables. This eliminates the dependencies in the load process, so everything can be loaded in parallel. The next diagram depicts a typical load pattern for this approach:

 

Hashed primary keys (DV 2.0 style)

Hashed PKs were introduced as part of DV 2.0. The goal was to be able to load all objects in parallel by eliminating the bottleneck encountered by most database sequence generators. At scale, this, combined with the load dependencies, turned out to be a limiting factor for many who were trying to implement a DV-style warehouse on legacy platforms. This DV pattern works well for customers who place more priority on data loading performance and who may want to use data warehouse automation tools that have built-in templates for generating hashed PKs (for example, using MD5) based on BKs. 

Because of the power of the Snowflake optimizer, queries against a DV that uses this approach are still very fast (compared to legacy platforms) but will be slower than queries using BKs (character-based, multicolumn) or integer surrogate keys. This is the nature of all RDBMSs: integer joins are the fastest, followed by character string joins, followed by dense character string joins (such as UUIDs and hashed key values). The advantage Snowflake provides over other data warehouse systems is that its unique elastic compute architecture achieves high performance on these types of joins, but it may require more compute power (which means a larger size virtual warehouse).

The diagram below includes MD5 hash-based PKs:

 

Today, there are Snowflake customers using MD5 hashes as well as SHA-256 hashes for PKs. So if you are using this approach today with your current DV on a legacy platform, there is no need to change the approach in order to migrate over to Snowflake.

A word of caution: Although hash-based joins may be slightly slower than other joins, it is possible to have a highly complex (multipart) BK with so many attributes that using a hash-based join may actually still be faster. In particular, this may occur with a central Link table that has many Hubs, which in turn have multipart business keys. Determining how many attributes is too many is hard because the size (length) of the attribute values is also a factor in determining how well the result sets can be pruned.

Like the BK style, the typical load pattern for DV 2.0 with hash keys looks the same:

In the next post, I will show you how to maximize load throughput of a DV using Snowflake.

In the meantime, be sure to follow us on Twitter @SnowflakeDB and @kentgraziano to keep up on all the latest news and innovations on DV and Snowflake Cloud Data Platform.

Share

Analytics on Big Data: Why It Matters

Analytics on Big Data is important to every organization. Find out how companies unlock the power of their big data with a data warehouse built for the cloud.

More Details
Read More

Data Vault Automation with erwin and Snowflake: Building and...

Read on to find out how the combination of Snowflake and erwin provides an end-to-end solution for a governed Data Vault...

Discover
Read More

Data Cloud Glossary

The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or...

Discover
Read More

Data Engineering vs. Data Science

The role of a data engineer and data scientist can look very similar — both are crucial to moving an organization’s data strategy forward. And both career paths offer exceptional opportunities.

See how
Read More
Snowflake Inc.
  • 플랫폼 개요
    • 아키텍처
    • 데이터 애플리케이션
  • 데이터 마켓플레이스
  • Snowflake 파트너 네트워크
  • 지원 및 서비스
  • 회사
    • 문의하기

Sign up for Snowflake Communications

Thanks for signing up!

  • Privacy Notice
  • Site Terms
  • Cookie Settings

© 2023 Snowflake Inc. All Rights Reserved