Getting Started with DCDF Data Architecture Incremental Processing & Logical Partitions

Snowflake for Developers GuidesGetting Started with DCDF Data Architecture Incremental Processing & Logical Partitions

Quickstart

Getting Started with DCDF Data Architecture Incremental Processing & Logical Partitions

Transformation

melindawebster

Overview

The topics of covered in this guide were originally presented in Episode 2 of Snowflake's Data Cloud Deployment Framework (DCDF) webinar series. DCDF Webinar Episode 2 focused on the ELT implementation patterns to operationalize data loading, centralize the management of data transformations and restructure the data for optimal reporting and analysis.

In this quickstart, we will focus on the actual SQL code templates for ingesting, transforming, and restructuring data into the presentation layer using incremental processing and logical partition definitions.

Caveat: This DCDF Data Architecture quickstart and template scripts are for illustrative purposes only. These scripts can be run in any Snowflake account in order to reinforce the concepts and patterns presented in DCDF Webinar Episode 2. They can be customized for your own business use cases.

Prerequisites

A Snowflake account. Existing or a free Trial Account.
ACCOUNTADMIN role (Note: Free Trial accounts provide this automatically)
Working knowledge of Snowflake database objects and the Snowflake Web UI/Snowsight.
Familiarity with Snowflake and Snowflake objects.
SnowSQL installed and configured to your Snowflake account.
Familiarity and understanding of SQL syntax

What You Will Learn

In this quickstart, we will build on the topics discussed in the webinar by loading and executing the SQL code used in the presentation.

DCDF Data Architecture
Data Ingestion
Data Transformation
Incremental processing and Logical Partitions

What You Will Need

A Snowflake Account or Trial Account, any edition will do as the scripts will run on any edition.
Login to Snowflake account that has ACCOUNTADMIN role access (Note: Free Trial accounts provide this automatically)
SnowSQL installed
Snowsight or Classic Ui will be used in these examples.

What You Will Build

An extensible ELT data pipeline, using logical partitions, that employs repeatable patterns for ingestion, transformation and consumable data assets.

DCDF Data Architecture

Let's review the DCDF Data Architecture processing layers and the purpose of each layer. This was discussed in detail in the DCDF Webinar Series Episode 1.

Raw Layer

The Raw Layer represents the first processing layer within Snowflake. It facilitates data ingestion into Snowflake and will manage the data as it exists in the source system, with no applied transformations. The following are attributes of the Raw Layer:

One database per data source.
Hash generation for surrogate keys.
Replication of data as it exists in the source system along with the creation of a surrogate key or control column for downstream transformations.

Integration Layer

The Integration Layer is used to centralize all business rules applied to the data. This layer performs the transformation, application of business rules, and materialization of data from the Raw Layer into one central location. The following are attributes of the Integration Layer:

Centralized business rules.
Intermediate results persisted as tables.
Units of work in one place with reusable components.

Presentation Layer

The Presentation Layer performs the organization of data from the raw and integration layers into various purpose-built solutions for reporting and analytics. The following are attributes of the Presentation Layer:

Conforming Dimensions
Atomic Aggregate Fact Tables
Flat data sets for data science model building
Views for governing exposure of data for consumption, analytics and sharing.

Note Business rules should not be implemented within the Presentation Layer. Using permanent tables for incrementally updated tables is recommended.

Common Database

The Common Database is a single database with one or more schemas to manage objects that span the breadth of the data architecture layers. The following are attributes of the Common Database:

User Defined Functions
File Formats
External and Internal Stages
Audit Tables
Common dimensions such as Date, Zip Code, State, Country, etc.
Common table data that is utilized across the DCDF layers.

Workspace Databases

A Workspace is a sandbox environment where individual teams can persist data for their own development and testing. These workspaces can be a database and related virtual warehouse for each department or team within a business entity. One example can be a Data Science team that clones data from the Presentation Layer into their workspace and run models on that data to determine actionable insights. The following are attributes of a Workspace Database:

Persist intermediate results
Non-production datasets for testing purposes
Compute isolation to mitigate contention and facilitate cost governance.

Incremental Processing Concepts

"The best way to process big data, is to turn it into small data"

Three steps for Incremental Processing

Step 1 - Define Logical Partitions

The volume of data being processed from the source system will drive the need to create logical partitions. Logical partitions are commonly defined as logical periods of time, or time series data, such as day, week, or month based on a business event represented in the data.

In this quickstart order date will be used as the logical partition. The order date will define how to identify new or changed data.

Step 2 - Identify Impacted Partitions

Next we want to identify the impacted logical partitions that are represented in the delta feed from the sources system. As the delta data is ingested into our staging tables in the raw layer, the impacted logical partitions (orderdate) can be identified.

Step 3 - Incrementally Process those Partitions

We will utilize the logical partitions identified in Step 2, to incrementally process the partitions.

In our example above we implement the aforementioned three steps.

Define Logical Partitions

We will use order date from the LINE_ITEM table.
This identifies the business event in our data that represents lifecycle changes over time for a given period.

Identify Impacted Partitions

We will query the LINE_ITEM_STG table in the Raw Layer for distinct order dates.
We will insert these dates into the Common Database table DELTA_DATE because this table will be utilized across the different layers.

Incrementally Process those Partitions

We will utilize the impacted partitions and process those dates of data through the raw, integration and presentation layers.

Let's see how this works!

Lab Overview

Below is an overview diagram of what we will be building. Each step builds upon what was produced in the prior step. We will only build the tables in this diagram.

Included in the code repository are additional table scripts that can be used to build out a dimensional model using these same concepts.

Lab Structure

Explain Code Snippets
- We will work through the code for each DCDF data architecture layer, explaining what is happening in each code snippet.
Execute Code and Verify Results
- In these sections, you will execute the code in your environment.
- We will validate that the data was loaded using our example tables line_items, orders, part and partsupp.

Lab summary

We are loading data for 17 days(logical partition) between 1998-06-18 and 1998-07-02 representing three distinct weeks.
Generate CSV files for logical partition range from Snowflake Sample tables.
Save this date range in a separate common delta table.
For Raw data layer, use delta table to load data in Line Item History and Line Item table.
Generate surrogate key and hash value to ensure only unique data is merged in Raw table.
Use same Delta table to load data in subsequent layers(Integration and Presentation)

DCDF Data Architecture Layers

Raw Layer
- In this layer, we will build the tables in the Raw Layer.
- Transient Staging tables used for most recent loaded data and having a suffix of _stg.
- Permanent History tables to track the history of changes and having a suffix of _hist.
Integration Layer
- In this layer, we will identify a unit of work to derive the margin (retail price minus cost) at the line item level.
- We will step through the code and extract these intermediate results by applying incremental processing.
Presentation Layer
- In this layer, we will create the ORDER_LINE_FACT table as well as the PART_DM dimension table as part of a dimensional model that can be used for consumption.
- We will walk through how to incrementally process the new data coming in.
Common database
- In this layer, we will create the DW_DELTA_DATE table that will be used for incremental processing.
- A table function named DW_DELTA_DATE_RANGE_F will be created in the Common database and utilized for the logical partitions. This will take place in the Quickstart Setup section.

Quickstart Setup

Clone Repository for Quickstart

The sample code templates provided will be used to demonstrate incremental processing and logical partitions. This code is written using SQL Scripting. The code is tool ignostic and can be easily implemented into your tool set.

To clone the repository for this Quickstart, visit the DCDF Incremental Processing associated GitHub Repository. Click on the green "Code" icon near the top of the page to obtain the "HTTPS" link needed to clone the repository. For more information on cloning a git repository please refer to the GitHub Docs.

NOTE: This sample code contains additional tables that we will not go through as part of the quickstart. They are just a continuation of the patterns we will be discussing to build out a purpose built dimensional model sample.

Sample Code Information

These code templates are written using SQL Scripting.
- Purpose of these scripts is to illustrate incremental processing and logical partitions.
- These scripts are tool agnostic.
- These templates can be easily modified to your environment and implemented in your toolset.
Naming Conventions of scripts
- Names of the folders are prefixed with a number to identify the order they will run as part of the data pipeline.
- Table Creation Scripts - Files with postfix of _tbl.sql
- ELT Process Scripts - Files with postfix of _ld.sql
- ddl_orch.sql is the parent script that will execute all the database, schema, and table creation scripts.
- dml_orch.sql is the parent script that will execute the data pipeline.

Snowflake Setup

Sample Data Set

Login to your account and verify that you have access to the SNOWFLAKE_SAMPLE_DATA database. See screenshot below.
SNOWFLAKE_SAMPLE_DATA data share is created by default in newer accounts. If you are not seeing this data share, you might need to create the data share in your account. Using Sample Data

Creating Example Databases, Schemas, Tables and Warehouse

Let's create the databases, tables and warehouse using the default names.

Make sure you are in the folder on your laptop where you cloned the sample code. You want to be at the top level where you see the sub-folders of 000_admin, 100_raw, etc.
Run Snowsql from the command line. This will create all the databases, schemas, tables and a warehouse that are needed for this lab.

snowsql -a <account_name> -u <username> -r sysadmin -D l_env=dev -f ddl_orch.sql -o output_file=ddl_orch.out

End of the output should show success like this screenshot.

Example Line Items

As part of the quickstart, we will monitor specific line item records.

Login to your Snowflake account and open a worksheet.
Copy and paste this query into a worksheet.

-- Sample Order
select
              row_number() over(order by uniform( 1, 60, random() ) ) as seq_no
             ,l.l_orderkey
             ,o.o_orderdate
             ,l.l_partkey
             ,l.l_suppkey
             ,l.l_linenumber
             ,l.l_quantity
             ,l.l_extendedprice
             ,l.l_discount
             ,l.l_tax
             ,l.l_returnflag
             ,l.l_linestatus
             ,l.l_shipdate
             ,l.l_commitdate
             ,l.l_receiptdate
             ,l.l_shipinstruct
             ,l.l_shipmode
             ,l.l_comment
        from
            snowflake_sample_data.tpch_sf1000.orders o
            join snowflake_sample_data.tpch_sf1000.lineitem l
              on l.l_orderkey = o.o_orderkey
        where
                o.o_orderdate >= to_date('7/1/1998','mm/dd/yyyy')
            and o.o_orderdate  < to_date('7/2/1998','mm/dd/yyyy')
            and l_orderkey = 5722076550
            and l_partkey in ( 105237594, 128236374);

Output should look like this.

Data Acquistion

During this step we will acquiring the data from the SNOWFLAKE_SAMPLE_DATA to load in the next step. We will use the SNOWFLAKE_SAMPLE_DATA data set for tables lineitem, orders, part, and partsupp to generate the data files to load into our raw layer.

Step 1 - Explain code snippets

LINE_ITEM_ACQ.SQL

Using the Snowsight UI, select Worksheets from the left hand menu.
Click the ellipsis next to the blue +Worksheets button.
Select "create worksheet from SQL file" and load the 100_acquisition/line_item_acq.sql.
In the first few lines of the script we are setting the context for this script. The defaults for database are DEV_WEBINAR_ORDERS_RL_DB and TPCH for the schema. This will be refered to as the Raw Layer going forward.

use database DEV_WEBINAR_ORDERS_RL_DB;
use schema   TPCH;

Next we are setting the date range for the data we want to acquire by setting the l_start_dt and l_end_dt variables. For the purposes of this lab, we are using a 17 day date range.

-- Set variables for this sample data for the time frame to acquire
set l_start_dt = dateadd( day, -16, to_date( '1998-07-02', 'yyyy-mm-dd' ) );
set l_end_dt   = dateadd( day,   1, to_date( '1998-07-02', 'yyyy-mm-dd' ) );

The "copy into" statement is where we are copying (Unloading) data, in CSV format, from the database SNOWFLAKE_SAMPLE_DATA into an Internal Table Stage. As part of this "copy into" statement we are modifying the data to show changes in l_line_status over time.

-- run this 2 or 3 times to produce overlapping files with new and modified records.
copy into
    @~/line_item
from
(
    with l_line_item as
    (
        select
              row_number() over(order by uniform( 1, 60, random() ) ) as seq_no
             ,l.l_orderkey
             ,o.o_orderdate
             ,l.l_partkey
             ,l.l_suppkey
             ,l.l_linenumber
             ,l.l_quantity
             ,l.l_extendedprice
             ,l.l_discount
             ,l.l_tax
             ,l.l_returnflag
             ,l.l_linestatus
             ,l.l_shipdate
             ,l.l_commitdate
             ,l.l_receiptdate
             ,l.l_shipinstruct
             ,l.l_shipmode
             ,l.l_comment
        from
            snowflake_sample_data.tpch_sf1000.orders o
            join sample_data.tpch_sf1000.lineitem l
              on l.l_orderkey = o.o_orderkey
        where
                o.o_orderdate >= $l_start_dt
            and o.o_orderdate  < $l_end_dt
    )
    select
         l.l_orderkey
        ,l.o_orderdate
        ,l.l_partkey
        ,l.l_suppkey
        ,l.l_linenumber
        ,l.l_quantity
        ,l.l_extendedprice
        ,l.l_discount
        ,l.l_tax
        ,l.l_returnflag
        -- simulate modified data by randomly changing the status
        ,case uniform( 1, 100, random() )
            when  1 then 'A'
            when  5 then 'B'
            when 20 then 'C'
            when 30 then 'D'
            when 40 then 'E'
            else l.l_linestatus
         end                            as l_linestatus
        ,l.l_shipdate
        ,l.l_commitdate
        ,l.l_receiptdate
        ,l.l_shipinstruct
        ,l.l_shipmode
        ,l.l_comment
        ,current_timestamp()            as last_modified_dt -- generating a last modified timestamp as part of data acquisition.
    from
        l_line_item l
    order by
        l.l_orderkey
)
file_format      = ( type=csv field_optionally_enclosed_by = '"' )
overwrite        = false
single           = false
include_query_id = true
max_file_size    = 16000000
;

Step 2 - Execute the code and Verify results

In this step we will unload data for the LINE_ITEM, PART and ORDERS tables.

LINE_ITEM_ACQ.SQL

If you haven't done so already, click the ellipsis next to the blue +Worksheets button.
Select "create worksheet from SQL file" and load the 100_acquisition/line_item_acq.sql.
Setting the context of your script. Highlight these in your worksheet, and run them to set the context.


use database DEV_WEBINAR_ORDERS_RL_DB;
use schema   TPCH;
use warehouse dev_webinar_wh;

Highlight the code to set the variables for l_start_dt, l_end_dt and run them.

-- Set variables for this sample data for the time frame to acquire
set l_start_dt = dateadd( day, -16, to_date( '1998-07-02', 'yyyy-mm-dd' ) );
set l_end_dt   = dateadd( day,   1, to_date( '1998-07-02', 'yyyy-mm-dd' ) );

Let's verify those variables have been set. Run the following statement in your worksheet.

select $l_start_dt, $l_end_dt;

Set your cursor on the "copy into" command and run it.
This might take around 2-3 minutes on a small warehouse. If you want to increase the size of your warehouse, it will run faster. The output should be similar to this.
Let's verify the number of files created. Paste this SQL into your worksheet and run it. Output should be similar to this.

list @~/line_item;

PART_ACQ.SQL

Click the ellipsis next to the blue +Worksheets button.
Select to "create worksheet from SQL file" and load the 100_acquisition/part_acq.sql.
Setting the context of your script. Highlight these in your worksheet, and run them to set the context.

use database DEV_WEBINAR_ORDERS_RL_DB;
use schema   TPCH;
use warehouse dev_webinar_wh;

Set your cursor on the "copy into" command and run it. This might take a few minutes. The output should be similar to this.

ORDERS_ACQ.SQL

Click the ellipsis next to the blue +Worksheets button.
Select to "create worksheet from SQL file" and load the 100_acquisition/orders_acq.sql.
Setting the context of your script. Highlight these in your worksheet, and run them to set the context.

use database DEV_WEBINAR_ORDERS_RL_DB;
use schema   TPCH;
use warehouse dev_webinar_wh;

Set your cursor on the "copy into" command and run it. This might take a few minutes. The output should be similar to this.

PARTSUPP_ACQ.SQL

Click the ellipsis next to the blue +Worksheets button.
Select to "create worksheet from SQL file" and load the 100_acquisition/partsupp_acq.sql.
Setting the context of your script. Highlight these in your worksheet, and run them to set the context.

use database DEV_WEBINAR_ORDERS_RL_DB;
use schema   TPCH;
use warehouse dev_webinar_wh;

Set your cursor on the "copy into" command and run it. This might take a few minutes. The output should be similar to this.

Raw Layer

Staging the data

In this section, we will take the acquired data from the Internal Table Stage mentioned in the previous section and load it into the staging tables in the Raw layer. We will load LINE_ITEM_STG, ORDER_STG, PART_STG, and PARTSUPP_STG tables.

This diagram illustrates the first steps of loading the new data files into the raw layer tables.

Incremental Processing Step 1 is to define the logical partitions. In the line item data we have identified the orderdate as the logical partition.

New data files from the Online Orders data source are placed into cloud storage.

A Truncate/Reload pattern is used here to load the LINE_ITEM_STG table.

First we truncate the prior data, and then load the new data into the LINE_ITEM_STG table.

Step 1 - Explain code snippets

LINE_ITEM_STG_LD.SQL

In Snowsight, "create worksheet from SQL file", select the 200_raw/line_item_stg_ld.sql
In the code, after setting the context, the next step is to truncate the line_item_stg table to remove any old ddata from the previous run.

truncate table line_item_stg;

Below in the "copy into" statement, the data from the files produced in the acquisition steps will be loaded in one statement. This is a bulk load.
The purge parameter is set to true so that the files will be purged from the Internal Table Stage once they have been loaded. This saves on storage usage and cost since these files are no longer needed.

-- perform bulk load
copy into
    line_item_stg
from
    (
    select
         s.$1                                            -- l_orderkey
        ,s.$2                                            -- o_orderdate
        ,s.$3                                            -- l_partkey
        ,s.$4                                            -- l_suppkey
        ,s.$5                                            -- l_linenumber
        ,s.$6                                            -- l_quantity
        ,s.$7                                            -- l_extendedprice
        ,s.$8                                            -- l_discount
        ,s.$9                                            -- l_tax
        ,s.$10                                           -- l_returnflag
        ,s.$11                                           -- l_linestatus
        ,s.$12                                           -- l_shipdate
        ,s.$13                                           -- l_commitdate
        ,s.$14                                           -- l_receiptdate
        ,s.$15                                           -- l_shipinstruct
        ,s.$16                                           -- l_shipmode
        ,s.$17                                           -- l_comment
        ,s.$18                                           -- last_modified_dt
        ,metadata$filename                               -- dw_file_name
        ,metadata$file_row_number                        -- dw_file_row_no
        ,current_timestamp()                             -- dw_load_ts
    from
        @~ s
    )
purge         = true
pattern       = '.*line_item/data.*\.csv\.gz'
file_format   = ( type=csv field_optionally_enclosed_by = '"' )
on_error      = skip_file
--validation_mode = return_all_errors
;

Step 2 - Execute code and Verify Results

In this step we will load 3 _stg tables: LINE_ITEM_STG, ORDERS_STG, PART_STG and PARTSUPP_STG.

LINE_ITEM_STG_LD.SQL

First, we will load the Line Item data into the LINE_ITEM_STG table.
Make sure you have 200_raw/line_item_stg_ld.sql script open in Snowsight.
Setting the context of our session, highlight these SQL statements in your worksheet, and run them to set the context.

use role     sysadmin;
use database dev_webinar_orders_rl_db;
use schema   tpch;
use warehouse dev_webinar_wh;

Highlight the truncate command in the script and run it.

truncate table line_item_stg;

Set your cursor on the "copy into" command and run it. On a small warehouse this will take approximately 1 minute to load the files. The results should look like this.
Let's verify the data was loaded. Highlight the following SQL statement in your worksheet and run it.

select 
    *
from 
    table(information_schema.copy_history(table_name=>'LINE_ITEM_STG', start_time=> dateadd(hours, -1, current_timestamp())))
where
    status = 'Loaded'
order by
    last_load_time desc
;

The results will look similar to this. The file name might differ.
Let's verify that the lines that we are monitoring are loaded into the LINE_ITEM_STG table. Highlight this SQL statement in your worksheet and run it.

select * 
from dev_webinar_orders_rl_db.tpch.line_item_stg 
where l_orderkey = 5722076550
and l_partkey in ( 105237594, 128236374); -- 2 lines

PART_STG_LD.SQL

Next, we will load the Part data into the PART_STG table.
Select "create worksheet from SQL file" and load the 200_raw/part_stg_ld.sql.
Setting the context of our script. Highlight these in your worksheet, and run them to set the context.

use role     sysadmin;
use database dev_webinar_orders_rl_db;
use schema   tpch;
use warehouse dev_webinar_wh;

Highlight the truncate command in the script and run it.

truncate table part_stg;

Set your cursor on the "copy into" command and run it. On a small warehouse this will take approximately 2 minutes to load the files. The results should look like this.

ORDERS_STG_LD.SQL

Finally, we will load the Orders data into the ORDERS_STG table.
Select to "create worksheet from SQL file" and load the 200_raw/orders_stg_ld.sql.
Setting the context of our script. Highlight these in your worksheet, and run them to set the context.

use role     sysadmin;
use database dev_webinar_orders_rl_db;
use schema   tpch;
use warehouse dev_webinar_wh;

Highlight the truncate command in the script and run it.

truncate table orders_stg;

Set your cursor on the "copy into" command and run it. On a small warehouse this will take approximately 2 minutes to load the files. The results should look like this.

PARTSUPP_STG_LD.SQL

Finally, we will load the Orders data into the PARTSUPP_STG table.
Select to "create worksheet from SQL file" and load the 200_raw/partsupp_stg_ld.sql.
Setting the context of our script. Highlight these in your worksheet, and run them to set the context.

use role     sysadmin;
use database dev_webinar_orders_rl_db;
use schema   tpch;
use warehouse dev_webinar_wh;

Highlight the truncate command in the script and run it.

truncate table partsupp_stg;

Set your cursor on the "copy into" command and run it. On a small warehouse this will take approximately 2 minutes to load the files. The results should look like this.

Identify Impacted Partitions

In this section we will identify the impacted partitions that were loaded into the staging tables in the Raw Layer and persist those identified partitions in a table for use in subsequent steps.

In this diagram illustrates the concept of identifying the impacted partitions, persisting them to be utilized in all the layers of the data architecture to incrementally process the data.

Incremental Processing Step 2 is to identify the impacted partitions.

In this diagram, we select all the distinct orderdates from the LINE_ITEM_STG table that are impacted by the new data that was loaded.

Then we persist those dates into a table called "dw_delta_date" so that we can utilize those impacted partitions to do our incremental processing at each layer (Raw, Integration and Presentation).

Step 1 - Explain code snippets

DW_DELTA_DATE_LD.SQL

In Snowsight, "create worksheet from SQL file", select the 200_raw/dw_delta_date_ld.sql
After setting the context there is an "insert" statement. As part of the "insert" statement, there is a CTE (Common Table Expression) identified by the "with" statement inside the "insert" statement. This "select" identifies all the orderdates that were impacted with the load into the _STG table.

insert overwrite into dw_delta_date
with l_delta_date as
(
    select distinct
        o_orderdate as event_dt
    from
        dev_webinar_orders_rl_db.tpch.line_item_stg 
)
select
     event_dt
    ,current_timestamp()            as dw_load_ts
from
    l_delta_date
order by
    1
;

DW_DELTA_DATE_RANGE_F.SQL (Additional Information)

As part of the objects we created back in the Getting Started Section, we created a table function called DW_DELTA_DATE_RANGE_F. This table function will take a type of time period such as day, week, month, quarter and year as a parameter, and return rows with a start_date and end_date of that period. We want the end_dt to be one greater than the actual max when returned from the function. This is so in our code for incremental processing we can use less than that end_date, to catch all data up through the end of that date.
To review the code in Snowsight, "create worksheet from SQL file", select the 000_admin/dw_delta_date_range_f.sql

use schema   &{l_common_schema};;

create or replace function dw_delta_date_range_f
(
    p_period_type_cd   varchar
)
returns table( start_dt timestamp_ltz, end_dt timestamp_ltz )
as
$$
    select
         start_dt
        ,end_dt
    from
        (
        select
             case lower( p_period_type_cd )
                 when 'all'     then current_date()
                 when 'day'     then date_trunc( day, event_dt )
                 when 'week'    then date_trunc( week, event_dt )
                 when 'month'   then date_trunc( month, event_dt )
                 when 'quarter' then date_trunc( quarter, event_dt )
                 when 'year'    then date_trunc( year, event_dt )
                 else current_date()
             end                as partition_dt
            ,min( event_dt ) as start_dt
            ,dateadd( day, 1, max( event_dt ) ) as end_dt  -- To provide end date 
        from
            dw_delta_date
        group by
            1
        )
    order by
        1
$$
;

Step 2 - Execute code and Verify Results

DW_DELTA_DATE_LD.SQL

Setting the context of our script. Highlight these in your worksheet, and run them to set the context.

use database dev_webinar_common_db;
use schema util;
use warehouse dev_webinar_wh;

Set the cursor on the "insert overwrite" statement and run it. The output should look like the following.

To verify that the order dates for 17 day window were loaded into the DW_DELTA_DATE table, highlight this query into your worksheet and run it.

select * 
from dev_webinar_common_db.util.dw_delta_date
order by event_dt;

DW_DELTA_DATE_RANGE_F.SQL (Query the table function)

Verify the table function DW_DELTA_DATE_RANGE_F returns results as well. Highlight the following query your worksheet and run it. This will return 3 rows representing 3 weeks of data. We will utilize this table function in future scripts.

select start_dt, end_dt 
FROM table(dev_webinar_common_db.util.dw_delta_date_range_f('week')) 
order by 1;

Incrementally Process

In this section we will incrementally process the data and load it into the persistent tables in the Raw layer by utilizing the impacted partitions that were identified in the prior step. We will load LINE_ITEM, LINE_ITEM_HIST, PART, ORDERS, and PARTSUPP tables.

In this diagram we are illustrating the concept of "looping" through the impacted partitions and processing only that data impacted into the persistent tables in the raw layer.

Incremental Processing Step 3 is to utilize the impacted partitions to process the data through each of the layers.

Here we are using the order dates stored in the DW_DELTA_DATE table to identify the dates that need to be processed into the LINE_ITEM_HIST and LINE_ITEM tables.

In the LINE_ITEM_HIST table we are capturing the life cycle of the rows as they change. This is for audit purposes or tracking.

In the LINE_ITEM table, we are peristing the current state of that line_item row.

Step 1 - Explain code snippets

LINE_ITEM_HIST_LD.SQL

In Snowsight, "create worksheet from SQL file", select the 200_raw/line_item_hist_ld.sql
After we set the context, there is the "execute immediate" statement. Since we are using SQL Scripting this statement is used to create an anonymous block.

execute immediate $$

In the declaration section, we are defining 2 variables (l_start_dt, and l_end_dt) that we will use to process the logical partition start and end dates.

declare
  l_start_dt date;
  l_end_dt   date;

Then we are declaring the cursor which holds the result of the DW_DELTA_DATE_RANGE_F table function that returns a start and end date for a week.

declare
  ...
  c1 cursor for select start_dt, end_dt FROM table(dev_webinar_common_db.util.dw_delta_date_range_f('week')) order by 1;

The "for" loop is where the script incrementally processes the data by looping through a week at a time.

begin
  --
  -- Loop through the dates to incrementally process based on the logical partition definition.
  -- In this example, the logical partitions are by week.
  --
  for record in c1 do
    l_start_dt := record.start_dt;
    l_end_dt   := record.end_dt;
    ...
  end for;

end;

Scroll down to the "insert" statement.
Inside the "with" statement the script is creating a CTE named l_stg.
The surrogate key and hash diff columns (_shk) being created in this CTE will be used to assist in identifying the rows that have changed.

insert into line_item_hist
with l_stg as
    (
        --
        -- Driving CTE to identify all records in the logical partition to be processed.
        select
            -- generate hash key and hash diff to streamline processing
             sha1_binary( concat( s.l_orderkey, '|', s.l_linenumber ) )  as dw_line_item_shk
            --
            -- note that last_modified_dt is not included in the hash diff since it only represents recency of the record versus an 
            -- actual meaningful change in the data
            ,sha1_binary( concat( s.l_orderkey
                                         ,'|', coalesce( to_char( s.o_orderdate, 'yyyymmdd' ), '~' )
                                         ,'|', s.l_linenumber
                                         ,'|', coalesce( to_char( s.l_partkey ), '~' )
                                         ,'|', coalesce( to_char( s.l_suppkey ), '~' )
                                         ,'|', coalesce( to_char( s.l_quantity ), '~' )
                                         ,'|', coalesce( to_char( s.l_extendedprice ), '~' )
                                         ,'|', coalesce( to_char( s.l_discount ), '~' )
                                         ,'|', coalesce( to_char( s.l_tax ), '~' )
                                         ,'|', coalesce( to_char( s.l_returnflag ), '~' )
                                         ,'|', coalesce( to_char( s.l_linestatus ), '~' )
                                         ,'|', coalesce( to_char( s.l_shipdate, 'yyyymmdd' ), '~' )
                                         ,'|', coalesce( to_char( s.l_commitdate, 'yyyymmdd' ), '~' )
                                         ,'|', coalesce( to_char( s.l_receiptdate, 'yyyymmdd' ), '~' )
                                         ,'|', coalesce( s.l_shipinstruct, '~' )
                                         ,'|', coalesce( s.l_shipmode, '~' )
                                         ,'|', coalesce( s.l_comment, '~' )
                                )
        
                        )               as dw_hash_diff
            ,s.*
        from
            line_item_stg s

In the CTE, the "where" clause is filtering the data from the LINE_ITEM_STG table to select only the rows that have an orderdate that is in the logical "week" partiton.

insert into line_item_hist
with l_stg as
    (
        --
        -- Driving CTE to identify all records in the logical partition to be processed.
        select
            ...
        from
            line_item_stg s
        where
                s.o_orderdate >= :l_start_dt
            and s.o_orderdate  < :l_end_dt
    )

The next CTE named l_deduped will go through and dedupe the records in the l_stg CTE, using the hash_diff to identify duplicate rows. This eliminates duplicates from getting loaded into the permanent Raw layer LINE_ITEM_HIST table.

,l_deduped as
    (
        --
        -- Dedupe the records from the staging table.
        -- This assumes that there may be late arriving or duplicate data that were loaded
        -- Need to identify the most recent record and use that to update the Current state table.
        -- as there is no reason to process each individual change in the record, the last one would have the most recent updates
        select
            *
        from
            l_stg
        qualify
            row_number() over( partition by dw_hash_diff order by last_modified_dt desc, dw_file_row_no )  = 1
    )

The final outside "select" statement will select data from the LINE_ITEM_HIST table using the same logical "week" partition range, and compare it to the deduped staging records to identify what needs to be inserted into the LINE_ITEM_HIST table.

    select
        ...
    from
        l_deduped s
    where
        s.dw_hash_diff not in
        (
            -- Select only the rows in that logical partition from the final table.
            select dw_hash_diff from line_item_hist 
            where
                    o_orderdate >= :l_start_dt
                and o_orderdate  < :l_end_dt
        )

Important: To optimize query performance in Snowflake, and ensure proper clustering of data in each micropartition, the "order by" clause sorts the rows by orderdate as they are inserted into the table.

    order by
        o_orderdate  -- physically sort rows by a logical partitioning date
    ;

LINE_ITEM_LD.SQL

In Snowsight, "create worksheet from SQL file", select the 200_raw/line_item_ld.sql
The script is very similar to line_item_hist_ld.sql except this is a merge pattern. The script has the same anonymous block, variable declarations, and cursor definition using the DW_DELTA_DATE_RANGE_F table function to loop through the logical partitions, as well as the same "for" loop.

execute immediate $$

declare
  l_start_dt date;
  l_end_dt   date;
  -- Grab the dates for the logical partitions to process
  c1 cursor for select start_dt, end_dt FROM table(dev_webinar_common_db.util.dw_delta_date_range_f('week')) order by 1;

begin

  --
  -- Loop through the dates to incrementally process based on the logical partition definition.
  -- In this example, the logical partitions are by week.
  --
  for record in c1 do
    l_start_dt := record.start_dt;
    l_end_dt   := record.end_dt;
    ...

The script also has the same l_stg CTE to identify the LINE_ITEM_STG records within that week of logical partitions. It is also doing the same surrogate key and hash diff derivations.

       with l_stg as
        (
            --
            -- Driving CTE to identify all records in the logical partition to be processed
            --
            select
                -- generate hash key and hash diff to streamline processing
                 sha1_binary( concat( s.l_orderkey, '|', s.l_linenumber ) )  as dw_line_item_shk
                --
                -- note that last_modified_dt is not included in the hash diff since it only represents recency of the record versus an 
                -- actual meaningful change in the data
                --
                ,sha1_binary( concat( s.l_orderkey
                                     ,'|', coalesce( to_char( s.o_orderdate, 'yyyymmdd' ), '~' )
                                     ,'|', s.l_linenumber
                                     ,'|', coalesce( to_char( s.l_partkey ), '~' )
                                     ,'|', coalesce( to_char( s.l_suppkey ), '~' )
                                     ,'|', coalesce( to_char( s.l_quantity ), '~' )
                                     ,'|', coalesce( to_char( s.l_extendedprice ), '~' )
                                     ,'|', coalesce( to_char( s.l_discount ), '~' )
                                     ,'|', coalesce( to_char( s.l_tax ), '~' )
                                     ,'|', coalesce( to_char( s.l_returnflag ), '~' )
                                     ,'|', coalesce( to_char( s.l_linestatus ), '~' )
                                     ,'|', coalesce( to_char( s.l_shipdate, 'yyyymmdd' ), '~' )
                                     ,'|', coalesce( to_char( s.l_commitdate, 'yyyymmdd' ), '~' )
                                     ,'|', coalesce( to_char( s.l_receiptdate, 'yyyymmdd' ), '~' )
                                     ,'|', coalesce( s.l_shipinstruct, '~' )
                                     ,'|', coalesce( s.l_shipmode, '~' )
                                     ,'|', coalesce( s.l_comment, '~' )
                                    )
            
                            )               as dw_hash_diff
                ,s.*
            from
                line_item_stg s
            where
                    s.o_orderdate >= :l_start_dt
                and s.o_orderdate  < :l_end_dt
        )

The script has the same dedupe logic.

,l_deduped as
    (
        --
        -- Dedupe the records from the staging table.
        -- This assumes that there may be late arriving or duplicate data that were loaded
        -- Need to identify the most recent record and use that to update the Current state table.
        -- as there is no reason to process each individual change in the record, the last one would have the most recent updates
        select
            *
        from
            l_stg
        qualify
            row_number() over( partition by dw_hash_diff order by last_modified_dt desc, dw_file_row_no )  = 1
    )

However...the script does have an additional CTE l_tgt. This CTE is important for partition pruning efficiencies because it is selecting only those rows from the LINE_ITEM table that are in the logical partition range we are processing.

,l_tgt as
        (
            --
            -- Select the records in the logical partition from the current table. 
            -- Its own CTE, for partition pruning efficiencies
            select *
            from line_item
            where
                    o_orderdate >= :l_start_dt
                and o_orderdate  < :l_end_dt
        )

Now let's look at the "merge" statement. In the "select" statement below, the l_deduped CTE and l_tgt CTE are joined together with a left join to identify the rows that are in the LINE_ITEM_STG table that might not be in the LINE_ITEM table; or where the hash_diff is different and the modified date is after what already exists in the table.

-- Merge Pattern 
    --
    merge into line_item tgt using
    (
      ...
        select
             current_timestamp()        as dw_version_ts
            ,s.*
        from
            l_deduped s
            left join l_tgt t on
                t.dw_line_item_shk = s.dw_line_item_shk
        where
            -- source row does not exist in target table
            t.dw_line_item_shk is null
            -- or source row is more recent and differs from target table
            or (
                    t.last_modified_dt  < s.last_modified_dt
                and t.dw_hash_diff     != s.dw_hash_diff
               )
        order by
            s.o_orderdate  -- physically sort rows by logical partitioning date
    ) src

Important: Please note the "on" clause of the "merge" statement. The logical partition dates are used to prevent a full table scan of the LINE_ITEM table.

-- Merge Pattern 
    --
    merge into line_item tgt using
    (
      ...
    on
    (
            tgt.dw_line_item_shk = src.dw_line_item_shk
        and tgt.o_orderdate     >= :l_start_dt
        and tgt.o_orderdate      < :l_end_dt
    )

Step 2 - Execute code and Verify Results

LINE_ITEM_HIST_LD.SQL

Select "create worksheet from SQL file", select the 200_raw/line_item_hist_ld.sql. Set the context of our script. Highlight the following SQL statements in your worksheet and run them to set the context.

use role     sysadmin;
use database dev_webinar_orders_rl_db;
use schema   tpch;
use warehouse dev_webinar_wh;

Put your cursor on the "execute immediate" command at the top of the script and run it.
Let's verify that the data was loaded into the LINE_ITEM_HISTORY table. Highlight the following query in your worksheet. If you have run these acquisition and load scripts multiple times you may see history changes in this table.

If you are using the Snowflake Extension for VS Code to run the scripts, the SHA1_BINARY fields will not be displayed properly. To confirm that the SHA1_BINARY was created properly, run the following query in Snowflake.

select * 
from dev_webinar_orders_rl_db.tpch.line_item_hist 
where l_orderkey = 5722076550 
and l_partkey in ( 105237594, 128236374)
order by 1;

LINE_ITEM_LD.SQL

Select "create worksheet from SQL file" and open the the 200_raw/line_item_ld.sql file.
Highlight the following SQL statements in your worksheet and run them to set the context.

use role     sysadmin;
use database dev_webinar_orders_rl_db;
use schema   tpch;
use warehouse dev_webinar_wh;

Put your cursor on the "execute immediate" command and run it.
Let's verify that the data was loaded into the LINE_ITEM table. Highlight the following query in your worksheet and run it.

select * 
from dev_webinar_orders_rl_db.tpch.line_item 
where l_orderkey = 5722076550 
and l_partkey in ( 105237594, 128236374)
order by 1;

PART_LD.SQL

Now we want to load the Part data into the PART table.
Select "create worksheet from SQL file" and open the worksheet 200_raw/part_ld.sql.
Highlight the following SQL statements in the worksheet and run them to set the context.

use role     sysadmin;
use database dev_webinar_orders_rl_db;
use schema   tpch;
use warehouse dev_webinar_wh;

Put your cursor on the "execute immediate" command and run it.
Verify the PART table was loaded. Highlight the following query and run it.

select *
from dev_webinar_orders_rl_db.tpch.part
where p_partkey in ( 105237594, 128236374);

ORDER_LD.SQL

Now we want to load the Order data into the ORDERS table.
Select "create worksheet from SQL file" and open the 200_raw/order_ld.sql file.
Highlight the following SQL statements in the worksheet and run them to set the context.

use role     sysadmin;
use database dev_webinar_orders_rl_db;
use schema   tpch;
use warehouse dev_webinar_wh;

Put your cursor on the "execute immediate" command and run it.
Verify the ORDERS table was loaded with the order data. Highlight the following query and run it.

select * 
from dev_webinar_orders_rl_db.tpch.orders 
where o_orderkey = 5722076550;

PARTSUPP_LD.SQL

Now we want to load the partsupp data into the PARTSUPP table.
Select "create worksheet from SQL file" and open the 200_raw/partsupp_ld.sql file.
Highlight the following SQL statements in the worksheet and run them to set the context.

use role     sysadmin;
use database dev_webinar_orders_rl_db;
use schema   tpch;
use warehouse dev_webinar_wh;

Put your cursor on the "execute immediate" command and run it.
Verify the PARTSUPP table was loaded with the partsupp data. Highlight the following query and run it.

select * 
from dev_webinar_orders_rl_db.tpch.partsupp
where ps_partkey in ( 105237594, 128236374);

Integration Layer

In this step we will incrementally process an isolated unit of work, deriving certain business rules from the impacted partitions that we previously identified.

This diagram illustrates the steps for incremental processing units of work in the integration layer of the DCDF data architecture.

A unit of work has been identified where the line item margin needs to be derived/calculated for each line item.

This is an isolated unit of work with business rules that should be incrementally processed each time new or changed line items are processed.

Extensible, ease of maintenance for that unit of work. As rules change to derive the margin, then they can be added/changed here in one place.

The impacted logical partitions have been identified and will be incrementally processed.

Step 1 - Explain code snippets

LINE_ITEM_MARGIN_LD.SQL

In Snowsight, "create worksheet from SQL file", select the 310_derivation/line_item_margin_ld.sql
This script is also using the anonymous block in SQL Scripting.

execute immediate $$

Again, the cursor is being declared using the result of the DW_DELTA_DATE_RANGE_F table function that returns a start and end date for a given week. We are going to loop through the weeks and process a week at a time. This is the incremental processing.

declare
  l_start_dt date;
  l_end_dt   date;
  -- Grab the dates for the logical partitions to process
  c1 cursor for select start_dt, end_dt FROM table(dev_webinar_common_db.util.dw_delta_date_range_f('week')) order by 1;

begin

  --
  -- Loop through the dates to incrementally process based on the logical partition definition.
  -- In this example, the logical partitions are by week.
  --
  for record in c1 do
    l_start_dt := record.start_dt;
    l_end_dt   := record.end_dt;

Just like the line_item_ld.sql file in the previous section, the merge processing pattern is being utilized here to select data by using the logical partitions. Important: Another note is the "on" clause of the "merge" statement.

merge into line_item_margin t using
    (
        with l_src as
        (
            -- 
            -- Driving CTE to identify all the records in the logical partition to be process
            --
            select
                 s.dw_line_item_shk
                ,s.o_orderdate
                ,s.l_extendedprice - (s.l_quantity * p.ps_supplycost ) as margin_amt
                ,s.last_modified_dt
            from
                dev_webinar_orders_rl_db.tpch.line_item s
                join dev_webinar_orders_rl_db.tpch.partsupp p
                  on ( p.ps_partkey = s.l_partkey
                       and p.ps_suppkey = s.l_suppkey )
            where
                    s.o_orderdate >= :l_start_dt
                and s.o_orderdate  < :l_end_dt
        )
        ,l_tgt as
        (
            -- 
            -- Select the records in the logical partition from the current table.
            -- Its own CTE, for partition pruning efficiencies
            select *
            from line_item_margin
            where
                    o_orderdate >= :l_start_dt
                and o_orderdate  < :l_end_dt
        )
        select
             current_timestamp()        as dw_update_ts
            ,s.*
        from
            l_src s
            left join l_tgt t on
                t.dw_line_item_shk = s.dw_line_item_shk
        where
            -- source row does not exist in target table
            t.dw_line_item_shk is null
            -- or source row is more recent and differs from target table
            or (
                    t.last_modified_dt  < s.last_modified_dt
                and t.margin_amt       != s.margin_amt
               )
        order by
            s.o_orderdate
    ) s
    on
    (
        t.dw_line_item_shk = s.dw_line_item_shk
        and t.o_orderdate >= :l_start_dt 
        and t.o_orderdate  < :l_end_dt
    )

Step 2 - Execute code and Verify Results

LINE_ITEM_MARGIN_LD.SQL

In Snowsight, "create worksheet from SQL file" and open the line_item_margin_ld.sql file.
Highlight the following SQL statements in your worksheet and run them to set the context.

use role     sysadmin;
use database dev_webinar_il_db;
use schema   main;
use warehouse dev_webinar_wh;

Place your cursor on the "execute immediate" command and run it.
Let's verify that the data was loaded into the LINE_ITEM_MARGIN table. Highlight the following SQL query in your worksheet. If you have run these load scripts multiple times you may see history changes in this table.

-- Integration
select m.*
from dev_webinar_il_db.main.line_item_margin m
    join dev_webinar_orders_rl_db.tpch.line_item l
where l.l_orderkey = 5722076550 
and l.l_partkey in ( 105237594, 128236374)
and m.dw_line_item_shk = l.dw_line_item_shk;

Presentation Layer

In this step we will incrementally process the data that was loaded in the previous section, and re-organizing the data for consumption from the Presentation layer, utilizing the identified impacted partitions.

In this diagram illustrates the steps for incremental processing for the Presentation layer of the DCDF data architecture.

Order_line_fact is a purpose built solution for consumption and analytics.

Dimensional model as an example to provide a solution around the orders lines and consumption at that atomic level of data.

The impacted logical partitions have been identified and will be incrementally processed in the presentation layer.

Step 1 - Explain code snippets

ORDER_LINE_FACT_LD.SQL

In Snowsight, "create worksheet from SQL file", select the 410_fact_atomic/order_line_fact_ld.sql
This script uses the anonymous block in SQL Scripting.

execute immediate $$

Again, the cursor is being declared using the result of the DW_DELTA_DATE_RANGE_F table function that returns a start and end date for a given week. We are going to loop through the weeks and process a week at a time. This is the incremental processing.

declare
  l_start_dt date;
  l_end_dt   date;
  -- Grab the dates for the logical partitions to process
  c1 cursor for select start_dt, end_dt FROM table(dev_webinar_common_db.util.dw_delta_date_range_f('week')) order by 1;

begin

  --
  -- Loop through the dates to incrementally process based on the logical partition definition.
  -- In this example, the logical partitions are by week.
  --
  for record in c1 do
    l_start_dt := record.start_dt;
    l_end_dt   := record.end_dt;

Within the "for" loop is a delete/insert processing pattern. The "delete" statement will delete from the ORDER_LINE_FACT table any rows within the logical partition. This logical partition is the week of order dates in our example. Then the "insert" will insert all the rows for that logical partition into the ORDER_LINE_FACT table.
We first apply deletes and then inserts in order to handle situations such as late arriving data. For instance if the supplier wasn't available when the order line item was first ordered, that value would be some default value. Then as the status of the line item changes, and the supplier was updated, the original defaulted row would need to be removed.

-- Delete the records using the logical partition 
     -- Very efficient when all the rows are in the same micropartitions.  Mirrors a truncate table in other database platforms.
     delete from order_line_fact
     where orderdate >= :l_start_dt
       and orderdate <  :l_end_dt;
 
     -- Insert the logical partitioned records into the table
     -- Inserts data from same order date into the same micropartitions
     -- Enables efficient querying of the data for consumption
     insert into order_line_fact
     select
         li.dw_line_item_shk
        ,o.o_orderdate
        ,o.dw_order_shk
        ,p.dw_part_shk
        ,s.dw_supplier_shk
        ,li.l_quantity      as quantity
        ,li.l_extendedprice as extendedprice
        ,li.l_discount      as discount
        ,li.l_tax           as tax
        ,li.l_returnflag    as returnflag
        ,li.l_linestatus    as linestatus
        ,li.l_shipdate
        ,li.l_commitdate
        ,li.l_receiptdate
        ,lim.margin_amt
        ,current_timestamp() as dw_load_ts
     from
         webinar_rl_db.tpch.line_item li
         --
         join webinar_rl_db.tpch.orders o
           on o.o_orderkey = li.l_orderkey
         --
         join webinar_il_db.main.line_item_margin lim
           on lim.dw_line_item_shk = li.dw_line_item_shk
         --
         -- Left outer join in case the part record is late arriving
         --
         left outer join webinar_rl_db.tpch.part p
           on p.p_partkey = li.l_partkey
         --
         -- left outer join in case the supplier record is late arriving
         --
         left outer join webinar_rl_db.tpch.supplier s
           on s.s_suppkey = li.l_suppkey
     where 
             li.o_orderdate >= :l_start_dt
         and li.o_orderdate <  :l_end_dt
     order by o.o_orderdate;

Let's look at a dimension table load to illlustrate a different processing pattern "insert overwrite".
In Snowsight, "create worksheet from SQL file", select the 400_dimension/part_dm_ld.sql
This script is also using the anonymous block in SQL Scripting.

execute immediate $$

This script uses an "insert overwrite" pattern where the code will essentially be doing a truncate/insert. The advantage of doing "insert overwrite", versus a truncate/insert, is that the table will never be empty. With a truncate/insert pattern there could be a time period in which the table is empty if users were to query the data.

insert overwrite into part_dm
   select
       p.dw_part_shk
      ,p.p_partkey
      ,p.p_name as part_name
      ,p.p_mfgr as mfgr
      ,p.p_brand as brand
      ,p.p_type as type
      ,p.p_size as size
      ,p.p_container as container
      ,p.p_retailprice as retail_price
      ,p.p_comment as comment
      ,d.first_orderdate
      ,p.last_modified_dt
      ,p.dw_load_ts
      ,p.dw_update_ts
   from
       dev_webinar_orders_rl_db.tpch.part p
       left join dev_webinar_il_db.main.part_first_order_dt d
         on d.dw_part_shk = p.dw_part_shk;

Step 2 - Execute code and Verify Results

ORDER_LINE_FACT_LD.SQL

In Snowsight, "create worksheet from SQL file" and open the worksheet for the 410_fact_atomic/order_line_fact_ld.sql.
Highlight the following SQL statements in your worksheet and run them to set the context.

use role     sysadmin;
use database dev_webinar_pl_db;
use schema   main;
use warehouse dev_webinar_wh;

3. Place the cursor on the "execute immediate" command and run it.

Let's verify that the data was loaded into the ORDER_LINE_FACT table. Highlight the following query in your worksheet. If you have run these load scripts multiple times you may see history changes in this table.

select olf.*
from dev_webinar_pl_db.main.order_line_fact olf
  join dev_webinar_orders_rl_db.tpch.line_item l
    on l.dw_line_item_shk = olf.dw_line_item_shk
where l.l_orderkey = 5722076550 
and l.l_partkey in ( 105237594, 128236374);

PART_DM_LD.SQL

In Snowsight, "create worksheet from SQL file" and open the worksheet for the 400_dimension/part_dm_ld.sql.
Highlight the following SQL statements in your worksheet and run them to set the context.

use role     sysadmin;
use database dev_webinar_pl_db;
use schema   main;
use warehouse dev_webinar_wh;

Place the cursor on the "execute immediate" command and run it.
Let's verify that the data was loaded into the PART_DM table. Highlight the following query in your worksheet. If you have run these load scripts multiple times you may see history changes in this table.

select *
from dev_webinar_pl_db.main.part_dm p
where p_partkey in ( 105237594, 128236374);

BONUS -Slowly Changing Dimension

Slowly changing dimension type 2 changes add a new row in the dimension with the updated attribute values. This requires generalizing the primary key of the dimension beyond the natural or durable key because there will potentially be multiple rows describing each member. When a new row is created for a dimension member, a new primary surrogate key is assigned and used as a foreign key in all fact tables from the moment of the update until a subsequent change creates a new dimension key and updated dimension row. - Kimball Group.

In this section we will go through incremental processing of a Type 2, slowly changing dimension; customer. We will go through running each layer in the DCDF for this customer data.

Step 1 - Acquistion

CUSTOMER_ACQ.SQL

In Snowsight, "create worksheet from SQL file", select the 100_acquisition/customer_acq.sql.
In the first few lines of the script we are setting the context for this script.
Here you will see a "union all" statement where we are creating some modified data for a specific customer to illustrate changes over time to the data.
The "copy into" statement is where we are copying (Unloading) the data from the SNOWFLAKE_SAMPLE_DATA into CSV formatted files into an internal table stage.

Step 2 - Raw Layer

CUSTOMER_STG_LD.SQL

In Snowsight, "create worksheet from SQL file", select the 100_acquisition/customer_stg_ld.sql.
Highlight the following SQL statements in your worksheet and run them to set the context.

use role     sysadmin;
use database dev_webinar_rl_orders_db;
use schema   tpch;
use warehouse dev_webinar_wh;

Place the cursor on the "truncate table " command and run it.
Place the cursor on the "execute immediate" command and run it.
Let's verify that the data was loaded into the CUSTOMER_STG table. Highlight the following query at the bottom of your worksheet and run it.

select *
from dev_webinar_orders_rl_db.tpch.customer_stg
where c_custkey in (50459048);

DW_DELTA_DATE_LD_BONUS.SQL

In Snowsight, "create worksheet from SQL file", select the 200_raw/dw_delta_date_ld_bonus.sql.
This script has been updated to identify logical partitions in the CUSTOMER_STG table as well as the LINE_ITEM_STG.
Highlight the following SQL statements in your worksheet and run them to set the context.

use role     sysadmin;
use database dev_webinar_rl_orders_db;
use schema   tpch;
use warehouse dev_webinar_wh;

Place the cursor on the "insert overwrite" statement and run it.

CUSTOMER_HIST_LD.SQL

In Snowsight, "create worksheet from SQL file", select the 100_acquisition/customer_hist_ld.sql.
Highlight the following SQL statements in your worksheet and run them to set the context.

use role     sysadmin;
use database dev_webinar_rl_orders_db;
use schema   tpch;
use warehouse dev_webinar_wh;

Scroll down to the dw_customer_shk surrogate key generation in the "with" statement. We are generating a surrogate key for each combination of the c_custkey and the change_date. We want to use this in the presentation layer to match an order to the correct version of the customer at the time of the order.

    select
        -- generate hash key and hash diff to streamline processing
        sha1_binary( s.c_custkey || to_char( s.change_date, 'yyyymmdd' ) )  as dw_customer_shk
        --
        -- note that last_modified_dt is not included in the hash diff since it only represents recency of the record versus an 
        -- actual meaningful change in the data
        ,sha1_binary( concat( s.c_custkey || to_char( s.change_date, 'yyyymmdd' )
                                     ,'|', coalesce( to_char( s.change_date, 'yyyymmdd'), '~' )
                                     ,'|', coalesce( s.c_name, '~' )
                                     ,'|', coalesce( s.c_address, '~' )
                                     ,'|', coalesce( to_char( s.c_nationkey ), '~' )
                                     ,'|', coalesce( s.c_phone, '~' )
                                     ,'|', coalesce( to_char( s.c_acctbal ), '~' )
                                     ,'|', coalesce( s.c_mktsegment, '~' )
                                     ,'|', coalesce( s.c_comment, '~' )
                            )
     
                    )               as dw_hash_diff
        ,s.*
    from
        customer_stg s
    where
            s.change_date >= :l_start_dt
        and s.change_date  < :l_end_dt

Place the cursor on the "execute immediate" command and run it. Here we are loading the CUSTOMER_HIST table using the same repeatable pattern as we did with the LINE_ITEM_HIST table.
Let's verify that the data was loaded into the CUSTOMER_HIST table. Highlight the following query at the bottom of your worksheet and run it.

select *
from dev_webinar_orders_rl_db.tpch.customer_hist
where c_custkey in (50459048)
order by 5;

Step 3 - Integration Layer

There are no steps here for the integration layer as there aren't any business rules being defined for customer.

Step 4 - Presentation Layer

CUSTOMER_DM.SQL

In Snowsight, "create worksheet from SQL file", select the 400_dimension/customer_dm_ld.sql.
Highlight the following SQL statements in your worksheet and run them to set the context.

use role     sysadmin;
use database dev_webinar_pl_db;
use schema   main;
use warehouse dev_webinar_wh;

Place the cursor on the "execute immediate" command and run it. Here we are loading the CUSTOMER_DM table.
Let's veriy that the data was loaded into the CUSTOMER_DM table. Highlight the query below in your worksheet and run it. There should be two rows, where the values of the c_acctbal are different.

select *
from dev_webinar_pl_db.main.customer_dm
where c_custkey in (50459048)
order by 5;

ORDER_LINE_FACT_BONUS_LD.SQL

In Snowsight, "create worksheet from SQL file", select the 410_fact_atomic/order_line_fact_bonus_ld.sql.
Highlight the following SQL statements in your worksheet and run them to set the context.

use role     sysadmin;
use database dev_webinar_pl_db;
use schema   main;
use warehouse dev_webinar_wh;

Place the cursor on the "execute immediate" command and run it. Here we are loading the ORDER_LINE_FACT_BONUS table.
Let's verify that the data was loaded into the ORDER_LINE_FACT_BONUS table and that we can see the one order with multiple line_items and the c_acctbal column is 1,293. Highlight the query below in your worksheet and run it.

select 
     c.c_custkey
    ,c.c_name
    ,c.c_acctbal
    ,olf.*
from dev_webinar_pl_db.main.customer_dm c
   join dev_webinar_pl_db.main.order_line_fact_bonus olf
     on olf.dw_customer_shk = c.dw_customer_shk
where c.c_custkey in (50459048);
order by olf.orderdate;

Cleanup

This step is to cleanup and drop all the objects we created as part of this quickstart.

Select to "create worksheet from SQL file" and open the worksheet for the 000_admin/cleanup.sql.

-- Cleanup all the objects we created

use role sysadmin;

drop database dev_webinar_orders_rl_db;
drop database dev_webinar_il_db;
drop database dev_webinar_pl_db;
drop database dev_webinar_common_db;

Run all the SQL statements to drop all the objects that were created.

Conclusion & Next Steps

This tutorial was designed as a hands-on introduction to the Data Cloud Deployment Framework (DCDF) data architecture incremental processing and logical partitions.

We encourage you to continue learning about the Data Cloud Deployment Framework, by watching the Data Cloud Deployment Framework Series Webinars either on-demand on register for upcoming episodes.

Also the github repo contains more scripts than what was covered in this lab. It's a full, working template model taking source data from the Raw layer, through the Integration layer, and finally to the Presentation layer dimension model, ready for consumption. Please take the time to go through each one of these scripts and slowly work through the examples. Feel free to use these as code templates to be implemented in your own environments and accounts for your data processing.

During this quickstart, our hope is that you noticed the repeatable patterns in these scripts which can facilitate an Agile Development Process.

What we have covered

A data pipeline utilizing incremental processing and logical partition definitions
Introduced the concepts of the Raw, Integration and Presentation layers and how each utilizes incremental processing and logical partition definitions.
Walked through code examples used for different processing patterns such as truncate/reload, insert overwrite, merge and delete/insert.

Potential Next Steps

Create a stored procedure for the LINE_ITEM_MARGIN business rules logic
Create streams on the LINE_ITEM raw layer table.
Create a Snowflake Task to execute the LINE_ITEM_MARGIN stored procedure when there is data in the LINE_ITEM stream.

Updated 2025-12-20

This content is provided as is, and is not maintained on an ongoing basis. It may be out of date with current Snowflake instances