RI (Referential Integrity) Constraints: 3 Reasons to Include Them in Your Data Warehouse
Jan 10, 2017
Over the years, I have had numerous conversations about the value of having referential integrity (RI) constraints, such as primary and foreign keys, in a relational data warehouse or data mart.
Many DBAs object that RI constraints slow the load process. This is a valid point if you are talking about enforced constraints that are checked in real time during the load. But this is not an issue if you define the constraints as disabled.
Which then leads to this common question:
Is there any reason to maintain a permanently disabled FK in the data model? If it is not going to be enabled, then from my perspective, it doesn’t make any sense to define the FK. Instead, the relationship can be described in the comment of the child column.
So, why would I want RI constraints in my data warehouse?
Mostly it has to do with good design and development best practices. Here is my rationale for why you should consider including RI constraints in your data warehouse design.
#1 – Design Metadata
RI constraints are valuable metadata/documentation. If somebody reverse engineers the database (say with ERWin or Oracle Data Modeler), the PKs and FKs show up in the diagram (much better than having to read a column comment to discover a relationship). This is quite valuable for new people on your project to orient themselves to the existing schema design and understand how the various tables in your data warehouse are related to each other.
#2 – BI Metadata
If you want to use any sort of reporting or BI tool against the database (it is a data warehouse, after all), most modern business intelligence and visualization tools import the foreign key definitions with the tables and build the proper join conditions. This is much better than having someone guess what the join will be and then manually adding it to the metadata layer in the reporting tool. This also ensures that different developers don’t interpret the joins differently.
Examples of tools that can read the Snowflake data dictionary include Looker, Tableau, COGNOS, MicroStrategy, and many others. Some of these tools actually use the FK definitions for join culling to provide better query performance.
#3 – QA your ETL/ELT code
I know you think your ETL code is perfect.
But does every developer test to the same standards? Do you maybe have a QA team who separately validates that the ETL is doing what you expect?
If so, having declared primary, unique, and foreign key constraints in your data warehouse gives the team more information they can use to ensure the quality of the data. In fact, using the Snowflake Information Schema, a QA engineer can potentially generate SQL to test that the loaded data conforms to the defined constraints.
Defining RI Constraints in Snowflake
You, of course, can (and IMHO should) define RI constraints in Snowflake. You can define primary keys, unique keys, foreign keys, and NOT NULL constraints. Because Snowflake is specifically engineered for data warehousing, only the NOT NULL constraints are enforced. The rest are always created as disabled.
The syntax is standard SQL. You can define the constraints both inline and out-of-line.
Here is a simple example of inline constraints:
Create or replace TABLE SAT_REGIONS ( HUB_REGION_KEY NUMBER(38,0) NOT NULL, SAT_LOAD_DTS DATE NOT NULL, REGION_COMMENT VARCHAR(152), HASH_DIFF VARCHAR(32) NOT NULL, SAT_REC_SRC VARCHAR(50) NOT NULL, constraint SAT_REGIONS_PK primary key (HUB_REGION_KEY, SAT_LOAD_DTS), constraint SAT_REGIONS_FK1 foreign key (HUB_REGION_KEY) references KENT_DB.DATAVAULT_MAIN.HUB_REGION(HUB_REGION_KEY) );
Example of an out-of-line constraint:
ALTER TABLE SAT_REGIONS ADD constraint SAT_REGIONS_PK primary key (HUB_REGION_KEY, SAT_LOAD_DTS);
It’s that easy. So why don’t you have constraints in your data warehouse?
Did you hear? Snowflake was declared the #1 Cloud Data Warehouse in a recent GigaOM analyst report. Go here to get your copy of the report.