Updating Type II Slowly Changing Dimensions
In this blog post I will provide an example of how I implement Type II dimensions in my data warehouses.
The process I go through when refreshing data into a Type II dimension is to first land the data from the source system. I always create a separate schema to hold the data, as it is presented from the source system with no transformations. The tables and columns are named exactly as they are in the source system. The basic idea here is to quickly off-load the data from the source system, thereby eliminating any long running or resource intensive processes from impacting the source application while providing a link back to the source system.
Once all of the data is landed, I can now proceed to the transformation and staging step. In this phase, I create staging tables for all of my dimensions. These tables look similar to the dimension table but they do not have a primary key or a majority of audit columns. Here is an example of a regular dimension table and its corresponding staging dimension.
If the staging environment is a full pull of all active customers, I will retrieve the information from the source landing tables and make the appropriate transformations to the data before landing it in the staging table. As you can see from the example above there is a data hash audit column in both the staging and data warehouse dimension. This column is a numeric value based on the hashing of all of the non-audit columns within the dimension. This allows me to use this column to help me quickly find the dimensional changes happening in the source system that I care about in the data warehouse.
Again, assuming a full retrieval of all active customers from the source system, I can quickly find my adds and (soft) deletes by comparing the natural key (CSTMR_ID) and the data hash found in the staging dimension against the natural key (CSTMR_ID) and the data hash of the active record (CSTMR_ACTV_RCRD_FL = 1) in the DW dimension via two simple SQL statements. Below are the examples of the SQL statements for a full refresh.
If I receive only the data that changes from the source system then I would follow the steps for the full list of active customers listed above but now I only need to worry about the records that are currently in the staging dimension. Below is the updated SQL statement for soft deleting a dimension record when the staging tables which contain only the data which has changed. There is no change in the SQL statement for the inserting of records the staging tables which contain only change data captured data.
I hope that you found this article useful and continue to use it as a reference as you implement your data warehouse.