Overview:
Although quite prevalent within the world of Business Intelligence, certain situations can still make handling slowly changing dimensions quite challenging. Just recently I had been tasked to deal with slowly changing dimensions in order to allow data within a dimension table to properly update while making sure to retain the history of the changes within the table.
For me the main challenge with handling the slowly changing dimension in this case was performance. No matter what method I tried I was not getting the performance I needed in order to complete all of the updates in a timely fashion within the SQL Server Integration Services (SSIS) ETL package.
In order to shed light on some of my findings I have developed this two part blog in the hopes that it may help someone else with a similar issue or scenario.
Part 1 of this blog offers a general overview of the two most popular types of slowly changing dimensions, Type 1 and Type 2. Part 2 of the blog will go into further detail and discuss one of my failed attempts at tackling slowly changing dimensions as well as cover a new method made available within SQL Server 2008.
Type 1 Slowly Changing Dimension:
In a Type 1 slowly changing dimension a change to the data simply overwrites the original information. Essentially what this means is that no history is kept within the dimension table.
In an oversimplified case, let’s assume we have the following dimension table. The table is composed of a surrogate key (Employee_ID), a business key (Employee_Num), the employee name (Employee_Name), and the employee state of residency (Employee_State).
Employee_ID |
Employee_Num |
Employee_Name |
Employee_State |
001 |
123456 |
Michael Burger |
IL |
Now, let’s assume that I (Michael Burger) win the lottery and move to California to avoid the cold winters in Illinois. By using the logic for a Type 1 slowly changing dimension the previous record would be replaced with the record shown below.
Employee_ID |
Employee_Num |
Employee_Name |
Employee_State |
001 |
123456 |
Michael Burger |
CA |
If anyone were to view this record within the table after the change had occurred they would be unable to tell that I had previously lived in Illinois. This is one of the key disadvantages to the Type 1 slowly changing dimension.
Type 1 Disadvantage:
The disadvantage to the Type 1 slowly changing dimension is that all history within the dimension is lost whenever data changes and is updated. It is impossible to perform analysis on historical information.
Suppose a manager wanted to add all of the sales of each employee, then aggregate the sales by the Employee_State. Without the history of my move from Illinois to California, all sales would inaccurately be allocated to California even though some of my sales did in fact occur when I lived in Illinois!
Type 1 Advantage:
If there is no need to record history, the Type 1 slowly changing dimension is superior to other options due to its simplicity. All data can simply update existing records without having to put anymore extra thought into the ETL process and development.
Type 2 Slowly Changing Dimension:
Unlike the Type 1 slowly changing dimension, in a Type 2 slowly changing dimension a new record is added to the table instead of replacing the old data. Since a new record is added instead of replacing the old record, both the old and the new record will always exist within the table. In the table below, the surrogate key (Employee_ID), the Employee_State, and the Status are updated in the new record.
Employee_ID |
Employee_Num |
Employee_Name |
Employee_State |
Status |
001 |
123456 |
Michael Burger |
IL |
0 |
002 |
123456 |
Michael Burger |
CA |
1 |
It is important to point out that without the surrogate key (Employee_ID), adding a new record for the same employee within the dimension table would not be possible. Prior to the addition of the surrogate key the primary key for the table was the business key, shown as Employee_Num. When Employee_Num was the primary key, a new row could not have been added since the key "123456" would need to be duplicated in order to allow the table to create a new record to maintain history. The surrogate key allows us to maintain the history within the table, while still allowing the business key to be used for analysis and reporting purposes.
The Status in the table is simply updated to show which record is in fact the "Active" record.
Another similar way to handle the Type 2 slowly changing dimension would be to include timestamps within the dimension table. In the table below a new record would be added with a new surrogate key, Employee_State, and Start_Date. The old record would have the End_Date updated with the date for which the record became "Inactive". Any record with a NULL End_Date would be considered "Active".
Employee_ID |
Employee_Num |
Employee_Name |
Employee_State |
Start_Date |
End_Date |
001 |
123456 |
Michael Burger |
IL |
28-Aug-2008 |
15-Dec-2008 |
002 |
123456 |
Michael Burger |
CA |
16-Dec-2008 |
NULL |
Type 2 Disadvantage:
In large data sets with dynamic data the Type 2 slowly changing dimension could potentially cause the size of the table to grow fast. The Type 2 slowly changing dimension is hard to make a case for when storage and performance are more important than keeping history. The Type 2 slowly changing dimension also adds some complexity to the ETL process.
Type 2 Advantage:
The main advantage to the Type 2 slowly changing dimension is the ability to maintain the historical information of the changes within the dimension table. History of the changes allows users to perform analysis on the history, as well as gives the system the ability to use the timestamp to roll back changes if unnecessary updates were to occur.
Conclusion:
In conclusion, both Type 1 and Type 2 slowly changing dimensions can be useful in the proper scenario. It is important to determine if it is necessary to keep track of the changes that occur within the dimension table. If history is not needed, then Type 1 is definitely the way to go. If there is a potential to need a history of all the changes within the dimension table then Type 2 would be the best bet. However, since that is somewhat of a generality, keep in mind that the choice can always be different for each client, business process, or business case involved.