In Part 1 of this blog I gave a theoretical overview of how to deal with slowly changing dimensions as well as some of the general advantages and disadvantages of each technique.
Moving away from the theoretical overview and more towards the actual practice of the theories, my first attempt to handle slowly changing dimensions was within SQL Server Integration Services (SSIS) using Business Intelligence Development Studio (BIDS). Microsoft developed a Data Flow Transformation to handle both Type 1 and Type 2 slowly changing dimensions called the Slowly Changing Dimension Transformation.
While attempting to use the Slowly Changing Dimension Transformation I quickly began to realize that there were a few limitations. Although the Wizard was very easy to use and configure, it only supports connections to SQL Server. If a connection to a database other than a SQL Server database is needed, forget it. Of course temporary tables or additional staging tables could be setup in a SQL Server database, but this adds another step within the ETL and is not always an option.
Furthermore, as I began to read more about the Slowly Changing Dimension Data Flow Transformation I found out that the transformation performs row by row comparisons between the source data and the target dimension table. On top of that, when a row has been identified as an update, it is also individually updated within the target dimension table on a row by row basis. All of the performance advantages of set based processing and fast loading cannot be leveraged using the Slowly Changing Dimension Transformation.
When using the Slowly Changing Dimension Transformation I often found my SSIS packages hanging or performing at levels well beyond what would be considered acceptable. This could have partially been due to the fact that the transformation queries the dimension table to look for matches, inserts new records, and updates records all at the same time, potentially locking itself out of the target dimension table!
Since the Slowly Changing Dimension Transformation was not providing the performance needed in order to perform the Type 2 slowly changing dimension updates I began searching for other answers. I knew that any solution that I might find would be required to perform the following tasks within the SSIS package:
- Identify the difference between New and Existing Records
- Differentiate between Existing records that contain changes and Existing records that did not change
- Identify records that no longer exist
- Insert, update, and delete records using record sets rather than using row by row based processing
After a lot of surfing I finally came across an interesting design tip from the Kimball Group, Design Tip #107. In short, the article explains how to use the SQL Server 2008 Merge command to imitate the behaviour of the Slowly Changing Dimension Data Flow Transformation for both Type 1 and Type 2 slowly changing dimensions.
The Merge command can be used to insert, update, and delete records within the dimension table based on the results of a join to the source table. Not only can the Merge command offer all of the previously mentioned functionality, but it can also perform all of those tasks on record sets. This ability to update entire sets of data provides the potential for drastically increased performance. For a great example and code showing how to use the new Merge command please visit the Kimball Group design tip mentioned above.
Conclusion
When trying to handle slowly changing dimensions the Slowly Changing Dimension Transformation within SSIS has some limitations. The inability of the transformation to analyze and update sets of data or use the SQL Server fast load can cause huge performance issues on large and / or dynamic dimensions. The new SQL Server 2008 Merge command offers a new method to handle slowly changing dimensions that can perform multiple actions (i.e. insert, update, and delete) using a single pass through the data set rather than using separate row by row inserts and updates.