Skip to main content

Development

Handle Slowly Changing Dimensions with Pentaho Kettle – Part2

We examine SCD Type I and Type II here.

Overview of Kettle transformation

Here is overview of transformation. It will be loaded from excel source file and loaded into database.

01

Target Table Structure

Here is the target table structure we build for SCD Type II. Search_Terms is business key from source. Because Type II slowly changing dimension tables keep tracking of history, there may be multiple rows for one particular business key, each represents a version that is valid from Start_Date to End_Date.

Note: We added technical keys and historical maintenance fields manually. Kettle will be added for you if those columns are not there.

02

Initial load

Data will loaded into database after clicking run button.

This dimension entry (sock_search_terms_ID = 1) is added automatically to the dimension table when the update firstly run. If you have “NOT NULL” fields in your table, adding this empty row and then the entire step will fail! So make sure that you have a record with the ID field = 0 or 1 in your table if you don’t want PDI to insert a potentially invalid empty record.

03

Incremental Load

We are testing SCD Type I and Type II here. Now we update Search_Popularity to 99999 for search_terms “socks” in source file, and update Search_Index to 6666 for search terms “socks for children” in source file. The type of dimension updated for Search_Popularity is insert, and for Search_Index is update. So the except result is, it will insert a new row with version 2 if changes happened and it will update row directly for Search_Index.

04

Let’s see the magic:

05

Now regarding updates option – the search terms “socks for children” was just updated, while for search terms “socks”, a new version with ID = 250 was added. This explains why the transformation returns IDs 2 and 250. Also we need to note that how the old version is considered to be invalid since the day of execution and the new version is valid from that day on.

Conclusion

We can handle slowly change dimension by Dimension Lookup/Update step with kettle. The steps integrate several options, it implement SCD Type I, Type II and others. It will look up business key in target table first to see if the record exist or not. If exist, it will compare stream fields with target fields. If changes happened, it will update target table with option set in field.

Reference

http://wiki.pentaho.com/display/EAI/Dimension+Lookup-Update

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Follow Us