Handle Slowly Changing Dimensions with Pentaho Kettle – Part1 - Perficient Blogs
Blog
  • Topics
  • Industries
  • Partners

Explore

Topics

Industries

Partners

Handle Slowly Changing Dimensions with Pentaho Kettle – Part1

In this blog we will talk about how to implement various types of slowly changing dimensions (SCDs) with Kettle in details. And I will introduce the examination of SCD Type I and Type II in Part2.

Types of Slowly Changing Dimensions

Following Kimball, we distinguish two main types of slowly changing dimensions: Type I, Type II:

  • Type I: Changes in the source system result in corresponding overwriting in the target dimension.
  • Type II: Changes in the source system result in inserting tamped versions of dimension rows multiple times in the target dimension.. This allows you to find whichever version of the dimension rows at any given point timely.

Configuration for Dimension Lookup/Update step

The Dimension Lookup/Update step allows you to implement Ralph Kimball’s slowly changing dimension for both Type I (update) and Type II (insert) together with some additional functions. You can use this step not only to update a dimension table, but may also to look up values in a dimension.

1.General Configuration

Connection, Target schema, Target table, and Commit size properties. Note that the “Commit size” is only applicable in the lookup mode.

2.Properties we should set for Dimension Lookup/Update step

Option Description
Technical key Technical key field is surrogate key in the dimension table.

Note: The use auto increment field option means you can choose it if your database supports auto_increment or IDENTITY columns and auto increment have been set for “Technical key field”.

Version field Record the version of the dimension entry (a revision number). E.g. 1, 2, 3, etc.
Start of date range Specify start date of version. The default mini year is 1900.
End of date range Specify end date of version. The default max year is 2199.
Keys Specify the names of the business keys in the stream and in the dimension table.

Note: Those are used to map the business key of the dimension table to the stream. It will specify if the row exist in target or not. If not, it will insert the new record.

Lookup / Update Fields Specify whether you want the values to be updated or you want to have the values inserted into the dimension as a new version.

Note: The column would be NULL if you haven’t put it into list, put the fields you must have in the dimension.

 

01

3.Lookup / Update Fields

In the Fields tab page, you can specify the mapping between the columns of the dimension table and the fields from the incoming stream. The values of these stream fields are compared to the values retrieved from the corresponding database columns. If the value of specify columns can’t match, it will perform the action set in “Type of dimension update” options.

The available options are:

  • Insert: This option implements a Type II slowly changing dimension policy. If the difference is detected for one or more mappings that have the Insert option, then a row is added to the dimension table.
  • Update: This option simply updates the matched row. It can be used to implement a Type I slowly changing dimension.
  • Punch through: The punch through option also performs an update. But instead of only updating the matched dimension row, it will update all versions of the row in a Type II slowly changing dimension.
  • Date of last insert or update (without stream field as source): Use this option to let the step automatically maintain a date field that records the date of the insert or update using the system date field as source.
  • Date of last insert (without stream field as source): Use this option to let the step automatically maintain a date field that records the date of the last insert using the system date field as source.
  • Date of last update (without stream field as source): Use this option to let the step automatically maintain a date field that records the date of the last update using the system date field as source.
  • Last version (without stream field as source): Use this option to let the step automatically maintain a flag that indicates if the row is the last version.

02

 

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Subscribe to the Weekly Blog Digest:

Sign Up