Overview
One issue that I have run into in every BI project is data quality and data cleansing. Oftentimes data cleansing can be excluded from the scope of a BI project since the client is often afraid to even tackle the problem. On top of that, data cleansing can often be a project all by itself. There are a ton of Master Data Management (MDM) software solutions on the market, and recently with the release of SQL Server 2008 R2, Microsoft has joined the space with the release of Master Data Services (MDS).
On one of my recent projects we decided to go with MDS as a backend for MDM, however, the client wanted the management of the data to be more intuitive and easy for the users without having to learn the new functionality and interface of MDS. For this reason, and since all the users had experience using SharePoint, we developed a way to use SQL Server Integration Services (SSIS) and SharePoint (SharePoint 2010 in this case but the solution would work with 2007 as well). By using SharePoint we can also leverage some out of the box functionality such as security, email alerts, and content approval (these will be shown later).
The key to this solution is that the business users manage the data that feeds that data warehouse, not IT. Oftentimes the decision of how to fix a data quality issue falls on the business. They use the data on a day to day basis so they should be an integral part in deciding how the data should be fixed. Below is an overview of the solution that we came up with.
Don’t feel like reading? Check out the webinar recording!
Solution Setup
In the screenshot below is a very simple example of an SSIS package that could be used to load a data warehouse. The package loads the lookup tables used to load the fact tables, loads the dimension tables, and then loads the facts. In reality, this package is over simplified. However, for the purposes of this blog it helps convey how SSIS and SharePoint can be used to manage data in a BI solution.
The goal of creating any MDM solution is to come up with “one version of the truth”. Using a well-built BI solution everyone should walk into meetings with the same numbers, the same list of products, the same sales regions, the same list of customers, etc. In this blog I am going to use a SharePoint list to make sure all the data that feeds into the data warehouse conforms to a list of preapproved country codes.
Using countries is a good example since a very accessible list of standard country codes can quickly be found by using a standard list of ISO Country Codes. As shown below, I have simply imported a list of ISO Countries, their codes, and other attributes into a SharePoint list. For the purposes of this blog this REF ISO Country list will be our “Master” country list. Any country code from a source system that does not exist in this “Master” list should not load into the data warehouse since it is considered bad data.
Let’s say a source system in fact does have valid country codes that don’t conform exactly to the ISO list shown above. Well we can handle this situation using a second SharePoint list. In the screenshot below we have a Country Mappings list. This list helps identify the source system (using the DataSourceName column), and allows us to map values from that source system to a preapproved value from the REF ISO Country (the MapToCountry column in the Country Mapping list below is the Lookup Value column from the REF ISO Country list). It is important to point out that both the REF ISO Country list and the Country Mappings list are setup with content approval. Confused? Keep reading and I’ll explain this in more detail.
Before explaining how these lists are used to manage the data, let me first explain how the data from these lists is used. In the data warehouse there are two tables, one for each SharePoint list. The lkpCountry table is used to store the mappings from the Country Mappings list, and the dimCountry table is used to store the preapproved country values from the REF ISO country list. In the screenshot below you can see both tables (empty for now until we run the ETL).
In order to demonstrate how these lists are used to map data while entering the data warehouse I have also created a simple set of sample source data. This data is shown below and is mocked up to represent some data that you would find in a typical source system, in our case a SQL Server database. This example use fake sales data, relevant dates, the country of sale, and associated quantities and amounts. The data set for this example is small, only 15 rows.
Solution Demonstration
Now that I have covered all of the setup, we can now start loading data into the data warehouse. In order to easily connect SSIS to the SharePoint lists I first downloaded and installed the SharePoint List Source and Destination. This needs to be installed on the server with SSIS and BIDS. I have used this on multiple projects and this is by far the easiest way I have ever found to read and write data to a SharePoint list. This SSIS add in can be found at the following site:
http://sqlsrvintegrationsrv.codeplex.com/releases/view/17652
In the screenshot below, the SharePoint List Source is the “SP_SRC_Country_Mappings” task. As you can see, I have added a conditional split to the Data Flow. This conditional split only allows “Approved” items from the SharePoint list to load into our data warehouse tables. This is important since this allows us to setup workflows on the SharePoint lists so that items must be approved before they are valid candidates to be used in the data warehouse. This will help ensure that the quality of data that loads into the data warehouse is maintained over time.
Similarly to the screenshot above for the lkpCountry table, I have added the same logic for the dimCountry table as shown below.
After running the ETL, we can now see the result below. The four records from the Country Mappings list are now loaded into the lkpCountry table, and countries from the REF ISO Country list are shown in the dimCountry table (only a sample of all data is shown for this).
Additionally, the data was loaded from the stgfactSales table into the factSales table. In a production system the factSales table would serve as the final data table that would provide the data for an SSAS cube. As you can see in the screenshot below, the table almost looks identical to the stgfactSales table with one exception; the CountryCode from the staging table has been replaced with the CountryID in the fact table. The CountryID is simply the surrogate key from the dimCountry table. It is important to point out that all of the country codes from the staging tables were codes that existed in our Country Mapping SharePoint list. For this reason, all 15 records loaded successfully into the fact table.
Well what happens if we see a record come through from the source with a country code that doesn’t exist in our Country Mapping SharePoint list or lkpCountry table? In the screenshot below you can see that I have now added an additional record to the staging table with a country code of “CAN”. “CAN” is also a value that doesn’t exist in our Country Mapping SharePoint list.
I will now run the ETL one more time without changing anything else. As we can see in the screenshot below, the “CAN” record did not load into our fact table! Rather, the record fell out and landed into a falloutsfactSales table. I would also like to note that each time the ETL runs the data is pulled from the source system as well as the fallout table. This allows us to pull deltas (only new transactions) from the source and attempt to load any fallout records that had previously failed to load into the fact table. The falloutsfactSales table is identical to the staging table with the exception of the DataSourceName and FalloutReason columns. These columns simply give us more visibility into where the record came from and why it fell out during the ETL process.
Well how did this work? As you can see in the screenshot below there is a lookup task for Country Code. This takes the data source and source system value, and looks for a match in our lkpCountry table (the table populated from our Country Mapping SharePoint list). Since a combination didn’t exist for a data source of “SQL Server” and a Country Code of “CAN”, the lookup has no match.
This causes the record to flow to the “Lookup No Match Output”. Here is where the DataSourceName and FalloutReason are assigned, and the record is loaded into the falloutfactSales table instead of the factSales table.
For records that have a match during the Country Code lookup, they flow to the “Lookup Match Output”. Additionally another lookup is performed to assign the record with a CountryID from the dimCountry table before it is loaded into the factSales table.
Not only did the record fallout into the falloutsfactSales table, but it was also loaded into the Country Mapping SharePoint list! In the screenshot below you can see that a record was added for the DataSourceName of “SQL Server” and the SourceSystemValue of “CAN”. The MapToCountry column is empty since a user has not managed this fallout, and the Approval Status is “Pending”. Since the status is “Pending” the record will not be loaded into the data warehouse until someone manages this fallout.
The record was loaded into the SharePoint list with the Data Flow shown below. All records in the fallout table are compared to the records that already exist in the Country Mappings list using a Left outer join to avoid loading duplicate records into the SharePoint list.
Note: An alternative to avoid the left outer join would be to pull “Pending” records into the lkpCountry table and just exclude those records from any lookups that were performed during the data warehouse load.
The last step of the Data Flow is the SharePoint List Destination task which loads the record into the SharePoint list.
Well this is all great, but how will the business user responsible for managing this list know to go map the newly added list item? This can easily be taken care of with some simple out of the box SharePoint functionality, alerts. If the end user is setup to receive an alert each time a new record is added to the list, they will receive and email similar to the one shown below.
The user then goes to the Country Mapping list, assigns a MapToCountry, and approves the list item.
The next time the ETL runs, we can now see the mapping for “CAN”, or Canada in the lkpCountry table.
Since the mapping now exists in the lkpCountry table, we can also see that the factSales table now has all 16 records! The valid mapping managed by the user now allows the record to successfully load into the fact table.
Final Thoughts
I have often been asked, “What if a new country is created, how would a user map to it if it didn’t exist in the REF ISO Country list?” Simply put, the user would add the country to the REF ISO Country list, approve it, and then map it in the Country Mapping list. That simple!
The last point I would like to make is that as an alternative to having the record fallout and sit in a fallout table until the mapping is added, the record could be allowed to load into the fact table as an inferred member. Later when the mapping is added the fact table record could be updated with a CountryID which mapped to the proper country in the dimCountry table.
One issue that I have run into in every BI project is data quality and data cleansing. Oftentimes data cleansing can be excluded from the scope of a BI project since the client is often afraid to even tackle the problem. On top of that, data cleansing can often be a project all by itself. There are a ton of Master Data Management (MDM) software solutions on the market, and recently with the release of SQL Server 2008 R2, Microsoft has joined the space with the release of Master Data Services (MDS).
On one of my recent projects we decided to go with MDS as a backend for MDM, however, the client wanted the management of the data to be more intuitive and easy for the users without having to learn the new functionality and interface of MDS. For this reason, and since all the users had experience using SharePoint, we developed a way to use SQL Server Integration Services (SSIS) and SharePoint (SharePoint 2010 in this case but the solution would work with 2007 as well). By using SharePoint we can also leverage some out of the box functionality such as security, email alerts, and content approval (these will be shown later).
The key to this solution is that the business users manage the data that feeds that data warehouse, not IT. Oftentimes the decision of how to fix a data quality issue falls on the business. They use the data on a day to day basis so they should be an integral part in deciding how the data should be fixed. Below is an overview of the solution that we came up with.
Don’t feel like reading? Check out the webinar recording!
Solution Setup
In the screenshot below is a very simple example of an SSIS package that could be used to load a data warehouse. The package loads the lookup tables used to load the fact tables, loads the dimension tables, and then loads the facts. In reality, this package is over simplified. However, for the purposes of this blog it helps convey how SSIS and SharePoint can be used to manage data in a BI solution.
The goal of creating any MDM solution is to come up with “one version of the truth”. Using a well-built BI solution everyone should walk into meetings with the same numbers, the same list of products, the same sales regions, the same list of customers, etc. In this blog I am going to use a SharePoint list to make sure all the data that feeds into the data warehouse conforms to a list of preapproved country codes.
Using countries is a good example since a very accessible list of standard country codes can quickly be found by using a standard list of ISO Country Codes. As shown below, I have simply imported a list of ISO Countries, their codes, and other attributes into a SharePoint list. For the purposes of this blog this REF ISO Country list will be our “Master” country list. Any country code from a source system that does not exist in this “Master” list should not load into the data warehouse since it is considered bad data.
Let’s say a source system in fact does have valid country codes that don’t conform exactly to the ISO list shown above. Well we can handle this situation using a second SharePoint list. In the screenshot below we have a Country Mappings list. This list helps identify the source system (using the DataSourceName column), and allows us to map values from that source system to a preapproved value from the REF ISO Country (the MapToCountry column in the Country Mapping list below is the Lookup Value column from the REF ISO Country list). It is important to point out that both the REF ISO Country list and the Country Mappings list are setup with content approval. Confused? Keep reading and I’ll explain this in more detail.
Before explaining how these lists are used to manage the data, let me first explain how the data from these lists is used. In the data warehouse there are two tables, one for each SharePoint list. The lkpCountry table is used to store the mappings from the Country Mappings list, and the dimCountry table is used to store the preapproved country values from the REF ISO country list. In the screenshot below you can see both tables (empty for now until we run the ETL).
In order to demonstrate how these lists are used to map data while entering the data warehouse I have also created a simple set of sample source data. This data is shown below and is mocked up to represent some data that you would find in a typical source system, in our case a SQL Server database. This example use fake sales data, relevant dates, the country of sale, and associated quantities and amounts. The data set for this example is small, only 15 rows.
Solution Demonstration
Now that I have covered all of the setup, we can now start loading data into the data warehouse. In order to easily connect SSIS to the SharePoint lists I first downloaded and installed the SharePoint List Source and Destination. This needs to be installed on the server with SSIS and BIDS. I have used this on multiple projects and this is by far the easiest way I have ever found to read and write data to a SharePoint list. This SSIS add in can be found at the following site:
http://sqlsrvintegrationsrv.codeplex.com/releases/view/17652
In the screenshot below, the SharePoint List Source is the “SP_SRC_Country_Mappings” task. As you can see, I have added a conditional split to the Data Flow. This conditional split only allows “Approved” items from the SharePoint list to load into our data warehouse tables. This is important since this allows us to setup workflows on the SharePoint lists so that items must be approved before they are valid candidates to be used in the data warehouse. This will help ensure that the quality of data that loads into the data warehouse is maintained over time.
Similarly to the screenshot above for the lkpCountry table, I have added the same logic for the dimCountry table as shown below.
After running the ETL, we can now see the result below. The four records from the Country Mappings list are now loaded into the lkpCountry table, and countries from the REF ISO Country list are shown in the dimCountry table (only a sample of all data is shown for this).
Additionally, the data was loaded from the stgfactSales table into the factSales table. In a production system the factSales table would serve as the final data table that would provide the data for an SSAS cube. As you can see in the screenshot below, the table almost looks identical to the stgfactSales table with one exception; the CountryCode from the staging table has been replaced with the CountryID in the fact table. The CountryID is simply the surrogate key from the dimCountry table. It is important to point out that all of the country codes from the staging tables were codes that existed in our Country Mapping SharePoint list. For this reason, all 15 records loaded successfully into the fact table.
Well what happens if we see a record come through from the source with a country code that doesn’t exist in our Country Mapping SharePoint list or lkpCountry table? In the screenshot below you can see that I have now added an additional record to the staging table with a country code of “CAN”. “CAN” is also a value that doesn’t exist in our Country Mapping SharePoint list.
I will now run the ETL one more time without changing anything else. As we can see in the screenshot below, the “CAN” record did not load into our fact table! Rather, the record fell out and landed into a falloutsfactSales table. I would also like to note that each time the ETL runs the data is pulled from the source system as well as the fallout table. This allows us to pull deltas (only new transactions) from the source and attempt to load any fallout records that had previously failed to load into the fact table. The falloutsfactSales table is identical to the staging table with the exception of the DataSourceName and FalloutReason columns. These columns simply give us more visibility into where the record came from and why it fell out during the ETL process.
Well how did this work? As you can see in the screenshot below there is a lookup task for Country Code. This takes the data source and source system value, and looks for a match in our lkpCountry table (the table populated from our Country Mapping SharePoint list). Since a combination didn’t exist for a data source of “SQL Server” and a Country Code of “CAN”, the lookup has no match.
This causes the record to flow to the “Lookup No Match Output”. Here is where the DataSourceName and FalloutReason are assigned, and the record is loaded into the falloutfactSales table instead of the factSales table.
For records that have a match during the Country Code lookup, they flow to the “Lookup Match Output”. Additionally another lookup is performed to assign the record with a CountryID from the dimCountry table before it is loaded into the factSales table.
Not only did the record fallout into the falloutsfactSales table, but it was also loaded into the Country Mapping SharePoint list! In the screenshot below you can see that a record was added for the DataSourceName of “SQL Server” and the SourceSystemValue of “CAN”. The MapToCountry column is empty since a user has not managed this fallout, and the Approval Status is “Pending”. Since the status is “Pending” the record will not be loaded into the data warehouse until someone manages this fallout.
The record was loaded into the SharePoint list with the Data Flow shown below. All records in the fallout table are compared to the records that already exist in the Country Mappings list using a Left outer join to avoid loading duplicate records into the SharePoint list.
Note: An alternative to avoid the left outer join would be to pull “Pending” records into the lkpCountry table and just exclude those records from any lookups that were performed during the data warehouse load.
The last step of the Data Flow is the SharePoint List Destination task which loads the record into the SharePoint list.
Well this is all great, but how will the business user responsible for managing this list know to go map the newly added list item? This can easily be taken care of with some simple out of the box SharePoint functionality, alerts. If the end user is setup to receive an alert each time a new record is added to the list, they will receive and email similar to the one shown below.
The user then goes to the Country Mapping list, assigns a MapToCountry, and approves the list item.
The next time the ETL runs, we can now see the mapping for “CAN”, or Canada in the lkpCountry table.
Since the mapping now exists in the lkpCountry table, we can also see that the factSales table now has all 16 records! The valid mapping managed by the user now allows the record to successfully load into the fact table.
Final Thoughts
I have often been asked, “What if a new country is created, how would a user map to it if it didn’t exist in the REF ISO Country list?” Simply put, the user would add the country to the REF ISO Country list, approve it, and then map it in the Country Mapping list. That simple!
The last point I would like to make is that as an alternative to having the record fallout and sit in a fallout table until the mapping is added, the record could be allowed to load into the fact table as an inferred member. Later when the mapping is added the fact table record could be updated with a CountryID which mapped to the proper country in the dimCountry table.