I was recently introduced to SQL Server 2012 and discovered Data Quality Services (DQS); a new feature of SQL Server 2012. I wanted to use this blog as an introduction to DQS, define key terms, and present a simple example of the tool. According to MSDN,
The data-quality solution provided by Data Quality Services (DQS) enables a data steward or IT professional to maintain the quality of their data and ensure that the data is suited for its business usage. DQS is a knowledge-driven solution that provides both computer-assisted and interactive ways to manage the integrity and quality of your data sources. DQS enables you to discover, build, and manage knowledge about your data. You can then use that knowledge to perform data cleansing, matching, and profiling. You can also leverage the cloud-based services of reference data providers in a DQS data-quality project.
(Click on each image to enlarge it.)
The below illustration displays the DQS process:
The installation was a very simple process and with just a few clicks everything was setup. Refer to Pinal Dave’s blog for installation instructions – “Installing Data Quality Services (DQS) on SQL Server 2012.” During the installation, you are prompted to type a password for the database master key. The database master key is required to encrypt the reference data service provider keys that will be stored in the DQS_MAIN database when you set up reference data providers in DQS later. NOTE: if you receive the below error message when trying to connect to DQS, you simply need to make sure the SQL Server (MSSQLSERVER) service is running – “How to: Start an Instance of SQL Server (SQL Server Configuration Manager).”
The installation will create three databases in the SQL Server instance
- DQS_MAIN: contains DQS stored procedures, the DQS engine, and published knowledge bases
- DQS_PROJECTS: contains the data quality project information
- DQS_STAGING_DATA: staging area where you can copy your source data to perform DQS operations, and then export your processed data
Below is a list of the key terms which I encountered during my research
- DQS Server: installed on top of the SQL Server 2012 Database; contains the three databases mentioned above
- Data Quality Client: enables you to perform data quality operations using a standalone tool
- Data Cleansing: the modification, removal, or enrichment of data that is incorrect or incomplete, using both computer-assisted and interactive processes
- In DQS, Data Cleansing includes a computer-assisted process that analyzes how data conforms to the knowledge in a knowledge base, and an interactive process that enables the data steward to review and modify computer-assisted process results to ensure that the data cleansing is exactly as they want to be done
- Knowledge Base: knowledge-driven solution that analyzes data based upon knowledge that you build with DQS. This enables you to create data quality processes that continually enhances the knowledge about your data and in so doing, continually improves the quality of your data
- Data Quality Project: proposes changes to the source data based on the knowledge in the knowledge base
- Domain Management: enables the data steward to interactively change and augment the metadata that is generated by the computer-assisted knowledge discovery activity
- Domain Properties: define and drive the processing that will be applied to the associated values
- Domain Rules: ensures the accuracy of data, ranging from a basic constraint, such as the possible terms that a string value can be, to a more complex regular expression, such as the valid forms of an email address
The below example will demonstrate two fundamental steps in DQS:
- Building a New Knowledge Base
- Creating a New Data Quality Project
Building a New Knowledge Base
We will use an Excel file as the source of our Knowledge Base. Your source can be a table from a SQL Server database or an Excel file.
Click “New Knowledge Base”
Clicking on the icon will bring up the below screen. Here you will enter the name of the new Knowledge Base. In our example, it is DQS_Example.
Clicking Next will bring up the following screen where it will allow you to select the Excel file and let you select the source column(s). I have selected Month as a source column.
Clicking on “Create a domain” will bring up the following screen. I have created the Months domain.
Clicking OK in the Create Domain screen will return you back to the below screen.
Clicking Next will bring you to the below screen where you will begin to analyze the source for knowledge discovery. NOTE: make sure the Domain column is set correctly in the Mappings.
Click Start to begin the analysis. Upon completion of the analysis, the below screen will appear. This screen contains the results of the data profiling.
Clicking Next will generate the below screen. Here you can specify which values from the source to be considered valid values for the domain that you created or correct invalid values. You can also add more values.
Clicking Finish will generate a confirmation page as shown below. Here you can either select No to just save the Knowledge Base or Publish to publish the Knowledge Base so that it can be used by the data quality project for data quality improvement.
We have successfully created and published a Knowledge Base by using the data discovery technique and now that Knowledge Base is available for use in data quality projects for data cleansing.
Creating a New Data Quality Project
We will now use the Knowledge Base that we previously created and create a new Data Quality Project to cleanse the data within an Excel file. Again, your source can be a table from a SQL Server database or an Excel file. Below is the file which contains our data to cleanse. I have highlighted the incorrect values.
Click on “New Data Quality Project” to start DQS Project.
In the below screen, you will specify the Name of the Data Quality Project, a Description, and which Knowledge Base to use for data cleansing. In our example, I have named the project “DQS_Project_Months.” I also selected the Knowledge Base that we previously created, “DQS_Example.” Selecting a Knowledge Base will display the domains in the tree view on the right side. Select Cleansing from the activity section. Finally click Next to create the Data Quality Project.
On the next screen of the Data Quality Project, specify the data source (it could be SQL Server or an Excel file) and map the source column with the Domain from the Knowledge Base and finally click the Next button to move ahead.
Next, click on Start and it will make the cleaning process and shows various results.
Below are the results of our cleansing process. In our example, the process found 3 suggested values. Click Next to proceed.
Below are the details of those suggested values.
Once you click on the “Approve” radio button, the values are then moved to the “Corrected” tab – see below screenshot. Click Next to proceed.
On the next screen DQS shows the summary of all the activities. It also demonstrates how the correction of the quality of the data was performed. You can export the data to a SQL Server Table, CSV file or Excel.
From the above screen, I am exporting the cleansed data to an Excel file. As you can see below all the incorrect and incomplete data has been fixed or cleansed.
In this blog, I demonstrated how you can create a Knowledge Base and complete a data cleansing project from a source using the interactive mode in the Data Quality Client tool. In my next blog, I will show how you can use a SSIS transformation component to complete data cleansing in a batch/automated mode.
Questions to the readers:
- Are you using this tool? If so, how is it implemented and what is the business case?
- Does anyone have any thoughts on some potential future uses?