Recently I listened in on a webinar on “Best Practices in Ensuring Data Quality” and I kept thinking to myself about all the data quality projects I have been on. Now one thing that came out as the obvious was that many of my previous and current clients all have had different standards to their data quality needs. I have had clients who needed their data so clean that you would not be able to find duplicates vendor, fat finger mistakes, misspelling or etc. But I also have had clients who were more relaxed with there data quality and only wanted to ensure a few data attributes were cleaned and up to par for their business needs. Now each scenario has its certain challenges and commitment in which needs to be laid out up front. Within my blog posting today, I want to discuss the following steps that will help insure that you can meet your client’s expectations with their data quality needs.
Now before you can get started, you first need to understand and explain to your client that the main types of activates/phases that are a part of a data quality initiative and these are the following:
- Data Discovery
- Data Profiling
- Data Quality
- Data Matching
- Data Requirements
- Data Cleansing/Correction
- Pre-Data Validation
- Data Upload/Updates
- Data Validation
Data Discovery: This is almost a very simple task and easiest part of the data initiative. During the data discovery phase you want to identify the data attributes that have data concerns and data issues. Now 99% of the time this is already known by the business, as they wouldn’t have called you in for the data quality project. 8) But here your client should be able to list out all the data problems they are seeing within the current system.
Data Profiling: Here you will analyze the source system data validating any data pain points that have been identified and uncover additional new data issues.
Data Quality: During the data-profiling phase you will also need to initiate data quality phase of the data. Here you will need to evaluate the major issues with the data in which adhere to completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use.[1]
The Future of Big Data
With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.
Data Matching: Describes efforts to compare two sets of collected data. This can be done in many different ways, but the process is often based on algorithms or programmed loops, where processors perform sequential analyses of each individual piece of a data set, matching it against each individual piece of another data set, or comparing complex variables like strings for particular similarities[2].
Data Requirements: Here you will need to gather the business rules in transforming the dirty data back into its cleaned form. Now for things such as name or address will needed to be manually updated. In addition, one important thing is to also understand and know the ay data regulations in regards to data elements a client would liked to be cleansed. Note: Due to many compliance and regulations changing data elements is not as easy as 1, 2, 3. You will need to make sure you are approved to update and data element and follow any necessary compliance and regulations.
Data Cleansing/Correction: Here you will apply the data requirement rules the will correct and clean the dirty data.
Pre Data Validation: Once the data has been corrected you will need the business to validate and sign off on the data changes. I always have a pre data check by out putting the data to a flat file to have the business pre validate the changes of before and after.
Data Upload/Updates: Once you have received sign offof your pre data validation you will then need to update the source data within the target system.
Data Validation: Once the data has been updated and cleansed, you will need to have the business validate the changes one last time for finial approval.
Following these steps and bringing your client up to speed about the data-cleansing activities will ensure for a successful data-cleansing initiative.
[1] http://www.cio.gov.bc.ca/other/daf/IRM_Glossary.htm
[2] http://www.techopedia.com/definition/28041/data-matching