Data Profiling: The First Step in Data Quality | Healthcare
Healthcare Blog

Data Profiling: The First Step in Data Quality

When I think of data quality, I think of three primary components: data profiling, data correction, and data monitoring. Data profiling is the act of analyzing your data contents. Data correction is the act of correcting your data content when it falls below your standards. And data monitoring is the ongoing act of establishing data quality standards in a set of metrics meaningful to the business, reviewing the results in a re-occurring fashion and taking corrective action whenever we exceed the acceptable thresholds of quality.

Today, I want to focus on data profiling. Data profiling is the analysis of data content in conjunction with every new and existing application effort. We can profile batch data, near/real time data, structured and non-structured data, or any data asset meaningful to the organization. Data profiling provides organizations the ability to analyze large amounts of data quickly in a systematic and repeatable process. Data profiling will provide your organization with a methodical, repeatable, consistent, and metrics-based means to evaluate your data. You should constantly evaluate your data given its dynamic nature.

I like to break down data profiling into the following categories:

  • Column Profiling, where all the values are analyzed within each column or attribute. The objective is to discover the true metadata and uncover data content quality problems
  • Dependency Profiling, where each attribute is compared in relation to every other attribute within a table where we’re looking for dependency relationships. The focus is on the discovery of functional dependencies; primary keys; and quality problems due to data structure.
  • Redundancy Profiling, where data is compared between tables in determining which attributes contain overlapping or identical sets of values. The purpose is to identify duplicate data across systems; foreign keys, synonyms, and homonyms. All values corrupting data integrity should be identified.
  • Transformation Profiling, where our processes (business rules) are examined to determine our data’s source(s); what transformation(s) are applied to data; and explore data target(s).
  • Security Profiling, where it is determined who (or what roles) have access to the data and what are they authorized to do with the data (add, update, delete, etc.).
  • Custom Profiling, where our data is analyzed in a fashion that is meaningful to our Organization. For example, an organization might want to analyze data consumption to determine if data is accessed more by web services, direct queries or in some other fashion. For example, a large organization, improved system throughput after determining how the business and its customer accessed their information.

Most times, you’ll find IT and Business may have a few false assumptions concerning data content and its quality. I believe the cost to the business is the risk of their future solvency or failure to reach their maximum revenue potential. Sometimes leadership has difficulty assessing their need for a data quality program due to an inability to assess the cost. Sometimes, action is taken after a bug is discovered at midnight or a customer feels their report is wrong. Data profiling allows your organization to be proactive and creates self-awareness.

The Two ‘Flavors’ of Data Profiling

There are two methods of data profiling: One based on sample and another based on profiling data “in place.” Sample based profiling involves performing your analysis on a random sample of data. For example, I might want to profile a 100 million row table. In my effort to be efficient, my sample might be 30% of rows where I select every third row. Sample base profiling requires me to store my sample in some temporary medium. Also, sample based profiling requires you to ensure you have a representative sample of your data. From a statistic standpoint, if my samples are too small, I can easily miss data patterns or not properly identify the column’s domain.

The second type of profiling involves profiling my data in place. It’s treated as just another query of my database. Generally, you will be profiling PROD and given the contention for resources, you’ll want to run your queries when it has the least impact to the database.

Data Profiling Toolsets

You might be asking what toolsets are available to perform Data Profiling. You have lots of options. Most of the ETL toolsets like Informatica and Data Stage offer built in Data Profilers. There are stand-alone Data Profiling alternatives. And if your budget is zero, you can write your own scripts to perform the analysis.

Data Profiling Insights

What data should I profile first? I like to focus on mission critical data first, like customer or product information. If I have a data warehouse, data mart, or OLAP cube, I’ll focus on their Data Sources. Your OLTP environment is a good starting point since most analytic data stores will pull from these sources.

Once you have performed your data profiling effort, what next? I like to map the results to my outstanding application’s bug fix reports. You can find a high correlation between the known errors and what your data profile informs you of. And you can be proactive in the discovery of errors that may reside in your data now. If I know my data contents, I can create better and smaller test data sets for QA purposes. I like to share my findings with QA, and develop a better test database and improve our test plans.

I can be proactive in my transformations where I can identify data misalignments where my data sources contain values that are not being handled properly. And if there are data anomalies where we have the same set of values stored in multiple locations, we can address our data structure if needed.

Another useful insight comes in the data modeling structure. Do my tables reflect the business at hand? Every organization will have tables that are processed each night, and not used by anyone. When I profile, I like to match my data to my Business Intelligence environment. When I identify a set of tables and reports that are not used by anyone we can remove them from PROD to improve our performance. Also, I can match my data sources to my staging area to determine if my processes are optimal.

There are so many great uses for data profiling. To start, I recommend looking at your business strategy and assessing your data quality cost. Once you’ve assessed the cost, determine if your current data quality strategy aligns to your business needs. A good data profile strategy should complement your business strategy and provide the business tangible bottom-line results.

What issues have you overcome in data profiling? How did you work through any issues? 

Subscribe to the Healthcare Weekly Digest

* indicates required

4 thoughts on “Data Profiling: The First Step in Data Quality

  1. Michael Anderson Post author


    >>1. the organization had a data quality framework developed >>back in 2008 but no longer being followed or implemented. >>How can i revive the enthusiasm on data quality?

    I suggest you could select a batch process that is run daily where you could identify a few metrics such as number of records processed, number of records rejected, number of records accepted. And if you have time, break down the error types. For example, if you have alpha characters where you should have numerics, column profiling would assist your organization in being proactive. Next, you could share the column profiling knowledge with the Source Record team to have them clean up their Source data. Thereby, you loads would run with fewer errors, and the data would be loaded the first time.

    2. there is no definite list or accountability on data stewardship. Do you have any suggestion how i can gain buy-in from the business to be involve and become data stewards?

    Yes. I would suggest setting up a billing code to measure how much DBA, ETL, and Bus Analyst time is used each day or week to clean up re-occuring errors as define in question 1. The maintenance cost, multiplied by 360 (by day) or 52 (by week) would be a good metric for determining the cost of quality. Also, ask the business how vital not loading a portion of their data – what is their cost of not having all of their records properly loaded. Maybe the rejected records (in item 1) reflects a major account and if these records are not loaded, the business has to manually pull records. Look at total cost of ownership when thinking in data quality framework.

    3. what are your suggestions in determining the impact of poor data? i.e how can i quantify the cost of quality or cost of poor quality of data.

    I mentioned some metric in item 1 and 2. Poor data quality impacts the business ability to be compete.


  2. Aristotle Castro

    Hi Micheal,

    Great article.

    I am working as an intern for a health insurance company and is currently involved in evaluating a data profiling functionality of a software vendor they currently have. However, I have some concerns and would like to ask for some suggestions or inputs.

    1. the organization had a data quality framework developed back in 2008 but no longer being followed or implemented. How can i revive the enthusiasm on data quality?

    2. there is no definite list or accountability on data stewardship. Do you have any suggestion how i can gain buy-in from the business to be involve and become data stewards?

    3. what are your suggestions in determining the impact of poor data? i.e how can i quantify the cost of quality or cost of poor quality of data.

    Looking forward to some advise and guidance


  3. Michael Anderson

    When I was at large Telecom, I added Data Profiling jobs in my ETL (Informatica) job streams so we could profile our data to discover potential data errors and any possible anomalies. The effort was extremely successful because we generated daily reports that were posted to a sharepoint that the Data Stewards could review. And they could request additional data validations when needed. As a result, our business rules improved dramatically and after the first year, our ETL and DBAs never received any midnight calls for batch process errors. And downstream, our Financial Data Mart BI Consumers, became huge fans of the numbers (single source of the truth) due to the diligence of our team. Now, I hear my old team is apply the same technique for real time where profiling is performed on specify assets like key tables in their Customer Master. I think in-graining data quality into your processes while making it cost effective is a win-win for IT, the Business, and our Customers.

Leave a Reply

Your email address will not be published. Required fields are marked *

Healthcare Blog

Perspectives on healthcare industry trends and topics and health IT insights to help organizations optimize operational performance, enhance patient and member experience, comply with regulatory demands and transform their business.