When I think of data quality, I think of three primary components: data profiling, data correction, and data monitoring. Data profiling is the act of analyzing your data contents. Data correction is the act of correcting your data content when it falls below your standards. And data monitoring is the ongoing act of establishing data quality standards in a set of metrics meaningful to the business, reviewing the results in a re-occurring fashion and taking corrective action whenever we exceed the acceptable thresholds of quality.
Today, I want to focus on data profiling. Data profiling is the analysis of data content in conjunction with every new and existing application effort. We can profile batch data, near/real time data, structured and non-structured data, or any data asset meaningful to the organization. Data profiling provides organizations the ability to analyze large amounts of data quickly in a systematic and repeatable process. Data profiling will provide your organization with a methodical, repeatable, consistent, and metrics-based means to evaluate your data. You should constantly evaluate your data given its dynamic nature.
I like to break down data profiling into the following categories:
- Column Profiling, where all the values are analyzed within each column or attribute. The objective is to discover the true metadata and uncover data content quality problems
- Dependency Profiling, where each attribute is compared in relation to every other attribute within a table where we’re looking for dependency relationships. The focus is on the discovery of functional dependencies; primary keys; and quality problems due to data structure.
- Redundancy Profiling, where data is compared between tables in determining which attributes contain overlapping or identical sets of values. The purpose is to identify duplicate data across systems; foreign keys, synonyms, and homonyms. All values corrupting data integrity should be identified.
- Transformation Profiling, where our processes (business rules) are examined to determine our data’s source(s); what transformation(s) are applied to data; and explore data target(s).
- Security Profiling, where it is determined who (or what roles) have access to the data and what are they authorized to do with the data (add, update, delete, etc.).
- Custom Profiling, where our data is analyzed in a fashion that is meaningful to our Organization. For example, an organization might want to analyze data consumption to determine if data is accessed more by web services, direct queries or in some other fashion. For example, a large organization, improved system throughput after determining how the business and its customer accessed their information.
Most times, you’ll find IT and Business may have a few false assumptions concerning data content and its quality. I believe the cost to the business is the risk of their future solvency or failure to reach their maximum revenue potential. Sometimes leadership has difficulty assessing their need for a data quality program due to an inability to assess the cost. Sometimes, action is taken after a bug is discovered at midnight or a customer feels their report is wrong. Data profiling allows your organization to be proactive and creates self-awareness.
The Two ‘Flavors’ of Data Profiling
There are two methods of data profiling: One based on sample and another based on profiling data “in place.” Sample based profiling involves performing your analysis on a random sample of data. For example, I might want to profile a 100 million row table. In my effort to be efficient, my sample might be 30% of rows where I select every third row. Sample base profiling requires me to store my sample in some temporary medium. Also, sample based profiling requires you to ensure you have a representative sample of your data. From a statistic standpoint, if my samples are too small, I can easily miss data patterns or not properly identify the column’s domain.
The Future of Big Data
With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.
The second type of profiling involves profiling my data in place. It’s treated as just another query of my database. Generally, you will be profiling PROD and given the contention for resources, you’ll want to run your queries when it has the least impact to the database.
Data Profiling Toolsets
You might be asking what toolsets are available to perform Data Profiling. You have lots of options. Most of the ETL toolsets like Informatica and Data Stage offer built in Data Profilers. There are stand-alone Data Profiling alternatives. And if your budget is zero, you can write your own scripts to perform the analysis.
Data Profiling Insights
What data should I profile first? I like to focus on mission critical data first, like customer or product information. If I have a data warehouse, data mart, or OLAP cube, I’ll focus on their Data Sources. Your OLTP environment is a good starting point since most analytic data stores will pull from these sources.
Once you have performed your data profiling effort, what next? I like to map the results to my outstanding application’s bug fix reports. You can find a high correlation between the known errors and what your data profile informs you of. And you can be proactive in the discovery of errors that may reside in your data now. If I know my data contents, I can create better and smaller test data sets for QA purposes. I like to share my findings with QA, and develop a better test database and improve our test plans.
I can be proactive in my transformations where I can identify data misalignments where my data sources contain values that are not being handled properly. And if there are data anomalies where we have the same set of values stored in multiple locations, we can address our data structure if needed.
Another useful insight comes in the data modeling structure. Do my tables reflect the business at hand? Every organization will have tables that are processed each night, and not used by anyone. When I profile, I like to match my data to my Business Intelligence environment. When I identify a set of tables and reports that are not used by anyone we can remove them from PROD to improve our performance. Also, I can match my data sources to my staging area to determine if my processes are optimal.
There are so many great uses for data profiling. To start, I recommend looking at your business strategy and assessing your data quality cost. Once you’ve assessed the cost, determine if your current data quality strategy aligns to your business needs. A good data profile strategy should complement your business strategy and provide the business tangible bottom-line results.
What issues have you overcome in data profiling? How did you work through any issues?