Skip to main content

Google

From Data Swamp to Data Lake: Data Quality

Teamwork Is Worth It

This is the third blog in a series that explains how organizations can prevent their Data Lake from becoming a Data Swamp, with insights and strategy from Perficient’s Senior Data Strategist and Solutions Architect, Dr. Chuck Brooks. Read the previous blog, here

In the First article in this series, I explained the five components necessary to prevent a Data Lake from Becoming a Data Swamp. In this blog, we discuss the third capability: Implementing data quality analysis and reporting.

A data lake will contain whatever source applications system and third-party data you put in it. That usually means your data will have the following quality issues.

Man And Woman With Documents In An Office, Smiling, Close Up

  1. Missing values
  2. Typos
  3. Data entered in the wrong field
  4. Data type mismatches
  5. Obsolete or invalid lookup values.

 

 

 

For most companies, the data could be anywhere from a little dirty to completely covered in dirt. Most likely it’s somewhere in between, and different for the different source systems. Typically, when you mix data from all your systems together, all that dirt starts to build up and more and more dirty data keeps coming into the data lake. Think of the insights you’re hoping to get out of your data. Do you want that insight to come out of a clean Data Lake full of accurate, consistent, high-quality data, or do you want to take your chances with “insight” based on a dirty Data Lake?

 

Understand the quality of data in the Data Lake

Often companies do not have the ability to analyze or understand data quality problems until the data has been ingested into the Data Lake. Source systems often do not have the computing or storage capacity to perform iterative detailed quality analysis on the data. However, in the Data Lake, we can implement simple python and SQL-based tools to analyze the data, or we can purchase third-party tools like Talend, Collibra, and others. I am partial to implementing three simple, but powerful homemade tools. The first tool is pandas-profiling, the second collects statistical value on each column in the data lake and the third provides the ability to run SQL statements that expose specific quality issues in the data.

Pandas profiling is an open-source Python module that can quickly do exploratory data analysis with just a few lines of code. Pandas Profiling also generates interactive reports in a web format that can be presented to any person, even if they don’t know programming.

The second is a simple python program that performs statistics-based profiling. The statistics profiling tool should capture the following statistics:

  1. Nulls: Number of missing values in a column
  2. Cardinality: Number of unique values in a column
  3. Selectivity: Ratio of cardinality to the number of rows, which provides uniqueness of data in a column
  4. Density: Number of values in a column relative to the number of rows, that is, the number of non-NULL values in a column

The third tool should allow data workers to feed SQL statements to an automated process that will store the statements to be run at a  user-specified frequency. When run the SQL statements will produce output tables stored in an internal database that contain quality data metrics or results. This tool can produce specific quality results.

Improving Data Quality  

While quality analysis is most often performed in the Data Lake. It is important to point out that data should never be changed/fixed in the Data Lake it should be fixed in the source systems and the fixes will be updated and propagated to the Data Lake.  Fixing the data in the source systems will improve data quality across your organizations and keep your data lake from becoming a data swamp.

Read the next blog, here.

 

Perficient’s Cloud Data Expertise

The world’s leading brands choose to partner with us because we are large enough to scale major cloud projects, yet nimble enough to provide focused expertise in specific areas of your business. Our cloud, data, and analytics team can assist with your entire data and analytics lifecycle, from data strategy to implementation. We will help you make sense of your data and show you how to use it to solve complex business problems. We’ll assess your current data and analytics issues and develop a strategy to guide you to your long-term goals. Learn more about our Google Data capabilities, here.

Download the guide, Becoming a Data-Driven Organization with Google Cloud Platform, to learn more about Dr. Chuck’s GCP data strategy.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Chuck Brooks

Dr. Chuck is a Senior Data Strategist / Solution Architect. He is a technology leader and visionary in big data, data lakes, analytics, and data science. Over a career that spans more than 40 years, Dr. Chuck has developed many large data repositories based on advancing data technologies. Dr. Chuck has helped many companies become data-driven and develop comprehensive data strategies. The cloud is the modern ecosystem for data and data lakes. Dr. Chuck’s expertise lies in the Google Cloud Platform, Advanced Analytics, Big Data, SQL and NoSQL Databases, Cloud Data Management Engines, and Business Management Development technologies such as SQL, Python, Data Studio, Qlik, PowerBI, Talend, R, Data Robot, and more. The following sales enablement and data strategy results from 40 years of Dr. Chuck’s career in the data space. For more information or to engage Dr. Chuck in an engagement, contact him at chuck.brooks@perficient.com.

More from this Author

Follow Us
TwitterLinkedinFacebookYoutubeInstagram