Skip to main content

Google

From Data Swamp to Data Lake

Focused Adult Man, Making A Schedule For Work.

 

For several years now I have heard people that wanted to slow the progress of companies becoming data-driven use the term “Data Swamp”, usually without much understanding of what a Data Swamp is. In this series of blogs, I will define what a Data Swamp really is and why it may not be a bad thing. More importantly, I will explain how straightforward it is to keep a data lake from becoming a Data Swamp or how to turn your Data Swamp into a data lake.

 

By definition, a Data Swamp is an unmanaged Data Lake that is either inaccessible to corporate knowledge workers or provides little value. Data swamps occur when adequate data quality and data governance measures are not implemented.

 

Colleagues Using Tablet Pc In Textile Factory

Given this definition, I want to point out that a data swamp is not necessarily bad. For many companies, the journey to becoming data-driven begins by collecting data from across the company into a centralized repository. This effort results in breaking down organizational silos of data, it often clarifies differences in similar data domains and often exposes duplication of data across the company. For a company trying to become data-driven, a Data Swamp may be a significant improvement over no data environment or the fragmented data environment that existed before the Data Swamp. The data swamp is often not only an eye-opening experience for management and knowledge workers but also an improvement from the inaccessible silos of data that existed before.

 

 

 

We have already discussed, in a previous blog,  the need to make data and the data lake accessible to corporate knowledge workers.  As described in my previous post, in my opinion making data accessible is really a prerequisite and obvious. In this series of blogs, assuming data is accessible, I want to discuss the 5 capabilities necessary to make sure your Data Lake does not become a Data Swamp or to turn your Data Swamp back into a Data Lake. The five capabilities are:

  1. Create a Data Catalog
  2. Create a Data Governance organization
  3. Implement data quality analysis and reporting
  4. Implement classification-based security in the Data Lake
  5. Have multiple data zones inside the Data Lake

 

That is all it takes! When these five capabilities are implemented, they will guarantee your Data Lake is not a Data Swamp. These five capabilities will make sure that knowledge workers can find and understand what data is in the Data Lake and where it came from. These capabilities will make sure that knowledge workers have trust in the data that is in the Data Lake. These capabilities will make sure that knowledge workers understand the quality of the data in the Data Lake and by extension the quality of the business decisions that can be made from the data in the Data Lake. These five capabilities will make sure access to the Data Lake meets all corporate and regulatory requirements for security and access to data in the Data Lake. These capabilities will help knowledge workers understand the maturity of the data in the Data Lake. Finally, these five capabilities will guide knowledge works on how to join and use data effectively in the Data Lake to make data-driven business decisions.

In the next five blogs, I will discuss individually each of the five capabilities. I look forward to these discussions and hope you will find them useful in the journey to becoming data-driven.

 

Read the next blog in the series, here.

 

Perficient’s Cloud Data Expertise

Our cloud, data, and analytics team can assist with your entire data and analytics lifecycle, from data strategy to implementation. We will help you make sense of your data and show you how to use it to solve complex business problems. We’ll assess your current data and analytics issues and develop a strategy to guide you to your long-term goals.

Download the guide, Becoming a Data-Driven Organization With Google Cloud Platform, to learn more about Dr. Chuck’s GCP data strategy.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Chuck Brooks

Dr. Chuck is a Senior Data Strategist / Solution Architect. He is a technology leader and visionary in big data, data lakes, analytics, and data science. Over a career that spans more than 40 years, Dr. Chuck has developed many large data repositories based on advancing data technologies. Dr. Chuck has helped many companies become data-driven and develop comprehensive data strategies. The cloud is the modern ecosystem for data and data lakes. Dr. Chuck’s expertise lies in the Google Cloud Platform, Advanced Analytics, Big Data, SQL and NoSQL Databases, Cloud Data Management Engines, and Business Management Development technologies such as SQL, Python, Data Studio, Qlik, PowerBI, Talend, R, Data Robot, and more. The following sales enablement and data strategy results from 40 years of Dr. Chuck’s career in the data space. For more information or to engage Dr. Chuck in an engagement, contact him at chuck.brooks@perficient.com.

More from this Author

Follow Us
TwitterLinkedinFacebookYoutubeInstagram