This is the final blog in a series that explains how organizations can prevent their Data Lake from becoming a Data Swamp, with insights and strategy from Perficient’s Senior Data Strategist and Solutions Architect, Dr. Chuck Brooks. Read the previous blog, here.
In the first article in this series, I explained the five components necessary to prevent a Data Lake from Becoming a Data Swamp. In this blog, we discuss the fifth capability: Having multiple data zones inside the Data Lake
A data lake is typically defined as a centralized and scalable storage repository that holds large volumes of raw data from multiple sources and systems in its native format. Data lakes work on the concept of load first and use later, which means the data stored in the repository doesn’t necessarily have to be used immediately for a specific purpose. It can be dumped as-is and used all together (or in parts) at a later stage as business needs arise. This flexibility, combined with the vast variety and amount of data stored, makes data lakes ideal for data experimentation as well as machine learning and advanced analytics applications within an enterprise.
Typically, data is landed in its raw format in what I call the discovery zone. As we begin to use, curate, and integrate data from multiple tables and sources into new wide tables that are use case specific this data lands in the strategic zone and if we have any data that is Mastered across the organizations then that data is stored in the optimization zone. Multiple zones allow for improved data integration, improved maturity, and matured quality.
Shaping the Future of Healthcare with Google Cloud
Learn how healthcare organizations are leveraging Google Cloud Platform to help reduce operational spend while increasing revenue, improving the quality of care, and meeting industry standards.
Get the Guide
The Discovery Zone
The discovery zone is the landing area for disparate sources of data in their native format. Data is not structured or curated on its way into the Data Lake. This reduces the upfront costs of data ingestion, especially transformation. Once data is in the Data Lake, the data can be made available to anyone. You don’t need an understanding of how data is related when it is ingested; rather, it relies on the data engineers and end-users to define those relationships as they consume it. Data curation happens on the way out instead of on the way in. This makes this zone very efficient in processing huge volumes of data. Another benefit is that this zone allows for data exploration and discovery, to find out if data is useful or to simply leave data in this zone while you wait to see how you might eventually use it. Data in the discovery zone can be used by people who don’t require top-quality cleansed data and are desperately trying to address time-critical business requirements. Such use cases can be as simple as one-time operational reports, or as complicated as using the Data Lake to offload workloads for refining data. It is important to mention that data in this zone is fully auditable to the level that it is auditable in the operating system that it comes from.
Once this data lands in the Data Lake discovery zone, the baton is handed to data scientists, data analysts, or business analysts for data usage of data discovery, analytic and predictive modeling tools. From a data preparation view, the ideal ingestion system will have landed the data so that it is ready for exploration and insight into business needs.
The Strategic Zone
In the strategic zone, data is enhanced. New derived data is created, and data aggregation is pervasive in this zone. Preprocessed data is created for statistical modeling and advanced analytics applications. The strategic zone offers a place where knowledge workers can denormalize raw data from the discovery zone. It is a place where data can be integrated from multiple lines of business source systems. The data in this zone is curated and reflects higher quality data than the discovery zone. Multiple yet consistent business views of the data exist in this zone. There is no single view of data in this zone as different views of the same data will be created to satisfy different use cases across different enterprise business units. However, to the best of everyone’s ability, a similar analysis of the data in this zone will produce the same results. Data cleansing, standardization, transformation, and aggregation begin in this zone.
The Optimization Zone
This zone contains corporate mastered Data. . Mastered data has been deduplicated, matched, and merged, and has had rules applied, such as entity consolidation, related entities, and survivorship rules. This zone will include customer data, product data, agent data, and household data for all business units of the company. This is without question the highest quality of data in the Data Lake.
Late to the series? Read the first blog, here.
Perficient’s Cloud Data Expertise
The world’s leading brands choose to partner with us because we are large enough to scale major cloud projects, yet nimble enough to provide focused expertise in specific areas of your business. Our cloud, data, and analytics team can assist with your entire data and analytics lifecycle, from data strategy to implementation. We will help you make sense of your data and show you how to use it to solve complex business problems. We’ll assess your current data and analytics issues and develop a strategy to guide you to your long-term goals.
Download the guide, Becoming a Data-Driven Organization with Google Cloud Platform, to learn more about Dr. Chuck’s GCP data strategy.