Skip to main content

Google

The Data Lake and Data Zones

I believe there is confusion about what a data Lake is based on a 10-year-old definition that is not accurate anymore so,  in this post, I wanted to provide some clarity of what a Data Lake is in today’s technology terms.

Portrait Of Programmers Working In Development Software Company

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. A data lake stores relational data from lines of business applications, non-relational data from applications, IoT devices, social media, and any other data of interest to your organization. The structure of the data or schema does not need to be defined when data is captured. This means you can store data without careful structure design or the need to know what questions you might need answers for in the future. Different types of analytics on your data like SQL queries, analytics, full-text search, real-time analytics, and machine learning can be used to uncover insights. However, because a Data Lake does not force you to define structure to land data, does not mean that data can not be structured in a Data Lake. Data can be landed in its source structure or unstructured and then integrated and restructured later as it becomes clearer how the data can and should be used. We accomplish this in a Data Lake by using data zones. An organization can have as many data zones as needed, but I prefer to keep it to three zones named Discovery, Strategic, and Optimization.

The Discovery Zone

The discovery zone is the landing area for disparate sources of data in their native format. Data is not restructured or curated on its way into the Data Lake.  This reduces the upfront costs of data ingestion, especially transformation.  Data is stored in the same structure or unstructured format from its source system. Once data is in the Data Lake, the data can be made available to anyone.  This zone is very efficient in processing huge volumes of data.  Another benefit is that this zone allows for data exploration and discovery, to find out if data is useful or to simply leave data in this zone. At the same time, you wait to see how you might eventually use it.  Data in the discovery zone can be used by people who don’t require top-quality cleansed data and are desperately trying to address time-critical business requirements. Such use cases can be as simple as one-time operational reports, or as complicated as using the Data Lake to offload workloads for refining data. It is important to mention that data in this zone is fully auditable to the level that it is auditable in the operational system that it comes from. Once data lands in the Data Lake discovery zone, the baton is handed to data scientists, data analysts, or business analysts, known as knowledge workers to use the data for discovery, analytics, and predictive modeling. From a data preparation view, the ideal ingestion system will have landed the data so that it is ready for exploration and insight into business needs.

The Strategic Zone

In the strategic zone, data is enhanced. The strategic zone integrates and restructures data from the discovery zone and the optimization zone. New derived data is created, and data aggregation is pervasive in this zone. This zone takes advantage of integrated, flat, and wide tables. Preprocessed data is created for statistical modeling and advanced analytics applications. The strategic zone offers a place where data workers can denormalize raw data from the discovery zone. It is a place where data can be integrated from multiple lines of business source systems. The data in this zone may be curated and reflects higher quality data than the discovery zone. The strategic zone is the primary consumption zone for business analysts and knowledge workers across the organization. Multiple yet consistent business views of the data exist in this zone. There is no single view of data in this zone as different views of the same data will be created to satisfy different use cases across different enterprise business units. However, a similar analysis of the data in this zone will produce the same results. Data cleansing, standardization, transformation, and aggregation happen in this zone.  The strategic zone makes data easy for knowledge workers to work with the data.

The Optimization Zone

Master data management has been applied to data that the business requires to be mastered before it is ingested and made available in this data zone.  This includes removing duplicate data, data match and merge rules, entity consolidation, related entities, and survivorship rules. This zone might include customer data, product data, agent data, and household data for all business units of your organization. This is without question the highest quality of data in the Data Lake.

Summary

Using multiple data zones allows data to be integrated and structured as knowledge of the data increases. Using multiple data zones allows for creating flat and wide data structures that are easy for knowledge workers to use and perform at the highest level on modern data management platforms. Using multiple data zones helps knowledge workers understand the quality and current state of the data they are working with. Using multiple data zones makes it possible to integrate and structure data for easy use in Analytics, Data Visualization, and AI/ML. There is no need for a Data Lake, Data Warehouse, and a Data Lakehouse, you can do it all in the Data Lake with the use of multiple Data Zones.

 

Perficient’s Cloud Data Expertise

Our cloud, data, and analytics team can assist with your entire data and analytics lifecycle, from data strategy to implementation. We will help you make sense of your data and show you how to use it to solve complex business problems. We’ll assess your current data and analytics issues and develop a strategy to guide you to your long-term goals.

Download the guide, Becoming a Data-Driven Organization with Google Cloud Platform, to learn more about Dr. Chuck’s GCP data strategy.

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Chuck Brooks

Dr. Chuck is a Senior Data Strategist / Solution Architect. He is a technology leader and visionary in big data, data lakes, analytics, and data science. Over a career that spans more than 40 years, Dr. Chuck has developed many large data repositories based on advancing data technologies. Dr. Chuck has helped many companies become data-driven and develop comprehensive data strategies. The cloud is the modern ecosystem for data and data lakes. Dr. Chuck’s expertise lies in the Google Cloud Platform, Advanced Analytics, Big Data, SQL and NoSQL Databases, Cloud Data Management Engines, and Business Management Development technologies such as SQL, Python, Data Studio, Qlik, PowerBI, Talend, R, Data Robot, and more. The following sales enablement and data strategy results from 40 years of Dr. Chuck’s career in the data space. For more information or to engage Dr. Chuck in an engagement, contact him at chuck.brooks@perficient.com.

More from this Author

Follow Us
TwitterLinkedinFacebookYoutubeInstagram