Skip to main content

Google

Data Lake, Data Warehouse, Data Lakehouse?

I have read several articles that explain that the Data Lake, Data Warehouse, and Data Lakehouse are different repositories and that you may need several or all of them. Frankly, I don’t agree, and I thought it was time to provide a different point of view. In this post, I will explain why I believe you don’t need anything other than a modern Data Lake.

 

Data Lake, Data Warehouse, and Date Lakehouse

The definition of a classic Data Warehouse and Data Mart typically involves star schemas and dimensional models. You can find my three-part series, here. If you are still using a classic data warehouse you are not fully enabling data workers within your organization, taking advantage of modern data management technology, and practicing financial responsibility. In my opinion, the classic Data Warehouse should be replaced with the modern Data Lake.

 

Teamwork Holds The Capacity For Increased Productivity

As for the Data Lakehouse, it is built on the false assumption that you must create another data repository to allow the data to be friendlier, more integrated, and easier to work with for AI/ML and data visualization tools. In my opinion, the premise is absurd. Data can be integrated, restructured, and denormalized without ever leaving the modern Data Lake. Flatter and wider tables can be created in the Data Lake. All modern Data Lakes allow AI and ML tools to access data directly and modern Data Lakes allow data visualization tools to work with data directly.

Why would anyone want to move data out of the Data Lake to another repository? Performing this action leads to more data silos and more data security issues, which are the two biggest reasons for creating the Data Lake in the first place. You don’t need a Data Lakehouse; you can do everything that a lakehouse does in the Modern Data Lake. I will go into more detail in the next blog in this series.

 

 

Modern Date Lake

When I speak of the modern Data Lake, I am not talking about the Hadoop stack, and I am not talking about an HDFS file system. I am talking about a centralized repository that allows you to store all your structured and unstructured data at any scale. My modern Data Lake repository of choice is Google’s BigQuery. BigQuery is serverless, it automatically scales without user intervention. It performs queries on billion-row tables in seconds. They also support a standard SQL dialect that is ANSI:2011 compliant, reducing the need for code rewrite and allowing you to take advantage of advanced SQL features. BigQuery provides free ODBC and JDBC drivers to ensure your current applications can interact with BigQuery’s powerful engine.

BigQuery provides a flexible, powerful foundation for machine learning and artificial intelligence. Users can perform ML in SQL Queries with BigQuery ML and it is integrated with Google AI. This ML platform is named Vertex. BigQuery forms the Data Lake backbone for BI solutions, and enables seamless data integration, transformation, analysis, and visualization, BigQuery is integrated with Looker and can be used by Tableau, Power BI, Qlik, and others. If you want to use data to make better business decisions you don’t need a Data Warehouse and you don’t need a Data Lakehouse, you just need to build a BigQuery Data Lake.

 

Read the next blog in the series, here.

 

Perficient’s Cloud Data Expertise

Our cloud, data, and analytics team can assist with your entire data and analytics lifecycle, from data strategy to implementation. We will help you make sense of your data and show you how to use it to solve complex business problems. We’ll assess your current data and analytics issues and develop a strategy to guide you to your long-term goals.

Download the guide, Becoming a Data-Driven Organization with Google Cloud Platform, to learn more about Dr. Chuck’s GCP data strategy

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Chuck Brooks

Dr. Chuck is a Senior Data Strategist / Solution Architect. He is a technology leader and visionary in big data, data lakes, analytics, and data science. Over a career that spans more than 40 years, Dr. Chuck has developed many large data repositories based on advancing data technologies. Dr. Chuck has helped many companies become data-driven and develop comprehensive data strategies. The cloud is the modern ecosystem for data and data lakes. Dr. Chuck’s expertise lies in the Google Cloud Platform, Advanced Analytics, Big Data, SQL and NoSQL Databases, Cloud Data Management Engines, and Business Management Development technologies such as SQL, Python, Data Studio, Qlik, PowerBI, Talend, R, Data Robot, and more. The following sales enablement and data strategy results from 40 years of Dr. Chuck’s career in the data space. For more information or to engage Dr. Chuck in an engagement, contact him at chuck.brooks@perficient.com.

More from this Author

Follow Us
TwitterLinkedinFacebookYoutubeInstagram