I have read several articles that explain that the Data Lake, Data Warehouse, and Data Lakehouse are different repositories and that you may need several or all of them. Frankly, I don’t agree, and I thought it was time to provide a different point of view. In this post, I will explain why I believe you don’t need anything other than a modern Data Lake.
Data Lake, Data Warehouse, and Date Lakehouse
The definition of a classic Data Warehouse and Data Mart typically involves star schemas and dimensional models. You can find my three-part series, here. If you are still using a classic data warehouse you are not fully enabling data workers within your organization, taking advantage of modern data management technology, and practicing financial responsibility. In my opinion, the classic Data Warehouse should be replaced with the modern Data Lake.
As for the Data Lakehouse, it is built on the false assumption that you must create another data repository to allow the data to be friendlier, more integrated, and easier to work with for AI/ML and data visualization tools. In my opinion, the premise is absurd. Data can be integrated, restructured, and denormalized without ever leaving the modern Data Lake. Flatter and wider tables can be created in the Data Lake. All modern Data Lakes allow AI and ML tools to access data directly and modern Data Lakes allow data visualization tools to work with data directly.
Shaping the Future of Healthcare with Google Cloud
Learn how healthcare organizations are leveraging Google Cloud Platform to help reduce operational spend while increasing revenue, improving the quality of care, and meeting industry standards.
Why would anyone want to move data out of the Data Lake to another repository? Performing this action leads to more data silos and more data security issues, which are the two biggest reasons for creating the Data Lake in the first place. You don’t need a Data Lakehouse; you can do everything that a lakehouse does in the Modern Data Lake. I will go into more detail in the next blog in this series.
Modern Date Lake
When I speak of the modern Data Lake, I am not talking about the Hadoop stack, and I am not talking about an HDFS file system. I am talking about a centralized repository that allows you to store all your structured and unstructured data at any scale. My modern Data Lake repository of choice is Google’s BigQuery. BigQuery is serverless, it automatically scales without user intervention. It performs queries on billion-row tables in seconds. They also support a standard SQL dialect that is ANSI:2011 compliant, reducing the need for code rewrite and allowing you to take advantage of advanced SQL features. BigQuery provides free ODBC and JDBC drivers to ensure your current applications can interact with BigQuery’s powerful engine.
BigQuery provides a flexible, powerful foundation for machine learning and artificial intelligence. Users can perform ML in SQL Queries with BigQuery ML and it is integrated with Google AI. This ML platform is named Vertex. BigQuery forms the Data Lake backbone for BI solutions, and enables seamless data integration, transformation, analysis, and visualization, BigQuery is integrated with Looker and can be used by Tableau, Power BI, Qlik, and others. If you want to use data to make better business decisions you don’t need a Data Warehouse and you don’t need a Data Lakehouse, you just need to build a BigQuery Data Lake.
Read the next blog in the series, here.
Perficient’s Cloud Data Expertise
Our cloud, data, and analytics team can assist with your entire data and analytics lifecycle, from data strategy to implementation. We will help you make sense of your data and show you how to use it to solve complex business problems. We’ll assess your current data and analytics issues and develop a strategy to guide you to your long-term goals.
Download the guide, Becoming a Data-Driven Organization with Google Cloud Platform, to learn more about Dr. Chuck’s GCP data strategy