LinkedIn open sources a lot of code. Kafka, of course, but also Samza and Voldemoort and a bunch of Hadoop tools like DataFu and Gobblin. Open-source projects tend to be created by developers to solve engineering problems while commercial products … Anyway, LinkedIn has a new open-source data offering called OpenHouse, which is billed as an control plane for tables in open data lakehouses. For the unfamiliar, a lakehouse is a portmanteu of data lake and data warehouse. Lots of companies offer data lakehouses. Its very popular in the data space right now because it really is a very useful concept. It seemed like LinkedIn was using Cloudera‘s terminology with ‘open’ data lakehouse. This makes sense because LinkedIn uses a lot of Hadoop internally and Hadoop takes a lot of work. Control planes take a of of work. When a company like LinkedIn with their history decides to open source a control plane for lakehouse, I’m going to dive a little deeper to see where I might want to use this.
Data Plane
LinkedIn uses blob storage and HDFS for their data storage. HDFS is Hadoop-specific but most have interacted with Azure Blob Storage more recently and may even think that it’s a Microsoft-specific technology. Binary Large Objects (BLOBS) are a typical mechanism for storing unstructured data. The Hadoop Distributed File system (HDFS) breaks up files into chunks and distributes the storage across multiple (usually three) nodes.
They use a couple of different open-source table formats for accessing the data. Hudi was open-sourced by Uber in 2017 and is now considered the first data lakehouse. I added that caveat because when Hudi first came out it called itself an “incremental data lake”. The term “data lakehouse” wasn’t around for another three years until Databricks coined the term. Databricks offering in the space is Delta Lake. Apache Iceberg is a table format that allows for high-performant analytics on huge table using SQL. OpenHouse offers a feature comparison of the three formats that’s pretty comprehensive. I get the feeling they like Hudi over there even though they are a well known Iceberg shop.
For compute, they use Spark, Trino and Flink. As a general rule of thumb, Trino is going to be as performant as you can get when performing SQL across massive datasets. Spark is a more general distributed execution framework so it can do more than just SQL, like machine learning, streaming and data processing. I’ve seen some different opinions out there, but they are usually using smaller datasets. Microbenchmarking is really, really hard and you need to take the results with a grain of salt. Flink is great for event-driven applications, stream and batch analytics and data pipelines. Most comparisons between Flink and Spark focus on their respective support for native streaming versus micro-batching.
Control Plane
Control planes offer unified security, governance, discover and observability. Common options for a control plan that can manage the different storage layers and formats and compute options listed above are …
Nothing.
If you are a Databricks shop or a Snowflake shop, you don’t have this issue. You have Unity Catalog or you leverage the cloud provider. This is a great example of open source projects representing developer pain points. I’m getting anxiety just writing this. The results was exactly what you would think it would be. Everyone basically doing their best to do the work they need to get done with the skills and tools they have. The LinkedIn team started with a basic abstraction, which was tables. They basically abstracted the lowest layer, blobs and files, and called them tables. This is fair because this is what almost every user would call them.
The table is the only API abstraction available to end-users. These tables are stored in a protected storage namespace accessed only by the control plane. This gives the control plane full control over the implementation including security, HA, DR, quotas, organization, etc. This also allows for enforcing company standard. With great power comes great responsibility, so now there is one team responsible for optimization, clustering, partitioning, vacuuming, garbage collection, etc. Your house; your rules. This opens up the possibility for opinionated interfaces.
RESTful API
OpenHouse provides a RESTful table service for users to interact with using declarative semantics. Users define schemas, tables and associated metadata declaratively using this RESTful table service. It the job of OpenHouse to reconcile the observed state of the Tables with the desired state expressed by the user. I’m going to pull some sample code from their LinkedIn engineering blog on OpenHouse because of how solid I think the UX has been implemented.
-- create table in openhouse CREATE TABLE openhouse.db.table (id bigint COMMENT 'unique id', data string); -- manipulate table metadata ALTER TABLE openhouse.db.table_partitioned SET POLICY ( RETENTION=30d ); ALTER TABLE openhouse.db.table ALTER COLUMN measurement TYPE double; ALTER TABLE openhouse.db.table SET TBLPROPERTIES ('key1' = 'value1'); -- manipulate table data INSERT INTO openhouse.db.table VALUES ('1', 'a'); -- share table ALTER TABLE openhouse.db.table_partitioned SET POLICY ( SHARING=true ); GRANT SELECT ON TABLE openhouse.db.table TO user;
Again, if you are using a single commercial application with an integrated data and control plane, this looks perfectly normal. And that’s the whole point. Go back and look at all the different technologies they are supporting. According to this same article, it takes two to three weeks to onboard a new table without OpenHouse. OpenHouse is self-service.
Conclusion
There are typically two approached we take to data systems to maintain reliability, stability and good governance. One approach is standardisation. If you use a single tool that takes care of all these issues natively, you get one throat to choke. This is a management-centric approach. It works well in the sense that all developers are equally unhappy. The other approach is to use the best tool for the job, but be prepared to pay the steep price of ownership. This is a very elegant solution that sets the bar very high for those who think that option two ultimately delivers the best business value. This was a multi-year initiative built on incremental milestones and executive support and that’s what it realistically takes to deliver this level of technical impact.
Get in touch with us if you want to know more about OpenHouse might fit into your strategic data initiatives!