Skip to main content

Data & Intelligence

Databricks Lakehouse Federation Public Preview

Data Lakehouse House 2

Sometimes, its nice to be able to skip a step. Most data projects involve data movement before data access. Usually this is not an issue; everyone agrees that the data must be made available before it can be available. There are use cases where the data movement part is a blocker because of time, cost, complexity, resource availability or any number of reasons. Business users may need to perform ad-hoc reporting on disparate datasets for a small amount of time for an isolated use case. Mergers and acquisitions often create windows of time where data in different systems needs to be made available before the official migration is complete. Some POC or exploratory work legitimately does not have a budget yet because the business case can’t be made without at least preliminary insights. There are workarounds, of course, but they usually aren’t compliant with standard corporate security and governance restrictions. You can’t have unauthorized access to PII just because its temporary and convenient. This is a gap that Databricks Lakehouse Federation is looking to fill.

Overview

Databricks Lakehouse Federation allows you to run execute queries across multiple data sources without moving the data and still provides for data governance, data lineage and fine-grained access control through Unity Catalog. As a side note, its interesting to note how much innovation you can do in a regulated environment once you have a centralized metastore capable of fine-grained access control, lineage and discovery. There are two components that Databricks uses to enable data federation: connections and a foreign catalog. The concept of connections is straightforward enough; you need to be able to provide path and credential information to access a database in any implementation. The foreign catalog is the innovation.

Unity (foreign) Catalog

The Unity Catalog metastore is composed of securable objects in a hierarchy (Catalog -> Schema -> [Table, View, Volume, Model, Function]. A foreign catalog is a securable object that is at the same level as Catalog in the hierarchy, except its a read-only mirror of an external database. I’m going to include the code snippet used to create a foreign catalog just to show how straightforward Databricks made the implementation.

CREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>
OPTIONS (database '<database-name>');

You may have noticed that materialized views were first supported when Delta Live Tables were made available. A materialized view allows for precomputing results based on the latest version in a source table on a defined schedule, rather than streaming. Once again, here is a code snippet just to show the ease of implementation.

CREATE MATERIALIZED VIEW xyz AS SELECT * FROM federated_catalog.federated_schema.federated_table;
Data Intelligence - The Future of Big Data
The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

You can leverage your experience with Unity Catalog and securable objects with foreign catalogs and materialized views. You can view catalog details, manage privileges and capture and view data lineage the same way you would with a regular catalog. The only difference is that you are able to perform read-only queries on data without having moved it into Databricks,

Conclusion

This is still in Public Preview so there may be some implementation restrictions that may change over time. Private Link and static IP range support on Serverless SQL warehouse are not available. Single-user access model is only available for users that own the connection. There are naming rules and limitations in Unity Catalog that may differ from the source database, like enforcing lower-case table names. However, we can give our business partners the ability to very quickly access data in most databases without moving data and without sacrificing security and governance. I can remember a number of times where I wish I had access to this kind of conscientious flexibility for time-boxed analytics and I’m sure it will come up again very soon.

Contact us to learn more about how to Databricks solutions for your organization.

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us
TwitterLinkedinFacebookYoutubeInstagram