Sometimes, its nice to be able to skip a step. Most data projects involve data movement before data access. Usually this is not an issue; everyone agrees that the data must be made available before it can be available. There are use cases where the data movement part is a blocker because of time, cost, complexity, resource availability or any number of reasons. Business users may need to perform ad-hoc reporting on disparate datasets for a small amount of time for an isolated use case. Mergers and acquisitions often create windows of time where data in different systems needs to be made available before the official migration is complete. Some POC or exploratory work legitimately does not have a budget yet because the business case can’t be made without at least preliminary insights. There are workarounds, of course, but they usually aren’t compliant with standard corporate security and governance restrictions. You can’t have unauthorized access to PII just because its temporary and convenient. This is a gap that Databricks Lakehouse Federation is looking to fill.
Overview
Databricks Lakehouse Federation allows you to run execute queries across multiple data sources without moving the data and still provides for data governance, data lineage and fine-grained access control through Unity Catalog. As a side note, its interesting to note how much innovation you can do in a regulated environment once you have a centralized metastore capable of fine-grained access control, lineage and discovery. There are two components that Databricks uses to enable data federation: connections and a foreign catalog. The concept of connections is straightforward enough; you need to be able to provide path and credential information to access a database in any implementation. The foreign catalog is the innovation.
Unity (foreign) Catalog
The Unity Catalog metastore is composed of securable objects in a hierarchy (Catalog -> Schema -> [Table, View, Volume, Model, Function]. A foreign catalog is a securable object that is at the same level as Catalog in the hierarchy, except its a read-only mirror of an external database. I’m going to include the code snippet used to create a foreign catalog just to show how straightforward Databricks made the implementation.
CREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name> OPTIONS (database '<database-name>');
You may have noticed that materialized views were first supported when Delta Live Tables were made available. A materialized view allows for precomputing results based on the latest version in a source table on a defined schedule, rather than streaming. Once again, here is a code snippet just to show the ease of implementation.
CREATE MATERIALIZED VIEW xyz AS SELECT * FROM federated_catalog.federated_schema.federated_table;
You can leverage your experience with Unity Catalog and securable objects with foreign catalogs and materialized views. You can view catalog details, manage privileges and capture and view data lineage the same way you would with a regular catalog. The only difference is that you are able to perform read-only queries on data without having moved it into Databricks,
Conclusion
This is still in Public Preview so there may be some implementation restrictions that may change over time. Private Link and static IP range support on Serverless SQL warehouse are not available. Single-user access model is only available for users that own the connection. There are naming rules and limitations in Unity Catalog that may differ from the source database, like enforcing lower-case table names. However, we can give our business partners the ability to very quickly access data in most databases without moving data and without sacrificing security and governance. I can remember a number of times where I wish I had access to this kind of conscientious flexibility for time-boxed analytics and I’m sure it will come up again very soon.
Contact us to learn more about how to Databricks solutions for your organization.