The goal of Databricks Unity Catalog is to provide centralized security and management to data and AI assets across the data lakehouse. Unity Catalog provides fine-grained access control for all the securable objects in the lakehouse; databases, tables, files and even models. Gone are the limitations of the Hive metadata store. The Unity Catalog metastore manages all data and AI assets across different workspaces and storage locations. Providing this level of access control substantially increases the quality of governance while reducing the workload involved. There is an additional target of opportunity with tagging.
Tagging Overview
Tags are metadata elements structured as key-value pairs that can be attached to any asset in the lakehouse. Tagging can make these assets more seachable, manageable and governable. A well-structured, well-executed tagging strategy can enhance data classification, enable regulatory compliance and streamline data lifecycle management. The first step is to identify a use case that could be used as a Proof of Value in your organization. A well-structured tagging strategy means that you will need buy-in and participation from multiple stakeholders, include technical resources, SMEs and a sponsor. These are five common use cases for tagging that might find some traction in a regulated enterprise because they can usually be piggy-backed off an existing or upcoming initiative:
- Data Classification and Security
- Data Lifecycle Management
- Compliance and Regulation
- Project Management and Collaboration
Data Classification and Security
There is always room for an additional mechanism to help safely manage PII (personally identifiable information). A basic initial implementation of tagging could be as simple as applying a PII tag to classify data based on sensitivity. These tags can then be integrated with access control policies in Unity Catalog to automatically grant or restrict access to sensitive data. Balancing the promise of data access in the lakehouse with the regulatory realities surrounding sensitive data is always difficult. Additional tools are always welcome here.
Data Lifecycle Management
Some organizations struggle with the concept of managing different environments in Databricks. This is particularly true when they are moving from a data landscape where there were specific servers for each environment. Tags can be used to identify stages (ex: dev, test, and prod). These tags can then be leveraged to implement policies and practices around moving data through different lifecycle stages. For example, masking policies or transformation steps may be different between environments. Tags can also be used to facilitate rules around deliberate destruction of sensitive data. Geo-coding data with tags to comply with European regulations is also a possible target of opportunity.
Data Cataloging and Discovery
There can be a benefit in attaching descriptive tags directly to the data for cataloging and discovery even if you are already using an external tool. Adding descriptive tags like ‘customer’ or ‘marketing’ directly to the data assets themselves can make it more convenient for analysts and data scientist to perform searches and therefore more likely to be actually used.
Compliance and Regulation
This is related to, and can be used in conjunction with, data classification and security. Applying tags such as ‘GDPR’ or ‘HIPAA’ can make performing audits for regulators much simpler. These tags can be used in conjunction with security tags. In an increasing regulated data environment, it pays to make your data assets easy to regulate.
Project Management and Collaboration
This tagging strategy can be used to organize data assets based on project, teams or departments. This can facilitate project management and improve collaboration by identifying which organizational unit owns or is working with a particular data asset.
Implementation
There are some practical considerations when implementing a tagging program:
- each securable object has a limit of twenty tags
- the maximum length of a tag is 255 characters, with no special characters allowed
- you can only search by using exact match (pattern-matching would have really been nice here)
A well-executed tagging strategy will involve some level of automation. It is possible to manage tags in the Catalog Explorer. This can be a good way to kick the tires in the very beginning but automation is critical for a consistent, comprehensive application of the tagging strategy. Good governance is automated. While tagging is available to all securable objects, you will likely start out applying tags to tables.
The information schema tables will have the tag information. However, Databricks Runtime 13.3 and above allows tag management through SQL commands. This is the preferred mechanism because it is so much easier to use than querying the information schema. Regardless of the mechanism used, a user must have the APPLY TAG privilege on the object, the USE SCHEMA privilege on the object’s parent schema and the USE CATALOG privilege on the object’s parent catalog. This is pretty typical with Unity Catalog’s three-tiered hierarchy. If you are using SQL commands to manage tags, you can use the SET TAGS and UNSET TAGS clauses in the ALTER TABLE command.
You can use a fairly straightforward PySpark script to loop through a set of tables, look for a certain set of column names and then apply tags as appropriate. This can be done as an initial one-time run and then automated by creating a distinct job to check for new tables and/or columns or include in existing ingestion processes. There is a lot to be gained by augmenting this pipeline from just using a script that checks for columns named ‘ssn’ to creating an ML job that looks for fields that contain social security numbers.
Conclusion
I’ve seen a lot of companies struggle with populating their Databricks Lakehouse with sensitive data. In their current state, databases had a very limited set of users, so only people that were authorized to see certain data, like PII, had access to the database that stored this information. However, the utility of a lakehouse is dramatically reduced if you don’t allow sensitive data. In most cases, it just won’t get any enterprise traction. Leveraging all of the governance and security feature of Unity Catalog is a great, if not mandatory, first step. Enhancing governance and security, as well as utility, with tagging is probably going to be necessary to one degree or another in your organization to get broad usage and acceptance.
Contact us to learn more about how to build robustly governed solutions in Databricks for your organization.