Skip to main content

Data & Intelligence

Feature Engineering with Databricks and Unity Catalog

Feature Engineering is the preprocessing step used to make raw data usable as input to an ML model through transformation, aggregation, enrichment, joining, normalization and other processes. Sometimes feature engineering is used against the output of another model rather than the raw data (transfer learning). At a high level, feature engineering has a lot in common with data engineering; we use these same steps going from the Bronze to Silver to Gold layers in a Databricks medallion architecture. The goal of data engineering is to have high quality data that is easily consumed by the business while the goal of feature engineering is to create a dataset that can be trained to build a machine learning model and reproduced later for inference. There are differences between a feature engineering pipeline and a data transformation pipeline that emerge when you look closer.

Feature Engineering

Both feature and data engineering pipelines need to prioritize data quality. However, machine learning models will use features for both training the model and later for when the model is used for inference. There is substantial business value in making features reusable. Reproducible feature computations are a core, and difficult, component of the ML pipeline. Unlocking the full benefits of feature engineering lies in reusability. An effective feature engineering pipeline can enable a more reliable and performant model. Managing and enabling their reuse saves time but, more important, improves the quality of any model using a robust, performant and well-tested feature. There is obviously a time savings in reusing features, as there is in reuse in general. However, robust features help prevent models from using different feature data between training and inference, which helps mitigate online/offline skew problems. Reusuability at the enterprise level required discoverability (so users can find the feature) and lineage (so engineers know where and how features are being used). Feature Stores are typically used to provide discoverability and feature lineage.

Feature Store

Feature stores offer users a centralized repository where features can be discovered and shared and lineage can be tracked. Feature stores also enable reproducible feature computation by ensuring the same feature is used during training and inference. Databricks provides additional value with their Feature Store. Discoverability is simplified through the Feature Store UI, which has browse and search features. Lineage is robust. Feature tables package their own data sources. Each feature in a feature table provides access to the models, notebooks, jobs and endpoints that use them. Databricks integrates the Feature Store with model scoring and serving. Feature metadata is packaged with models that are trained using features from the Databricks Feature Store. Features from the Feature Store are retrieved automatically during batch scoring or online inference. This simplifies model deployment and updates.

This centralized repository had been enabled using the Hive metastore in Databricks, which has limitations. Databricks has been upgrading their functionality by moving past the legacy Hive metastore into the more robust Unity Catalog and the Feature Store has recently moved to Unity Catalog as well. Unity Catalog will serve as your feature store automatically assuming you are using the Databricks Runtime 13.2 or above and your workspace is enabled for Unity Catalog.

Unity Catalog

Unity Catalog is the unified governance layer for the Databricks Intelligence Platform. The primary enterprise driver for adoption tends to be its simplified access management capabilities. Unity Catalog provides a single permission model for assets. The Unity Catalog object model has a hierarchy where the metastore is the top-level container. Under the metastore are Catalog and Schema. Securable objects under Schema include Table, View, Volume, Model and Function. This allows all securable object in Unity Catalog to be referenced using a three-level namespace: catalog.schema.asset. Models must be registered in the MLFlow Model Registry.

Databricks has come up with what I consider to be a very elegant approach to incorporating feature engineering into their Unity Catalog governance umbrella in a way that should be very simple to implement at an enterprise level. Under the hood, Databricks leveraged how Unity Catalog already works to provide a path for using Unity Catalog as a feature store. As far as Unity Catalog is concerned, any Delta table with a primary key constraint can be a feature table. If you want a time-series feature table, just add a time column to the primary key with the keyword TIMESERIES (Databrick Runtime 13.3 and above). If you have a non-ML Databricks runtime, you need to use pip to install the databricks-feature-engineering package. This package is pre-installed on Databricks Runtime 13.2 ML and above.

Conclusion

The intent of Databricks has always been to provide a unified platform for data and AI. However, the actual realization of this vision once hands are on the keyboard is more difficult than it would seem. Python and SQL have made Data and AI the modern equivalent of “two nations separated by a common language”.  Bringing feature engineering under the data governance model using such a light touch provides a very practical mechanism for building ML and AI solutions with the same level of of security and governance as data solutions.

Contact us to learn more about how to build robust ML and AI solutions in Databricks for your organization.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us