How Automatic Liquid Clustering Supports Databricks FinOps at Scale / Blogs / Perficient

Perficient has a FinOps mindset with Databricks, so the Automatic Liquid Clustering announcement grabbed my attention.

I’ve mentioned Liquid Clustering before when discussing the advantages of Unity Catalog beyond governance use cases. Unity Catalog: come for the data governance, stay for the predictive optimization. I am usually a fan of being able to tune the dials of Databricks. In this case, Liquid Clustering addresses the data management and query optimization aspects of cost control so simply and elegantly that I’m happy to take my hands off the controls.

Manual Tuning: The Struggle Is Real

Experienced Databricks data engineers are familiar with partitioning and data-skipping strategies to increase performance and reduce costs for their workloads. These topics are even in the certification exams.

Partitioning involves taking a very large table (1TB or greater) and breaking it down into smaller 1GB chunks based on one or more columns—this method is best for low-cardinality columns.
Data-skipping uses statistics stored in the metadata of a table to intelligently find relevant data.
Z-Ordering goes even further than data-skipping and co-locates similar information in high-cardinality columns in the same file, improving I/O efficiency.

Partitioning is set on table creation, while Z-Order columns are applied with the OPTIMIZE command.

Simple in theory; frustrating in practice.

In all fairness, I think most of us were partitioning wrong. In my case, I had initially approached partitioning a Delta table as if it were a Hive table or a Parquet file. This made intuitive sense to me as an early Spark developer, and I had deep knowledge of both architectures. Yet, repeatedly, I’d find myself staring wistfully into the middle distance through the ashes of another failed optimization attempt.

Queries slowed as access patterns evolved.
Optimization efforts produced inconsistent benefits.
Z-Ordering introduced write amplification and higher compute costs since it isn’t incremental or on-write.

Databricks clearly saw that manual tuning didn’t scale. So, they introduced a better way.

Ingestion Time Clustering: A Step in the Right Direction

Ingestion Time Clustering was introduced to address the issues with custom partitioning and Z-Ordering. This approach was taken based on their assumption that 51% of tables are partitioned on date/time keys. Now, we have a solution for about half of our workloads, which is great. But what about the other half?

Liquid Clustering: Smarter, Broader Optimization

Liquid Clustering addresses additional use cases beyond date/time partitioning. Addressing partitioning’s limitations with concurrent write requirements was a big step forward in reliability. This is also a better solution for managing tables where access patterns change over time and potential keys may not result in well-sized partitions. It also manages tables filtered by high cardinality columns like Z-Order without additional costs. It adds the ability to manage tables with significant skew as well as tables that experience rapid growth. Databricks recommends enabling Liquid Clustering for all Delta tables, including materialized views and streaming tables. The syntax is very straightforward:

CLUSTER BY (col1)

CLUSTER  BY (col1)

It seems pretty simple: use liquid clustering everywhere and identify the column on which to cluster. How much simpler could it get?

Automatic Liquid Clustering: Supports Databricks FinOps at Scale

Now, we find ourselves at a logical conclusion.

Unity Catalog collects statistics on managed tables and automatically identifies when OPTIMIZE, VACUUM, and ANALYZE maintenance operations should be run. Historical workloads for a managed table are analyzed asynchronously as an additional maintenance operation to inform candidates of clustering keys.

You may have noticed by the syntax (CLUSTER BY (col1)) that Liquid Clustering is still vulnerable to changing access patterns invalidating initial partition key selection. Clustering keys are changed when the predicted cost savings from data skipping outweigh the data clustering cost.

In other words,

CLUSTER BY AUTO

CLUSTER  BY AUTO

Final Thoughts: Keep Calm and Cluster by Auto

Data is in a very exciting but very tough place right now. Mainstream corporate acceptance of AI/ML means data engineers need to work harder than ever to get lots of data from disparate sources available to everything from SQL Warehouses to ML to RAGs to agentic solutions, while maintaining and improving on security and governance. Add downward pressure on budgets as cloud costs are perceived as too high. Optimization tuning is not a value-add at this point.

Keep Calm and Cluster by Auto.

Want help implementing this in your Databricks environment?

Get in touch with us if you want to know more about how Automatic Liquid Clustering in Databricks could help you improve performance and bring costs down.

Thoughts on “How Automatic Liquid Clustering Supports Databricks FinOps at Scale”

goldenfish March 14, 2025 at 3:18 am

Great insights, thank you for such an informative article.
fasterman March 18, 2025 at 11:08 am

Thank you for sharing good information.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

How Automatic Liquid Clustering Supports Databricks FinOps at Scale

by David Callaghan on March 13th, 2025 | ~ minute read

David Callaghan, Senior Solutions Architect

Categories

Follow Us