Skip to main content

Data & Intelligence

How Automatic Liquid Clustering Supports Databricks FinOps at Scale

Perficient has a FinOps mindset with Databricks, so the Automatic Liquid Clustering announcement grabbed my attention.

I’ve mentioned Liquid Clustering before when discussing the advantages of Unity Catalog beyond governance use cases. Unity Catalog: come for the data governance, stay for the predictive optimization. I am usually a fan of being able to tune the dials of Databricks. In this case, Liquid Clustering addresses the data management and query optimization aspects of cost control soi simply and elegantly that I’m happy to take my hands off the controls.

Manual Tuning: The Struggle Is Real

Experienced Databricks data engineers are familiar with partitioning and data-skipping strategies to increase performance and reduce costs for their workloads. These topics are even in the certification exams.

  • Partitioning involves taking a very large table (1TB or greater) and breaking it down into smaller 1GB chunks based on one or more columns – best for low-cardinality columns.
  • Data-skipping uses statistics stored in the metadata of a table to intelligently find relevant data.
  • Z-Ordering goes even further than data-skipping and co-locates similar information in high-cardinality columns in the same file, improving I/O efficiency.

Partitioning is set on table creation, while Z-Order columns are applied with the OPTIMIZE command.

Simple in theory; frustrating in practice.

In all fairness, I think most of us were partitioning wrong. In my case, I had originally approached partitioning a Delta table as if it were a Hive table or a Parquet file. This made intuitive sense to me as an early Spark developer, and I had deep knowledge of both architectures. Yet time and time again, I’d find myself staring wistfully into the middle distance through the ashes of another failed optimization attempt.

  • Queries slowed as access patterns evolved.
  • Optimization efforts produced inconsistent benefits.
  • Z-Ordering introduced write amplification and higher compute costs, since it isn’t incremental or on-write.

Databricks clearly saw that manual tuning didn’t scale. So they introduced a better way.

Ingestion Time Clustering: A Step in the Right Direction

Ingestion Time Clustering was introduced to address the issues with custom partitioning and Z-Ordering. This approach was taken based on their assumption that 51% of tables are partitioned on date/time keys. Now, we have a solution for about half of our workloads, which is great. But what about the other half?

Liquid Clustering: Smarter, Broader Optimization

Liquid Clustering addresses additional use cases beyond date/time partitioning. Addressing partitioning’s limitations with concurrent write requirements was a big step forward in reliability. This is also a better solution for managing tables where access patterns change over time and where potential keys may not result in well-sized partitions. It also manages tables filtered by high cardinality columns like Z-Order without the additional costs. It adds the ability to manage tables with significant skew as well as tables that experience rapid growth. Databricks recommends that Liquid Clustering be enabled for all Delta tables including materialized views and streaming tables. The syntax is very straightforward:

CLUSTER  BY (col1)

Seems pretty simple: use liquid clustering everywhere and just identify the column on which to cluster. How much simpler could it get?

Automatic Liquid Clustering: Supports Databricks FinOps at Scale

Now we find ourselves at a logical conclusion.

Unity Catalog collects statistics on managed tables and automatically identifies when OPTIMIZE, VACUUM, and ANALYZE maintenance operations should be run. Historical workloads for a managed table are analyzed asynchronously as an additional maintenance operation to inform candidates for clustering keys.

You may have noticed by the syntax (CLUSTER BY (col1)) that Liquid Clustering is still vulnerable to changing access patterns invalidating initial partition key selection. Clustering keys are changed when the predicted cost savings from data skipping outweigh the cost of clustering the data.

In other words,

CLUSTER  BY AUTO

Final Thoughts: Keep Calm and Cluster by Auto

Data is in a very exciting, but very tough, place right now. Mainstream corporate acceptance of AI/ML means data engineers need to work harder than ever to get lots of data from disparate sources available to everything from SQL Warehouses to ML to RAGs to agentic solutions, while maintaining and improving on security and governance. Add in the downward pressure on budgets as cloud costs are perceived as being too high. Optimization tuning is not a value-add at this point.

Keep Calm and Cluster by Auto.

Want help implementing this in your Databricks environment?

Get in touch with us if you want to know more about how Automatic Liquid Clustering in Databricks could help you improve performance and bring costs down.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us