Skip to main content

Data & Intelligence

Unity Catalog, the Well-Architected Lakehouse and Cost Optimization

Istock 1419229965

I have written about the importance of migrating to Unity Catalog as an essential component of your Data Management Platform. Any migration exercise implies movement from a current to a future state. A migration from the Hive Metastore to Unity Catalog will require planning around workspaces, catalogs and user access. This is also an opportunity to realign some of your current practices that may be less than optimal with newer, better practices. In fact, some of these improvements might be easier to fund than a straight governance play. One comprehensive model to use for guidance is the Databricks well-architected lakehouse framework. I have discussed the seven pillars of the well-architected lakehouse framework in general and now I want to focus on cost optimization.

Cost Optimization Overview

Cost OptimizationI had placed cost optimization as the first pillar to tackle since I have seen many governance-focused projects, like Unity Catalog migrations, fizzle out due to lack of a clear funding imperative. I usually am able to get a lot more long-term support around cost reduction, particular in the cloud. The principals of cost optimization in Databricks are relatively straightforward: start with a resource that aligns with the workload, dynamically scale as needed, monitor and mange closely. Straightforward principals rarely mean simple implementations. Usually I see workloads that were lifted and shifted from on-premise without being rewritten for the new environment as well as new jobs that look like they were written from on on-premise perspective. Compute resources that were created a year and a half ago for one type of job just being reused as a “best practice”. And no effective monitoring. There are best practices that can be implemented pretty easily a=nd can have a real impact on your bottom line.

Understand Your Resource Options

First of all, start using Delta as your storage framework. Second, start using up-to-date runtimes for your workloads. Third, use job compute instead of all purpose compute for your jobs and use SQL warehouse for your SQL workloads. You probably want to use serverless services for your BI workloads and ML and AI model serving. You probably don’t need GPUs unless you are doing deep learning. Photon will probably help your more complex queries and you just need to turn it on to find out. Using the most up-to-date instance types will usually give you a price/performance boost (ex: AWS Graviton2 instances). Further optimizations on instance takes a little more thinking, but honestly not much more. By default, just use the latest general purpose instance type. Use memory-optimized workloads for ML. Some, but by no means all, ML and AI jobs might benefit from GPU but be careful.  Use storage-optimized workloads for ad-hoc and interactive data analysis. Use compute-optimized for structured streaming and maintenance jobs.

The real work is in choosing the most efficient compute size. The first thing I usually recommend is to get as much work off the driver node as possible and push the work to the worker nodes. As I mentioned before, executors perform transformations (map, filter, groupBy, sortBy, sample, randomSplit, union, distinct, coalesce, repartition) while the driver performs actions (reduce, collect, count, min, max, sum, mean, stddev, variance, saveAs). Once you make sure the workers are doing the right work, you need to know what they are working on. This involved undertanding how much data you are consuming, how its parallelized and partitioned, what is the complexity. My best advice here is to measure your jobs and identify the ones that need to be re-evaluated. Then limit your technical (and actual) debt going forward by enforcing best practices.

Make sure to dynamically allocate resources as much as possible. Consider whether or not fixed resources can use spot instance. Auto-terminate. Look into cluster pools since Databricks does not change while instances are idle in the pool.

Monitor and Chargeback

The account console is your friend. Tag everyone that has their own checkbook. Consider cost control when you are setting up workspaces and clusters and tag accordingly. You can tag cluster, SQL warehouse and pools. Some organizations don’t implement a chargeback model. That’s fine; start today. Seriously, no one scales without accountability. There is effort involved in designing cost-effective workloads. You will see substantially optimized workloads once people start getting a bill. You’ll be surprised at how many “real-time” workflows can use the AvailableNow trigger once a dollar amount comes into play. A cost-optimization strategy is not is not the same as an executive blanket cost-reduction edict. Don’t take my word for it: if you don’t adopt the former, you will experience the latter.

Conclusion

Cost optimization is typically the best pillar to focus on since it can have an immediate impact to the budget and often has a steep political hill to climb because of the importance of a chargeback model. Re-evaluating workspace models is a big part of Unity Catalog migration preparation and cost optimization through tagging can be an impotant part of this conversation.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us