Skip to main content

Data & Intelligence

Top 5 Mistakes That Make Your Databricks Queries Slow (and How to Fix Them)

Istock 2163867912

I wanted to discuss the top 5 mistakes that make your Databricks queries slow as a prequel to some of my FinOps blogs. Premature optimization may or may be the root of all evil, but we can all agree optimization without a solid foundation is not an effective use of time and resources. Predictive optimization cannot currently address data skew, select the best join strategy (although Photon can), optimize merge operations, or optimize most streaming operations. Databricks is a system with a lot of dials. Let’s look at the top five mistakes that I regularly see in practice.

1. Ignoring Data Skew

Mistake: Uneven distribution of data leading to some tasks taking significantly longer than others.

Solution: Monitor stages in the Spark UI to detect straggler tasks and check skewed columns with high cardinality or frequent NULLs.

2. Suboptimal Join Strategies

Mistake: Using expensive join techniques without optimization, especially with large datasets or streaming data.

Solution: Take advantage of tools and techniques specifically for issues with size and speed.

3. Inefficient Streaming Joins

Mistake: Improper handling of stream-stream and stream-static joins, leading to increased state management and latency.

Solution: Set appropriate watermarks to prevent unbounded state growth.

4. Suboptimal Merge Operations

Mistake: Triggering high shuffle during merge operations by not using low shuffle techniques.

Solution: Use Low Shuffle Merge. , preferably by switing over to DLTs or revisiting operations built before 10.4.

  • Merges only the changed data, reducing I/O and shuffle.

5. Ignoring Join Performance Best Practices

Mistake: Using default settings without leveraging advanced features.

Solution: Use Photon to dynamically select the best join type as Unity Catalog helps maintain statistics but does not always handle join order effectively

  • Enable Photon for vectorized execution.

  • Optimize Join Order: Always join smaller tables first and avoid cross joins.

  • Maintain Fresh Statistics: Use ANALYZE TABLE to help the optimizer make better decisions. Or, better yet, automate.

Conclusion

As an Elite Databricks Partner, we are here to help organizations keep costs under control as the get meaningful value from the data and AI assets.

Contact us to explore how we can help build performance and cost optimization tools and techniques into your data and AI pipeline.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Senior Solutions Architect

Databricks Champion | Center of Excellence Lead | Data Privacy & Governance Expert | Speaker & Trainer | 30+ Yrs in Enterprise Data Architecture

More from this Author

Follow Us