Top 5 Mistakes That Make Your Databricks Queries Slow (and How to Fix Them) / Blogs / Perficient

I wanted to discuss the top 5 mistakes that make your Databricks queries slow as a prequel to some of my FinOps blogs. Premature optimization may or may be the root of all evil, but we can all agree optimization without a solid foundation is not an effective use of time and resources. Predictive optimization cannot currently address data skew, select the best join strategy (although Photon can), optimize merge operations, or optimize most streaming operations. Databricks is a system with a lot of dials. Let’s look at the top five mistakes that I regularly see in practice.

1. Ignoring Data Skew

Mistake: Uneven distribution of data leading to some tasks taking significantly longer than others.

Solution: Monitor stages in the Spark UI to detect straggler tasks and check skewed columns with high cardinality or frequent NULLs.

Use Salting to distribute skewed keys.
Use Broadcast Join for small, skewed tables.
Leverage Range Join Hints to optimize inequality joins.

2. Suboptimal Join Strategies

Mistake: Using expensive join techniques without optimization, especially with large datasets or streaming data.

Solution: Take advantage of tools and techniques specifically for issues with size and speed.

Range Join Optimization: Ideal for joins using inequality conditions (e.g., timestamp ranges).
Bloom Filter Indices: Efficiently filters unnecessary data during joins.
Materialized Views: Optimize incremental join computation.

3. Inefficient Streaming Joins

Mistake: Improper handling of stream-stream and stream-static joins, leading to increased state management and latency.

Solution: Set appropriate watermarks to prevent unbounded state growth.

For Stream-Stream Joins: Specify watermarks for both sides to manage state efficiently.
For Stream-Static Joins: Use Delta Lake for the static side to benefit from stateless joins.

Mistake: Triggering high shuffle during merge operations by not using low shuffle techniques.

Solution: Use Low Shuffle Merge. , preferably by switing over to DLTs or revisiting operations built before 10.4.

Merges only the changed data, reducing I/O and shuffle.

5. Ignoring Join Performance Best Practices

Mistake: Using default settings without leveraging advanced features.

Solution: Use Photon to dynamically select the best join type as Unity Catalog helps maintain statistics but does not always handle join order effectively

Enable Photon for vectorized execution.
Optimize Join Order: Always join smaller tables first and avoid cross joins.
Maintain Fresh Statistics: Use ANALYZE TABLE to help the optimizer make better decisions. Or, better yet, automate.

Conclusion

As an Elite Databricks Partner, we are here to help organizations keep costs under control as the get meaningful value from the data and AI assets.

Contact us to explore how we can help build performance and cost optimization tools and techniques into your data and AI pipeline.

Top 5 Mistakes That Make Your Databricks Queries Slow (and How to Fix Them)

by David Callaghan on March 28th, 2025 | ~ minute read

1. Ignoring Data Skew

2. Suboptimal Join Strategies

3. Inefficient Streaming Joins

4. Suboptimal Merge Operations

5. Ignoring Join Performance Best Practices

Conclusion

Tags

Leave a Reply

David Callaghan, Senior Solutions Architect

Categories

Follow Us