Skip to main content

Analytics

Take advantage of windows in your Spark data science pipeline

InForm

Windows can perform calculations across a certain time frame around the current record in your Spark data science pipeline. Windows are SQL functions that allow you to access data before and after the current record to perform calculations. They can be broken down into ranking and analytic functions and, like aggregate functions. Spark provides the following:

Ranking Functions

Command Return x value within a window partition
row_number() sequential number starting from 1
rank() rank of rows with gaps
dense_rank() rank of rows without any gaps.
percent_rank() percentile rank of rows
ntile(int) distributed rows into specified number of roughly equal buckets
cume_dist() cumulative distribution of values

Analytic Functions

Command Return x value within a window partition
lag() ‘offset’ rows before the current row or null
lead() ‘offset’ rows after the current row or null

Tuning

Most tuning considerations around Spark are concerned with actions done on the cluster by the executors. These transformations include map, filter, groupBy, sortBy, sample, randomSplit, union, distinct, coalesce, repartition. The driver represents a single machine and can be a bottleneck. Unfortunately, the the driver performs actions that include ranking, analytic and aggregation functions. A good practice is to perform these action in your data mungine pipeline before the data scientists use them in their modeling.

Summary

Tuning Spark for your machine learning pipeline can be a complex and time consuming process. Store and compute play a different role for your Spark cluster in different stages of your machine learning pipeline. In this post, I discuss how to optimize your Spark machine learning pipeline for aggregations. The performance considerations for windowing are similar.

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Senior Solutions Architect

Databricks Champion | Center of Excellence Lead | Data Privacy & Governance Expert | Speaker & Trainer | 30+ Yrs in Enterprise Data Architecture

More from this Author

Follow Us