Windows can perform calculations across a certain time frame around the current record in your Spark data science pipeline. Windows are SQL functions that allow you to access data before and after the current record to perform calculations. They can be broken down into ranking and analytic functions and, like aggregate functions. Spark provides the following:
Ranking Functions
Command | Return x value within a window partition |
row_number() | sequential number starting from 1 |
rank() | rank of rows with gaps |
dense_rank() | rank of rows without any gaps. |
percent_rank() | percentile rank of rows |
ntile(int) | distributed rows into specified number of roughly equal buckets |
cume_dist() | cumulative distribution of values |
Analytic Functions
Command | Return x value within a window partition |
lag() | ‘offset’ rows before the current row or null |
lead() | ‘offset’ rows after the current row or null |
Tuning
Most tuning considerations around Spark are concerned with actions done on the cluster by the executors. These transformations include map, filter, groupBy, sortBy, sample, randomSplit, union, distinct, coalesce, repartition. The driver represents a single machine and can be a bottleneck. Unfortunately, the the driver performs actions that include ranking, analytic and aggregation functions. A good practice is to perform these action in your data mungine pipeline before the data scientists use them in their modeling.
Summary
Tuning Spark for your machine learning pipeline can be a complex and time consuming process. Store and compute play a different role for your Spark cluster in different stages of your machine learning pipeline. In this post, I discuss how to optimize your Spark machine learning pipeline for aggregations. The performance considerations for windowing are similar.