Take advantage of windows in your Spark data science pipeline / Blogs / Perficient

Windows can perform calculations across a certain time frame around the current record in your Spark data science pipeline. Windows are SQL functions that allow you to access data before and after the current record to perform calculations. They can be broken down into ranking and analytic functions and, like aggregate functions. Spark provides the following:

Ranking Functions

Command	Return x value within a window partition
row_number()	sequential number starting from 1
rank()	rank of rows with gaps
dense_rank()	rank of rows without any gaps.
percent_rank()	percentile rank of rows
ntile(int)	distributed rows into specified number of roughly equal buckets
cume_dist()	cumulative distribution of values

Analytic Functions

Command	Return x value within a window partition
lag()	‘offset’ rows before the current row or null
lead()	‘offset’ rows after the current row or null

Tuning

Most tuning considerations around Spark are concerned with actions done on the cluster by the executors. These transformations include map, filter, groupBy, sortBy, sample, randomSplit, union, distinct, coalesce, repartition. The driver represents a single machine and can be a bottleneck. Unfortunately, the the driver performs actions that include ranking, analytic and aggregation functions. A good practice is to perform these action in your data mungine pipeline before the data scientists use them in their modeling.

Summary

Build an AI-First Enterprise

From early pilots to enterprise-wide deployment, our award-winning AI consulting and technical services help you build the right foundation, scale responsibly, and deliver meaningful business outcomes.

Learn More

Tuning Spark for your machine learning pipeline can be a complex and time consuming process. Store and compute play a different role for your Spark cluster in different stages of your machine learning pipeline. In this post, I discuss how to optimize your Spark machine learning pipeline for aggregations. The performance considerations for windowing are similar.

Take advantage of windows in your Spark data science pipeline

by David Callaghan on May 26th, 2020 | ~ minute read

Ranking Functions

Analytic Functions

Tuning

Summary

Build an AI-First Enterprise

Tags

Leave a Reply

David Callaghan, Senior Solutions Architect

Categories

Follow Us