Skip to main content

Analytics

Take advantage of windows in your Spark data science pipeline

InForm

Windows can perform calculations across a certain time frame around the current record in your Spark data science pipeline. Windows are SQL functions that allow you to access data before and after the current record to perform calculations. They can be broken down into ranking and analytic functions and, like aggregate functions. Spark provides the following:

Ranking Functions

Command Return x value within a window partition
row_number() sequential number starting from 1
rank() rank of rows with gaps
dense_rank() rank of rows without any gaps.
percent_rank() percentile rank of rows
ntile(int) distributed rows into specified number of roughly equal buckets
cume_dist() cumulative distribution of values

Analytic Functions

Command Return x value within a window partition
lag() ‘offset’ rows before the current row or null
lead() ‘offset’ rows after the current row or null

Tuning

Most tuning considerations around Spark are concerned with actions done on the cluster by the executors. These transformations include map, filter, groupBy, sortBy, sample, randomSplit, union, distinct, coalesce, repartition. The driver represents a single machine and can be a bottleneck. Unfortunately, the the driver performs actions that include ranking, analytic and aggregation functions. A good practice is to perform these action in your data mungine pipeline before the data scientists use them in their modeling.

Summary

Tuning Spark for your machine learning pipeline can be a complex and time consuming process. Store and compute play a different role for your Spark cluster in different stages of your machine learning pipeline. In this post, I discuss how to optimize your Spark machine learning pipeline for aggregations. The performance considerations for windowing are similar.

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us