Analytics

Take advantage of windows in your Spark data science pipeline

InForm

Windows can perform calculations across a certain time frame around the current record in your Spark data science pipeline. Windows are SQL functions that allow you to access data before and after the current record to perform calculations. They can be broken down into ranking and analytic functions and, like aggregate functions. Spark provides the following:

Ranking Functions

Command Return x value within a window partition
row_number() sequential number starting from 1
rank() rank of rows with gaps
dense_rank() rank of rows without any gaps.
percent_rank() percentile rank of rows
ntile(int) distributed rows into specified number of roughly equal buckets
cume_dist() cumulative distribution of values

Analytic Functions

Command Return x value within a window partition
lag() ‘offset’ rows before the current row or null
lead() ‘offset’ rows after the current row or null

Tuning

Most tuning considerations around Spark are concerned with actions done on the cluster by the executors. These transformations include map, filter, groupBy, sortBy, sample, randomSplit, union, distinct, coalesce, repartition. The driver represents a single machine and can be a bottleneck. Unfortunately, the the driver performs actions that include ranking, analytic and aggregation functions. A good practice is to perform these action in your data mungine pipeline before the data scientists use them in their modeling.

Summary

Data Intelligence - The Future of Big Data
The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

Tuning Spark for your machine learning pipeline can be a complex and time consuming process. Store and compute play a different role for your Spark cluster in different stages of your machine learning pipeline. In this post, I discuss how to optimize your Spark machine learning pipeline for aggregations. The performance considerations for windowing are similar.

 

 

 

About the Author

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Subscribe to the Weekly Blog Digest:

Sign Up