Tuning Spark for your machine learning pipeline can be a complex and time consuming process. Store and compute play a different role for your Spark cluster in different stages of your machine learning pipeline. Spark defaults are never the right way to go. It makes more sense to know what settings are most effective at different stages of your pipeline: data munging, training and modeling.
How to start
Identify what type of task you are performing. Specifically, is Spark going to run the job on the driver or on the executor? Executors perform transformations (map, filter, groupBy, sortBy, sample, randomSplit, union, distinct, coalesce, repartition) while the driver performs actions (reduce, collect, count, min, max, sum, mean, stddev, variance, saveAs). Make sure to run the explain plan at different stages to verify what you think is happening is actually happening. Pay careful attention to the size of the data that is being moved around at different stages. The settings in Spark are typically calculated with a formula when you don’t use the defaults, so you need confidence in your numbers.
Data Munging
Data munging seems to break the most rules. The standard advice is to avoid activities that involve using the driver, but this is the heart of data science. Rather than tell your team to avoid aggregations, build it into your pipeline. We precalculate the resource intensive aggregations and store the results as parquet in the raw zone in our data munging sections.
Shuffling is considered bad since it involves disk I/O, data serialization and network I/O. Spark will organize the data using a set of map tasks and the use a set of reduce tasks to aggregate the data. Results from individual map tasks are kept in memory until they cannot fit, or spill. Once in memory they are sorted based on the target partition and written to a single file, which is then read by the reduce operator. The trick becomes ensuring data doesn’t spill to disk. Spark splits the data into partitions where each partition is computed on a per executor thread basis.
Identify the size of the cluster and the size of your data. It makes sense in the data munging stage, where shuffling is common, to create a salting strategy that results in partitions that will fit in your cluster assuming one partition per executor thread. Salting the dataset to provide for predictable partitioning is useful as a rule of thumb; no reason to figure out if the data will be skewed based on a natural partition. Once the data size is known, set the appropriate Spark config settings, like spark.reducer.maxSizeInFlight and spark.reducer.maxReqsInFlight. Repartition the data to move all values for the same key into the same partition on one executor before a shuffling operation.
There are some hardware considerations. You really need to put /var/lib/spark on a big, independent SSD and map spark.local.dir there. This is the temporary storage location Spark uses for long running jobs, and most of the data munging jobs fall into this category. Another consideration is that bigger servers, especially for the driver, are in order.
Training
We want to leave the data arranged nicely for the next steps in the data science pipeline, like training. Training is the most storage intensive step. Spread the data evenly across a cluster. Aim for a partitioning and bucketing strategy that allows for processing to be performed in Spark on 128MB chucks. Partitioning distributes data horizontally while bucketing decomposes the data into more manageable chunks based on the hashed value of a column. Store output on less expensive storage. Hopefully, we have done enough CPU intensive work munging the data. More, smaller servers are better at this stage where parallelism is key.
Modeling
Finally, we arrive at the model stage. Here, I need to leave you with the tools you have learned to use from the prior two steps. Honestly, the models are too different to give general advice. Deep learning likes GPUs, H2O.ai tends to need three times the amount of storage I usually allocate. Random forests seem to play nicely in the same cluster we set up for training. You learned enough from analyzing execution plans and staring at the memory management reports to give you a feel for your cluster. Use a limited number of MLLib algorithms at first . As a general rule of thumb, verify that store is feeding compute through proper parallelization before spending more on compute.
Conclusion
Distributed processing is hard. Machine learning is hard. Distributed machine learning is quite hard. Data scientists are used to working on a single machine with a small dataset. Make sure there are as few speed bumps as possible during the transition. Correctly sizing your Spark cluster to accommodate he different store and compute needs of different parts of the machine learning pipeline is a good start.