Frequently, the “crawl, walk, run, fly” metaphor is used when describing the path to implementing a scalable data science practice. There are a lot of problems with this concept, not the least of which is the fact there is already motion involved. People are already doing BI work, often complex work enabling high value results. Picture someone running on a treadmill that suddenly switches to a walk. Also, the complexity and scale of the algorithm are not necessarily in direct proportion to the business value. Sometimes regulators demand that report in SAS and that’s how we keep the lights on. Is mission critical work considered crawling? Oh, and people can’t fly. A better model is the law of conservation of angular momentum.
Formally model your path
How does the law of angular momentum apply to scaling your data science practice? The law of angular momentum states:
When the net external torque acting on a system about a given axis is zero, the total angular momentum of the system about that axis remains constant.
“But wait, there is funding and executive backing to get us out of the dark ages and doing modern analytics!”. Not really. You still pretty much have the same people, the same budget and the same organization. You just have a new tool that’s hard to use that most of the users didn’t ask for. Don’t kid yourself; you have no torque. But you have a formula: L = mvr
L = angular momentum m = mass v = velocity r = radius
Mass, velocity and radius are in a direct relationship that will return the same end result. Basically, you have some limiting factors in your organization; number of skilled users, system resources, potential high value targets. You don’t have to move the speed of the whole system from crawl to walk to run. You need to identify which values you can move up and which you should leave untouched or even scaled back.
This is a user-centric model: users are measured in units of technical debt. Your first targets to onboard to the new platform are typically your current data scientists. The idea is that if people are already using Python or R on their workstation, they will find it easy to adapt to a big data platform. This is not exactly how it works in practice. The reality is that data science in Spark has a learning curve, for data scientists, administrators and the business. There are ways to minimize the impact.
Mass is the measure of resistance to acceleration when a net force is applied.
We can measure mass by the number of users actively generating value from the platform. Assuming we can’t control that number or, honestly, the net force applied, we are left minimizing the resistance.
There is a learning curve when moving from working with datasets that can fit into a laptop’s memory to working with large, distributed datasets. The first habit that must forcibly be broken is thinking locally, which in Python is a reliance on using pandas. When processing a dataset in pandas, you are using the resources of the edge node. This is a shared resource and will have the effect of making other people’s jobs fail. Spark dataframes have a lot in common with panda’s dataframes, so the learning curve really is minimal. While spark dataframes have a lot in common with pandas dataframes, concepts like coalesce vs repartition and partitionBy vs bucketBy are unique to distributed datasets. R has the same problems, but more people use Python in Spark because its better supported in MLLib and other technologies like Spark Streaming.
Limit the number of users that you onboard to the platform. Once they aren’t using up all the resources on the edge node or getting out of memory errors on the cluster, you are ready to expand. When there are enough data scientists who understand and embrace the distributed model that they can overcome the resistance of other data scientists in the organization, you can open up the environment. You have a limited amount of mass that you can move in terms of technical debt so you need to transfer that load first.
Velocity is the speed in a given direction and speed is the distance traveled over a certain time.
You need to be able to careful measure and manage the resources of your machine learning pipeline in order to control the velocity.
First, consider access control. If your users need to ssh into a terminal session on the edge node to kinit every 24 hours, you are limited. If there is SSO through LDAP, but their data ACL did not migrate to the new system, you get frustration.
Next, We need to minimize the amount of time it takes users to complete a task in the new system as opposed to the old system by minimizing speed bumps. Data is usually as much the problem as the solution here. People can quickly get data if they are already familiar with the schema and purpose or are one person removed from that knowledge. Getting access to the data people are used to getting must be just as easy. In distributed systems, its best to think column-oriented when you store data and row-oriented when you present data. Don’t skimp on the presentation layer here. People will come for current state date. They will stay for the additional data.
Finally, you need to get control of your Spark environment. This is a complex and time consuming process I describe here:
Once jobs are going through the entire data science pipeline without unplanned admin intervention, you are ready service new business use cases on demand.
The radius is the measurement from center to the perimeter.
Here is where we get into what some consider data science democratization. I consider this to be the question of how do you move from business as usual to more advanced cases: how you move from the center where all the current work is being done to the edge. I actually consider this to be primarily a tools issue more so than the other metrics. It’s a good idea to have a GUI for everyone.
For business analysts, give them the tools they are used to pointing to the new location. For example, Tableau can point to Athena, which is reading parquet files on S3. Just make sure that the data is easy to access and share to senior managers who do not have the tool. I would much rather skimp on enterprise licenses to allow analysts to dynamically interact with the data in meetings with senior management in just a few departments than paste graphics onto PowerPoint.
For data scientists, it’s best to give them a tool they are NOT used to working with like Sagemaker Studio in AWS. There can be a lot of frustration having to go back and retool existing code to work on a new platform. I’ve gotten better results using a new tool on the new platform and get some wins going from the center to the edge. Once they want more control than a tool offers, the perception around making the changes to the underlying R or Python code is completely different.
Make sure you have your infrastructure issue under control through managing mass and velocity before you turn the tools loose. But keep in mind this is where the business wants to be. If you are experiencing too many issues converting python to pyspark or building a custom data science pipeline; you may want to consider just using cloud-native tools. Remember, the formula is a relationship and you might end up getting too bogged down in the technical implications of increasing mass and velocity and forget that radius is the key business metric.
Adding a distributed data science environment to your enterprise is a slow, methodical process. The task is not to bring the whole enterprise along in lockstep. This just leads to failed POCs and enterprise-grade science projects. Respect the scarcity of resources. Change, measure, repeat.