I help companies build out, manage and hopefully get value from large data stores. Or at least, I try. In order to get value from these petabytes-scale datastores, I need the data scientists to be able to easily apply their statistical and domain knowledge. There’s one fundamental problem: large datasets are always distributed and data scientists work on single-machine datasets. With pandas. Always.
Most data scientists take pandas for granted. Loading a csv file into a dataframe using pandas may have been their first line of code ever. But pandas is a single-node library. This means it runs on the driver in a spark cluster rather than all the worker nodes. So it doesn’t scale. Then it becomes my job to try to take their pandas away.
Typically, I try to convince them that PySpark is really easy and does everything that pandas can do, but it can do it distributed across all the nodes. A statement both true and unconvincing. The unfortunate result is typically a stalemate and turns a once promising initiative into a science project. My position was their code just needed some small rewrites in order to work. Their position was their existing code not only needed to be rewritten, but all future code would need to be written using an unfamiliar tool. In the case of a draw, the status quo rarely wins. And I get that. But I’ve seen enough successful large scale data deployments to know the gain in value and insight more than outweigh the short-term pain.
Mining massive datasets is more interesting than mining small dataset. Statistically significant size differences matter. Reproducible research is more meaningful when the experiment provides a parameter-driven data extraction process rather than just the same dataset. p-hacking is harder to hide.
All that’s behind me now. (Updated: Koalas is officially included to PySpark in Apache Spark 3.2. For Apache Spark 3.2 and above, please use PySpark directly. Otherwise, install with pip as shown below:)
pip install koalas import databricks.koalas as ks
Getting past pandas
I am a big fan of Databricks. I like their platform but I am passionate about open-source code that makes my life measurably easier. They have more than one project like this, but I’m talking about Koalas. Koalas is an (almost) drop-in replacement for pandas. There are some differences, but these are mainly around he fact that you are working on a distributed system rather than a single node. For example, the sort order in not guaranteed. Once you are more familiar with distributed data processing, this is not a surprise.
The point is, now people can become familiar with distributed data processing.
PySpark is still the best tool to use to develop Spark applications in Python. However, data scientists were wasting cycles needlessly learning a new library that didn’t directly bring any value. By providing a drop-in replacement for their most common data management tool, we can short-circuit a frustrating and fruitless conversation and start getting to work.
Just go play with it here.