Koalas are better than Pandas (on Spark) / Blogs / Perficient

I help companies build out, manage and hopefully get value from large data stores. Or at least, I try. In order to get value from these petabytes-scale datastores, I need the data scientists to be able to easily apply their statistical and domain knowledge. There’s one fundamental problem: large datasets are always distributed and data scientists work on single-machine datasets. With pandas. Always.

Most data scientists take pandas for granted. Loading a csv file into a dataframe using pandas may have been their first line of code ever. But pandas is a single-node library. This means it runs on the driver in a spark cluster rather than all the worker nodes. So it doesn’t scale. Then it becomes my job to try to take their pandas away.

Typically, I try to convince them that PySpark is really easy and does everything that pandas can do, but it can do it distributed across all the nodes. A statement both true and unconvincing. The unfortunate result is typically a stalemate and turns a once promising initiative into a science project. My position was their code just needed some small rewrites in order to work. Their position was their existing code not only needed to be rewritten, but all future code would need to be written using an unfamiliar tool. In the case of a draw, the status quo rarely wins. And I get that. But I’ve seen enough successful large scale data deployments to know the gain in value and insight more than outweigh the short-term pain.

Revolutionize Your Business With Generative AI

From product design and software development to virtual agents, content creation, and reporting, GenAI is transforming business. Our AI experts help you unlock GenAI’s full potential and drive growth.

Let’s Get Started

Mining massive datasets is more interesting than mining small dataset. Statistically significant size differences matter. Reproducible research is more meaningful when the experiment provides a parameter-driven data extraction process rather than just the same dataset. p-hacking is harder to hide.

All that’s behind me now. (Updated: Koalas is officially included to PySpark in Apache Spark 3.2. For Apache Spark 3.2 and above, please use PySpark directly. Otherwise, install with pip as shown below:)

pip install koalas

import databricks.koalas as ks

pip install koalas import databricks.koalas as ks

pip install koalas 

import databricks.koalas as ks

Getting past pandas

I am a big fan of Databricks. I like their platform but I am passionate about open-source code that makes my life measurably easier. They have more than one project like this, but I’m talking about Koalas. Koalas is an (almost) drop-in replacement for pandas. There are some differences, but these are mainly around he fact that you are working on a distributed system rather than a single node. For example, the sort order in not guaranteed. Once you are more familiar with distributed data processing, this is not a surprise.

The point is, now people can become familiar with distributed data processing.

PySpark is still the best tool to use to develop Spark applications in Python. However, data scientists were wasting cycles needlessly learning a new library that didn’t directly bring any value. By providing a drop-in replacement for their most common data management tool, we can short-circuit a frustrating and fruitless conversation and start getting to work.

Getting started

Just go play with it here.

Koalas are better than Pandas (on Spark)

by David Callaghan on June 16th, 2021 | ~ minute read

Revolutionize Your Business With Generative AI

Getting past pandas

Getting started

Tags

Leave a Reply

David Callaghan, Senior Solutions Architect

Categories

Follow Us