Skip to main content

Data & Intelligence

Koalas are better than Pandas (on Spark)

Watercolor Koala And Panda Sitting On The Tree.

I help companies build out, manage and hopefully get value from large data stores. Or at least, I try. In order to get value from these petabytes-scale datastores, I need the data scientists to be able to easily apply their statistical and domain knowledge. There’s one fundamental problem: large datasets are always distributed and data scientists work on single-machine datasets. With pandas. Always.

Most data scientists take pandas for granted. Loading a csv file into a dataframe using pandas may have been their first line of code ever. But pandas is a single-node library. This means it runs on the driver in a spark cluster rather than all the worker nodes. So it doesn’t scale. Then it becomes my job to try to take their pandas away.

Typically, I try to convince them that PySpark is really easy and does everything that pandas can do, but it can do it distributed across all the nodes. A statement both true and unconvincing. The unfortunate result is typically a stalemate and turns a once promising initiative into a science project. My position was their code just needed some small rewrites in order to work. Their position was their existing code not only needed to be rewritten, but all future code would need to be written using an unfamiliar tool. In the case of a draw, the status quo rarely wins. And I get that. But I’ve seen enough successful large scale data deployments to know the gain in value and insight more than outweigh the short-term pain.

Mining massive datasets is more interesting than mining small dataset. Statistically significant size differences matter. Reproducible research is more meaningful when the experiment provides a parameter-driven data extraction process rather than just the same dataset. p-hacking is harder to hide.

All that’s behind me now. (Updated: Koalas is officially included to PySpark in Apache Spark 3.2. For Apache Spark 3.2 and above, please use PySpark directly. Otherwise, install with pip as shown below:)

pip install koalas 

import databricks.koalas as ks

Getting past pandas

I am a big fan of Databricks. I like their platform but I am passionate about open-source code that makes my life measurably easier. They have more than one project like this, but I’m talking about Koalas. Koalas is an (almost) drop-in replacement for pandas. There are some differences, but these are mainly around he fact that you are working on a distributed system rather than a single node. For example, the sort order in not guaranteed. Once you are more familiar with distributed data processing, this is not a surprise.

The point is, now people can become familiar with distributed data processing.

PySpark is still the best tool to use to develop Spark applications in Python. However, data scientists were wasting cycles needlessly learning a new library that didn’t directly bring any value. By providing a drop-in replacement for their most common data management tool, we can short-circuit a frustrating and fruitless conversation and start getting to work.

Getting started

Just go play with it here.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us