A little stuffed animal called Hadoop / Blogs / Perficient

Doug Cutting – Hadoop creator – is reported to have explained how the name for his Big Data technology came about:

“The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.”

The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

The term, of course, evolved over time and almost took on a life of its own… this little elephant kept on growing, and growing… to the point that, nowadays, the term Hadoop is often used to refer to a whole ecosystem of projects, such as:

Common – components and interfaces for distributed filesystems and general I/O
Avro – serialization system for RPC and persistent data storage
MapReduce – distributed data processing model and execution environment running on large clusters of commodity machines
HDFS – distributed filesystem running on large clusters of commodity machines
Pig – data flow language / execution environment to explore huge datasets (running on HDFS and MapReduce clusters)
Hive – distributed data warehouse, manages data stored in HDFS providing a query language based on SQL for querying the data
HBase – distributed, column-oriented database that uses HDFS for its underlying storage, supporting both batch-style computations and random reads
ZooKeeper – distributed, highly available coordination service, providing primitives to build distributed applications
Sqoop – transfer bulk data between structured data stores and HDFS
Oozie – service to run and schedule workflows for Hadoop jobs

This is a sizable portion of the Big Data ecosystem… an ecosystem that keeps on growing almost by the day. In fact, we could spend a considerable amount of time describing additional technologies out there that play an important part in the Big Data symphony – DataStax, Sqrrl, Hortonworks, Cloudera, Accumulo, Apache, Ambari, Cassandra, Chukwa, Mahout, Spark, Tez, Flume, Fuse, YARN, Whirr, Grunt, HiveQL, Nutch, Java, Ruby, Python, Perl, R, NoSQL, PigLatin, Scala, etc.

Interestingly enough, most of the aforementioned technologies are used in the realm of Data Science as well, mostly due to the fact that the main goal of Data Science is to make sense out of and generate value from all data, in all of its many forms, shapes, structures and sizes.

In my next blog post, we’ll see how Big Data and Data Science are actually two sides of the same coin, and how whoever does Big Data, is actually doing Data Science as well to some extent – wittingly, or unwittingly.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

A little stuffed animal called Hadoop

by Andrea Serafini on June 30th, 2014 | ~ minute read

The Future of Big Data

Tags

Leave a Reply

Andrea Serafini

Categories

Follow Us