Doug Cutting – Hadoop creator – is reported to have explained how the name for his Big Data technology came about:
“The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.”
The term, of course, evolved over time and almost took on a life of its own… this little elephant kept on growing, and growing… to the point that, nowadays, the term Hadoop is often used to refer to a whole ecosystem of projects, such as:
- Common – components and interfaces for distributed filesystems and general I/O
- Avro – serialization system for RPC and persistent data storage
- MapReduce – distributed data processing model and execution environment running on large clusters of commodity machines
- HDFS – distributed filesystem running on large clusters of commodity machines
- Pig – data flow language / execution environment to explore huge datasets (running on HDFS and MapReduce clusters)
- Hive – distributed data warehouse, manages data stored in HDFS providing a query language based on SQL for querying the data
- HBase – distributed, column-oriented database that uses HDFS for its underlying storage, supporting both batch-style computations and random reads
- ZooKeeper – distributed, highly available coordination service, providing primitives to build distributed applications
- Sqoop – transfer bulk data between structured data stores and HDFS
- Oozie – service to run and schedule workflows for Hadoop jobs
This is a sizable portion of the Big Data ecosystem… an ecosystem that keeps on growing almost by the day. In fact, we could spend a considerable amount of time describing additional technologies out there that play an important part in the Big Data symphony – DataStax, Sqrrl, Hortonworks, Cloudera, Accumulo, Apache, Ambari, Cassandra, Chukwa, Mahout, Spark, Tez, Flume, Fuse, YARN, Whirr, Grunt, HiveQL, Nutch, Java, Ruby, Python, Perl, R, NoSQL, PigLatin, Scala, etc.
Interestingly enough, most of the aforementioned technologies are used in the realm of Data Science as well, mostly due to the fact that the main goal of Data Science is to make sense out of and generate value from all data, in all of its many forms, shapes, structures and sizes.
In my next blog post, we’ll see how Big Data and Data Science are actually two sides of the same coin, and how whoever does Big Data, is actually doing Data Science as well to some extent – wittingly, or unwittingly.