When it comes to Big Data, we’ve talked about the significance of Hadoop, but it is still a mystery to many people. In fact, even among IT experts, there have been multiple and conflicting interpretations of what it is. While all of its complexities may not be unraveled, an article by Derrick Harris on GigaOM.com gives us a good overview of the Hadoop environment. There are the components of Hadoop itself, and then, there is the software that makes use of Hadoop, whether through enabling the writing of Hadoop applications or assisting in the analysis of data stored within Hadoop.
Harris says that Hadoop, as an Apache Software Foundation project, consists of two essential components: “Hadoop MapReduce and the Hadoop Distributed File System. MapReduce is the parallel-processing engine that allows Hadoop to churn through large data sets in relatively short order. HDFS is the distributed file system that lets Hadoop scale across commodity servers and, importantly, store data on the compute nodes in order to boost performance (and potentially save money).”
Harris then discusses and details the other Apache projects that are related to Hadoop, some of which are built on either MapReduce or HDFS. These include query languages and databases for Hadoop. He also points out that “many Hadoop distributions integrate with various data warehouses, databases and other data-management products, with the goal of moving data between Hadoop clusters and other environments so each might process or query data stored in the other.”
There is also Hadoop management software, making it easier to manage and troubleshoot a Hadoop cluster. And, there are products for developers to write Hadoop applications, and others for performing data analysis outside of the traditional MapReduce jobs.
The number of products surrounding and integrating with Hadoop will only continue to grow as the challenge of Big Data continues. (See Harris Hadoop article and related feedback at http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/)