In previous post I talked over basic concepts, business values, some situation and facts of big data application. Actually in current IT world, the big data related technologies and products were being developed and practiced dramatically. There will be a pretty long list in which how many companies and organizations have been involved in this wave of big data solution spectrum, which indicates lots of leaders realized its importance and were expecting this new solution can derive the true business value. When we talk about big data, an important term should jump into our mind – Hadoop. It was claimed by many editors that Hadoop was NOT the only bullet to meet challenge of big data management and analysis in corporations. Nevertheless, it was truly a silver bullet to solve most big data problems and has been the very foundation infrastructure for big data appliances provided by leading companies such as Oracle, Informatica, IBM, Microsoft etc.
What is Hadoop? The definition is quoted from Jeff Hammerbacher who is chief scientist in Cloudera– Hadoop is a scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license). Hadoop with an elephant icon is now one part of Apache projects. The core components of Hadoop are comprise of Common, HDFS(Hadoop Distributed File System) and MapReduce. The latter 2 parts make up to scalable data processing engine:
Common: The common utilities that support other Hadoop Subprojects.
HDFS: Self-healing high-bandwidth clustered storage;
MapReduce: Fault-tolerant distributed processing;
In addition, there are couple of sub projects are also key to the big data practitioner such as Hive, HBase, Zookeeper and Hue etc. Cloudera has defined the functionality and feature for each layer and the major projects around Hadoop. This is what I called Hadoop family.
- HBase: A scalable, distributed database that supports structured data storage for large tables.
- Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Pig: A high-level data-flow language and execution framework for parallel computation.
- ZooKeeper: A high-performance coordination service for distributed applications.
Well, since we speak of Hadoop as a silver bullet, what kind of problems that Hadoop is able to solve?
- Identify true risk. Identifying customer and business transaction risk usually happens in insurance and finance industry such as bank and Credit Card Company. They want to know more about the customers and determine how much risk they will have if to do loan or issue a credit card. Hadoop family solution can pull disparate data sources and do parse/aggregation so as to build a comprehensive data picture. The source can be semi-structure like credit card records, call recordings, chat sessions and emails etc.
- Online targeted Advertising. In the era of Internet, companies have made maximum use of advantage of online ad to market their products and service. Web search engine such as Google, Yahoo, Bing and Baidu is popular now. The facing problem is how to ensure Ad ROI. As many may know, targeted advertisement is an effective way. Hadoop can help on it by conducting data analysis in parallel, reducing processing times from days to hours. With Hadoop solution, the only expansion in cost is hardware as data volumes grow fast. It will not degrade the performance when adding more nodes in Advertising industry.
And actually more example…
An interesting story I have read in the article is that New York police has adopted big data technology to do analysis over internet (web, twitter etc), to find out potential criminal and determine whether there was a trend of racial profiling. Up to now, Hadoop plays top key fundamental role in various Big data relevant products.