Previously, I reviewed why Spark will not by itself replace Hadoop, but Spark combined with other data storage and resource management technologies creates other options for managing Big Data. Today we will investigate how an enterprise should proceed in this new, “Hadoop is not the only option” world.
Hadoop, Spark, Cassandra, Oh My! Open source Hadoop and NoSQL are moving pretty fast. No wonder some companies might feel like Dorothy in the Land of Oz. To make sense and things and find the yellow brick road, first we need to understand Hadoop’s market position.
HDFS and YARN can support the storage of just about any type of data. With Microsoft’s help, now it is possible to run Hadoop on Windows and leverage .NET. Of course most are running on Linux and Java. HBASE, Hive, and Cassandra all can run on top of Hadoop. Hadoop and HIVE support is quickly becoming ubiquitous across data discovery, BI, and analytics tools sets. Hadoop is maturing fast from the security perspective. Thanks to HortonWorks, Apache Knox and Ranger have delivered enterprise security capabilities. Cloudera and IBM both have their own stories on enterprise security as well. WANdisco provides robust multi-site data replication and state of the art data protection. The bottom line is that Hadoop has and is continuing to mature AND there is an extensive amount of support from most vendors and related Apache open-source projects. Hadoop is not going anywhere but up! Hadoop is the defacto standard (aka safe bet) for Big Data management and processing. It will meet the requirements of most enterprises, and its ability to support many different execution frameworks like Storm, Mapreduce, TEZ, and Spark will assure support for most any application processing scenario.
Although Hadoop may be a safe choice it isn’t always the correct choice for an enterprise. There are other options. I worked at a mid-size company recently, and based on their requirements, they did not require a Hadoop to meet their 2-3 year needs. A solution build on RDBMS was sufficient because of the characteristics of their data. Meanwhile a large retailer has recently deployed a very large (100s of TBs) Cassandra system to replace a relational operational data store. Requirements, willingness to accept risk, budget, and in-house skill sets all play a part of the overall decision.
When making a platform decision for Big Data, enterprises should:
- Make a decision based on the current market – do not “bet” on a technology that does not exist yet. If you have a business case today, the last thing you need to do is wait. That is, a good decision today is better than great decision tomorrow. Time is money.
- Do not lock yourself into any single technology by investing a significant amount of time and expense in writing code that would need to be written if something better comes along. Rely on product vendors to isolate you from changes in technology. Using a third party tool like Snaplogic or Informatica for data transformation will help isolate you from underlying platform changes.
- Stick to more mature open-source offerings. New open-source projects offer a lot of promise and excitement, however, they need to mature first (Note that IBM’s announcement implies that they believe Spark needs to mature more before it is enterprise ready).
- Perform a proof of concept, but do not fall into the analysis paralysis trap of testing every different technology combination. Pick one or two deployment scenarios to prove. Time is money.
Lastly, consider partnering with an experience consulting firm with hands on experience in operationalizing Big Data (Perficient is an excellent choice by the way!). A good consultant will, in the long run, save you money by providing objective advice and speeding you through a solution and implementation by focusing your organization on those decisions that truly important.