I have seen a number of articles asking the question of whether Apache Spark will replace Hadoop. This is the wrong question! It is like asking if your your DVD player will replace your entire home theater system, which is pretty absurd. Just like a home theatre system has many components, a TV or Projector, a Receiver, DVD player, and speakers, Hadoop has different components. Specifically, Hadoop has three main components:
- File System – Hadoop File System (HDFS) – the scalable and reliable information storage layer
- Resource Management – YARN – the resource manager which manages processing across the cluster (eg. HDFS).
- Cluster Computing Framework – Originally this was MAPREDUCE now there are other options like TEZ , STORM and SPARK.
Spark’s core process framework provides a cluster computing framework and has a basic resource management layer. What it is missing it is missing is the data storage layer (e.g. HDFS). So, Spark, by itself, will not replace Hadoop (for any workload that requires data storage services).
The conversation does not end here, though. Spark’s main benefits are that it stores data for intermediate processing steps in memory versus MapReduce which stores data on disk. Mapreduce works well for batch processing where data can be read and written to disk in bulk, but for more streaming and many analytical types of workload, Spark’s in-memory centric architecture provides significant benefits. Spark’s in-memory architecture creates a framework that is very versatile supporting machine learning, graph analytics, SQL data access/management, and stream processing. Spark’s versatility is not just limited to the types of workloads. It supports Mesos, YARN, Ring Master (Cassandra’s resource manager), and its own resource management. For data storage, Spark is compatible with Hadoop, S3, Cassandra, and others. So, Spark alone will not replace Hadoop. However, Spark combined with other resource managers and data storage solutions can provide a holistic system that can replace Hadoop.
In summary, just like a DVD player can be utilized in a system that either a Plasma, LCD or DLP Projector for the display, Spark will open the door to storing and processing data residing in other systems other than Hadoop.
If you like today’s blog entry, please follow me @bigdata73.