Will Spark Replace Hadoop? / Blogs / Perficient

I have seen a number of articles asking the question of whether Apache Spark will replace Hadoop. This is the wrong question Spark does not equal hadoop.jpg ! It is like asking if your your DVD player will replace your entire home theater system, which is pretty absurd. Just like a home theatre system has many components, a TV or Projector, a Receiver, DVD player, and speakers, Hadoop has different components. Specifically, Hadoop has three main components:

File System – Hadoop File System (HDFS) – the scalable and reliable information storage layer
Resource Management – YARN – the resource manager which manages processing across the cluster (eg. HDFS).
Cluster Computing Framework – Originally this was MAPREDUCE now there are other options like TEZ , STORM and SPARK.

Spark’s core process framework provides a cluster computing framework and has a basic resource management layer. What it is missing it is missing is the data storage layer (e.g. HDFS). So, Spark, by itself, will not replace Hadoop (for any workload that requires data storage services).

Build an AI-First Enterprise

From early pilots to enterprise-wide deployment, our award-winning AI consulting and technical services help you build the right foundation, scale responsibly, and deliver meaningful business outcomes.

Learn More

The conversation does not end here, though. Spark’s main benefits are that it stores data for intermediate processing steps in memory versus MapReduce which stores data on disk. Mapreduce works well for batch processing where data can be read and written to disk in bulk, but for more streaming and many analytical types of workload, Spark’s in-memory centric architecture provides significant benefits. Spark’s in-memory architecture creates a framework that is very versatile supporting machine learning, graph analytics, SQL data access/management, and stream processing. Spark’s versatility is not just limited to the types of workloads. It supports Mesos, YARN, Ring Master (Cassandra’s resource manager), and its own resource management. For data storage, Spark is compatible with Hadoop, S3, Cassandra, and others. So, Spark alone will not replace Hadoop. However, Spark combined with other resource managers and data storage solutions can provide a holistic system that can replace Hadoop.

In summary, just like a DVD player can be utilized in a system that either a Plasma, LCD or DLP Projector for the display, Spark will open the door to storing and processing data residing in other systems other than Hadoop.

If you like today’s blog entry, please follow me @bigdata73.

Will Spark Replace Hadoop?

by Bill Busch on June 15th, 2015 | ~ minute read

Build an AI-First Enterprise

Tags

Leave a Reply

Bill Busch

Categories

Follow Us