Will Spark Replace Hadoop? / Blogs / Perficient

I have seen a number of articles asking the question of whether Apache Spark will replace Hadoop. This is the wrong question Spark does not equal hadoop.jpg ! It is like asking if your your DVD player will replace your entire home theater system, which is pretty absurd. Just like a home theatre system has many components, a TV or Projector, a Receiver, DVD player, and speakers, Hadoop has different components. Specifically, Hadoop has three main components:

File System – Hadoop File System (HDFS) – the scalable and reliable information storage layer
Resource Management – YARN – the resource manager which manages processing across the cluster (eg. HDFS).
Cluster Computing Framework – Originally this was MAPREDUCE now there are other options like TEZ , STORM and SPARK.

Spark’s core process framework provides a cluster computing framework and has a basic resource management layer. What it is missing it is missing is the data storage layer (e.g. HDFS). So, Spark, by itself, will not replace Hadoop (for any workload that requires data storage services).

Revolutionize Your Business With Generative AI

From product design and software development to virtual agents, content creation, and reporting, GenAI is transforming business. Our AI experts help you unlock GenAI’s full potential and drive growth.

Let’s Get Started

The conversation does not end here, though. Spark’s main benefits are that it stores data for intermediate processing steps in memory versus MapReduce which stores data on disk. Mapreduce works well for batch processing where data can be read and written to disk in bulk, but for more streaming and many analytical types of workload, Spark’s in-memory centric architecture provides significant benefits. Spark’s in-memory architecture creates a framework that is very versatile supporting machine learning, graph analytics, SQL data access/management, and stream processing. Spark’s versatility is not just limited to the types of workloads. It supports Mesos, YARN, Ring Master (Cassandra’s resource manager), and its own resource management. For data storage, Spark is compatible with Hadoop, S3, Cassandra, and others. So, Spark alone will not replace Hadoop. However, Spark combined with other resource managers and data storage solutions can provide a holistic system that can replace Hadoop.

In summary, just like a DVD player can be utilized in a system that either a Plasma, LCD or DLP Projector for the display, Spark will open the door to storing and processing data residing in other systems other than Hadoop.

If you like today’s blog entry, please follow me @bigdata73.

Will Spark Replace Hadoop?

by Bill Busch on June 15th, 2015 | ~ minute read

Revolutionize Your Business With Generative AI

Tags

Leave a Reply

Bill Busch

Categories

Follow Us