Skip to main content

Data & Intelligence

Will Spark Replace Hadoop?

I have seen a number of articles asking the question of whether Apache Spark will replace Hadoop.   This is the wrong questionSpark does not equal hadoop.jpg!  It is like asking if your your DVD player will replace your entire home theater system, which is pretty absurd.  Just like a home theatre system has many components, a TV or Projector, a Receiver, DVD player, and speakers, Hadoop has different components.   Specifically, Hadoop has three main components:

  1. File System – Hadoop File System (HDFS) – the scalable and reliable information storage layer
  2. Resource Management – YARN – the resource manager which manages processing across the cluster (eg. HDFS).
  3. Cluster Computing Framework – Originally this was MAPREDUCE now there are other options like TEZ , STORM and SPARK.

Spark’s core process framework provides a cluster computing framework and has a basic resource management layer.  What it is missing it is missing is the data storage layer (e.g. HDFS).  So, Spark, by itself, will not replace Hadoop (for any workload that requires data storage services).

Data Intelligence - The Future of Big Data
The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

The conversation does not end here, though.   Spark’s main benefits are that it stores data for intermediate processing steps in memory versus MapReduce which stores data on disk.   Mapreduce works well for batch processing where data can be read and written to disk in bulk, but for more streaming and many analytical types of workload, Spark’s in-memory centric architecture provides significant benefits.   Spark’s in-memory architecture creates a framework that is very versatile supporting machine learning, graph analytics, SQL data access/management, and stream processing. Spark’s versatility is not just limited to the types of workloads.  It supports Mesos, YARN, Ring Master (Cassandra’s resource manager), and its own resource management.  For data storage, Spark is compatible with Hadoop, S3, Cassandra, and others. So, Spark alone will not replace Hadoop.  However, Spark combined with other resource managers and data storage solutions can provide a holistic system that can replace Hadoop.

In summary, just like a DVD player can be utilized in a system that either a Plasma, LCD or DLP Projector for the display, Spark will open the door to storing and processing data residing in other systems other than Hadoop.

If you like today’s blog entry, please follow me @bigdata73.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Bill Busch

Bill is a Director and Senior Data Strategist leading Perficient's Big Data Team. Over his 27 years of professional experience he has helped organizations transform their data management, analytics, and governance tools and practices. As a veteran in analytics, Big Data, data architecture and information governance, he advises executives and enterprise architects on the latest pragmatic information management strategies. He is keenly aware of how to advise and lead companies through developing data strategies, formulating actionable roadmaps, and delivering high-impact solutions. As one of Perficient’s prime thought leaders for Big Data, he provides the visionary direction for Perficient’s Big Data capability development and has led many of our clients largest Data and Cloud transformation programs. Bill is an active blogger and can be followed on Twitter @bigdata73.

More from this Author

Follow Us