Last modified: September 6, 2017
MapReduce and Spark are both used to perform large scale data processing on Hadoop. While MapReduce is native to Hadoop and the traditional option for batch processing, Spark is the "new kid on the block" and offers a significant performance boost for real time data processing. In this article, we'll discuss the advantages of Spark over MapReduce, what makes Spark faster than MapReduce, and Spark vs MapReduce performance. We'll also address reasons for using MapReduce over Spark and vice versa.
MapReduce has always been a fundamental piece to the Hadoop puzzle. It's responsible for processing large data sets in parallel and works directly with HDFS. For more on the concepts surrounding Hadoop MapReduce, see MapReduce Quick Explanation.
Apache Spark is a general purpose compute engine initially released in 2014. It allows for both batch and real-time streaming jobs on a distributed cluster and is significantly faster than Hadoop MapReduce.
Unlike MapReduce, Spark performs all processing in memory. While MapReduce persists back to disk after map/reduce functions, Spark doesn't rely on the same costly I/O operations to process data.
Additionally, Spark can perform both batch and real time processing. This makes Spark a potential "one stop shop" for all your data processing needs. While MapReduce can be used for real time processing, its significantly slower than Spark.
MapReduce is notorious for being difficult to code. Spark, on the other hand, provides easy to understand API's for popular languages like Java and Python. It includes Spark SQL, making conventional SQL queries possible with Spark.
MapReduce requires a lot of reading/writing to the hard drive to process data. Since Spark can process data in-memory, it drastically reduces the latency experienced with these operations.
There are also fewer stages involved with Spark processing. Spark doesn't rely on separate stages (map, reduce) which makes repeated access to the same data much faster than MapReduce.
Spark uses Resilient Distributed Data Sets (RDD) to persist intermediate results in memory. RDDs are representations of data in object format that easily allow for both manipulation and fast recovery. For more on RDD, see Apache Spark: What is RDD?.
There are many benchmarks and case studies out there that compare the speed of MapReduce to Spark. In a nutshell, Spark is hands down much faster than MapReduce. In fact, it's estimated that Spark operates up to 100x faster than Hadoop MapReduce.
Despite the clear performance advantage of using Spark over MapReduce, there are still reasons to use MapReduce over Spark.
Remember that Spark runs entirely in memory. This means Spark is only as effective as the amount of memory it has to work with in the cluster.
Memory isn't cheap. Since hard disk space is much cheaper, it may be more cost effective to use MapReduce in certain situations. While Spark can do "more with less", you need to evaluate the cost of adding memory to your system versus using hard disk space for processing.
Spark plays well with Hadoop. In fact, you can read/write directly from HDFS with Spark. Despite this relationship, Spark processes don't necessarily stop running once a job completes. With MapReduce, processes are always killed as soon as the job is finished. This allows MapReduce to run more effectively with other services in the cluster without jeopardizing overall performance.
Spark's superior performance makes it one of the most popular options for distributed data processing today. This is not to say that traditional MapReduce is dead, especially when considering the limitations memory introduces to your cluster. While Spark is best for data iterations, MapReduce is still a good option for batch processing and data integration.