Apache Spark: What is RDD?
Last modified: September 6, 2017
Resilient Distributed Dataset (RDD) is the fundamental data structure of Spark. It is a distributed collection of immutable objects that allows for both batch and real-time data processing in Hadoop. In this article, we discuss the basics of RDD including what it is and how it’s better than classic MapReduce.
What is RDD?
RDD stands for Resilient Distributed Dataset. RDD is the primary data abstraction in Spark. It is:
RDD is fault tolerant. RDD uses lineage graphs to track and replace any lost data. Everything that is stored in RDD remembers how it came to be and can recreate itself based on failures.
The data sets stored in RDD are distributed over a shared cluster of nodes. This allows the data stored in RDD to be processed in parallel, taking full advantage of a distributed environment. Any data that is stored in RDD is stored across different nodes to enable parallel processing.
RDD is a collection of data sets or partitioned records. Every dataset in RDD is partitioned across the cluster.
How RDD works
RDD stores data in immutable objects which are easily sharable across different jobs. These objects can contain any type of data structure written in Python, Scala, Java, etc.
Unlike MapReduce, RDD doesn't read/write from disk each time it requests data. Rather, RDD temporarily stores data for continued processing. Think of RDD as a cache for HDFS.
Why is RDD better than MapReduce
RDD avoids all of the reading/writing to HDFS. By significantly reducing I/O operations, RDD offers a much faster way to retrieve and process data in a Hadoop cluster. In fact, it's estimated that Hadoop MapReduce apps spend more than 90% of their time performing reads/writes to HDFS.
RDD is simply the data abstraction layer for Apache Spark. It effectively persists data in memory to reduce I/O operations and increase performance drastically. For more information on Apache Spark and how it compares to classic MapReduce, see Hadoop MapReduce vs Spark.