MapReduce Quick Explanation

MapReduce is central to Apache Hadoop and distributed data processing. By leveraging data locality, MapReduce allows functions to run in parallel across multiple server nodes in a cluster. In this article, we discuss what MapReduce is, how it works, and provide a basic example for getting started with MapReduce.

What is MapReduce?

MapReduce is responsible for running parallel and distributed algorithms on a shared data cluster. MapReduce processes data near the place it is stored to improve performance. Rather than fetching data sets from different server nodes and aggregating results, MapReduce brings processing TO the data itself.

MapReduce by itself isn't inherently faster than other processing methods. It becomes an advantage only when applied to distributed data sets like HDFS.

How MapReduce works

The MapReduce process can be summarized in three generalized steps:

1) Map

A master node or job tracker applies a certain map() function to the different data nodes in a cluster. The function runs in parallel on each data set, producing a list of key/value pairs.

2) Shuffle

These key/value pairs are redistributed by worker nodes. All keys having the same value are grouped together on the same worker nodes in preparation for the reduce stage.

3) Reduce

These different groups of key/value pairs are then used as input values for a reduce() job. This reduce() function is applied across the worker nodes in parallel. The aggregated response is stored (typically in HDFS) or returned to the client.

Since HDFS and the MapReduce framework run on the same set of nodes, Hadoop can leverage data locality to run map jobs on data residing on a particular node. Running these "local" jobs in parallel across the cluster speeds things up significantly. In fact, it's estimated that a large cluster can sort a petabyte of data in a few hours because of MapReduce!

A Basic Example

The word count problem is an easy way to demonstrate how MapReduce works. Lets say you have two files (on different nodes) with the following text:

Node 1

Hello from Hadoop

Node 2

Goodbye from Hadoop

These could represent two simple text files on two separate data nodes in HDFS. If we use MapReduce to get an aggregated word count, here is how it would play out.

First, a map function is applied to the different nodes. It will produce the following output:

Node 1

<Hello, 1>
<from, 1>
<Hadoop, 1>

Node 2

<Goodbye, 1>
<from, 1>
<Hadoop, 1>

Now that the map function has processed both data nodes, the shuffle step groups similar key/value pairs across the cluster. At this point, the reduce step aggregates the results like so:

<Hello, 1>
<Goodbye, 1>
<from, 2>
<Hadoop, 2>

Notice how the reduce step has aggregated the results to give us an accurate word count on the entire data set.

Conclusion

The benefits of MapReduce are only realized when applied to a distributed cluster of servers. This creates a "multithreaded" environment where the same function can be applied to many machines at the same time. While the map stage applies a certain function to data sets across the cluster, the reduce stage takes those sorted key/value pairs and aggregates the results.

Your thoughts?