Quick Start | Apache Hadoop

Hadoop is an open source framework for distributed computing. It's reliable, scalable, and very popular as a big data platform. In this article, we'll discuss the basics of Hadoop including it's major components and architecture.

What is Hadoop?

Hadoop is a framework designed for big data. Specifically, the Hadoop "ecosystem" is a collection of libraries, scripts, and engines that work together to store large amounts of data on a distributed cluster of computers.

Hadoop's base framework includes four major modules. Below is a description of each module and it's role in the larger Hadoop ecosystem.

Hadoop Common

This includes all the common utilities and libraries that support other Hadoop modules.

Hadoop Distributed File System (HDFS)

Data stored in a Hadoop cluster is broken down into smaller units called blocks. These blocks are distributed and stored over DataNodes. For every node in a cluster, there exists a DataNode. These nodes perform block operations to read/write data to Hadoop's distributed file system (HDFS).

By distributing the data over these different data nodes, the cluster can more efficiently execute map and reduce functions on smaller chunks of data. This also allows data to be replicated across multiple servers to prevent data loss.

In addition to these DataNodes, clusters also have a central NameNode which manages file system name space, client permissions, and metadata related operations (such as renaming, closing, and opening directories).

Cluster's also have a secondary namenode which takes periodic snapshots of the primary name node's directory info.

Hadoop MapReduce

MapReduce provides an API for applying functions in parallel across a cluster that act on smaller chunks of data. This distributes the work load and allows for faster processing on larger data sets. After the map task completes its work, it's outputs are then inputs for a reduce task that performs operations on sorted key/value pairs.

It should be noted that map-reduce itself isn't inherently faster than other approaches. If you were to run map-reduce on a single server, it wouldn't be much faster than running a function on the whole data set.

Map-reduce simply leverages a distributed cluster of server nodes by bringing functions to the data. By running a specified function in parallel across multiple servers, map-reduce makes processing data on a shared cluster much faster.

Hadoop Yarn

Yarn separates resource management from processing components. In Hadoop 1.0, a single JobTracker was responsible for allocating map-reduce jobs to different TaskTrackers. While TaskTrackers ran independently on different nodes, the centralized JobTracker became a bottleneck for larger jobs.

To solve this issue, Yarn was introduced with Hadoop 2.0 to load balance the JobTracker role across the cluster. With Yarn, a central ResourceManager is responsible for different NodeManagers which act as JobTrackers for individual nodes. ApplicationMasters coordinate execution within the NodeManagers. This distributes the work load and improves scalability.

Hadoop also supports other modules that extend or build on it's base framework, including:

HBase

An open source, non-relational database that runs on top of HDFS

Flume

A service for collecting, aggregating, and moving large amounts of log data

Spark

Spark provides a faster way to read/write from HDFS. Spark is used in combination with MapReduce and other big data frameworks to improve performance.

Hive

Hive runs on top of Hadoop and provides a SQL like interface for querying HDFS data.

Hadoop Architecture

Hadoop runs on a cluster of host machines called nodes. These nodes can be separated into different racks.

A centralized ResourceManager has rack awareness. It knows where slave nodes are located and runs a ResourceScheduler that dictates resource allocation.

The slave nodes are also called NodeManagers. These NodeManagers communicate with the ResourceManager and provide containers of memory for running applications.

These containers run an ApplicationMaster for any given app. By using different ApplicationMaster instances to run applications, YARN reduces the load of the ResourceManager to improve performance.

While YARN provides resources for running apps, HDFS provides fault-tolerant and reliable storage of data. The MapReduce framework runs on top of YARN to load balance data processing across the shared cluster.

Conclusion

The whole idea behind Hadoop is to leverage a shared cluster of computers to process and store big data. Through data locality, the MapReduce framework is able to bring processing to the data. HDFS reliably stores data in a distributed, fault-tolerant way via data blocks. YARN decouples resource management from processing to further leverage the shared cluster. Additional modules like Spark and Hive work on top of Hadoop for added functionality and performance.

Your thoughts?