Hadoop YARN Introduction

YARN stands for Yet Another Resource Negotiator. YARN was introduced with Hadoop 2.0 to address the limitations of MapReduce and improve performance and compatibility with Hadoop. In this article, we discuss what YARN is and why it makes Hadoop better.

Problems with MapReduce

Classic MapReduce relied on a single job tracker to coordinate all jobs running on the cluster. With Hadoop v1, this single job tracker assigned map/reduce tasks to the various task trackers on the cluster.

While this approach effectively leveraged a distributed environment, relying on a single process to delegate and monitor every job created a performance bottleneck. Additionally, the relationship between MapReduce and HDFS made it difficult to accommodate other programming models (besides MapReduce) with Hadoop. The batch-oriented nature of MapReduce made supporting real-time processing and stream processing difficult.

Introducing YARN

YARN was introduced with Hadoop 2.0 to address the issues experienced with classic MapReduce. YARN effectively decouples MapReduce resource management from data processing. YARN not only eliminates the bottleneck issue with a single job tracker, it also allows Hadoop to support more programming models and paradigms.

YARN Architecture

YARN separates the roles of the traditional job tracker with the introduction of a ResourceManager and concept of application masters. While the ResourceManager manages the usage of resources across the cluster, AppMasters manage the lifecycle of the app. Following is an explanation of the major YARN components in more detail:

ResourceManager

A single ResourceManager acts primarily as a job scheduler for Hadoop. It monitors the different application masters for job requests and brokers the resources consumed by NodeManagers and agents.

NodeManager

NodeManagers are more efficient and generic versions of task trackers. They monitor and process the operations of individual cluster nodes. NodeManagers provide computational resources in the form of containers.

Containers

Containers are handled by NodeManagers to execute tasks. Containers are a way of organizing memory, cpu, disk, networking resources and give context to scheduling work with available resources.

App Master

AppMasters are framework specific entities that negotiate resources with the ResourceManager and working s. AppMasters themselves are run as containers on specified s. For every app or jop submitted to YARN, an AppMaster is created for that job.

YARN introduces the idea of the ResourceManager and AppMaster to alleviate the former responsibilities of a centralized job tracker. While the ResourceManager schedules jobs and manages cluster resources, the AppMaster exists for each submitted job to manage the lifecycle of the app or process.

Such a design better leverages a distributed environment as job monitoring is decoupled from processing. The ApplicationMaster also run with different frameworks, bringing greater processing flexibility to the Hadoop ecosystem.

YARN Job Process

The YARN process is as follows:

  • User submits an application (job) to the ResourceManager.
  • The ResourceManager determines what resources should be used for the app.
  • The ResourceManager chooses a container for the job's AppMaster to run.
  • The AppMaster for the submitted application starts.
  • The AppMaster requests the necessary resources from the ResourceManager.
  • The ResourceManager delegates the resources (containers) that satisfy the requirements requested by the AppMaster.
  • The AppMaster works with the NodeManager and delegated container(s) to monitor progress and task completion.
  • Once the process is complete, the AppMaster shuts itself down.

Conclusion

YARN enhances the performance of a Hadoop cluster through superior architecture and flexibility. By more effectively utilizing cluster resources, YARN eliminates performance bottlenecks and better isolates applications and processes running on Hadoop.

Your thoughts?