Hadoop YARN Introduction

YARN stands for Yet Another Resource Negotiator. YARN was introduced with Hadoop 2.0 to address the limitations of MapReduce and improve performance and compatibility with Hadoop. In this article, we discuss what YARN is and why it makes Hadoop better.

Problems with MapReduce

Classic MapReduce relied on a single job tracker to coordinate all jobs running on the cluster. With Hadoop v1, this single job tracker assigned map/reduce tasks to the various task trackers on the cluster.

While this approach effectively leveraged a distributed environment, relying on a single process to delegate and monitor every job created a performance bottleneck. Additionally, the relationship between MapReduce and HDFS made it difficult to accommodate other programming models (besides MapReduce) with Hadoop. The batch-oriented nature of MapReduce made supporting real-time processing and stream processing difficult.

Introducing YARN

YARN was introduced with Hadoop 2.0 to address the issues experienced with classic MapReduce. YARN effectively decouples MapReduce resource management from data processing. YARN not only eliminates the bottleneck issue with a single job tracker, it also allows Hadoop to support more programming models and paradigms.

YARN Architecture

YARN separates the roles of the traditional job tracker with the introduction of a ResourceManager and concept of application masters. While the ResourceManager manages the usage of resources across the cluster, AppMasters manage the lifecycle of the app. Following is an explanation of the major YARN components in more detail:

ResourceManager

A single ResourceManager acts primarily as a job scheduler for Hadoop. It monitors the different application masters for job requests and brokers the resources consumed by NodeManagers and agents.

NodeManager

NodeManagers are more efficient and generic versions of task trackers. They monitor and process the operations of individual cluster nodes. NodeManagers provide computational resources in the form of containers.

Containers

Containers are handled by NodeManagers to execute tasks. Containers are a way of organizing memory, cpu, disk, networking resources and give context to scheduling work with available resources.

App Master

AppMasters are framework specific entities that negotiate resources with the ResourceManager and working s. AppMasters themselves are run as containers on specified s. For every app or jop submitted to YARN, an AppMaster is created for that job.

YARN introduces the idea of the ResourceManager and AppMaster to alleviate the former responsibilities of a centralized job tracker. While the ResourceManager schedules jobs and manages cluster resources, the AppMaster exists for each submitted job to manage the lifecycle of the app or process.

Such a design better leverages a distributed environment as job monitoring is decoupled from processing. The ApplicationMaster also run with different frameworks, bringing greater processing flexibility to the Hadoop ecosystem.

YARN Job Process

The YARN process is as follows:

Conclusion

YARN enhances the performance of a Hadoop cluster through superior architecture and flexibility. By more effectively utilizing cluster resources, YARN eliminates performance bottlenecks and better isolates applications and processes running on Hadoop.

Join the conversation...

Posted by sparkdoopz
August 29, 2017

right on. check out spark next!