YARN stands for Yet Another Resource Negotiator. YARN was introduced with Hadoop 2.0 to address the limitations of MapReduce and improve performance and compatibility with Hadoop. In this article, we discuss what YARN is and why it makes Hadoop better.
Classic MapReduce relied on a single job tracker to coordinate all jobs running on the cluster. With Hadoop v1, this single job tracker assigned map/reduce tasks to the various task trackers on the cluster.
While this approach effectively leveraged a distributed environment, relying on a single process to delegate and monitor every job created a performance bottleneck. Additionally, the relationship between MapReduce and HDFS made it difficult to accommodate other programming models (besides MapReduce) with Hadoop. The batch-oriented nature of MapReduce made supporting real-time processing and stream processing difficult.
YARN was introduced with Hadoop 2.0 to address the issues experienced with classic MapReduce. YARN effectively decouples MapReduce resource management from data processing. YARN not only eliminates the bottleneck issue with a single job tracker, it also allows Hadoop to support more programming models and paradigms.
YARN separates the roles of the traditional job tracker with the introduction of a ResourceManager and concept of application masters. While the ResourceManager manages the usage of resources across the cluster, AppMasters manage the lifecycle of the app. Following is an explanation of the major YARN components in more detail:
A single ResourceManager acts primarily as a job scheduler for Hadoop. It monitors the different application masters for job requests and brokers the resources consumed by NodeManagers and agents.
NodeManagers are more efficient and generic versions of task trackers. They monitor and process the operations of individual cluster nodes. NodeManagers provide computational resources in the form of containers.
Containers are handled by NodeManagers to execute tasks. Containers are a way of organizing memory, cpu, disk, networking resources and give context to scheduling work with available resources.
YARN introduces the idea of the ResourceManager and AppMaster to alleviate the former responsibilities of a centralized job tracker. While the ResourceManager schedules jobs and manages cluster resources, the AppMaster exists for each submitted job to manage the lifecycle of the app or process.
Such a design better leverages a distributed environment as job monitoring is decoupled from processing. The ApplicationMaster also run with different frameworks, bringing greater processing flexibility to the Hadoop ecosystem.
The YARN process is as follows:
YARN enhances the performance of a Hadoop cluster through superior architecture and flexibility. By more effectively utilizing cluster resources, YARN eliminates performance bottlenecks and better isolates applications and processes running on Hadoop.