Yarn Interview Questions

May 09, 2017

1) Whats is YARN?
- (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.
- YARN is a powerful and efficient feature rolled out as a part of Hadoop 2.0.YARN is a large scale distributed system for running big data applications.
- ResourceManager:
- It receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly,
- where the actual processing takes place. It allocates resources to applications based on the needs.
- it consist of 2 components:
- Application Manager:It accepts job-submissions, negotiates the container for ApplicationMaster and handles failures while executing MapReduce jobs.
- Scheduler: Scheduler allocates resources that is required by various MapReduce application running on the Hadoop cluster.
-Node Manager:
- NodeManager is installed on every DataNode and it is responsible for execution of the task on every single DataNode.

2) Is YARN a replacement of Hadoop MapReduce?
- YARN is not a replacement of Hadoop but it is a more powerful and efficient technology that supports MapReduce and is also referred to as
Hadoop 2.0 or MapReduce 2.

3) What are the additional benefits YARN brings in to Hadoop?
- Effective utilization of the resources as multiple applications can be run in YARN all sharing a common resource.
- In Hadoop MapReduce there are seperate slots for Map and Reduce tasks whereas in YARN there is no fixed slot.
- The same container can be used for Map and Reduce tasks leading to better utilization.
- YARN is backward compatible so all the existing MapReduce jobs.
- Using YARN, one can even run applications that are not based on the MaReduce model

4) How can native libraries be included in YARN jobs?
- There are two ways to include native libraries in YARN jobs
1) By setting the -Djava.library.path on the command line but in this case there are chances that the native libraries might not be loaded correctly
and there is possibility of errors.
2) The better option to include native libraries is to the set the LD_LIBRARY_PATH in the .bashrc file.

5) Explain the differences between Hadoop 1.x and Hadoop 2.x
- In Hadoop 1.x, MapReduce is responsible for both processing and cluster management
- Hadoop 2.x processing is taken care of by other processing models and YARN is responsible for cluster management.
- Hadoop 2.x scales better when compared to Hadoop 1.x with close to 10000 nodes per cluster.
- Hadoop 1.x has single point of failure problem and whenever the NameNode fails it has to be recovered manually.
- in case of Hadoop 2.x StandBy NameNode overcomes the SPOF problem and whenever the NameNode fails it is configured for automatic recovery.
- Hadoop 1.x works on the concept of slots whereas Hadoop 2.x works on the concept of containers and can also run generic tasks.

6) What are the core changes in Hadoop 2.0?
- Hadoop 2.x provides an upgrade to Hadoop 1.x in terms of resource management, scheduling and the manner in which execution occurs.
- In Hadoop 2.x the cluster resource management capabilities work in isolation from the MapReduce specific programming logic.
- This helps Hadoop to share resources dynamically between multiple parallel processing frameworks like Impala and the core MapReduce component.
- Hadoop 2.x allows workable and fine grained resource configuration leading to efficient and better cluster utilization so that the
application can scale to process larger number of jobs.

7) Differentiate between NFS, Hadoop NameNode and JournalNode.
- StandBy Nodes and Active Nodes communicate with a group of light weight nodes to keep their state synchronized. These are known as Journal Nodes.

8) What are the modules that constitute the Apache Hadoop 2.0 framework?
- Hadoop 2.0 contains four important modules of which 3 are inherited from Hadoop 1.0 and a new module YARN is added to it.
- Hadoop Common – This module consists of all the basic utilities and libraries that required by other modules.
- HDFS- Hadoop Distributed file system that stores huge volumes of data on commodity machines across the cluster.
- MapReduce- Java based programming model for data processing.
- YARN- This is a new module introduced in Hadoop 2.0 for cluster resource management and job scheduling.

9) How is the distance between two nodes defined in Hadoop?
- Measuring bandwidth is difficult in Hadoop so network is denoted as a tree in Hadoop.
- The distance between two nodes in the tree plays a vital role in forming a Hadoop cluster
and is defined by the network topology and java interface DNStoSwitchMapping.
- The distance is equal to the sum of the distance to the closest common ancestor of both the nodes.
- The method getDistance(Node node1, Node node2) is used to calculate the distance between two nodes
with the assumption that the distance from a node to its parent node is always 1.

Search Your Error

Errors and Fixes

Yarn Interview Questions

Comments

Post a Comment

Popular posts from this blog

Hive Related Errors and fixes

HBase Interview Questions

Flume Related Errors