Yarn Interview Questions

1) Whats is YARN?
- (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.
- YARN is a powerful and efficient feature rolled out as a part of Hadoop 2.0.YARN is a large scale distributed system for running big data applications.
- ResourceManager:
- It receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly,
- where the actual processing takes place. It allocates resources to applications based on the needs.
- it consist of 2 components:
- Application Manager:It accepts job-submissions, negotiates the container for ApplicationMaster and handles failures while executing MapReduce jobs.
- Scheduler: Scheduler allocates resources that is required by various MapReduce application running on the Hadoop cluster.
-Node Manager:
- NodeManager is installed on every DataNode and it is responsible for execution of the task on every single DataNode.

2) Is YARN a replacement of Hadoop MapReduce?
- YARN is not a replacement of Hadoop but it is a more powerful and efficient technology that supports MapReduce and is also referred to as
 Hadoop 2.0 or MapReduce 2.

3) What are the additional benefits YARN brings in to Hadoop?
- Effective utilization of the resources as multiple applications can be run in YARN all sharing a common resource.
- In Hadoop MapReduce there are seperate slots for Map and Reduce tasks whereas in YARN there is no fixed slot.
- The same container can be used for Map and Reduce tasks leading to better utilization.
- YARN is backward compatible so all the existing MapReduce jobs.
- Using YARN, one can even run applications that are not based on the MaReduce model

4) How can native libraries be included in YARN jobs?
- There are two ways to include native libraries in YARN jobs
1) By setting the -Djava.library.path on the command line  but in this case there are chances that the native libraries might not be loaded correctly
  and there is possibility of errors.
2) The better option to include native libraries is to the set the LD_LIBRARY_PATH in the .bashrc file.

5) Explain the differences between Hadoop 1.x and Hadoop 2.x
- In Hadoop 1.x, MapReduce is responsible for both processing and cluster management
- Hadoop 2.x processing is taken care of by other processing models and YARN is responsible for cluster management.
- Hadoop 2.x scales better when compared to Hadoop 1.x with close to 10000 nodes per cluster.
- Hadoop 1.x has single point of failure problem and whenever the NameNode fails it has to be recovered manually.
- in case of Hadoop 2.x StandBy NameNode overcomes the SPOF problem and whenever the NameNode fails it is configured for automatic recovery.
- Hadoop 1.x works on the concept of slots whereas Hadoop 2.x works on the concept of containers and can also run generic tasks.

6) What are the core changes in Hadoop 2.0?
- Hadoop 2.x provides an upgrade to Hadoop 1.x in terms of resource management, scheduling and the manner in which execution occurs.
- In Hadoop 2.x the cluster resource management capabilities work in isolation from the MapReduce specific programming logic.
- This helps Hadoop to share resources dynamically between multiple parallel processing frameworks like Impala and the core MapReduce component.
- Hadoop 2.x allows workable and fine grained resource configuration leading to efficient and better cluster utilization so that the
 application can scale to process larger number of jobs.

7) Differentiate between NFS, Hadoop NameNode and JournalNode.
- StandBy Nodes and Active Nodes communicate with a group of light weight nodes to keep their state synchronized. These are known as Journal Nodes.

8) What are the modules that constitute the Apache Hadoop 2.0 framework?
- Hadoop 2.0 contains four important modules of which 3 are inherited from Hadoop 1.0 and a new module YARN is added to it.
- Hadoop Common – This module consists of all the basic utilities and libraries that required by other modules.
- HDFS- Hadoop Distributed file system that stores huge volumes of data on commodity machines across the cluster.
- MapReduce- Java based programming model for data processing.
- YARN- This is a new module introduced in Hadoop 2.0 for cluster resource management and job scheduling.

9) How is the distance between two nodes defined in Hadoop?
- Measuring bandwidth is difficult in Hadoop so network is denoted as a tree in Hadoop.
- The distance between two nodes in the tree plays a vital role in forming a Hadoop cluster
 and is defined by the network topology and java interface DNStoSwitchMapping.
- The distance is equal to the sum of the distance to the closest common ancestor of both the nodes.
- The method getDistance(Node node1, Node node2) is used to calculate the distance between two nodes
 with the assumption that the distance from a node to its parent node is always 1.

Comments

Popular posts from this blog

Hive Related Errors and fixes

Map Reduce Interview Questions

Sqoop Interview Questions