Hadoop Cluster and HDFS Interview Questions
1) What do the four V’s of Big Data denote?
a) Volume –Scale of data
b) Velocity –Analysis of streaming data
c) Variety – Different forms of data
d) Veracity –Uncertainty of data
2) On what concept the Hadoop framework works?
- Hadoop Framework works on the following two core components
1) HDFS:(Storage Unit)
- Hadoop Distributed File System is the java based file system for scalable and reliable storage of large datasets.
- Data in HDFS is stored in the form of blocks and it operates on the Master Slave Architecture.
2) MapReduce:(Processing and Cluster Management Unit)
- This is a java based programming paradigm of Hadoop framework that provides scalability across various Hadoop clusters.
- MapReduce distributes the workload into various tasks that can run in parallel. Hadoop jobs perform 2 separate tasks- job.
- MAP: -The map job breaks down the data sets into key-value pairs or tuples.
- Reduce: - The reduce job then takes the output of the map job and combines the data tuples to into smaller set of tuples.
- The reduce job is always performed after the map job is executed.
3) What are the main components of a Hadoop Application?
1) Sprage Unit: HDFS(Name Node and Data Node)
2) Processing Framework: YARN(Resource Manager, Node Manager)-Yet Another Resource Negotiator
3) Data Access Components are - Pig and Hive
4) Data Storage Component is - HBase
5) Data Integration Components are - Apache Flume, Sqoop, Chukwa
. 6) Data Management and Monitoring Components are - Ambari, Oozie and Zookeeper.
7) Data Serialization Components are - Thrift and Avro
8) Data Intelligence Components are - Apache Mahout and Drill.
4) What is a block and block scanner in HDFS?
- Block:
- The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS.
- The default size of a block in HDFS is 64MB in Hadoop 1.x
- The default size of a block in HDFS is 128MB in Hadoop 2.x
- which is much larger as compared to the Linux system where the block size is 4KB.
- The reason of having this huge block size is to minimize the cost of seek and reduce the meta data information generated per block.
- The dfs.block.size parameter can be used in the hdfs-site.xml file
- Block Scanner:
- Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors.
- Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.
- Block scanner runs periodically on every DataNode to verify whether the data blocks stored are correct or not.
- The following steps will occur when a corrupted data block is detected by the block scanner:
- First, the DataNode will report about the corrupted block to the NameNode.
- Then, NameNode will start the process of creating a new replica using the correct replica of the corrupted block present in other DataNodes.
- The corrupted data block will not be deleted until the replication count of the correct replicas matches with the replication factor (3 by default).
- Finally you can see the Blosc Scanner status report in DataNode UI
localhost:50075/blockScannerReport
5) Explain the difference between Name Node, DataNode and Checkpoint Node and BackupNode.
Name Node:
- NameNode is at the heart of the HDFS file system which manages the metadata about the actual block reside in the Data nodes in the cluster.
- NameNode uses two files for the namespace
- fsimage file - It keeps track of the latest checkpoint of the namespace.
- It contains the complete state of the file system namespace since the start of the NameNode.
- edits file - It is a log of changes that have been made to the namespace since checkpoint.
- It contains all the recent modifications made to the file system with respect to the recent FsImage.
- As a thumb rule, metadata for a file, block or directory takes 150 bytes.
Data Node:
- Where the Actual Data Resides in the Cluster in the form of block's
- for every 3 sec it needs to send heart beat to Name node else namenode consider the respective datanode is dead.
- responsible for read and writer requests for clients
Checkpoint Node(Secondary Name Node):
- Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory.
- Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally.
- The new image is then again updated back to the active NameNode.
- It is performed by the Secondary NameNode.
BackupNode:
- Backup Node also provides check pointing functionality like that of the checkpoint node
- but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.
- kind of passive nameNodeN
6) What is commodity hardware?
- Commodity Hardware refers to inexpensive systems that do not have high availability or high quality.
- Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM.
- Hadoop can be run on any commodity hardware and does not require any super computers or high end hardware configuration to execute jobs.
7) Ports of Hadoop Eco System:
Reffer in the Word Doc.
8) Explain about the process of inter cluster data copying.
- HDFS provides a distributed data copying facility through the DistCP from source to destination.
- If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying.
- DistCP requires both source and destination to have a compatible or same version of hadoop.
- Command: hadoop distcp hdfs://<source NameNode> hdfs://<target NameNode>
9) How can you overwrite the replication factors in HDFS?
- Using the Hadoop FS Shell changed per file basis:- hadoop fs –setrep –w 2 /my/test_file
- Using the Hadoop FS Shell all files under a given directory:- hadoop fs –setrep –w 2 /my/test_dir
- To check the replication factor use the follwoing command:
hadoop fsck /sample/test.xml -files
or
hadoop fs -ls /sample
- By adding this property to hdfs-site.xml:
<property>
<name>dfs.replication</name>
<value>5</value>
<description>Block Replication</description>
</property>
10) Explain about the indexing process in HDFS.
- Indexing process in HDFS depends on the block size.
- HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.
11) What is a rack awareness and on what basis is data stored in a rack?
- All the data nodes put together form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS.
- The rack information i.e. the rack id of each data node is acquired by the NameNode.
- The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.
- the Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack and the next two replicas will be stored on a
different (remote) rack but, on a different DataNode within that (remote) rack.
- There are two reasons for using Rack Awareness:
1) To improve the network performance:
2) To prevent loss of data:
12) What happens when a user submits a Hadoop job when the NameNode is down- does the job get in to hold or does it fail.
- The Hadoop job fails when the NameNode is down.
13) What happens when a user submits a Hadoop job when the Job Tracker is down- does the job get in to hold or does it fail.
- The Hadoop job fails when the Job Tracker is down.
14) Whenever a client submits a hadoop job, who receives it?
- NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block information.
- JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.
15) What is daemon?
- Daemon is the process that runs in background in the UNIX environment. In Windows it is ‘services’ and in DOS it is ‘TSR’.
16) What is meant by heartbeat in HDFS?
- Data nodes and task trackers send heartbeat signals to Name node and Job tracker respectively to inform that they are alive.
- If the signal is not received it would indicate problems with the node or task tracker.
17) Is it necessary that Name node and job tracker should be on the same host?
- No! They can be on different hosts.
18) How a data node is identified as saturated?
- When a data node is full and has no space left the name node will identify it.
19) How the client communicates with Name node and Data node in HDFS?
- The communication mode for clients with name node and data node in HDFS is SSH.
20) What is the difference between NAS (Network Attached Storage) and HDFS?
NAS:
- it is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients.
- NAS can either be a hardware or software which provides a service for storing and accessing files.
- in NAS, data is stored on a dedicated hardware.
- NAS is not suitable for MapReduce
HDFS:
- Whereas Hadoop Distributed File System (HDFS) is a distributed file system to store data using commodity hardware.
- In HDFS, data blocks are distributed across all the machines in a cluster.
- HDFS is designed to work with MapReduce paradigm, where computation is moved to the data.
21) What is throughput? How does HDFS provides good throughput?
- Throughput is the amount of work done in a unit time. HDFS provides good throughput because:
- The HDFS is based on Write Once and Read Many Model, it simplifies the data coherency issues as the data written once can’t be modified and therefore,
provides high throughput data access.
- In Hadoop, the computation part is moved towards the data which reduces the network congestion and therefore, enhances the overall system throughput.
- the process of moving computation unit to data rather data to the computation unit is known as datalocality.
22) How to copy a file into HDFS with a different block size to that of existing block size configuration?
- Yes, one can copy a file into HDFS with a different block size by using ‘-Ddfs.blocksize=block_size’ where the block_size is specified in Bytes.
- hadoop fs -Ddfs.blocksize=33554432 -copyFromLocal /home/edureka/test.txt /sample_hdfs
- hadoop fs -stat %o /sample_hdfs/test.txt
- check the HDFS block size associated with this file
hadoop fs -stat %o /sample_hdfs/test.txt
23) Can you change the block size of HDFS files?
- Yes, I can change the block size of HDFS files by changing the default size parameter present in hdfs-site.xml.
- But, I will have to restart the cluster for this property change to take effect.
24) How HDFS ensures the Fault Tolerance capability of the system
- HDFS provides fault tolerance by replicating the data blocks and distributing it among different DataNodes across the cluster.
- By default, this replication factor is set to 3 which is configurable.
25) Can you modify the file present in HDFS?
- No, I cannot modify the files already present in HDFS, as HDFS follows Write Once Read Many model.
- But, I can always append data into the existing HDFS file.
26) Can multiple clients write into an HDFS file concurrently?
- No, multiple clients can’t write into an HDFS file concurrently.
- Because HDFS follows single writer multiple reader model.
27) Does HDFS allow a client to read a file which is already opened for writing?
- Yes, one can read the file which is already opened.
- But, the problem in reading a file which is currently being written lies in the consistency of the data
- i.e. HDFS does not provide the surety that the data which has been written into the file will be visible to a new reader before the file has been closed.
- For this, one can call the hflush operation explicitly which will push all the data in the buffer into the write pipeline and
then the hflush operation will wait for the acknowledgements from the DataNodes.
- Hence, by doing this the data that has been written into the file before the hflush operation will be visible to the readers for sure.
28) What do you mean by the High Availability of a NameNode? How is it achieved?
- NameNode used to be single point of failure in Hadoop 1.x where the whole Hadoop cluster becomes unavailable as soon as NameNode is down.
- and NameNode recovery Process is a manual work which took huge time to recover.
- To solve this Single Point of Failure problem of NameNode, HA feature was intorduced in Hadoop 2.x where we have two NameNode in our HDFS cluster in
an active/passive configuration.
- Hence, if the active NameNode fails, the other passive NameNode can take over the responsibility of the failed NameNode and keep the HDFS up and running.
- Zookeeper controls the Swithcing proces over the cluster.
29) Define Hadoop Archives? What is the command for archiving a group of files in HDFS.
- Hadoop Archive was introduced to cope up with the problem of increasing memory usage of the NameNode for storing the metadata information because of
too many small files.
- Basically, it allows us to pack a number of small HDFS files into a single archive file and therefore, reducing the metadata information.
- The final archived file follows the .har extension and one can consider it as a layered file system on top of HDFS.
- Command:
hadoop archive –archiveName edureka_archive.har /input/location /output/location
30) Tell me about the various Hadoop daemons and their roles in a Hadoop cluster.
- NameNode:
- It is the master node which is responsible for storing the metadata of all the files and directories.
- It has information about blocks, that make a file, and where those blocks are located in the cluster.
- Datanode:
- It is the slave node that contains the actual data.
- Secondary NameNode:
- It periodically merges the changes (edit log) with the FsImage (Filesystem Image), present in the NameNode.
- it stores the modified FsImage into persistent storage, which can be used in case of failure of NameNode.
- ResourceManager:
- It is the central authority that manages resources and schedule applications running on top of YARN.
- NodeManager:
- It runs on slave machines, and is responsible for launching the application’s containers (where applications execute their part),
- monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.
- JobHistoryServer:
- It maintains information about MapReduce jobs after the Application Master terminates.
31) Commissioning and Decommissioning Nodes in a Hadoop Cluster:
- Steps for Decommissioning process:
1) Update the network address in exclude files: dfs.exclude and mapred.exclude
2) Update the Name Node: hadoop dfsadmin -refreshNodes
3) Update the Job Tracker: hadoop mradmin -refreshNodes
4) check in the WebUi for status of Decommission process status: Decommission in progress or decommissioned.
5) remove the nodes from include file and refresh nodes:
6 ) remove the nodes from slaves:
Here is the sample configuration for the exclude file in hdfs-site.xml and mapred-site.xml:
hdfs-site.xml
<property>
<name>dfs.hosts.exclude</name>
<value>/home/hadoop/excludes</value>
<final>true</final>
</property>
mapred-site.xml
<property>
<name>mapred.hosts.exclude</name>
<value>/home/hadoop/excludes</value>
<final>true</final>
</property>
- Steps for commissioning process:
1) update the address in include file.
2) refresh the namenode
3) refresh the job tracker.
4) update slaves
5) start the datanode and job tracker in added node: hadoop-deamon.sh start tasktracker and hadoop-deamon.sh start datanode
6) check the ui for successfull addition
7) run balancer to move hdfs blocks to data node: hadoop balancer -threshold 40
Note: Balancer attempts to provide a balance to a certain threshold among data nodes by copying block data from older nodes to newly commissioned nodes.
32) What is configured in /etc/hosts and what is its role in setting Hadoop cluster?
- /etc/hosts file contains the hostname and their IP address of that host.
- It maps the IP address to the hostname.
- In Hadoop cluster, we store all the hostnames (master and slaves) with their IP addresses in /etc/hosts
so, that we can use hostnames easily instead of IP addresses.
33) What are the main Hadoop configuration files?
- core-site.xml: - core-site.xml informs Hadoop daemon where NameNode runs on the cluster.
- It contains configuration settings of Hadoop core such as I/O settings that are common to HDFS & MapReduce.
- hdfs-site.xml: - hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode, Secondary NameNode).
- It also includes the replication factor and block size of HDFS.
- mapred-site.xml: - mapred-site.xml contains configuration settings of the MapReduce framework like number of JVM that can run in parallel,
the size of the mapper and the reducer, CPU cores available for a process, etc.
- yarn-site.xml: - yarn-site.xml contains configuration settings of ResourceManager and NodeManager like application memory management size,
the operation needed on program & algorithm, etc.
34) How does Hadoop CLASSPATH plays vital role in starting or stopping in Hadoop daemons?
- CLASSPATH includes all the directories containing jar files required to start/stop Hadoop daemons.
- The CLASSPATH is set inside /etc/hadoop/hadoop-env.sh file.
35) What is the full form of fsck?
- The full form of fsck is File System Check. HDFS supports the fsck (filesystem check) command to check for various inconsistencies.
- It is designed for reporting the problems with the files in HDFS, for example, missing blocks of a file or under-replicated blocks.
36) Which are the main hdfs-site.xml properties?
- dfs.name.dir:
- gives you the location where NameNode stores the metadata (FsImage and edit logs)
and where DFS is located – on the disk or onto the remote directory.
- dfs.data.dir:
- which gives you the location of DataNodes, where the data is going to be stored.
- fs.checkpoint.dir:
- is the directory on the filesystem where the Secondary NameNode stores the temporary images of edit logs,
which is to be merged and the FsImage for backup.
37) How can we view the compressed files via HDFS command?
- We can view compressed files in HDFS using hadoop fs -text /filename command.
38) What is the command to move into safe mode and exit safe mode?
- Safe Mode in Hadoop is a maintenance state of the NameNode during which NameNode doesn’t allow any changes to the file system.
- During Safe Mode, HDFS cluster is read-only and doesn’t replicate or delete blocks.
- To know the status of safe mode, you can use the command: hadoop dfsadmin -safemode get
- To exit safe mode: hadoop dfsadmin -safemode leave
- To enter safe mode: hadoop dfsadmin -safemode enter
39) In Hadoop_PID_DIR, what does PID stands for? What does it do?
- PID stands for ‘Process ID’. This directory stores the Process ID of the servers that are running.
40) What does hadoop-metrics.properties file do?
- hadoop-metrics.properties is used for ‘Performance Reporting‘ purposes.
- It controls the reporting for Hadoop.
- The API is abstract so that it can be implemented on top of a variety of metrics client libraries. The choice of client library is a configuration option,
- and different modules within the same application can use different metrics implementation libraries. This file is stored inside /etc/hadoop.
41) What are the network requirements for Hadoop?
- the Hadoop core uses Shell (SSH) for communication with salve and to launch the server processes on the slave nodes.
- It requires a password-less SSH connection between the master and all the slaves and the secondary machines,
- so every time it does not have to ask for authentication as master and slave requires rigorous communication.
42) You have a directory DeZyre that has the following files – HadoopTraining.txt, _SparkTraining.txt, #DataScienceTraining.txt, .SalesforceTraining.txt.
If you pass the DeZyre directory to the Hadoop MapReduce jobs, how many files are likely to be processed?
- Only HadoopTraining.txt and #DataScienceTraining.txt will be processed for Mapreduce jobs because when we process a file
(either in a directory or individual) in Hadoop using any FileInputFormat such as TextInputFormat, KeyValueInputFormat or SequenceFileInputFormat,
we must confirm that none of files must have a hidden file prefix such as “_” or “.” because mapreduce FileInputFormat will by default uses hiddenFileFilter
class to ignore all those files with these prefix in their name.
43) Whats is Fair Scheduler
- Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time.
- When there is a single job running, that job uses the entire cluster.
- When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time.
- Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs.
- It is also an easy way to share a cluster between multiple of users.
- The fair scheduler organizes jobs into pools, and divides resources fairly between these pools.
- By default, there is a separate pool for each user, so that each user gets an equal share of the cluster.
- It is also possible to set a job's pool based on the user's Unix group or any jobconf property.
- Fair Scheduler allows assigning guaranteed minimum shares to pools, which is useful for ensuring that certain users,
groups or production applications always get sufficient resources.
- You will also need to set the following property in the Hadoop config file HADOOP_CONF_DIR/mapred-site.xml to have Hadoop use the fair scheduler:
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
- check that the fair scheduler is running by going to http://<jobtracker URL>/scheduler on the JobTracker's web UI.
- configuration in two places
- HADOOP_CONF_DIR/mapred-site.xml,
- HADOOP_CONF_DIR/fair-scheduler.xml, is used to configure pools
44) What is capacity scheduler.
- The CapacityScheduler is designed to run Hadoop Map-Reduce as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput
and the utilization of the cluster while running Map-Reduce applications.
- Traditionally each organization has it own private set of compute resources that have sufficient capacity to meet the organization's
SLA under peak or near peak conditions.
- This generally leads to poor average utilization and the overhead of managing multiple independent clusters, one per each organization.
- Sharing clusters between organizations is a cost-effective manner of running large Hadoop installations since this allows them to reap benefits of
economies of scale without creating private clusters.
- The CapacityScheduler is designed to allow sharing a large cluster while giving each organization a minimum capacity guarantee.
- The central idea is that the available resources in the Hadoop Map-Reduce cluster are partitioned among multiple organizations
who collectively fund the cluster based on computing needs.
- The primary abstraction provided by the CapacityScheduler is the concept of queues.
- The CapacityScheduler is available as a JAR file in the Hadoop tarball under the contrib/capacity-scheduler directory.
- The name of the JAR file would be on the lines of hadoop-capacity-scheduler-*.jar.
- To run the CapacityScheduler in your Hadoop installation, you need to put it on the CLASSPATH. The easiest way is to copy the hadoop-capacity-scheduler-*.jar
from to HADOOP_HOME/lib. Alternatively, you can modify HADOOP_CLASSPATH to include this jar, in conf/hadoop-env.sh.
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.CapacityTaskScheduler</value>
</property>
a) Volume –Scale of data
b) Velocity –Analysis of streaming data
c) Variety – Different forms of data
d) Veracity –Uncertainty of data
2) On what concept the Hadoop framework works?
- Hadoop Framework works on the following two core components
1) HDFS:(Storage Unit)
- Hadoop Distributed File System is the java based file system for scalable and reliable storage of large datasets.
- Data in HDFS is stored in the form of blocks and it operates on the Master Slave Architecture.
2) MapReduce:(Processing and Cluster Management Unit)
- This is a java based programming paradigm of Hadoop framework that provides scalability across various Hadoop clusters.
- MapReduce distributes the workload into various tasks that can run in parallel. Hadoop jobs perform 2 separate tasks- job.
- MAP: -The map job breaks down the data sets into key-value pairs or tuples.
- Reduce: - The reduce job then takes the output of the map job and combines the data tuples to into smaller set of tuples.
- The reduce job is always performed after the map job is executed.
3) What are the main components of a Hadoop Application?
1) Sprage Unit: HDFS(Name Node and Data Node)
2) Processing Framework: YARN(Resource Manager, Node Manager)-Yet Another Resource Negotiator
3) Data Access Components are - Pig and Hive
4) Data Storage Component is - HBase
5) Data Integration Components are - Apache Flume, Sqoop, Chukwa
. 6) Data Management and Monitoring Components are - Ambari, Oozie and Zookeeper.
7) Data Serialization Components are - Thrift and Avro
8) Data Intelligence Components are - Apache Mahout and Drill.
4) What is a block and block scanner in HDFS?
- Block:
- The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS.
- The default size of a block in HDFS is 64MB in Hadoop 1.x
- The default size of a block in HDFS is 128MB in Hadoop 2.x
- which is much larger as compared to the Linux system where the block size is 4KB.
- The reason of having this huge block size is to minimize the cost of seek and reduce the meta data information generated per block.
- The dfs.block.size parameter can be used in the hdfs-site.xml file
- Block Scanner:
- Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors.
- Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.
- Block scanner runs periodically on every DataNode to verify whether the data blocks stored are correct or not.
- The following steps will occur when a corrupted data block is detected by the block scanner:
- First, the DataNode will report about the corrupted block to the NameNode.
- Then, NameNode will start the process of creating a new replica using the correct replica of the corrupted block present in other DataNodes.
- The corrupted data block will not be deleted until the replication count of the correct replicas matches with the replication factor (3 by default).
- Finally you can see the Blosc Scanner status report in DataNode UI
localhost:50075/blockScannerReport
5) Explain the difference between Name Node, DataNode and Checkpoint Node and BackupNode.
Name Node:
- NameNode is at the heart of the HDFS file system which manages the metadata about the actual block reside in the Data nodes in the cluster.
- NameNode uses two files for the namespace
- fsimage file - It keeps track of the latest checkpoint of the namespace.
- It contains the complete state of the file system namespace since the start of the NameNode.
- edits file - It is a log of changes that have been made to the namespace since checkpoint.
- It contains all the recent modifications made to the file system with respect to the recent FsImage.
- As a thumb rule, metadata for a file, block or directory takes 150 bytes.
Data Node:
- Where the Actual Data Resides in the Cluster in the form of block's
- for every 3 sec it needs to send heart beat to Name node else namenode consider the respective datanode is dead.
- responsible for read and writer requests for clients
Checkpoint Node(Secondary Name Node):
- Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory.
- Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally.
- The new image is then again updated back to the active NameNode.
- It is performed by the Secondary NameNode.
BackupNode:
- Backup Node also provides check pointing functionality like that of the checkpoint node
- but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.
- kind of passive nameNodeN
6) What is commodity hardware?
- Commodity Hardware refers to inexpensive systems that do not have high availability or high quality.
- Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM.
- Hadoop can be run on any commodity hardware and does not require any super computers or high end hardware configuration to execute jobs.
7) Ports of Hadoop Eco System:
Reffer in the Word Doc.
8) Explain about the process of inter cluster data copying.
- HDFS provides a distributed data copying facility through the DistCP from source to destination.
- If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying.
- DistCP requires both source and destination to have a compatible or same version of hadoop.
- Command: hadoop distcp hdfs://<source NameNode> hdfs://<target NameNode>
9) How can you overwrite the replication factors in HDFS?
- Using the Hadoop FS Shell changed per file basis:- hadoop fs –setrep –w 2 /my/test_file
- Using the Hadoop FS Shell all files under a given directory:- hadoop fs –setrep –w 2 /my/test_dir
- To check the replication factor use the follwoing command:
hadoop fsck /sample/test.xml -files
or
hadoop fs -ls /sample
- By adding this property to hdfs-site.xml:
<property>
<name>dfs.replication</name>
<value>5</value>
<description>Block Replication</description>
</property>
10) Explain about the indexing process in HDFS.
- Indexing process in HDFS depends on the block size.
- HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.
11) What is a rack awareness and on what basis is data stored in a rack?
- All the data nodes put together form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS.
- The rack information i.e. the rack id of each data node is acquired by the NameNode.
- The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.
- the Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack and the next two replicas will be stored on a
different (remote) rack but, on a different DataNode within that (remote) rack.
- There are two reasons for using Rack Awareness:
1) To improve the network performance:
2) To prevent loss of data:
12) What happens when a user submits a Hadoop job when the NameNode is down- does the job get in to hold or does it fail.
- The Hadoop job fails when the NameNode is down.
13) What happens when a user submits a Hadoop job when the Job Tracker is down- does the job get in to hold or does it fail.
- The Hadoop job fails when the Job Tracker is down.
14) Whenever a client submits a hadoop job, who receives it?
- NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block information.
- JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.
15) What is daemon?
- Daemon is the process that runs in background in the UNIX environment. In Windows it is ‘services’ and in DOS it is ‘TSR’.
16) What is meant by heartbeat in HDFS?
- Data nodes and task trackers send heartbeat signals to Name node and Job tracker respectively to inform that they are alive.
- If the signal is not received it would indicate problems with the node or task tracker.
17) Is it necessary that Name node and job tracker should be on the same host?
- No! They can be on different hosts.
18) How a data node is identified as saturated?
- When a data node is full and has no space left the name node will identify it.
19) How the client communicates with Name node and Data node in HDFS?
- The communication mode for clients with name node and data node in HDFS is SSH.
20) What is the difference between NAS (Network Attached Storage) and HDFS?
NAS:
- it is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients.
- NAS can either be a hardware or software which provides a service for storing and accessing files.
- in NAS, data is stored on a dedicated hardware.
- NAS is not suitable for MapReduce
HDFS:
- Whereas Hadoop Distributed File System (HDFS) is a distributed file system to store data using commodity hardware.
- In HDFS, data blocks are distributed across all the machines in a cluster.
- HDFS is designed to work with MapReduce paradigm, where computation is moved to the data.
21) What is throughput? How does HDFS provides good throughput?
- Throughput is the amount of work done in a unit time. HDFS provides good throughput because:
- The HDFS is based on Write Once and Read Many Model, it simplifies the data coherency issues as the data written once can’t be modified and therefore,
provides high throughput data access.
- In Hadoop, the computation part is moved towards the data which reduces the network congestion and therefore, enhances the overall system throughput.
- the process of moving computation unit to data rather data to the computation unit is known as datalocality.
22) How to copy a file into HDFS with a different block size to that of existing block size configuration?
- Yes, one can copy a file into HDFS with a different block size by using ‘-Ddfs.blocksize=block_size’ where the block_size is specified in Bytes.
- hadoop fs -Ddfs.blocksize=33554432 -copyFromLocal /home/edureka/test.txt /sample_hdfs
- hadoop fs -stat %o /sample_hdfs/test.txt
- check the HDFS block size associated with this file
hadoop fs -stat %o /sample_hdfs/test.txt
23) Can you change the block size of HDFS files?
- Yes, I can change the block size of HDFS files by changing the default size parameter present in hdfs-site.xml.
- But, I will have to restart the cluster for this property change to take effect.
24) How HDFS ensures the Fault Tolerance capability of the system
- HDFS provides fault tolerance by replicating the data blocks and distributing it among different DataNodes across the cluster.
- By default, this replication factor is set to 3 which is configurable.
25) Can you modify the file present in HDFS?
- No, I cannot modify the files already present in HDFS, as HDFS follows Write Once Read Many model.
- But, I can always append data into the existing HDFS file.
26) Can multiple clients write into an HDFS file concurrently?
- No, multiple clients can’t write into an HDFS file concurrently.
- Because HDFS follows single writer multiple reader model.
27) Does HDFS allow a client to read a file which is already opened for writing?
- Yes, one can read the file which is already opened.
- But, the problem in reading a file which is currently being written lies in the consistency of the data
- i.e. HDFS does not provide the surety that the data which has been written into the file will be visible to a new reader before the file has been closed.
- For this, one can call the hflush operation explicitly which will push all the data in the buffer into the write pipeline and
then the hflush operation will wait for the acknowledgements from the DataNodes.
- Hence, by doing this the data that has been written into the file before the hflush operation will be visible to the readers for sure.
28) What do you mean by the High Availability of a NameNode? How is it achieved?
- NameNode used to be single point of failure in Hadoop 1.x where the whole Hadoop cluster becomes unavailable as soon as NameNode is down.
- and NameNode recovery Process is a manual work which took huge time to recover.
- To solve this Single Point of Failure problem of NameNode, HA feature was intorduced in Hadoop 2.x where we have two NameNode in our HDFS cluster in
an active/passive configuration.
- Hence, if the active NameNode fails, the other passive NameNode can take over the responsibility of the failed NameNode and keep the HDFS up and running.
- Zookeeper controls the Swithcing proces over the cluster.
29) Define Hadoop Archives? What is the command for archiving a group of files in HDFS.
- Hadoop Archive was introduced to cope up with the problem of increasing memory usage of the NameNode for storing the metadata information because of
too many small files.
- Basically, it allows us to pack a number of small HDFS files into a single archive file and therefore, reducing the metadata information.
- The final archived file follows the .har extension and one can consider it as a layered file system on top of HDFS.
- Command:
hadoop archive –archiveName edureka_archive.har /input/location /output/location
30) Tell me about the various Hadoop daemons and their roles in a Hadoop cluster.
- NameNode:
- It is the master node which is responsible for storing the metadata of all the files and directories.
- It has information about blocks, that make a file, and where those blocks are located in the cluster.
- Datanode:
- It is the slave node that contains the actual data.
- Secondary NameNode:
- It periodically merges the changes (edit log) with the FsImage (Filesystem Image), present in the NameNode.
- it stores the modified FsImage into persistent storage, which can be used in case of failure of NameNode.
- ResourceManager:
- It is the central authority that manages resources and schedule applications running on top of YARN.
- NodeManager:
- It runs on slave machines, and is responsible for launching the application’s containers (where applications execute their part),
- monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.
- JobHistoryServer:
- It maintains information about MapReduce jobs after the Application Master terminates.
31) Commissioning and Decommissioning Nodes in a Hadoop Cluster:
- Steps for Decommissioning process:
1) Update the network address in exclude files: dfs.exclude and mapred.exclude
2) Update the Name Node: hadoop dfsadmin -refreshNodes
3) Update the Job Tracker: hadoop mradmin -refreshNodes
4) check in the WebUi for status of Decommission process status: Decommission in progress or decommissioned.
5) remove the nodes from include file and refresh nodes:
6 ) remove the nodes from slaves:
Here is the sample configuration for the exclude file in hdfs-site.xml and mapred-site.xml:
hdfs-site.xml
<property>
<name>dfs.hosts.exclude</name>
<value>/home/hadoop/excludes</value>
<final>true</final>
</property>
mapred-site.xml
<property>
<name>mapred.hosts.exclude</name>
<value>/home/hadoop/excludes</value>
<final>true</final>
</property>
- Steps for commissioning process:
1) update the address in include file.
2) refresh the namenode
3) refresh the job tracker.
4) update slaves
5) start the datanode and job tracker in added node: hadoop-deamon.sh start tasktracker and hadoop-deamon.sh start datanode
6) check the ui for successfull addition
7) run balancer to move hdfs blocks to data node: hadoop balancer -threshold 40
Note: Balancer attempts to provide a balance to a certain threshold among data nodes by copying block data from older nodes to newly commissioned nodes.
32) What is configured in /etc/hosts and what is its role in setting Hadoop cluster?
- /etc/hosts file contains the hostname and their IP address of that host.
- It maps the IP address to the hostname.
- In Hadoop cluster, we store all the hostnames (master and slaves) with their IP addresses in /etc/hosts
so, that we can use hostnames easily instead of IP addresses.
33) What are the main Hadoop configuration files?
- core-site.xml: - core-site.xml informs Hadoop daemon where NameNode runs on the cluster.
- It contains configuration settings of Hadoop core such as I/O settings that are common to HDFS & MapReduce.
- hdfs-site.xml: - hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode, Secondary NameNode).
- It also includes the replication factor and block size of HDFS.
- mapred-site.xml: - mapred-site.xml contains configuration settings of the MapReduce framework like number of JVM that can run in parallel,
the size of the mapper and the reducer, CPU cores available for a process, etc.
- yarn-site.xml: - yarn-site.xml contains configuration settings of ResourceManager and NodeManager like application memory management size,
the operation needed on program & algorithm, etc.
34) How does Hadoop CLASSPATH plays vital role in starting or stopping in Hadoop daemons?
- CLASSPATH includes all the directories containing jar files required to start/stop Hadoop daemons.
- The CLASSPATH is set inside /etc/hadoop/hadoop-env.sh file.
35) What is the full form of fsck?
- The full form of fsck is File System Check. HDFS supports the fsck (filesystem check) command to check for various inconsistencies.
- It is designed for reporting the problems with the files in HDFS, for example, missing blocks of a file or under-replicated blocks.
36) Which are the main hdfs-site.xml properties?
- dfs.name.dir:
- gives you the location where NameNode stores the metadata (FsImage and edit logs)
and where DFS is located – on the disk or onto the remote directory.
- dfs.data.dir:
- which gives you the location of DataNodes, where the data is going to be stored.
- fs.checkpoint.dir:
- is the directory on the filesystem where the Secondary NameNode stores the temporary images of edit logs,
which is to be merged and the FsImage for backup.
37) How can we view the compressed files via HDFS command?
- We can view compressed files in HDFS using hadoop fs -text /filename command.
38) What is the command to move into safe mode and exit safe mode?
- Safe Mode in Hadoop is a maintenance state of the NameNode during which NameNode doesn’t allow any changes to the file system.
- During Safe Mode, HDFS cluster is read-only and doesn’t replicate or delete blocks.
- To know the status of safe mode, you can use the command: hadoop dfsadmin -safemode get
- To exit safe mode: hadoop dfsadmin -safemode leave
- To enter safe mode: hadoop dfsadmin -safemode enter
39) In Hadoop_PID_DIR, what does PID stands for? What does it do?
- PID stands for ‘Process ID’. This directory stores the Process ID of the servers that are running.
40) What does hadoop-metrics.properties file do?
- hadoop-metrics.properties is used for ‘Performance Reporting‘ purposes.
- It controls the reporting for Hadoop.
- The API is abstract so that it can be implemented on top of a variety of metrics client libraries. The choice of client library is a configuration option,
- and different modules within the same application can use different metrics implementation libraries. This file is stored inside /etc/hadoop.
41) What are the network requirements for Hadoop?
- the Hadoop core uses Shell (SSH) for communication with salve and to launch the server processes on the slave nodes.
- It requires a password-less SSH connection between the master and all the slaves and the secondary machines,
- so every time it does not have to ask for authentication as master and slave requires rigorous communication.
42) You have a directory DeZyre that has the following files – HadoopTraining.txt, _SparkTraining.txt, #DataScienceTraining.txt, .SalesforceTraining.txt.
If you pass the DeZyre directory to the Hadoop MapReduce jobs, how many files are likely to be processed?
- Only HadoopTraining.txt and #DataScienceTraining.txt will be processed for Mapreduce jobs because when we process a file
(either in a directory or individual) in Hadoop using any FileInputFormat such as TextInputFormat, KeyValueInputFormat or SequenceFileInputFormat,
we must confirm that none of files must have a hidden file prefix such as “_” or “.” because mapreduce FileInputFormat will by default uses hiddenFileFilter
class to ignore all those files with these prefix in their name.
43) Whats is Fair Scheduler
- Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time.
- When there is a single job running, that job uses the entire cluster.
- When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time.
- Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs.
- It is also an easy way to share a cluster between multiple of users.
- The fair scheduler organizes jobs into pools, and divides resources fairly between these pools.
- By default, there is a separate pool for each user, so that each user gets an equal share of the cluster.
- It is also possible to set a job's pool based on the user's Unix group or any jobconf property.
- Fair Scheduler allows assigning guaranteed minimum shares to pools, which is useful for ensuring that certain users,
groups or production applications always get sufficient resources.
- You will also need to set the following property in the Hadoop config file HADOOP_CONF_DIR/mapred-site.xml to have Hadoop use the fair scheduler:
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
- check that the fair scheduler is running by going to http://<jobtracker URL>/scheduler on the JobTracker's web UI.
- configuration in two places
- HADOOP_CONF_DIR/mapred-site.xml,
- HADOOP_CONF_DIR/fair-scheduler.xml, is used to configure pools
44) What is capacity scheduler.
- The CapacityScheduler is designed to run Hadoop Map-Reduce as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput
and the utilization of the cluster while running Map-Reduce applications.
- Traditionally each organization has it own private set of compute resources that have sufficient capacity to meet the organization's
SLA under peak or near peak conditions.
- This generally leads to poor average utilization and the overhead of managing multiple independent clusters, one per each organization.
- Sharing clusters between organizations is a cost-effective manner of running large Hadoop installations since this allows them to reap benefits of
economies of scale without creating private clusters.
- The CapacityScheduler is designed to allow sharing a large cluster while giving each organization a minimum capacity guarantee.
- The central idea is that the available resources in the Hadoop Map-Reduce cluster are partitioned among multiple organizations
who collectively fund the cluster based on computing needs.
- The primary abstraction provided by the CapacityScheduler is the concept of queues.
- The CapacityScheduler is available as a JAR file in the Hadoop tarball under the contrib/capacity-scheduler directory.
- The name of the JAR file would be on the lines of hadoop-capacity-scheduler-*.jar.
- To run the CapacityScheduler in your Hadoop installation, you need to put it on the CLASSPATH. The easiest way is to copy the hadoop-capacity-scheduler-*.jar
from to HADOOP_HOME/lib. Alternatively, you can modify HADOOP_CLASSPATH to include this jar, in conf/hadoop-env.sh.
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.CapacityTaskScheduler</value>
</property>
Comments
Post a Comment