Map Reduce Interview Questions

***************************************************Map Reduce********************************************************************
1) Explain the usage of Context Object.
- Context Object is used to help the mapper interact with other Hadoop systems.
- Context Object can be used for updating counters, to report the progress and to provide any application level status updates.
- ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output.
- it is introduced in Hadoop New Api
- in old APi we use the OutputCollector, and the Reporter object use for communicate with Map reduce System.

2) What are the core methods of a Reducer?
1) run:
- receives a Context containing the job's configuration as well as interfacing methods that return data from the reducer itself back to the framework.
- This method is responsible for calling setup() once, reduce() once for each key associated with the reduce task, and cleanup() once at the end.
2) setup:
- This method of the reducer is used for configuring various parameters like the input data size, distributed cache, heap size, etc.
- public void setup (context)
3) reduce:
-  it is heart of the reducer which is called once per key with the associated reduce task.
- public void reduce (Key,Value,context)
4) cleanup:
- This method is called only once at the end of reduce task for clearing all the temporary files.
- public void cleanup (context)

3) What are the core methods of a Mapper?
1) run:
- receives a Context containing the job's configuration as well as interfacing methods that return data from the mapper itself back to the framework.
- This method is responsible for calling setup() once, map() once for number of input spills, and cleanup() once at the end.
2) setup:
- This method of the mapper is used for configuring various parameters like the input data size, distributed cache, heap size, etc.
- public void setup (context)
3) reduce:
- it is heart of the Mapper which is the actual key value pair sepration happens. and intermediate data's were stored in context and ready to send to
 the redcers.
- public void mapper (Key,Value,context)
4) cleanup:
- This method is called only once at the end of map task for clearing all the temporary files.
- public void cleanup (context)

4) Explain about the partitioning, shuffle and sort phase?
Shuffle Phase:
- Once the first map tasks are completed, the nodes continue to perform several other map tasks and also exchange the intermediate outputs
 with the reducers as required.
- This process of moving the intermediate outputs of map tasks to the reducer is referred to as Shuffling.
Sort Phase:
-  Hadoop MapReduce automatically sorts the set of intermediate keys on a single node before they are given as input to the reducer.
Partitioning Phase:
- The process that determines which intermediate keys and value will be received by each reducer instance is referred to as partitioning.
- The destination partition is same for any key irrespective of the mapper instance that generated it.
- no of partition = no of reducer.

5) What is Hadoop streaming API?
- Hadoop distribution has a generic application programming interface for writing Map and Reduce jobs in any desired programming language
 like Python, Perl, Ruby, etc.
- This is referred to as Hadoop Streaming. Users can create and run jobs with any kind of shell scripts or executable as the Mapper or Reducers.

6) What are the most commonly defined input formats in Hadoop?
- Text Input Format
- Key Value Input Format
- Sequence File Input Format

7) How to write a custom partitioner for a Hadoop MapReduce job?
- A new class must be created that extends the pre-defined Partitioner Class.
- getPartition method of the Partitioner class must be overridden.
- The custom partitioner to the job can be added as a config file in the wrapper which runs Hadoop MapReduce
- or the custom partitioner can be added to the job by using the set method of the partitioner class.
- job.setPartitionerClass(Partitioner.class);

8)  What is “speculative execution” in Hadoop?
- If a node appears to be executing a task slower, the master node can redundantly execute another instance of the same task on another node.
- Then, the task which finishes first will be accepted and the other one is killed. This process is called “speculative execution”.

9) What is the difference between an “HDFS Block” and an “Input Split”?
- The “HDFS Block” is the physical division of the data while “Input Split” is the logical division of the data.
- HDFS divides data in blocks for storing the blocks together,
- Input Split:
- whereas for processing, MapReduce divides the data into the input split and assign it to mapper function.
- number of blocks may or may not equal to number of mappers but always number of spills = number of mappers.
- known as Map reduce tuning.
- By deaflut Size of Split = Size of block.
- Split size is Configurable to increase and reduce from Block size.
- Decreasing the Slpit:  merge 2 block to reduce the Resource Consumption (64+64 = 128)
- Increasing the Split: Devide 1 block in to 2 split and use more resource to conmplete the Job.
- Devide 1 block in to 2 split and use more resource to conmplete the Job.
- by default this will call from a methord called E-splitable and return type is boolean

10) Name the three modes in which Hadoop can run.
- Standalone (local) mode:
- This is the default mode if we don’t configure anything.
- In this mode, all the components of Hadoop, such NameNode, DataNode, ResourceManager, and NodeManager, run as a single Java process.
- This uses local filesystem.
- This mode is mainly used for debugging purpose, and it does not support the use of HDFS.
- Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files.
- Pseudo distributed mode:
-  A single-node Hadoop deployment is considered as running Hadoop system in pseudo-distributed mode.
- In this mode, all the Hadoop services, including both the master and the slave services, were executed on a single compute node.
- Fully distributed mode:
- A Hadoop deployments in which the Hadoop master and slave services run on separate nodes, are stated as fully distributed mode.

11) What are the main configuration parameters in a “MapReduce” program?
- Job’s input locations in the distributed file system
- Job’s output location in the distributed file system
- Input format of data
- Output format of data
- Class containing the map function
- Class containing the reduce function
- JAR file containing the mapper, reducer and driver classes

12) State the reason why we can’t perform “aggregation” (addition) in mapper? Why do we need the “reducer” for this?
- We cannot perform “aggregation” (addition) in mapper because sorting does not occur in the “mapper” function.
- Sorting occurs only on the reducer side and without sorting aggregation cannot be done.
- During “aggregation”, we need output of all the mapper functions which may not be possible to collect in the map phase as
 mappers may be running on different machine where the data blocks are stored.
- And lastly, if we try to aggregate data at mapper, it requires communication between all mapper functions which may be running on different machines.
- So, it will consume high network bandwidth and can cause network bottlenecking.

13) What is the purpose of “RecordReader” in Hadoop?
- The “InputSplit” defines a slice of work, but does not describe how to access it.
- The “RecordReader” class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper” task.
- The “RecordReader” instance is defined by the “Input Format”.
- The "RecordReader" is responsible for providing the information regarding record boundaries in an input split.

14) Explain “Distributed Cache” in a “MapReduce Framework”.
- Distributed Cache can be explained as, a facility provided by the MapReduce framework to cache files needed by applications.
- Once you have cached a file for your job, Hadoop framework will make it available on each and every data nodes where you map/reduce tasks are running.
- Then you can access the cache file as a local file in your Mapper or Reducer job.
- by using distributed cache Map-side join is achieved in MapReduce.

15) How do “reducers” communicate with each other?
- The “MapReduce” programming model does not allow “reducers” to communicate with each other. “Reducers” run in isolation.


16) What does a “MapReduce Partitioner” do?
- A “MapReduce Partitioner” makes sure that all the values of a single key go to the same “reducer”,
- thus allowing even distribution of the map output over the “reducers”.
- It redirects the “mapper” output to the “reducer” by determining which “reducer” is responsible for the particular key.
- no of partition = no of reducer.

17) What is a “Combiner”?
- A “Combiner” is a mini “reducer” that performs the local “reduce” task.
- It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”.
- “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”.
- it may or may not be configured in mapper.
- no of mapper = no of combiner
- job.setCombinerClass(reducer.class);

18) What do you know about “SequenceFileInputFormat”?
- “SequenceFileInputFormat” is an input format for reading within sequence files.
- It is a specific compressed binary file format which is optimized for passing the data between the outputs of one “MapReduce” job to the
 input of some other “MapReduce” job.
- Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from
 one MapReduce job to another.

19) What do you mean by data locality?
- Data locality talks about moving computation unit to data rather data to the computation unit.
- MapReduce framework achieves data locality by processing data locally
- i.e. processing of the data happens in the very node by Node Manager where data blocks are present.

20) Is it mandatory to set input and output type/format in MapReduce?
- No, It is not mandatory to set input and output type/format in MapReduce.
- By default, the cluster takes the input and the output type as ‘text’.

21) Can we rename the output file?
- Yes, we can rename the output file by implementing multiple format output class.

22) Why the output of map tasks are stored (spilled ) into local disc and not in HDFS?
- storing these intermediate output into HDFS and replicate it will create unnecessary overhead.

23) What is an identity Mapper(or)Empty Mapper and Identity Reducer(or)Empty Reducer?
Identity Mapper:
- Identity mapper is the default mapper provided by the Hadoop framework.
- It runs when no mapper class has been defined in the MapReduce program where it simply passes the input key – value pair for the reducer phase.

Identity Reducer:
- Identity Reducer is also the default reducer class provided by the Hadoop
- which is automatically executed if no reducer class has been defined.
- It also performs no computation or process, rather it just simply write the input key – value pairfrom the mapper into the specified output directory.
- but sorting process will happen in Identity Reducer, whereelse it will not happen in 0 Reducer.

24) What is a map side join?
- Map side join is a process where two data sets are joined by the mapper.
- Map-side join helps in minimizing the cost that is incurred for sorting and merging in the shuffle and reduce stages.
- Map-side join also helps in improving the performance of the task by decreasing the time to finish the task.
- one file will be loaded for loacal disk and another needs to be present in HDFS.

25) What is reduce side join in MapReduce?
- As the name suggests, in the reduce side join, the reducer is responsible for performing the join operation.
- It is comparatively simple and easier to implement than the map side join as the sorting and shuffling phase sends
the values having identical keys to the same reducer and therefore, by default, the data is organized for us.
- both the files needs to be present in HDFS.

26) Is it legal to set the number of reducer task to zero? Where the output will be stored in this case?
- Yes, It is legal to set the number of reduce-tasks to zero if there is no need for a reducer.
- In this case the outputs of the map task is directly stored into the HDFS which is specified in the setOutputPath(Path).

27) How do you stop a running job gracefully?
- One can gracefully stop a MapReduce job by using the command: hadoop job -kill JOBID

28) How does inputsplit in MapReduce determines the record boundaries correctly?
- RecordReader is responsible for providing the information regarding record boundaries in an input split.

29) What is a spill factor with respect to the RAM?
- The map output is stored in an in-memory buffer; when this buffer is almost full, then spilling phase starts in order to move data to a temp folder.
- Map output is first written to buffer and buffer size is decided by mapreduce.task.io.sort.mb property .By default, it will be 100 MB.
- When the buffer reaches certain threshold, it will start spilling buffer data to disk. This threshold is specified in mapreduce.map.sort.spill.percent .

30) Why we need map reduce?
-

31) what is the by default reducer count and how to change it ?
- By default Reducer count will be 1
- to change it use the following code
job.setNumReduceTasks(2);

32) default number of task attempt if fails ?
- Default Task atempts -> 4
- runtime change -Dmapreduce.map.maxattempts=<value> or mapreduce.reduce.maxattempts=<value>
- OR mapreduce.map.maxattempts= <value> or mapreduce.reduce.maxattempts=<value >in java code
- OR commonly change in mapred-site.xml for mapreduce.map.maxattempts and mapreduce.reduce.maxattempts

33) default number of threads for Map reduce task ?
- By default Map treads for task traker is 2

34) how to increase the maximum heap size and buffer size in MApreduce:
- to increase the Heap size use the following:  mapred.child.java.opts = -Xmx400m
- to increase the Buffer Size: mapreduce.task.io.sort.mb = 200

35) What is Buffer in Mapper:
- To Balance the Trushold of Mapper we introduced a method called Buffer in Mapper.
- Mapper will have I/O Buffer of 100MB(out off 200MB Mapper thread JVM Size) By default to store the intermediate Data.

36) Explain Hash Partitioning Algorithm:
- the default partitioning algorithm in Mapreduce.
- it decides all the values respect to a single key goes to same reducer.
- to complete the partition algorith/Formula(Hash Partition) it needs 2 input
- Hash Partition Algorithm:
Hash of the key/Mod(Number of reducer) = Output (tell the position of the reducer)
- can't predict which input goes to which Reducer

37) Difference between old and new Api:

Old API -> No difference between Mapper Result and Reducer Result(part-0001) Hadoop 0.2x
- OLD API used Mapper & Reduceer as Interface.
- old API can still be found in org.apache.hadoop.mapred.
- JobConf, the OutputCollector, and the Reporter object use for communicate with Map reduce System
- Controlling mappers by writing a MapRunnable, but no equivalent exists for reducers.
- Job Control was done through JobClient (not exists in the new API)
- jobconf objet was use for Job configuration.which is extension of Configuration class.
- in the old API both map and reduce outputs are named part-nnnnn
- In the Old API, the reduce() method passes values as a java.lang.Iterator

New API -> Difference Between Mapper and Reducer Result(part-m-00001 and part-r-00001) Hadoop 1.x
- New API useing Mapper and Reducer as Class So can add a method (with a default implementation) to an abstract class without breaking old implementations of the class.
- new API is in the org.apache.hadoop.mapreduce package.
- use “context” object to communicate with mapReduce system
- new API allows both mappers and reducers to control the execution flow by overriding the run() method.
- Job control is done through the JOB class in New API
- Job Configuration done through Configuration class via some of the helper methods on Job.
- In the new API map outputs are named part-m-nnnnn, and reduce outputs are named part-r-nnnnn (where nnnnn is an integer designating the part number, starting from zero).
- In the new API, the reduce() method passes values as a java.lang.Iterable

38) how to achive parlalizum in mapreduce.
- conf.set("mapred.map.tasks","10")
- by use of the above code we can run map task parallely.
Note:
Sqoop runs on map reduce frame work but it is a map only Job.
Best Example of Reducer 0 is Sqoop

Comments

Popular posts from this blog

Hive Related Errors and fixes

Sqoop Interview Questions