Posts

Showing posts from May, 2017

Oozie Interview Questions

1) Types of Oozie Jobs - Periodical/Coordinator Job: These are recurrent jobs which run based on a particular time or they can be configured to run when data is available. - Coordinator jobs can manage multiple workflow based jobs as well as where the output of one workflow can be the input for another workflow. - The chained behavior is known as “data application pipeline”. - Oozie Hadoop Workflow: It is Directed Acyclic Graph (DAG) which consists of collection of actions. - The Control nodes decide the chronological order, setting of rules, execution path decision, joining the nodes and fork. - Whereas, Action node triggers the execution. - Oozie Bundle: An Oozie bundle is collection of many coordinator jobs which can be started, suspended and stopped periodically. - The jobs in this bundle are usually dependent on each other. 2) Oozie Architecture - Oozie Architecture has a Web Server and a database for storing all the jobs. - The default web server is Ap

Sqoop Interview Questions

1) Compare Sqoop and Flume Sqoop Flume Used for importing data from structured data sources like RDBMS. Used for moving bulk streaming data into HDFS. It has a connector based architecture. It has a agent based architecture. Data import in sqoop is not evetn driven. Data load in flume is event driven HDFS is the destination for importing data. Data flows into HDFS through one or more channels. 2) What is the default file format to import data using Apache Sqoop? 1) Delimited Text File Format - This is the default file format to import data using Sqoop. - This file format can be explicitly specified using the –as-textfile argument to the import command in Sqoop. - Passing this as an argument to the command will produce the string based representation of all the records to  the output files with the delimited characters between rows and columns. 2) Sequence File Format - It is a binary file format where records are stored in cust

Yarn Interview Questions

1) Whats is YARN? - (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes. - YARN is a powerful and efficient feature rolled out as a part of Hadoop 2.0.YARN is a large scale distributed system for running big data applications. - ResourceManager: - It receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly, - where the actual processing takes place. It allocates resources to applications based on the needs. - it consist of 2 components: - Application Manager:It accepts job-submissions, negotiates the container for ApplicationMaster and handles failures while executing MapReduce jobs. - Scheduler: Scheduler allocates resources that is required by various MapReduce application running on the Hadoop cluster. -Node Manager: - NodeManager is installed on every DataNode and it is responsible for execution of

Pig Interview Questions

1) What are different modes of execution in Apache Pig? - Apache Pig runs in 2 modes- one is the “Pig (Local Mode) Command Mode” and the other is the “Hadoop MapReduce (Java) Command Mode”. - Local Mode requires access to only a single machine where all files are installed and executed on a local host - MapReduce requires accessing the Hadoop cluster. 2) Explain about co-group in Pig. - COGROUP operator in Pig is used to work with multiple tuples. - COGROUP operator is applied on statements that contain or involve two or more relations. - The COGROUP operator can be applied on up to 127 relations at a time. - When using the COGROUP operator on two tables at once-Pig first groups both the tables and after that joins the two tables on the grouped columns. - COGROUP is more like a combination of GROUP and JOIN, 3) What are the different relational operations in “Pig Latin” you worked with? COGROUP: Joins two or more tables and then perform GROUP operation on the jo

HBase Interview Questions

1) When should you use HBase and what are the key components of HBase? - HBase should be used when the big data application has 1) A variable schema 2) When data is stored in the form of collections 3) If the application demands key based access to data while retrieving. - Key components of HBase are HRegion- This component contains memory data store and Hfile. default size 1024 mb HRegion Server-This monitors the Region. HBase Master-It is responsible for monitoring the region server. Zookeeper- It takes care of the coordination between the HBase Master component and the client. Catalog Tables-The two important catalog tables are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system. 2) What are the different operational commands in HBase at record level and table level? - Record Level Operational Commands in HBase are –put, get, increment, scan and delete. - Table Level Operational Comma

Hive Interview Questions

1) Explain about the SMB Join in Hive. - SMB represents Sort Merge Bucket join in Hive. - In SMB join in Hive, each mapper reads a bucket from the first table and the corresponding bucket from the second table  and then a merge sort join is performed. - it s mainly used as there is no limit on file or partition or table join. - SMB join can best be used when the tables are large. - In SMB join the columns are bucketed and sorted using the join columns. - All tables should have the same number of buckets in SMB join. 2) How can you connect an application, if you run Hive as a server? - ODBC Driver-This supports the ODBC protocol - JDBC Driver- This supports the JDBC protocol - Thrift Client- This client can be used to make calls to all hive commands using different programming language like PHP, Python, Java, C++ and Ruby. 3) What does the overwrite keyword denote in Hive load statement? - Overwrite keyword in Hive load statement deletes the contents of the t