Pig Interview Questions

1) What are different modes of execution in Apache Pig?
- Apache Pig runs in 2 modes- one is the “Pig (Local Mode) Command Mode” and the other is the “Hadoop MapReduce (Java) Command Mode”.
- Local Mode requires access to only a single machine where all files are installed and executed on a local host
- MapReduce requires accessing the Hadoop cluster.

2) Explain about co-group in Pig.
- COGROUP operator in Pig is used to work with multiple tuples.
- COGROUP operator is applied on statements that contain or involve two or more relations.
- The COGROUP operator can be applied on up to 127 relations at a time.
- When using the COGROUP operator on two tables at once-Pig first groups both the tables and after that joins the two tables on the grouped columns.
- COGROUP is more like a combination of GROUP and JOIN,

3) What are the different relational operations in “Pig Latin” you worked with?
COGROUP: Joins two or more tables and then perform GROUP operation on the joined table result.
CROSS: CROSS operator is used to compute the cross product (Cartesian product) of two or more relations.
DISTINCT: Removes duplicate tuples in a relation.
FILTER: Select a set of tuples from a relation based on a condition.
FOREACH: Iterate the tuples of a relation, generating a data transformation.
GROUP: Group the data in one or more relations.
JOIN: Join two or more relations (inner or outer join).
LIMIT: Limit the number of output tuples.
LOAD: Load data from the file system.
ORDER: Sort a relation based on one or more fields.
SPLIT: Partition a relation into two or more relations.
STORE: Store data in the file system.
UNION: Merge the content of two relations. To perform a UNION operation on two relations, their columns and domains must be identical.

4) What is a UDF?
- If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring those functionalities
 using other languages like Java, Python, Ruby, etc. and embed it in Script file.

5) Explain the need for MapReduce while programming in Apache Pig.
- Apache Pig programs are written in a query language known as Pig Latin that is similar to the SQL query language.
- To execute the query, there is a need for an execution engine.
- The Pig engine converts the queries into MapReduce jobs and thus MapReduce acts as the execution engine and is needed to run the programs.
- we can use Tez also as a Execution Engine for pig to increase the performance of the pig.

6) Explain about the BloomMapFile.
- BloomMapFile is a class, that extends the MapFile class.
- It is used in HBase table format to provide quick membership test for the keys using dynamic bloom filters.

7) What do you mean by a bag in Pig?
- Collection of tuples is referred as a bag in Apache Pig
- A bag is one of the data models present in Pig.
- It is an unordered collection of tuples with possible duplicates.
- Bags are used to store collections of tuples while grouping.
- The size of bag is the size of the local disk. this means that the size of the bag is limited.
- When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag in memory.
- There is no necessity that the complete bag should fit into memory. We represent bags with “{}”.

8) What is the usage of foreach operation in Pig scripts?
- FOREACH operation in Apache Pig is used to apply transformation to each element in the data bag,
 so that respective action is performed to generate new data items.
- Syntax- FOREACH data_bagname GENERATE exp1, exp2

9) Explain about the different complex data types in Pig.
Maps- These are key, value stores joined together using #.
Tuples- Just similar to the row in a table, where different items are separated by a comma. Tuples can have multiple attributes.
Bags- Unordered collection of tuples. Bag allows multiple duplicate tuples.

10) What does Flatten do in Pig?
- Sometimes there is data in a tuple or a bag and if we want to remove the level of nesting from that data, then Flatten modifier in Pig can be used
- Flatten un-nests bags and tuples.
- For tuples, the Flatten operator will substitute the fields of a tuple in place of a tuple,
- un-nesting bags is a little complex because it requires creating new tuples.

11) How do users interact with the shell in Apache Pig?
- Using Grunt i.e. Apache Pig’s interactive shell, users can interact with HDFS or the local file system. T
- To exit from grunt shell, press CTRL+D or just type exit.

12) What are the debugging tools used for Apache Pig scripts?
- describe and explain are the important debugging utilities in Apache Pig.
- explain utility is helpful for Hadoop developers, when trying to debug error or optimize PigLatin scripts.
- explain can be applied on a particular alias in the script or it can be applied to the entire script in the grunt interactive shell.
- explain utility produces several graphs in text format which can be printed to a file.

- describe debugging utility is helpful to developers when writing Pig scripts as it shows the schema of a relation in the script.
- For beginners who are trying to learn Apache Pig can use the describe utility to understand how each operator makes alterations to data.
- A pig script can have multiple describes.

13) What is illustrate used for in Apache Pig?
- Executing pig scripts on large data sets, usually takes a long time.
- To tackle this, developers run pig scripts on sample data but there is possibility that the sample data selected, might not execute your pig script properly.
- illustrate takes a sample from the data and whenever it comes across operators like join or filter that remove data, it ensures that only some records
 pass through and some do not, by making modifications to the records such that they meet the condition.
-  illustrate just shows the output of each stage but does not run any MapReduce task.

14) Explain about the execution plans of a Pig Script OR Differentiate between the logical and physical plan of an Apache Pig script
- Logical and Physical plans are created during the execution of a pig script.
-  Pig scripts are based on interpreter checking.
- Logical plan is produced after semantic checking and basic parsing and no data processing takes place during the creation of a logical plan.
- For each line in the Pig script, syntax check is performed for operators and a logical plan is created.
- Whenever an error is encountered within the script, an exception is thrown and the program execution ends
- else for each statement in the script has its own logical plan.
- A logical plan contains collection of operators in the script but does not contain the edges between the operators.
- After the logical plan is generated, the script execution moves to the physical plan
- where there is a description about the physical operators, Apache Pig will use, to execute the Pig script.
- A physical plan is more or less like a series of MapReduce jobs but then the plan does not have any reference on how it will be executed in MapReduce.
- During the creation of physical plan, cogroup logical operator is converted into 3 physical operators namely –Local Rearrange, Global Rearrange and Package.
- Load and store functions usually get resolved in the physical plan.

15) What do you know about the case sensitivity of Apache Pig?
- It is difficult to say whether Apache Pig is case sensitive or case insensitive.\
- For instance, user defined functions, relations and field names in pig are case sensitive
- On the other hand, keywords in Apache Pig are case insensitive

16) What are some of the Apache Pig use cases you can think of?
- Apache Pig big data tools, is used in particular for iterative processing,
- research on raw data and for traditional ETL data pipelines.
- As Pig can operate in circumstances where the schema is not known, inconsistent or incomplete-
- it is widely used by researchers who want to make use of the data before it is cleaned and loaded into the data warehouse.
- To build behaviour prediction models, for instance, it can be used by a website to track the response of the visitors to various types of ads, images, articles, etc.

17) Is PigLatin a strongly typed language? If yes, then how did you come to the conclusion?
- In a strongly typed language, the user has to declare the type of all variables upfront.
- In Apache Pig, when you describe the schema of the data, it expects the data to come in the same format you mentioned.
- However, when the schema is not known, the script will adapt to actually data types at runtime.
- So, it can be said that PigLatin is strongly typed in most cases but in rare cases it is gently typed

18) What do you understand by an inner bag and outer bag in Pig?
- A relation inside a bag is referred to as inner bag and outer bag is just a relation in Pig

19) Differentiate between GROUP and COGROUP operators.
- Both GROUP and COGROUP operators are identical and can work with one or more relations.
- GROUP operator is generally used to group the data in a single relation for better readability
- COGROUP can be used to group the data in 2 or more relations.
- COGROUP is more like a combination of GROUP and JOIN.

20) Explain the difference between COUNT_STAR and COUNT functions in Apache Pig?
- COUNT function does not include the NULL value when counting the number of elements in a bag,
- whereas COUNT_STAR (0 function includes NULL values while counting.

21) What are the various diagnostic operators available in Apache Pig?
Dump Operator- It is used to display the output of pig Latin statements on the screen, so that developers can debug the code.
Describe Operator - Return the schema of a relation.
Explain Operator - Display the logical, physical, and MapReduce execution plans.
Illustrate Operator - Gives the step-by-step execution of a sequence of statements.

22) How will you merge the contents of two or more relations and divide a single relation into two or more relations?
- This can be accomplished using the UNION and SPLIT operators.

23) I have a relation R. How can I get the top 10 tuples from the relation R.?
- TOP () function returns the top N tuples from a bag of tuples or a relation.
- N is passed as a parameter to the function top () along with the column whose values are to be compared and the relation R.

24) What are the commonalities between Pig and Hive?
- HiveQL and PigLatin both convert the commands into MapReduce jobs.
- They cannot be used for OLAP transactions as it is difficult to execute low latency queries.

25) What are the different types of UDF’s in Java supported by Apache Pig?
- Algebraic, Eval and Filter functions are the various types of UDF’s supported in Pig.

26) You have a file employee.txt in the HDFS directory with 100 records. You want to see only the first 10 records from the employee.txt file. How will you do this?
- The first step would be to load the file employee.txt into with the relation name as Employee.
- The first 10 records of the employee data can be obtained using the limit operator -
Result= limit employee 10.

27) Explain about the scalar datatypes in Apache Pig.
- integer, float, double, long, bytearray and char array are the available scalar datatypes in Apache Pig.

28) Can you join multiple fields in Apache Pig Scripts ?
- Yes, it is possible to join multiple fields in PIG scripts because the join operations takes records from one input and joins them with another input.
- This can be achieved by specifying the keys for each input and the two rows will be joined when the keys are equal.
- like nprmal join Operation in SQL.

29) Does Pig support multi-line commands?
Yes.

30) Components of Pig?
- It has two main components – the Pig Latin language and the Pig Run-time Environment, in which Pig Latin programs are executed.

31) Notes:
- Apache Pig follows ETL (Extract Transform Load) process. It can handle inconsistent schema (in case of unstructured data).
- Apache Pig automatically optimizes the tasks before execution, i.e. automatic optimization. Apache Pig handles all kinds of data.

32) Differences between MapReduce and Apache Pig.
- Apache Pig is a high-level data flow platform, whereas MapReduce is a low-level data processing paradigm.
- Without writing complex Java implementations in MapReduce, programmers can achieve the same implementations very easily using Pig Latin.
- Apache Pig provides nested data types like tuples, bags, and maps that are missing from MapReduce.
- Pig provides many built-in operators to support data operations like joins, filters, ordering, sorting etc.
 Whereas to perform the same function in MapReduce is a humongous task.

33) What are the components of Pig Execution Environment?
Pig Scripts: Pig scripts are submitted to the Apache Pig execution environment which can be written in Pig Latin using built-in operators and UDFs can be
    embedded in it.
Parser: The Parser does the type checking and checks the syntax of the script. The parser outputs a DAG (directed acyclic graph).
DAG represents the Pig Latin statements and logical operators.
Optimizer: The Optimizer performs the optimization activities like split, merge, transform, reorder operators, etc.
  The optimizer provides the automatic optimization feature to Apache Pig. The optimizer basically aims to reduce the amount of data in the pipeline.
Compiler: The Apache Pig compiler converts the optimized code into MapReduce jobs automatically.
Execution Engine: Finally, the MapReduce jobs are submitted to the execution engine. Then, the MapReduce jobs are executed and the required result is produced.

34) What are the different ways of executing Pig script?
Grunt Shell: This is Pig’s interactive shell provided to execute all Pig Scripts.
Script File: Write all the Pig commands in a script file and execute the Pig script file. This is executed by the Pig Server.
Embedded Script: If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring that
functionality using other languages like Java, Python, Ruby, etc. and embed it in the Pig Latin Script file. Then, execute that script file.

35) How Apache Pig deals with the schema and schema-less data?
- The Apache Pig handles both, schema as well as schema-less data.
- If the schema only includes the field name, the data type of field is considered as a byte array.
- If you assign a name to the field you can access the field by both, the field name and the positional notation,
 whereas if field name is missing we can only access it by the positional notation i.e. $ followed by the index number.
- If you perform any operation which is a combination of relations (like JOIN, COGROUP, etc.)
 and if any of the relation is missing schema, the resulting relation will have null schema.
- If the schema is null, Pig will consider it as a byte array and the real data type of field will be determined dynamically.

36) Is the keyword ‘DEFINE’ like a function name?
- Yes, the keyword ‘DEFINE’ is like a function name.
- DEFINE statement is used to assign a name (alias) to a UDF function or to a streaming command.

37) What is a MapFile?
- MapFile is a class which serves file-based map from keys to values.
- A map is a directory containing two files, the data file, containing all keys and values in the map, and a smaller index file,
 containing a fraction of the keys.
- The fraction is determined by MapFile.Writer.getIndexInterval().
- The index file is read entirely into memory. Thus, key implementations should try to keep themselves small. Map files are created by adding entries in-order.

38) What is Pig Statistics?
- Pig Statistics is a framework for collecting and storing script-level statistics for Pig Latin.
- Characteristics of Pig Latin scripts and the resulting MapReduce jobs are collected while the script is executed.
- These statistics are then available for Pig users and tools using Pig (such as Oozie) to retrieve after the job is completed.
- The stats classes are in the package org.apache.pig.tools.pigstats:
1) PigStats
2) JobStats
3) OutputStats
4) InputStats.

39) What are the limitations of the Pig?
- As the Pig platform is designed for ETL-type use cases, it’s not a better choice for real-time scenarios.
- Apache Pig is not a good choice for pinpointing a single record in huge data sets.
- Apache Pig is built on top of MapReduce, which is batch processing oriented.

Comments

Popular posts from this blog

Hive Related Errors and fixes

Map Reduce Interview Questions

Sqoop Interview Questions