Hive Interview Questions

1) Explain about the SMB Join in Hive.
- SMB represents Sort Merge Bucket join in Hive.
- In SMB join in Hive, each mapper reads a bucket from the first table and the corresponding bucket from the second table
 and then a merge sort join is performed.
- it s mainly used as there is no limit on file or partition or table join.
- SMB join can best be used when the tables are large.
- In SMB join the columns are bucketed and sorted using the join columns.
- All tables should have the same number of buckets in SMB join.

2) How can you connect an application, if you run Hive as a server?
- ODBC Driver-This supports the ODBC protocol
- JDBC Driver- This supports the JDBC protocol
- Thrift Client- This client can be used to make calls to all hive commands using different programming language like PHP, Python, Java, C++ and Ruby.

3) What does the overwrite keyword denote in Hive load statement?
- Overwrite keyword in Hive load statement deletes the contents of the target table and replaces them with the files referred by the file path

4) What is SerDe in Hive? How can you write your own custom SerDe?
- SerDe is a Serializer DeSerializer.
- Hive uses SerDe to read and write data from tables.
- Generally, users prefer to write a Deserializer instead of a SerDe as they want to read their own data format rather than writing to it.

5) Define the difference between Hive and HBase?
- Apache Hive queries are executed as MapReduce jobs internally whereas HBase operations run in a real-time on its database rather than MapReduce.

HBase Hive
HBase does not allow execution of SQL queries. Hive allows execution of most SQL queries.
HBase runs on top of HDFS. Hive runs on top of Hadoop MapReduce.
HBase is a NoSQL database. Hive is a datawarehouse framework.
Supports record level insert, updated and delete operations by versioning. Does not support record level insert, update and delete.
HBase is ideal for real time querying of big data Hive is an ideal choice for analytical querying of data collected over period of time.

6) Where does the data of a Hive table gets stored?
- By default, the Hive table is stored in an HDFS directory -> /user/hive/warehouse.
- One can change it by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml.

7) What is a metastore in Hive?
- Metastore in Hive stores the meta data information using RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus
 which converts the object representation into relational schema and vice versa.
- in simple Hive Metastore is a central repository that stores metadata in external database.
- by default hive comes with embedded metastore Derby, which not supports multi instance user access.
- so it is optional confoguratble with other realtime databses like PostGreSQL and MySQL in Hive-site.xml

8) Why Hive does not store metadata information in HDFS?
- Hive stores metadata information in the metastore using RDBMS instead of HDFS.
- The reason for choosing RDBMS is to achieve low latency as HDFS read/write operations are time consuming processes.

9) What is the difference between local and remote metastore?
- Local Metastore:
- In local metastore configuration, the metastore service runs in the same JVM in which the Hive service is running and
 connects to a database running in a separate JVM, either on the same machine or on a remote machine.
- Remote Metastore:
- In the remote metastore configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM.
- Other processes communicate with the metastore server using Thrift Network APIs.
- You can have one or more metastore servers in this case to provide more availability.

10) What is the difference between external table and managed table?
- In case of managed table, If one drops a managed table, the metadata information along with the table data is deleted from the Hive warehouse directory.
- On the contrary, in case of an external table, Hive just deletes the metadata information regarding the table and leaves the table data present in HDFS untouched.
- when performing join operations with HBase integration tables external tables provide more perfromance than managed tables.

11) Is it possible to change the default location of a managed table?
- Yes, it is possible to change the default location of a managed table.
- It can be achieved by using the clause – LOCATION ‘<hdfs_path>’.

12) When should we use SORT BY instead of ORDER BY ?
- We should use SORT BY instead of ORDER BY when we have to sort huge datasets.
- because SORT BY clause sorts the data using multiple reducers whereas ORDER BY sorts all of the data together using a single reducer.
- Therefore, using ORDER BY against a large number of inputs will take a lot of time to execute.

13) What is a partition in Hive?
- Hive organizes tables into partitions for grouping similar type of data together based on a column or partition key.
- Each Table can have one or more partition keys to identify a particular partition.
- Physically, a partition is nothing but a sub-directory in the table directory.
- Partition collumn must not present in the table schema it should present with partioned by clause.
- two types of partitions were available they are 1) StaticPartitioning and Dynamic Partitioning.

14) Why do we perform partitioning in Hive?
- Partitioning provides granularity in a Hive table and therefore,
- reduces the query latency by scanning only relevant partitioned data instead of the whole data set.

15) What is dynamic partitioning and when is it used?
- In dynamic partitioning values for partition columns are known in the runtime,
- Loading data from an existing non-partitioned table to improve the sampling and therefore, decrease the query latency.
- When one does not know all the values of the partitions before hand and therefore,
 finding these partition values manually from a huge data sets is a tedious task.
- and it is not applicable for huge dataset it will take too much time to process , in that case static partitiono will be more effective.

16) How can you add a new partition for the month December in the above partitioned table?
- need to invoke alter partition statement.
- For adding a new partition in the above table partitioned_transaction, we will issue the command give below:
- ALTER TABLE partitioned_transaction ADD PARTITION (month=’Dec’) LOCATION  ‘/partitioned_transaction’;

17) What is the default maximum dynamic partition that can be created by a mapper/reducer? How can you change it?
- By default the number of maximum partition that can be created by a mapper or reducer is set to 100.
- SET hive.exec.max.dynamic.partitions.pernode = <value>
- otherhand, You can set the total number of dynamic partitions that can be created by one statement by using: hive.exec.max.dynamic.partitions = <value>

18) I am inserting data into a table based on partitions dynamically. But, I received an error – FAILED ERROR IN SEMANTIC ANALYSIS:
Dynamic partition strict mode requires at least one static partition column. How will you remove this error?

- To remove this error one has to execute following commands:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Things to Remember:
- By default, hive.exec.dynamic.partition configuration property is set to False in case you are using Hive whose version is prior to 0.9.0.
- hive.exec.dynamic.partition.mode is set to strict by default. Only in non – strict mode Hive allows all partitions to be dynamic.

19) Why do we need buckets?
- There are two main reasons for performing bucketing to a partition:
sc-1: - A map side join requires the data belonging to a unique join key to be present in the same partition.
- But what about those cases where your partition key differs from that of join key
sc-2: - Bucketing makes the sampling process more efficient and therefore, allows us to decrease the query time.

20) How Hive distributes the rows into buckets?
- Hive determines the bucket number for a row by using the formula:
hash_function (bucketing_column) modulo (num_of_buckets).
- hash_function depends on the column data type. For integer data type, the hash_function will be:
hash_function (int_type_column)= value of int_type_column

21) What will happen in case you have not issued the command:  ‘SET hive.enforce.bucketing=true;’ before bucketing a table in Hive in Apache Hive 0.x or 1.x?
- The command:  ‘SET hive.enforce.bucketing=true;’ allows one to have the correct number of reducer while using ‘CLUSTER BY’ clause for bucketing a column.
- In case it’s not done, one may find the number of files that will be generated in the table directory to be not equal to the number of buckets.
- As an alternative, one may also set the number of reducer equal to the number of buckets by using set mapred.reduce.task = num_bucket.

22) What is indexing and why do we need it?
- One of the Hive query optimization methods is Hive index.
- Hive index is used to speed up the access of a column or set of columns in a Hive database because with the use of index
 the database system does not need to read all rows in the table to find the data that one has selected.

23) Suppose, I have a CSV file – ‘sample.csv’ present in ‘/temp’ directory with the following entries:
id first_name last_name email gender ip_address
1 Hugh Jackman hughjackman@cam.ac.uk Male 136.90.241.52
2 David Lawrence dlawrence1@gmail.com Male 101.177.15.130
3 Andy Hall andyhall2@yahoo.com Female 114.123.153.64
4 Samuel Jackson samjackson231@sun.com Male 89.60.227.31
5 Emily Rose rose.emily4@surveymonkey.com Female 119.92.21.19
How will you consume this CSV file into the Hive warehouse using built SerDe?
- SerDe stands for serializer/deserializer. A SerDe allows us to convert the unstructured bytes into a record that we can process using Hive.
-  SerDes are implemented using Java. Hive comes with several built-in SerDes and many other third-party SerDes are also available.

CREATE EXTERNAL TABLE sample(id int, first_name string, last_name string, email string,gender string, ip_address string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’ STORED AS TEXTFILE LOCATION ‘/temp’;
- Now, we can perform any query on the table ‘sample’:
SELECT first_name FROM sample WHERE gender = ‘male’;

24) Suppose, I have a lot of small CSV files present in /input directory in HDFS and I want to create a single Hive table corresponding to these files.
The data in these files are in the format: {id, name, e-mail, country}. Now, as we know, Hadoop performance degrades when we use lots of small files.
So, how will you solve this problem where we want to create a single Hive table for lots of small files without degrading the performance of the system?

- One can use the SequenceFile format which will group these small files together to form a single sequence file.
- The steps that will be followed in doing so are as follows:
- Create a temporary table:
CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING) ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS TEXTFILE;
- Load the data into temp_table:
LOAD DATA INPATH ‘/input’ INTO TABLE temp_table;
- Create a table that will store data in SequenceFile format:
CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING, country STRING) ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS SEQUENCEFILE;
- Transfer the data from the temporary table into the sample_seqfile table:
INSERT OVERWRITE TABLE sample SELECT * FROM temp_table;

25) Diference Between Pig and Hive:

Criteria Pig Hive
Type of Data Apache Pig is usually used for semi structured data. Used for Structured Data
Schema Schema is optional. Hive requires a well-defined Schema.
Language It is a procedural data flow language. Follows SQL Dialect and is a declarative language.
Purpose Mainly used for programming. It is mainly used for reporting.
General Usage Usually used on the client side of the hadoop cluster. Usually used on the server side of the hadoop cluster.
Coding Style Verbose More like SQL.

26) How can I delete the above index named index_bonuspay?


27) Can you list few commonly used Hive services?
- Command Line Interface (cli)
- Hive Web Interface (hwi)
- HiveServer (hiveserver)
- Printing the contents of an RC file using the tool rcfilecat.
- Jar
- Metastore

28) Suppose that I want to monitor all the open and aborted transactions in the system along with the transaction id and the transaction state.
    Can this be achieved using Apache Hive?
- Hive 0.13.0 and above version support SHOW TRANSACTIONS command that helps administrators monitor various hive transactions.

29) What is the use of Hcatalog?
- Hcatalog can be used to share data structures with external systems.
- Hcatalog provides access to hive metastore to users of other tools on Hadoop so that they can read and write data to hive’s data warehouse.

30) Write a query to rename a table Student to Student_New.
- Alter Table Student RENAME to Student_New

31) Explain the difference between partitioning and bucketing.
- Partitioning and Bucketing of tables is done to improve the query performance.
- Partitioning helps execute queries faster, only if the partitioning scheme has some common range filtering.
-  i.e. either by timestamp ranges, by location, etc. Bucketing does not work by default.
- Partitioning helps eliminate data when used in WHERE clause.
- Bucketing helps organize data inside the partition into multiple files so that same set of data will always be written in the same bucket.
- Bucketing helps in joining various columns.
- In partitioning technique, a partition is created for every unique value of the column and there could be a situation where several tiny
 partitions may have to be created.
- with bucketing, one can limit it to a specific number and the data can then be decomposed in those buckets.
- Basically, a bucket is a file in Hive whereas partition is a directory.

32) Explain about the different types of partitioning in Hive?
- Partitions are created when data is inserted into the table.
- Based on how data is loaded into the table, requirements for data and the format in which data is produced at source
- static partitions:
- the name of the partition is hardcoded into the insert statement.
- When loading data from huge files, static partitions are preferred over dynamic partitions as they save time in loading data.
- The partition is added to the table and then the file is moved into the static partition.
- The partition column value can be obtained from the file name without having to read the complete file.
- dynamic partition:
- Hive automatically identifies the partition based on the value of the partition field.
- the complete data in the file is read and is partitioned through a MapReduce job based into the tables based on a particular field in the file
- Dynamic partitions are usually helpful during ETL flows in the data pipeline.

33) When executing Hive queries in different directories, why is metastore_db created in all places from where Hive is launched?
- When running Hive in embedded mode, it creates a local metastore.
- When you run the query, it first checks whether a metastore already exists or not.
- The property javax.jdo.option.ConnectionURL defined in the hive-site.xml has a default value jdbc: derby: databaseName=metastore_db; create=true.

34) How will you read and write HDFS files in Hive?
- TextInputFormat- This class is used to read data in plain text file format.
- HiveIgnoreKeyTextOutputFormat- This class is used to write data in plain text file format.
- SequenceFileInputFormat- This class is used to read data in hadoop SequenceFile format.
- SequenceFileOutputFormat- This class is used to write data in hadoop SequenceFile format.

35) What are the components of a Hive query processor?
- Query processor in Apache Hive converts the SQL to a graph of MapReduce jobs with the execution time framework so that the jobs can be executed in
 the order of dependencies.
- Parser
- Semantic Analyser
- Type Checking
- Logical Plan Generation
- Optimizer
- Physical Plan Generation
- Execution Engine
- Operators
- UDF’s and UDAF’s.

36) Differentiate between describe and describe extended.
- Describe database/schema- This query displays the name of the database, the root location on the file system and comments if any.
- Describe extended database/schema- Gives the details of the database or schema in a detailed manner.

37) Is it possible to overwrite Hadoop MapReduce configuration in Hive?
- Yes, hadoop MapReduce configuration can be overwritten by changing the hive conf settings file.

38) I want to see the present working directory in UNIX from hive. Is it possible to run this command from hive?
- Hive allows execution of UNIX commands with the use of exclamatory (!) symbol. Just use the ! Symbol before the command to be executed at the hive prompt.
- To see the present working directory in UNIX from hive run !pwd at the hive prompt.

39) What is the use of explode in Hive?
- Explode in Hive is used to convert complex data types into desired table formats.
- explode UDTF basically emits all the elements in an array into multiple rows.

40) Explain about SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY in Hive.
SORT BY – Data is ordered at each of ‘N’ reducers where the reducers can have overlapping range of data.
ORDER BY- This is similar to the ORDER BY in SQL where total ordering of data takes place by passing it to a single reducer.
DISTRUBUTE BY – It is used to distribute the rows among the reducers. Rows that have the same distribute by columns will go to the same reducer.
CLUSTER BY- It is a combination of DISTRIBUTE BY and SORT BY where each of the N reducers gets non overlapping range of
data which is then sorted by those ranges at the respective reducers.

41) How can you prevent a large job from running for a long time?
- This can be achieved by setting the MapReduce jobs to execute in strict mode set hive.mapred.mode=strict;
- The strict mode ensures that the queries on partitioned tables cannot execute without defining a WHERE clause.

42) Are multiline comments supported in Hive?
- No.

42) What is ObjectInspector functionality
- ObjectInspector is used to analyse the structure of individual columns and the internal structure of the row objects.
- ObjectInspector in Hive provides access to complex objects which can be stored in multiple formats.

43) Explain about the different types of join in Hive.
- JOIN- Similar to Join in SQL
- FULL OUTER JOIN – Combines the records of both the left and right outer tables that fulfil the join condition.
- LEFT OUTER JOIN- All the rows from the left table are returned even if there are no matches in the right table.
- RIGHT OUTER JOIN-All the rows from the right table are returned even if there are no matches in the left table.

44) How can you configure remote metastore mode in Hive?
- hive-site.xml file has to be configured with the below property.
hive.metastore.uris
thrift: //node1 (or IP Address):9083 <IP address and port of the metastore host >

45) How data transfer happens from HDFS to Hive?
- If data is already present in HDFS then the user need not LOAD DATA that moves the files to the /user/hive/warehouse/.
- So the user just has to define the table using the keyword external that creates the table definition in the hive metastore.

46) In case of embedded Hive, can the same metastore be used by multiple users?
- We cannot use metastore in sharing mode. It is suggested to use standalone real database like PostGreSQL and MySQL.

47) The partition of hive table has been modified to point to a new directory location.
    Do I have to move the data to the new location or the data will be moved automatically to the new location?
- Changing the point of partition will not move the data to the new location.
- It has to be moved manually to the new location from the old one.

48) What will be the output of cast (‘XYZ’ as INT)?
- It will return a NULL value.

49) What are the different components of a Hive architecture?
User Interface – UI component of the Hive architecture calls the execute interface to the driver.
Driver -  create a session handle to the query and sends the query to the compiler to generate an execution plan for it.
Metastore - Sends the metadata to the compiler for the execution of the query on receiving the sendMetaData request.
Compiler- Compiler generates the execution plan which is a DAG of stages where each stage is either a metadata operation,
a map or reduce job or an operation on HDFS.
Execute Engine- Execution engine is responsible for submitting each of these stages to the relevant components
by managing the dependencies between the various stages in the execution plan generated by the compiler.

50) What happens on executing the below query? After executing the below query, if you modify   the column –how will the changes be tracked?
-  CREATE INDEX index_bonuspay ON TABLE employee (bonus) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
- The query creates an index named index_bonuspay which points to the bonus column in the employee table.
- Whenever the value of bonus is modified it will be stored using an index value.

51) to print database name in cli.
- set hive.cli.print.current.db=true

52) to enable columnvisibility.
- set hive.cli.print.header=true;

53) Hive complex data types:
1) Structs: {a int,b int,c string}
2) Maps: M['group']
3) Arrays: ['a','b','c']

Comments

Post a Comment

Popular posts from this blog

Hive Related Errors and fixes

Map Reduce Interview Questions

Sqoop Interview Questions