HBase Interview Questions

1) When should you use HBase and what are the key components of HBase?
- HBase should be used when the big data application has
1) A variable schema
2) When data is stored in the form of collections
3) If the application demands key based access to data while retrieving.
- Key components of HBase are
HRegion- This component contains memory data store and Hfile. default size 1024 mb
HRegion Server-This monitors the Region.
HBase Master-It is responsible for monitoring the region server.
Zookeeper- It takes care of the coordination between the HBase Master component and the client.
Catalog Tables-The two important catalog tables are ROOT and META.
ROOT table tracks where the META table is and
META table stores all the regions in the system.

2) What are the different operational commands in HBase at record level and table level?
- Record Level Operational Commands in HBase are –put, get, increment, scan and delete.
- Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.

3) What is Row Key?
- Every row in an HBase table has a unique identifier known as RowKey.
- It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server.
- RowKey is internally regarded as a byte array.

4) Explain the difference between RDBMS data model and HBase data model.
- RDBMS is a schema based database whereas HBase is schema less data model.
- RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning.
- RDBMS stores normalized data whereas HBase stores de-normalized data.

5) What is column families? What happens if you alter the block size of ColumnFamily on an already populated database?
- The logical deviation of data is represented through a key known as column Family.
- Column families consist of the basic unit of physical storage on which compression features can be applied.
- In an already populated database, when the block size of column family is altered, the old data will remain within the old block size
 whereas the new data that comes in will take the new block size.
- When huge compaction takes place, the old data will take the new block size so that the existing data is read correctly.

6) Explain the process of row deletion in HBase.
- On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but
 rather the cells are made invisible by setting a tombstone marker.
- The deleted cells are removed at regular intervals during compaction.

7) What are the different types of tombstone markers in HBase for deletion?
- There are 3 different types of tombstone markers in HBase for deletion
1)Family Delete Marker- This markers marks all columns for a column family.
2)Version Delete Marker-This marker marks a single version of a column.
3)Column Delete Marker-This markers marks all the versions of a column.

8) Explain about HLog and WAL in HBase.
- All edits in the HStore are stored in the HLog.
- Every region server has one HLog.
- HLog contains entries for edits of all regions performed by a particular Region Server.
- WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are written immediately.
- WAL edits remain in the memory till the flush period in case of deferred log flush.

9) Hallmark Features of HBase
- Schema Flexibility
- Scalability
- High Reliability

10) What is co processor in HBase?
- Coprocessor in HBase is a framework that helps users run their custom code on Region Server.

11) What do you understand by CAP theorem and which features of CAP theorem does HBase follow?
- CAP stands for Consistency, Availability and Partition Tolerance.
Consistency –At a given point of time, all nodes in a cluster will be able to see the same set of data.
Availability- Every request generates a response, regardless of whether it is a success or a failure.
Partition Tolerance – System continues to work even if there is failure of part of the system or intermittent message loss.
- HBase is a column oriented databases providing features like partition tolerance and consistency.

12) Name few other popular column oriented databases like HBase.
- CouchDB, MongoDB, Cassandra

13) What do you understand by Filters in HBase?
- HBase filters enhance the effectiveness of working with large data stored in tables by allowing users to add limiting selectors to a query and
 eliminate the data that is not required.
- Filters have access to the complete row to which they are applied. HBase has 18 filters
1) KeyOnlyFilter - takes no arguments. Returns the key portion of each key-value pair. <KeyOnlyFilter ()>
2) FirstKeyOnlyFilter - takes no arguments. Returns the key portion of the first key-value pair.<FirstKeyOnlyFilter ()>
3) PrefixFilter - takes a single argument, a prefix of a row key. It returns only those key-values present in a row that start with the specified
 row prefix.<PrefixFilter (‘Row’)>
4) ColumnPrefixFilter - takes a single argument, a column prefix. It returns only those key-values present in a column that starts with the specified
column prefix.<ColumnPrefixFilter (‘Col’)>
5) MultipleColumnPrefixFilter - takes a list of column prefixes. It returns key-values that are present in a column that starts with any of the
specified column prefixes.<MultipleColumnPrefixFilter (‘Col1’, ‘Col2’)>
6) ColumnCountGetFilter - takes one argument, a limit. It returns the first limit number of columns in the table.<ColumnCountGetFilter (4)>
7) PageFilter - takes one argument, a page size. It returns page size number of rows from the table.<PageFilter (2)>
8) ColumnPaginationFilter - takes two arguments, a limit and offset. It returns limit number of columns after offset number of columns.
   It does this for all the rows.<ColumnPaginationFilter (3, 5)>
9) InclusiveStopFilter - takes one argument, a row key on which to stop scanning. It returns all key-values present in rows up to and including the
 specified row.<InclusiveStopFilter (‘Row2’)>
10) TimeStampsFilter - takes a list of timestamps. It returns those key-values whose timestamps matches any of the specified timestamps.
<TimeStampsFilter (5985489, 48895495, 58489845945)>
11) RowFilter - takes a compare operator and a comparator. It compares each row key with the comparator using the compare operator and if the
comparison returns true, it returns all the key-values in that row.< RowFilter (<=, ‘binary:xyz)>
12) FamilyFilter - takes a compare operator and a comparator. It compares each family name with the comparator using the compare operator and if the
  comparison returns true, it returns all the key-values in that family.<FamilyFilter (>=, ‘binaryprefix:FamilyB’)>
13) QualifierFilter - takes a compare operator and a comparator. It compares each qualifier name with the comparator using the compare operator and
     if the comparison returns true, it returns all the key-values in that column.<QualifierFilter (=, ‘substring:Column1’)>
14) ValueFilter - takes a compare operator and a comparator. It compares each value with the comparator using the compare operator and if the
 comparison returns true, it returns that key-value.<ValueFilter (!=, ‘binary:Value’)>
15) DependentColumnFilter - takes two arguments required arguments, a family and a qualifier. It tries to locate this column in each row and returns
   all key-values in that row that have the same timestamp.
   DependentColumnFilter (‘conf’, ‘blacklist’, false, >=, ‘zebra’)
           DependentColumnFilter (‘conf’, ‘blacklist’, true)
           DependentColumnFilter (‘conf’, ‘blacklist’)
16) SingleColumnValueFilter - takes a column family, a qualifier, a compare operator and a comparator. If the specified column is not found,
     all the columns of that row will be emitted. If the column is found and the comparison with the comparator returns true,
     all the columns of the row will be emitted. If the condition fails, the row will not be emitted.
<Example: SingleColumnValueFilter (‘FamilyA’, ‘Column1’, <=, ‘abc’, true, false)
Example: SingleColumnValueFilter ('FamilyA’, ‘Column1’, <=, ‘abc’)>
17) SingleColumnValueExcludeFilter - takes the same arguments and behaves same as SingleColumnValueFilter. However, if the column is found and the
    condition passes, all the columns of the row will be emitted except for the tested column value.
<Example: SingleColumnValueExcludeFilter (‘FamilyA’, ‘Column1’, ‘<=’, ‘abc’, ‘false’, ‘true’)
Example: SingleColumnValueExcludeFilter (‘FamilyA’, ‘Column1’, ‘<=’, ‘abc’)>
18) ColumnRangeFilter - takes either minColumn, maxColumn, or both. Returns only those keys with columns that are between minColumn and maxColumn.
It also takes two boolean variables to indicate whether to include the minColumn and maxColumn or not.
If you don’t want to set the minColumn or the maxColumn, you can pass in an empty argument.
<ColumnRangeFilter (‘abc’, true, ‘xyz’, false)>
19) CustomFilter - You can create a custom filter by implementing the Filter class.

14)  Explain about the data model operations in HBase.
Put Method – To store data in HBase
Get Method – To retrieve data stored in HBase.
Delete Method- To delete the data from HBase tables.
Scan Method –To iterate over the data with larger key ranges or the entire table.

15) How will you back up an HBase cluster?
- HBase cluster backups are performed in 2 ways-
1) Live Cluster Backup:
- copy table utility is used to copy the data from one table to another on the same cluster or another cluster.
- Export utility can also be used to dump the contents of the table onto HDFS on the same cluster.
2) Full Shutdown Backup:
- a periodic complete shutdown of the HBase cluster is performed so that the Master and Region Servers go down and
 if there are hardly any chances of losing out the in-flight changes happening to metadata or StoreFiles.
- However, this kind of approach can be used only for back-end analytic capacity and not for applications that serve front end webpages.

16) Does HBase support SQL like syntax?
- SQL like support for HBase is not yet available.
- With the use of Apache Phoenix, user can retrieve data from HBase through SQL queries.
- and user can itegrate HBase with Hive for SQL like support.

17) Is it possible to iterate through the rows of HBase table in reverse order?
- No.Column values are put on disk and the length of the value is written first and then the actual value is written.
- To iterate through these values in reverse order-the bytes of the actual value should be written twice.

18) Should the region server be located on all DataNodes?
- Yes. Region Servers run on the same servers as DataNodes.

19) Suppose that your data is stored in collections, for instance some binary data,
    message data or metadata is all keyed on the same value. Will you use HBase for this?
- Yes, it is ideal to use HBase whenever key based access to data is required for storing and retrieving.

20) Assume that an HBase table Student is disabled. Can you tell me how will I access the student table using Scan command once it is disabled?
- Any HBase table that is disabled cannot be accessed using Scan command.

21) What do you understand by compaction?
- During periods of heavy incoming writes, it is not possible to achieve optimal performance by having one file per store.
- Thus, HBase combines all these HFiles to reduce the number of disk seeds for every read.
- This process is referred to as Compaction in HBase.
- Minor Compaction: HBase automatically picks smaller HFiles and recommits them to bigger HFiles.
- Major Compaction: In Major compaction, HBase merges and recommits the smaller HFiles of a region to a new HFile.

22) Explain about the various table design approaches in HBase.
- Tall-Narrow and Flat-Wide are the two HBase table design approaches that can be used.
- However, which approach should be used when merely depends on what you want to achieve and how you want to use the data.
- The performance of HBase completely depends on the RowKey and hence on directly on how data is accessed.
- On a high level the major difference between flat-wide and tall-narrow approach is similar to the difference between get and scan
- Full scans are costly in HBase because of ordered RowKey storage policy.
- Tall-narrow approach can be used when there is a complex RowKey so that focused scans can be performed on logical group of entries.
- tall-narrow approach is used when there are less number of rows and large number of columns
- flat-wide approach is used when there are less number of columns and large number of rows.

23) Which one would you recommend for HBase table design approach – tall-narrow or flat wide?
- There are several factors to be considered when deciding between flat-wide (millions of columns and limited keys) and
 tall-narrow (millions of keys with limited columns),
- however, a tall-narrow approach is often recommended because of the following reasons
- Under extreme scenarios, a flat-wide approach might end up with a single row per region, resulting in poor performance and scalability.
- Table scans are often efficient over multiple reads.
- Considering that only a subset of the row data will be required,

24) What is the best practice on deciding the number of column families for HBase table?
- It is ideal not to exceed the number of columns families per HBase table by 15 because every column family in HBase is stored as a single file
- so large number of columns families will be required to read and merge multiple files.

24) How will you implement joins in HBase?
- HBase does not support joins directly but by using MapReduce jobs join queries can be implemented to retrieve data from various HBase tables.

25) What is the difference between HBase and HDFS?
- HDFS is a local file system in Hadoop for storing large files but it does not provide tabular form of storage.
- HDFS is more like a local file system (NTFS or FAT).
- Data in HDFS is accessed through MapReduce jobs and is well suited for high latency batch processing operations.

- HBase is a column oriented database on Hadoop that runs on top of HDFS and stores data in tabular format.
- HBase is like a database management system that communicates with HDFS to write logical tabular data to physical file system.
- One can access single rows using HBase from billions of records it has and is well-suited for low latency operations.
  -  HBase puts data in indexed StoreFiles present on HDFS for high speed lookups.
26) HBase puts data in indexed StoreFiles present on HDFS for high speed lookups.
27) You want to fetch data from HBase to create a REST API. Which is the best way to read HBase data using a Spark Job or a Java program?
28) Design a HBase table for many to many relationship between two entities, for example employee and department.
29) Explain an example that demonstrates good de-normalization in HBase with consistency.
30) Should your HBase and MapReduce cluster be the same or they should be run on separate clusters?

31) What are the components of Region Server?
- WAL: Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment.
      The WAL stores the new data that hasn’t been persisted or committed to the permanent storage.
- Block Cache: Block Cache resides in the top of Region Server. It stores the frequently read data in the memory.
default size is - 40% of heap size.
- MemStore: It is the write cache. It stores all the incoming data before committing it to the disk or permanent memory.
           There is one MemStore for each column family in a region.
   Default size of memsotre is 128MB
- HFile: HFile is stored in HDFS. It stores the actual cells on the disk.

32) Explain “WAL” in HBase?
- Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment.
- The WAL stores the new data that hasn’t been persisted or committed to the permanent storage. It is used in case of failure to recover the data sets.

33) Explain the data model of HBase.
- Set of tables.
- Each table contains column families and rows.
- Row key acts as a Primary key in HBase.
- Any access to HBase tables uses this Primary Key.
- Each column qualifier present in HBase denotes attributes corresponding to the object which resides in the cell.

34) Define standalone mode in HBase?
- It is a default mode of HBase. In standalone mode, HBase does not use HDFS—it uses the local filesystem instead
- and it runs all HBase daemons and a local ZooKeeper in the same JVM process.

35) What is decorating Filters?
- It is useful to modify, or extend, the behavior of a filter to gain additional control over the returned data.
- These types of filters are known as decorating filter. It includes SkipFilter and WhileMatchFilter.

36) Which code is used to open a connection in HBase?
Configuration myConf = HBaseConfiguration.create();
HTable table = new HTable(myConf, “users”);

37) HBase blocksize is configured on which level?
- The blocksize is configured per column family and the default value is 64 KB. This value can be changed as per requirements.

38) What is the full form of MSLAB?
- MSLAB stands for Memstore-Local Allocation Buffer.
- Whenever a request thread needs to insert data into a MemStore, it doesn’t allocates the space for that data from the heap at large,
 but rather allocates memory arena dedicated to the target region.

39) Define LZO?
- Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm that focuses on decompression speed.

40) What is HBase Fsck?
- HBase comes with a tool called hbck which is implemented by the HBaseFsck class
- HBaseFsck (hbck) is a tool for checking for region consistency and table integrity problems and repairing a corrupted HBase.
- It works in two basic modes – a read-only inconsistency identifying mode and a multi-phase read-write repair mode.

41) What is REST?
- Rest stands for Representational State Transfer which defines the semantics so that the protocol can be used in a generic way to address remote resources.
- It also provides support for different message formats, offering many choices for a client application to communicate with the server.

42) What is Thrift?
- Apache Thrift is written in C++, but provides schema compilers for many programming languages, including Java, C++, Perl, PHP, Python, Ruby, and more.

43) What is Nagios?
- Nagios is a very commonly used support tool for gaining qualitative data regarding cluster status.
- It polls current metrics on a regular basis and compares them with given thresholds.

44) What is the use of ZooKeeper?
- The ZooKeeper is used to maintain the configuration information and communication between region servers and clients.
- It also provides distributed synchronization.
- It helps in maintaining server state inside the cluster by communicating through sessions.
- Every Region Server along with HMaster Server sends continuous heartbeat at regular interval to Zookeeper and it checks which server is alive and available.
- It also provides server failure notifications so that, recovery measures can be executed.

45) What is the use of HColumnDescriptor class?
- HColumnDescriptor stores the information about a column family like compression settings, number of versions etc.
- It is used as input when creating a table or adding a column.

46) Which filter accepts the pagesize as the parameter in hBase?
- PageFilter accepts the pagesize as the parameter.
- Implementation of Filter interface that limits results to a specific page size.
- It terminates scanning once the number of filter-passed the rows greater than the given page size: syntax-> How will you design or modify schema in
 HBase programmatically?PageFilter (<page_size>)

47) How will you design or modify schema in HBase programmatically?

Creating table schema:
Configuration config = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf); // execute command through admin</span></pre>

// Instantiating table descriptor class
HTableDescriptor t1 = new HTableDescriptor(TableName.valueOf("employee"));

// Adding column families to t1
t1.addFamily(new HColumnDescriptor("professional"));
t1.addFamily(new HColumnDescriptor("personal"));

// Create the table through admin
admin.createTable(t1);

For modification:
String table = “myTable”;
admin.disableTable(table);
admin.modifyColumn(table, cf2); // modifying existing ColumnFamily
admin.enableTable(table);

48) How HBase Handles the write failure?
- Failures are common in large distributed systems, and HBase is no exception.
- If the server hosting a MemStore that has not yet been flushed crashes. The data that was in memory, but not yet persisted are lost.
- HBase safeguards against that by writing to the WAL before the write completes. Every server that’s part of the.
- HBase cluster keeps a WAL to record changes as they happen.
- The WAL is a file on the underlying file system. A write isn’t considered successful until the new WAL entry is successfully written.
- This guarantee makes HBase as durable as the file system backing it. Most of the time, HBase is backed by the Hadoop Distributed Filesystem (HDFS).
- If HBase goes down, the data that were not yet flushed from the MemStore to the HFile can be recovered by replaying the WAL.

49) While reading data from HBase, from which three places data will be reconciled before returning the value?
- For reading the data, the scanner first looks for the Row cell in Block cache. Here all the recently read key value pairs are stored.
- If Scanner fails to find the required result, it moves to the MemStore, as we know this is the write cache memory.
There, it searches for the most recently written files, which has not been dumped yet in HFile.
- At last, it will use bloom filters and block cache to load the data from the HFile.

50) Can you explain data versioning?
- In addition to being a schema-less database, HBase is also versioned.
- Every time you perform an operation on a cell, HBase implicitly stores a new version.
- Creating, modifying and deleting a cell are all treated identically, they are all new versions.
- When a cell exceeds the maximum number of versions, the extra records are dropped during the major compaction.
- Instead of deleting an entire cell, you can operate on a specific version within that cell.
- Values within a cell are versioned and it is identified the timestamp.
- If a version is not mentioned, then the current timestamp is used to retrieve the version.
- The default number of cell version is three.

51) What is a Bloom filter and how does it help in searching rows?
- HBase supports Bloom Filter to improve the overall throughput of the cluster.
- A HBase Bloom Filter  is a space efficient mechanism to test whether a HFile contains a specific row or row-col cell.
- Without Bloom Filter, the only way to decide if a row key is present in a HFile  is to check the HFile’s block index
- which stores the start row key of each block in the HFile.
- There are many rows drops between the two start keys.
- So, HBase has to load the block and scan the block’s keys to figure out if that row key actually exists.

Comments

Post a Comment

Popular posts from this blog

Hive Related Errors and fixes

Map Reduce Interview Questions

Sqoop Interview Questions