Top 3 Most Important Things to Know About HBase

HBase is a non-relational database that's perfect for performing fast lookups on big tables. While HBase is a cornerstone of the Hadoop ecosystem, there are many misconceptions as to what exactly HBase is and the role it plays in managing big data with HDFS. In this article, we discuss the top 3 most important things to know about HBase before using it with your Hadoop cluster.

Preface: What is HBase?

HBase is a distributed column-oriented database build on top of HDFS. It's open source and considered horizontally scalable, meaning capacity can be increased by adding more hardware. Using HBase, you can both read and write data to HDFS. Through it's column-oriented architecture, HBase allows for superior random access to information, meaning it's easier to find a data needle in a haystack.

HBase leverages the power of HDFS to provide fault tolerance, linear scalability, consistent read/writes, and data replication. Following are the 3 most important things to understand about HBase before using it with your Hadoop cluster:

1) HBase is NOT an RDBMS

Similar to MongoDb or Cassandra, HBase is considered a "NoSQL" data store. Unlike a traditional RDBMS which is characterized by normalized data and transactions, HBase is schemaless. The data stored in HBase is not normalized, meaning there is no logical connection or relationship connecting different tables of data. For example, you won't find a primary key that links to another row of data in some other table like you would with a traditional SQL database.

While RDBMS have been used successfully for years, they don't scale well. Specifically the distributed joins and transactional nature of relational tables create a bottleneck that does not play well with "big data" This is why denormalized data stores like HBase work best with HDFS.

2) HBase sits on top of HDFS

There is a misconception that HBase is independent of HDFS. HBase can be thought of as a layer that sits on top of HDFS. All of the reads/writes performed on HBase are ultimately coming from the underlying HDFS. When data producers write to HBase, the information gets stored in HDFS. When data consumer read from HBase, the information is pulled from HDFS.

By working directly on top of HDFS, HBase can both leverage things that make HDFS so great (data replication, fault tolerance) and also address its pitfalls (batch processing limitations, etc.).

3) HBase is a column-oriented database

HBase is able to achieve random access through a column-oriented architecture. Unlike a traditional RDBM table where a collection of different columns make up a single row, a column-oriented table uses a row key to access different column families. These column families contain actual columns having different versions of data. This results in a four dimensional data model where accessing a single value requires knowing the row key, column family, column, and version.

This may be confusing, but it's what makes HBase so powerful. By organizing tables like this, HBase ultimately creates a key/value store where rows are the keys and values are the column families. This makes accessing individual data values a lot faster since HBase doesn't have to sequentially process everything in the data set (batch processing) to fetch results. While the direct access to data is largely dictated by the row-key design, this architecture is what allows you to access single rows within billions of records using HBase tables.

This column-oriented structure combined with sophisticated caching and indexing is what makes HBase so fast.

Conclusion

It's important to remember that HBase sits on top of HDFS. It does not compete with HDFS. HBase is a non-relational data store that uses a column-oriented table architecture to provide random access to data stored in HDFS. Using HBase, you can perform fast and reliable reads and writes not inherently supported by HDFS batch processing.

Your thoughts?