Why HBase?
Hbase is a distributed, scalable, big data store for Hadoop. It runs on top of HDFS and provides a NoSQL database for real-time reads and writes. In this article, we discuss the benefits of using HBase, including how it works and advantages of using HBase with Hadoop.
What is HBase?
HBase is a NoSQL random access database for Hadoop. . It runs on top of HDFS and provides real time data access to distributed data via its key/value store.
How HBase Works
HBase is considered a column-oriented database, meaning data is stored in columns rather than rows. It's schema defines column families which are collections of actual columns called column qualifiers. The column qualifiers themselves store data values. Additionally, column qualifiers can store multiple versions of data values.
HBase tables are organized by rows of column families. While each row must have the same column families, the column qualifiers within these column families can vary from row to row. This gives HBase a more flexible schema as columns can be added on the fly.
By storing data in rows of column families, HBase achieves a four dimensional data model that makes lookups exceptionally fast. For example, you can quickly retrieve a specific column by specifying the table, row, and column family. This makes finding "needles" in data "haystacks" a reality with HBase.
It should be noted that the design of the row keys dictates the level of real-time/direct access you can achieve with HBase. This is largely because the row keys uniqueness determine the data's distribution across HDFS.
HBase relies on Zookeeper for the coordination of cluster nodes. Updates to the database are registered in a Write-Ahead-Log(WAL) and cached in a Memstore. Once the Memstore reaches its storage capacity, the recorded changes are written to H files and stored in HDFS.
HBase achieves faster reads by first consulting the Memstore before checking H files. Additionally, the WAL register provides a backup to anything lost in the Memstore. When data is flushed to H files from the Memstore, the H files are replicated to other data nodes automatically.
Advantages of HBase
HBase provides a dual approach to data access. While it's row key based table scans provide consistent and real-time reads/writes, it also leverages Hadoop MapReduce for batch jobs. This makes it great for both real-time querying and batch analytics. Hbase also automatically manages sharding and failover support.
Conclusion
HBase provides a NoSQL layer to HDFS. It uses a four dimensional data model to quickly run table scans and perform real time reads/writes. HBase provides highly available and consistent reads/writes while also leveraging MapReduce for batch processing.