Hive vs HBase

Hive and HBase are both great additions to the Hadoop ecosystem. While Hive provides a SQL-like interface for Hadoop, HBase acts as a NoSQL layer for HDFS. Both are quite different but can work well together with Hadoop. In this article, we discuss the key differences between Hive and HBase. We'll look at what makes them different as well as when to use which and why.

What is Hive?

Apache Hive is a data warehouse built on top of Hadoop. It allows you to easily run MapReduce jobs on a Hadoop cluster while using a SQL-like syntax.

What is a data warehouse anyways?

A data warehouse is a system for reporting and data analysis. While implementations of a data warehouse may vary, it typically serves as a central repo of data fed by multiple sources.

A SQL Abstraction

Hive provides it's own query language (HQL) for running MapReduce jobs on a Hadoop cluster. It adds a relational schema to HDFS so you can run traditionally complicated MapReduce jobs with more familiar SQL-like queries.

When you run a Hive query, it runs batch processing on Hadoop to aggregate data. While it doesn't support updates, Hive takes an RDBMS approach to reads and writes on HDFS.

When to use Hive

Hive is great for SQL savvy developers who want to run MapReduce jobs without knowing how to implement MapReduce. This could include data analysts or anyone that is familiar with SQL or RDBMS. Remember that the key point of Hive is to provide a SQL-like abstraction for running MapReduce jobs on Hadoop. This makes it good for analytical queries (OLAP).

What is HBase?

HBase is a non-relational, distributed database that runs on top of HDFS. It brings the benefits of NoSQL to Hadoop. For more on NoSql, see choosing the right database.

Through it's NoSQL key/value store, HBase is great for real time querying on big data. This makes it perfect for lighting-fast reads and writes on live data streams and provides a lot of transactional support to HDFS.

When to use HBase

Use HBase for real-time queries and fast lookups. HBase is perfect for quickly storing and processing data on top of a static HDFS data store.

Hive vs HBase

Remember that HBase is a database and Hive is a database engine. Comparing the two is apples and oranges.

Despite their differences, Hive and Hbase actually work well together. For example, you can run Hive queries on top of HBase. This couples the convenience of a SQL-like syntax with the benefits of a non-relational data store for HDFS.

Conclusion

It is a mistake to think that Hive and HBase compete within the Hadoop ecosystem. While Hive improves the analytical side of HDFS, HBase improves transactions in a real-time environment. For these reasons, it's recommended that both are used together to enhance Hadoop.

Join the conversation...

Posted by apachenutz
August 18, 2017

also can you give more examples as to how you would actually implement hbase on a cluster?
Posted by apachenutz
August 18, 2017

great read.
Posted by stackchief
August 16, 2017

from a performance standpoint, it's important to understand what each is doing. hive runs traditional map-reduce (batch) jobs on HDFS. this leaves it subject to the same performance issues you experience with regular MapReduce.

Hbase is running a NoSQL db on top of HDFS. This makes it extremely fast from performance standpoint.

Key takeaways are that both work well together but Hive is a DB engine and HBase is a DB.