Hive vs Impala

Hive and Impala both provide SQL-like interfaces for querying large data sets in Hadoop. While Hive transforms queries into MapReduce jobs, Impala uses MPP (massively parallel processing) to run lightning fast queries against HDFS, HBase, etc. In this article, we discuss the key differences between Hive and Impala, including performance comparisons and best use cases for each.

What is Hive?

Hive is an abstraction of Hadoop MapReduce. Using a SQL-like query language (HiveQL), Hive allows users to query HDFS / HBase without writing complicated MapReduce jobs. For more on Hive, check out this article.

What is Impala?

Impala was internally developed by Cloudera and provides a similar SQL-like syntax for database querying in Hadoop. Unlike Hive, Impala uses MPP for querying big data (not MapReduce).

Differences Between Hive and Impala

Hive queries translate to MapReduce jobs. Impala queries implement memory bound MPP jobs that are much faster than disk based MapReduce. Unlike Hive, Impala doesn't rely on data transformations or moving data to process queries. Impala also avoids startup overhead by starting daemon processes at boot time. This makes Impala more readily available to process queries than Hive since Hive has to start new processes for every query you run.

Since Hive is MapReduce based, it adds a level of fault-tolerance that Impala can't match. While Impala makes querying a lot faster, it loses the added advantage of fault-tolerance provided by Hadoop MapReduce jobs.

Impala vs Hive Performance

Impala is much faster than Hive, however the line is becoming more blurred with the introduction of Hive 2.0 and LLAP support. The performance advantage is largely due to the avoidance of using classic MapReduce. Since Impala uses MPP instead of MapReduce, it doesn't suffer from startup overhead or excessive I/O operations seen with Hive. Impala avoids the need to migrate huge data sets or convert data formats before it runs queries, giving it a performance edge over Hive.

When to Use Impala vs Hive

Impala is better for low latency, interactive queries and most appropriate for interactive computing with multiple users. Impala sacrifices reliability and fault-tolerance for performance. This makes Hive the best option for situations where fault-tolerance is valued, such as with extremely long running queries. If a Hive query fails because a data node goes down, the query will still execute at the end of the day. This is not the case with Impala. The question to use Hive or Impala really boils down to compatibility vs performance.

Conclusion

Hive vs Impala shouldn't be looked at as one verse the other. Instead, the two should be considered compliments in the database querying space. It's important to remember that Hive and Impala use the same metastore and can work off the same schemas. Use Hive for long running queries where compatibility and fault tolerance are emphasized. Use Impala for running faster queries where performance is valued.

Your thoughts?