The Apache Hive Metastore

Hive sits on top of Hadoop and makes querying big data easier. Using Hive, you can apply a SQL-like relational context to Hadoop and perform MapReduce operations without all the Java. While Hive doesn't store any of the data it processes, it relies on a metastore database for storing schema information for the Hive tables you create. In this article, we discuss the basics of the Hive metastore including what it is, how it works, and configuring it for your Hadoop environment.

What is the Hive Metastore?

The Hive metastore is simply a relational database. It stores metadata related to the tables/schemas you create to easily query big data stored in HDFS. When you create a new Hive table, the information related to the schema (column names, data types) is stored in the Hive metastore relational database. Other information like input/output formats, partitions, HDFS locations are all stored in the metastore.

How does the Hive Metastore work?

The Hive metastore acts as a central repo for Hive metadata. The Hive driver connects to the metastore and queries information related to the schema you create so that you can run SQL like queries (HQL) against the underlying Hadoop cluster. Remember that Hive simply sits on top of HDFS, it is not an actual form of storage for the big data you store via HDFS, HBase, etc.

By default, Hive uses Derby SQL server for the metastore database. This runs as a single process and is not recommended for production. Using other supported relational engines like MySQL or Postgres is necessary for running Hive on a production cluster.

Configuring the Hive Metastore

The hive-site.xml file is used for specifying the locations and type of database used for the metastore. Specifically, the javax.jdo.option.Connection url property specifies the JDBC connection string for the metastore. While Derby is the default engine, Hive supports MySql, MS Sql Server, Oracle, and PostGres for production grade implementations.

There are three modes for implementing Hive:

Embedded Mode:

This is largely used for experimental purposes and is the default configuration for Hive. Using embedded mode, both the metastore database and the service API run as a single process. This means only one active user can be using the service at a time. Embedded mode requires minimal configuration, but has severe limitations and is not suitable for production environments.

Local Mode

In local mode, the metastore service runs as a separate process from the metastore database. The metastore database is a standalone instance, allowing multiple users to query against it simultaneously. It should be noted that the metastore service (API) and main driver still run as a single process in local mode.

What is the driver anyways?

The driver is responsible for receiving the SQL like queries and monitoring their execution. It tracks and stores the necessary metadata generated from queries. Users submit queries to the driver through the CLI.

Remote Mode

This is the recommended implementation for production grade environments. The main difference between remote and local is that users connect to a separate metastore server JVM residing outside the metstastore service. In other words, the Hive metastore service runs as a separate process from the main driver. This provides better availability and scalability over a network.

Conclusion

The metastore provides the infrastructure for the relational tables you create with Hive. While Hive simply sits on top of Hadoop, it stores schema information in a relational database that can be configured in several ways. Implementing Hive in remote mode is recommended since the Hive metastore service runs separately from the main driver.

Your thoughts?