Setting Up Apache Hive Using the Command Line

Using Hive data in HBase is a common task. See Importing Data Into HBase.

For information about Hive on Spark, see Running Apache Hive on Spark in CDH.

Apache Hive is a powerful data warehousing application for Hadoop. It enables you to access your data using HiveQL, a language similar to SQL.

Install Hive on your client machine(s) from which you submit jobs; you do not need to install it on the nodes in your Hadoop cluster. As of CDH 5, Hive supports HCatalog which must be installed separately.

Configuring Heap Size and Garbage Collection

To configure the heap size for HiveServer2 and Hive metastore, set the -Xmx parameter in the HADOOP_OPTS variable to the desired maximum heap size in /etc/hive/hive-env.sh.

To configure the heap size for the Beeline CLI, set the HADOOP_HEAPSIZE environment variable in /etc/hive/hive-env.sh before starting the Beeline CLI.

The following example shows a configuration with the following settings:
  • HiveServer2 uses 12 GB heap.
  • Hive metastore uses 12 GB heap.
  • Hive clients use 2 GB heap.
The settings to change are in bold. All of these lines are commented out (prefixed with a # character) by default.
if [ "$SERVICE" = "cli" ]; then
  if [ -z "$DEBUG" ]; then
    export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xmx12288m -Xms12288m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+UseParNewGC -XX:-UseGCOverheadLimit"
  else
    export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xmx12288m -Xms12288m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit"
  fi
fi

export HADOOP_HEAPSIZE=2048

You can use either the Concurrent Collector or the new Parallel Collector for garbage collection by passing -XX:+UseConcMarkSweepGC or -XX:+UseParNewGC in the HADOOP_OPTS lines above. To enable the garbage collection overhead limit, remove the -XX:-UseGCOverheadLimit setting or change it to -XX:+UseGCOverheadLimit.

Set the PermGen space for Java garbage collection to 512M for all in the JAVA-OPTS environment variable. For example:

set JAVA_OPTS="-Xms256m -Xmx1024m -XX:PermSize=512m -XX:MaxPermSize=512m"

HiveServer2

HiveServer2 is an improved version of HiveServer that supports a Thrift API tailored for JDBC and ODBC clients, Kerberos authentication, and multi-client concurrency. The CLI for HiveServer2 is Beeline.