Installing Hive

Install the appropriate Hive packages using the appropriate command for your distribution.
OS Command
RHEL-compatible
$ sudo yum install <pkg1> <pkg2> ...
SLES
$ sudo zypper install <pkg1> <pkg2> ...
Ubuntu or Debian
$ sudo apt-get install <pkg1> <pkg2> ...
The packages are:
  • hive – base package that provides the complete language and runtime
  • hive-metastore – provides scripts for running the metastore as a standalone service (optional)
  • hive-server2 – provides scripts for running HiveServer2
  • hive-hbase - optional; install this package if you want to use Hive with HBase.

Heap Size and Garbage Collection for Hive Components

This section provides guidelines for setting HiveServer2 and Hive metastore memory and garbage-collection properties.

Memory and Hardware Requirements Recommendations

HiveServer2 and the Hive metastore require sufficient memory to run correctly. The default heap size of 256 MB for each component is inadequate for production workloads. The table below contains guidelines for sizing the heap for each component, based on your cluster size. The table refers to connections, the number of open connections to HiveServer (Cloudera Manager hive_open_connections metric). In Cloudera Manager, HiveServer2, Status, the visual representation of this metric appears. For example:


Component Java Heap CPU Disk
HiveServer 2 Single Connection 4 GB Minimum 4 dedicated cores

Minimum 1 disk

This disk is required for the following:

  • HiveServer2 log files
  • stdout and stderr output files
  • Configuration files
  • Operation logs stored in the operation_logs_dir directory, which is configurable
  • Any temporary files that might be created by local map tasks under the /tmp directory
2-10 connections 4-6 GB
11-20 connections 6-12 GB
21-40 connections 12-16 GB
41 to 80 connections 16-24 GB

Cloudera recommends splitting HiveServer2 into multiple instances and load balancing them once you start allocating more than 16 GB to HiveServer2. The objective is to adjust the size to reduce the impact of Java garbage collection on active processing by the service.

Set this value using the Java Heap Size of HiveServer2 in Bytes Hive configuration property.

For more information, see Tuning Hive in CDH.

Hive Metastore Single Connection 4 GB Minimum 4 dedicated cores

Minimum 1 disk

This disk is required so that the Hive metastore can store the following artifacts:

  • Logs
  • Configuration files
  • Backend database that is used to store metadata if the database server is also hosted on the same node
2-10 connections 4-10 GB
11-20 connections 10-12 GB
21-40 connections 12-16 GB
41 to 80 connections 16-24 GB

Set this value using the Java Heap Size of Hive Metastore Server in Bytes Hive configuration property.

For more information, see Tuning Hive in CDH.

Beeline CLI Minimum: 2 GB N/A N/A

In addition, set the PermGen space for Java garbage collection to 512 MB for all.

Configuring Heap Size and Garbage Collection

Using Cloudera Manager

To configure heap size and garbage collection for HiveServer2:

  1. To set heap size, go to Home > Hive > Configuration > HiveServer2 > Resource Management.
  2. Set Java Heap Size of HiveServer2 in Bytes to the desired value, and click Save Changes.
  3. To set garbage collection, go to Home > Hive > Configuration > HiveServer2 > Advanced.
  4. Set the PermGen space for Java garbage collection to 512M, the type of garbage collector used (ConcMarkSweepGC or ParNewGC), and enable or disable the garbage collection overhead limit in Java Configuration Options for HiveServer2.

    The following example sets the PermGen space to 512M, uses the new Parallel Collector, and disables the garbage collection overhead limit:

    -XX:MaxPermSize=512M -XX:+UseParNewGC -XX:-UseGCOverheadLimit
    
  5. From the Actions drop-down menu, select Restart to restart the HiveServer2 service.

To configure heap size and garbage collection for the Hive metastore:

  1. To set heap size, go to Home > Hive > Configuration > Hive Metastore > Resource Management.
  2. Set Java Heap Size of Hive Metastore Server in Bytes to the desired value, and click Save Changes.
  3. To set garbage collection, go to Home > Hive > Configuration > Hive Metastore Server > Advanced.
  4. Set the PermGen space for Java garbage collection to 512M, the type of garbage collector used (ConcMarkSweepGC or ParNewGC), and enable or disable the garbage collection overhead limit in Java Configuration Options for Hive Metastore Server. For an example of this setting, see step 4 above for configuring garbage collection for HiveServer2.
  5. From the Actions drop-down menu, select Restart to restart the Hive Metastore service.

To configure heap size and garbage collection for the Beeline CLI:

  1. To set heap size, go to Home > Hive > Configuration > Gateway > Resource Management.
  2. Set Client Java Heap Size in Bytes to at least 2 GiB and click Save Changes.
  3. To set garbage collection, go to Home > Hive > Configuration > Gateway > Advanced.
  4. Set the PermGen space for Java garbage collection to 512M in Client Java Configuration Options.

    The following example sets the PermGen space to 512M and specifies IPv4:

    -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true
    
  5. From the Actions drop-down menu, select Restart to restart the client service.

Using the Command Line

To configure the heap size for HiveServer2 and Hive metastore, set the -Xmx parameter in the HADOOP_OPTS variable to the desired maximum heap size in /etc/hive/hive-env.sh.

To configure the heap size for the Beeline CLI, set the HADOOP_HEAPSIZE environment variable in /etc/hive/hive-env.sh before starting the Beeline CLI.

The following example shows a configuration with the following settings:
  • HiveServer2 uses 12 GB heap.
  • Hive metastore uses 12 GB heap.
  • Hive clients use 2 GB heap.
The settings to change are in bold. All of these lines are commented out (prefixed with a # character) by default.
if [ "$SERVICE" = "cli" ]; then
  if [ -z "$DEBUG" ]; then
    export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xmx12288m -Xms12288m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+UseParNewGC -XX:-UseGCOverheadLimit"
  else
    export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xmx12288m -Xms12288m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit"
  fi
fi

export HADOOP_HEAPSIZE=2048

You can use either the Concurrent Collector or the new Parallel Collector for garbage collection by passing -XX:+UseConcMarkSweepGC or -XX:+UseParNewGC in the HADOOP_OPTS lines above. To enable the garbage collection overhead limit, remove the -XX:-UseGCOverheadLimit setting or change it to -XX:+UseGCOverheadLimit.

Set the PermGen space for Java garbage collection to 512M for all in the JAVA-OPTS environment variable. For example:

set JAVA_OPTS="-Xms256m -Xmx1024m -XX:PermSize=512m -XX:MaxPermSize=512m"

Configuration for WebHCat

If you want to use WebHCat, you need to set the PYTHON_CMD variable in /etc/default/hive-webhcat-server after installing Hive; for example:
export PYTHON_CMD=/usr/bin/python