Managing Hive
Use the following procedures to manage HiveServer2 and the Hive metastore. To configure high availability for the Hive metastore, see Hive Metastore High Availability.
Heap Size and Garbage Collection for Hive Components
Hive Component Memory Recommendations
Number of Concurrent Connections | HiveServer2 Heap Size Minimum Recommendation | Hive Metastore Heap Size Minimum Recommendation |
---|---|---|
Up to 40 concurrent connections (Cloudera recommends splitting HiveServer2 into multiple instances and load balancing once you start allocating >12 GB to HiveServer2. The objective is to size to reduce impact of Java garbage collection on active processing by the service. | 12 GB | 12 GB |
Up to 20 concurrent connections | 6 GB | 10 GB |
Up to 10 concurrent connections | 4 GB | 8 GB |
Single connection | 2 GB | 4 GB |
In addition, the Beeline CLI should use a heap size of at least 2 GB.
The permGenSize should be set to 512M for all.
Configuring Heap Size and Garbage Collection for Hive Components
To configure the heap size for HiveServer2 and Hive metastore, set the -Xmx parameter in the HADOOP_OPTS variable to the desired maximum heap size in the hive-env.sh advanced configuration snippet if you use Cloudera Manager or otherwise edit /etc/hive/hive-env.sh.
To configure the heap size for the Beeline CLI, set the HADOOP_HEAPSIZE environment variable in the hive-env.sh advanced configuration snippet if you use Cloudera Manager or otherwise edit /etc/hive/hive-env.sh before starting the Beeline CLI.
- HiveServer2 uses 12 GB heap
- Hive metastore uses 12 GB heap
- Hive clients use 2 GB heap
if [ "$SERVICE" = "cli" ]; then if [ -z "$DEBUG" ]; then export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xmx12288m -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+UseParNewGC -XX:-UseGCOverheadLimit" else export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xmx12288m -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit" fi fi export HADOOP_HEAPSIZE=2048
You can choose whether to use the Concurrent Collector or the New Parallel Collector for garbage collection, by passing -XX:+UseParNewGC or -XX:+UseConcMarkSweepGC in the HADOOP_OPTS lines above, and you can tune the garbage collection overhead limit by setting -XX:-UseGCOverheadLimit. To enable the garbage collection overhead limit, remove the setting or change it to -XX:+UseGCOverheadLimit.
Configuration for WebHCat
export PYTHON_CMD=/usr/bin/python
Transaction (ACID) Support in Hive
The CDH distribution of Hive does not support transactions (HIVE-5317). Currently, transaction support in Hive is an experimental feature that only works with the ORC file format. Cloudera recommends using the Parquet file format, which works across many tools. Merge updates in Hive tables using existing functionality, including statements such as INSERT, INSERT OVERWRITE, and CREATE TABLE AS SELECT.