Cloudera Observability Hive cluster metrics

Lists the Hive cluster health check tests that are performed by Cloudera Observability at the end of a Hive job. The list includes the severity conditions and thresholds, and what actions you should consider to resolve the problem.

Table 1.
Health Test Description Severity Condition Recommendation
Hive on Tez JVM Pause Rate Analyzer

This health test checks the time taken to free up memory by the Java garbage collector.

Where, a high value for the JVM pause rate indicates that the Java garbage collection took longer than the threshold to complete its work.

It uses the hive_on_tez_jvm_pause_time_rate metric to check the Hive on Tez JVM pause rate.

  • A Good result implies that there were no pauses greater than 300ms and no occurrences of more than five pauses between 100ms and 300ms.
  • A Concerning result implies that there were occurrences of more than five pauses between 100ms and 300ms.
  • A Bad result implies that there was at least one pause that was greater than 300ms.

To address this condition, consider increasing the allocated heap size for the HiveServer2 instance.

Hive Metastore JVM Pause Rate Analyzer

This health test checks the time taken to free up memory by the Java garbage collector.

Where, a high value for the JVM pause rate indicates that the Java garbage collection took longer than the threshold to complete its work.

It uses the hive_jvm_pause_time_rate metric to check the Hive Metastore JVM pause rate.

  • A Good result implies that there were no pauses greater than 300ms and no occurrences of more than five pauses between 100ms and 300ms.
  • A Concerning result implies that there were occurrences of more than five pauses between 100ms and 300ms.
  • A Bad result implies that there was at least one pause that was greater than 300ms.
To address this condition consider increasing the allocated heap size for the Hive Metastore.
Hive on Tez Waiting Compile Ops Analyzer

This health test counts the number of Hive On Tez operations waiting to compile.

Where, if the number of operations waiting to compile is greater than 0, then the HiveServer2 instance is likely overloaded.

It uses the hive_on_tez_waiting_compile_ops metric to count the number of Hive On Tez operations waiting to compile.

  • A Good result implies that there were zero operations waiting to compile.
  • A Bad result implies that the number of operations waiting to compile is consistently greater than zero.

If the number of operations waiting to compile is consistently greater than zero, then to address this condition, consider restarting the HiveServer2 instance.

HiveServer2 Memory Usage Analyzer

This health test calculates the percentage of Hive On Tez heap memory utilization for the input period.

Where, if the percentage of heap memory utilization is above the threshold, there is a possibility of running out of heap space.

It uses the hive_on_tez_memory_heap_used and the hive_on_tez_memory_heap_max metrics to calculate the percentage of heap memory utilization for the input period.

  • A Good result implies that the maximum heap utilization is less than 80% of the available heap.
  • A Concerning result implies that the maximum heap utilization is between 80% and 95% of the available heap.
  • A Bad result implies that the maximum heap utilization exceeded 95% of the available heap.

To address this condition, consider increasing the allocated heap size for the HiveServer2 instance.

Hive Metastore Memory Usage Analyzer

This health test calculates the percentage of Hive Metastore heap memory utilization for the input period.

Where, if the percentage of heap memory utilization is above the threshold, there is a possibility of running out of heap space.

It uses the hive_memory_heap_used and the hive_memory_heap_max metrics to calculate the percentage of heap memory utilization for the input period.

  • A Good result implies that the maximum heap utilization is less than 80% of the available heap.
  • A Concerning result implies that the maximum heap utilization is between 80% and 95% of the available heap.
  • A Bad result implies that the maximum heap utilization exceeded 95% of the available heap.

To address this condition, consider increasing the allocated heap size for the Hive Metastore.