Workload XM Cluster Services Health Checks

Lists the ZooKeeper health check tests that are performed on your Workload XM cluster services. They provide processing performance insights, such as messaging queue bottlenecks and delays that can cause workload scheduling issues. You can find the ZooKeeper Queue and Processing Timers metric charts in the Workload XM Charts Library tab and the following Health checks on the Workload XM related cluster service's page in the Health Tests section.

Analytic Database Server

Table 1. Analytic Database Server Processing health check
Health Check Description
Impala Query Processing Time This health test raises an alert when more than 25% of the Impala Queries do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively.

It uses the wxm_adb_service_impala_query_processing_timer_75th_percentile metric that collects the time in which 75% of the calls are processed.

Pipeline Server

Table 2. Pipeline Server Processing health checks
Health Check Description
Spark Event Log Processing Time This health test raises an alert when more than 25% of the Spark Event Logs do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively.

It uses the wxm_pipelines_service_spark_application_processing_timer_75th_percentile metric that collects the time in which 75% of the calls are processed.

MR Jhist Processing Time This health test raises an alert when more than 25% of the MR Jhist payloads do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively.

It uses the wxm_pipelines_service_mr_job_processing_timer_75th_percentile metric that collects the time in which 75% of the calls are processed.

Hive Audit Processing Time This health test raises an alert when more than 25% of the Hive Audit payloads do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively.

It uses the wxm_pipelines_service_hive_query_processing_timer_75th_percentile metric that collects the time in which 75% of the calls are processed.

Oozie Workflow Processing Time This health test raises an alert when more than 25% of the Oozie Workflows do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively.

It uses the wxm_pipelines_service_oozie_workflow_processing_timer_75th_percentile metric that collects the time in which 75% of the calls are processed.

Yarn App Processing Time This health test raises an alert when more than 25% of the Yarn Apps do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively.

It uses the wxm_pipelines_service_yarn_application_processing_timer_75th_percentile metric that collects the time in which 75% of the calls are processed.

Tez Dag Event Processing Time This health test raises an alert when more than 25% of the Tez Dag Events do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively.

It uses the wxm_pipelines_service_tez_dag_event_log_processing_timer_75th_percentile metric that collects the time in which 75% of the calls are processed.

Hive Tez Processing Time This health test raises an alert when more than 25% of the Hive Tez Applications do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively.

It uses the wxm_pipelines_service_hive_query_event_log_processing_timer_75th_percentile metric that collects the time in which 75% of the calls are processed.

MR Task Log Processing Time This health test raises an alert when more than 25% of the MR task logs do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively.

It uses the wxm_pipelines_service_mr_task_log_processing_timer_75th_percentile metric that collects the time in which 75% of the calls are processed.

Spark Task Log Processing Time This health test raises an alert when more than 25% of the Spark Task Logs do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively.

It uses the wxm_pipelines_service_spark_task_log_processing_timer_75th_percentile metric that collects the time in which 75% of the calls are processed.

Yarn Container Log Processing Time This health test raises an alert when more than 25% of the Yarn Container Logs do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively.

It uses the wxm_pipelines_service_yarn_container_log_processing_timer_75th_percentile metric that collects the time in which 75% of the calls are processed.

Hive HDP26 Log Processing Time This health test raises an alert when more than 25% of the Hive HDP26 Logs do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively.

It uses the wxm_pipelines_service_hive_hdp_26_processing_timer_75th_percentile metric that collects the time in which 75% of the calls are processed.

Admin API Server

Table 3. Admin API Server elevated queue health checks
Health Check Name Description
Spark Event Log Zookeeper Queue Size This health test raises an alert when the size of the Spark Event Log ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 100K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 200K threshold size.

It uses the wxm_admin_api_service_queue_spark_event_log_items_total metric, which collects the number of items in the queue.

Spark Task Log Zookeeper Queue Size This health test raises an alert when the size of the Spark Task Log ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size.

It uses the wxm_admin_api_service_queue_spark_task_log_items_total metric, which collects the number of items in the queue.

Yarn App Zookeeper Queue Size This health test raises an alert when the size of the Yarn App ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size.

It uses the wxm_admin_api_service_queue_yarn_app_items_total metric, which collects the number of items in the queue.

Hive Audit Zookeeper Queue Size This health test raises an alert when the size of the Hive Audit ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size.

It uses the wxm_admin_api_service_queue_hive_audit_items_total metric, which collects the number of items in the queue.

MR Jhist Zookeeper Queue Size This health test raises an alert when the size of the MR Jhist ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size.

It uses the wxm_admin_api_service_queue_mr_jhist_items_total metric, which collects the number of items in the queue.

MR Task Log Zookeeper Queue Size This health test raises an alert when the size of the MR Task Log ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size.

It uses the wxm_admin_api_service_queue_mr_task_log_items_total metric, which collects the number of items in the queue.

Oozie Workflow Zookeeper Queue Size This health test raises an alert when the size of the Oozie Workflow ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size.

It uses the wxm_admin_api_service_queue_oozie_workflow_items_total metric, which collects the number of items in the queue.

Pse Zookeeper Queue Size This health test raises an alert when the size of the Pse ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 400K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 800K threshold size.

It uses the wxm_admin_api_service_queue_pse_items_total metric, which collects the number of items in the queue.

Sdx Details Zookeeper Queue Size This health test raises an alert when the size of the Sdx Details ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size.

It uses the wxm_admin_api_service_queue_sdx_details_items_total metric, which collects the number of items in the queue.

Impala Query Zookeeper Queue Size This health test raises an alert when the size of the Impala Query ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size.

It uses the wxm_admin_api_service_queue_impala_query_profile_items_total metric, which collects the number of items in the queue.

Yarn App Metric Zookeeper Queue Size This health test raises an alert when the size of the Yarn App Metric ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size.

It uses the wxm_admin_api_service_queue_yarn_app_metrics_items_total metric, which collects the number of items in the queue.

Hive On MR Table Zookeeper Queue Size This health test raises an alert when the size of the Hive On MR Table ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size.

It uses the wxm_admin_api_service_queue_hive_on_mr_table_items_total metric, which collects the number of items in the queue.

Tez History Protobuf Zookeeper Queue Size This health test raises an alert when the size of the Tez History Protobuf ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size.

It uses the wxm_admin_api_service_queue_tez_history_protobuf_items_total metric, which collects the number of items in the queue.

Hive History Protobuf Zookeeper Queue Size This health test raises an alert when the size of the Hive History Protobuf ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size.

It uses the wxm_admin_api_service_queue_hive_history_protobuf_items_total metric, which collects the number of items in the queue.

Llap History Protobuf Zookeeper Queue Size This health test raises an alert when the size of the Llap History Protobuf ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size.

It uses the wxm_admin_api_service_queue_llap_history_protobuf_items_total metric, which collects the number of items in the queue.

Hive HDP26 Log Zookeeper Queue Size This health test raises an alert when the size of the Hive HDP26 Log ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size.

It uses the wxm_admin_api_service_queue_hivehdp26btc_items_total metric, which collects the number of items in the queue.

Cluster Metrics Zookeeper Queue Size This health test raises an alert when the size of the Cluster Metrics ZooKeeper queue is above the Concerning and Bad threshold size. Where:
  • A Good result implies that there are no processing delays.
  • A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size.
  • A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size.

It uses the wxm_admin_api_service_queue_cluster_metrics_items_total metric, which collects the number of items in the queue.