Cloudera Observability on premises cluster services health checks

Lists the ZooKeeper health check tests that are performed on your Cloudera Observability on premises cluster services. They provide processing performance insights, such as messaging queue bottlenecks and delays that can cause workload scheduling issues. You can find the ZooKeeper Queue and Processing Timers metric charts in the Cloudera Observability on premises Charts Library tab and the following Health checks on the Cloudera Observability on premises related cluster service's page in the Health Tests section.

Analytic Database server🔗

Table 1. Analytic Database Server Processing health check
Health Check	Description
Impala Query Processing Time	This health test raises an alert when more than 25% of the Impala Queries do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively. It uses the `wxm_adb_service_impala_query_processing_timer_75th_percentile metric` that collects the time in which 75% of the calls are processed.

Pipeline server🔗

Table 2. Pipeline Server Processing health checks
Health Check	Description
Spark Event Log Processing Time	This health test raises an alert when more than 25% of the Spark Event Logs do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively. It uses the `wxm_pipelines_service_spark_application_processing_timer_75th_percentile` metric that collects the time in which 75% of the calls are processed.
MR Jhist Processing Time	This health test raises an alert when more than 25% of the MR Jhist payloads do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively. It uses the `wxm_pipelines_service_mr_job_processing_timer_75th_percentile` metric that collects the time in which 75% of the calls are processed.
Hive Audit Processing Time	This health test raises an alert when more than 25% of the Hive Audit payloads do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively. It uses the `wxm_pipelines_service_hive_query_processing_timer_75th_percentile` metric that collects the time in which 75% of the calls are processed.
Oozie Workflow Processing Time	This health test raises an alert when more than 25% of the Oozie Workflows do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively. It uses the `wxm_pipelines_service_oozie_workflow_processing_timer_75th_percentile` metric that collects the time in which 75% of the calls are processed.
Yarn App Processing Time	This health test raises an alert when more than 25% of the Yarn Apps do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively. It uses the `wxm_pipelines_service_yarn_application_processing_timer_75th_percentile` metric that collects the time in which 75% of the calls are processed.
Tez Dag Event Processing Time	This health test raises an alert when more than 25% of the Tez Dag Events do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively. It uses the `wxm_pipelines_service_tez_dag_event_log_processing_timer_75th_percentile` metric that collects the time in which 75% of the calls are processed.
Hive Tez Processing Time	This health test raises an alert when more than 25% of the Hive Tez Applications do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively. It uses the `wxm_pipelines_service_hive_query_event_log_processing_timer_75th_percentile` metric that collects the time in which 75% of the calls are processed.
MR Task Log Processing Time	This health test raises an alert when more than 25% of the MR task logs do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively. It uses the `wxm_pipelines_service_mr_task_log_processing_timer_75th_percentile` metric that collects the time in which 75% of the calls are processed.
Spark Task Log Processing Time	This health test raises an alert when more than 25% of the Spark Task Logs do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively. It uses the `wxm_pipelines_service_spark_task_log_processing_timer_75th_percentile` metric that collects the time in which 75% of the calls are processed.
Yarn Container Log Processing Time	This health test raises an alert when more than 25% of the Yarn Container Logs do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively. It uses the `wxm_pipelines_service_yarn_container_log_processing_timer_75th_percentile` metric that collects the time in which 75% of the calls are processed.
Hive HDP26 Log Processing Time	This health test raises an alert when more than 25% of the Hive HDP26 Logs do not finish processing within the threshold's run time value. Where the defined Concerning and Bad runtime threshold limits are 30 and 60 seconds, respectively. It uses the `wxm_pipelines_service_hive_hdp_26_processing_timer_75th_percentile` metric that collects the time in which 75% of the calls are processed.

Admin API server🔗

Table 3. Admin API Server elevated queue health checks
Health Check Name	Description
Spark Event Log Zookeeper Queue Size	This health test raises an alert when the size of the Spark Event Log ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 100K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 200K threshold size. It uses the `wxm_admin_api_service_queue_spark_event_log_items_total` metric, which collects the number of items in the queue.
Spark Task Log Zookeeper Queue Size	This health test raises an alert when the size of the Spark Task Log ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size. It uses the `wxm_admin_api_service_queue_spark_task_log_items_total` metric, which collects the number of items in the queue.
Yarn App Zookeeper Queue Size	This health test raises an alert when the size of the Yarn App ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size. It uses the `wxm_admin_api_service_queue_yarn_app_items_total` metric, which collects the number of items in the queue.
Hive Audit Zookeeper Queue Size	This health test raises an alert when the size of the Hive Audit ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size. It uses the `wxm_admin_api_service_queue_hive_audit_items_total` metric, which collects the number of items in the queue.
MR Jhist Zookeeper Queue Size	This health test raises an alert when the size of the MR Jhist ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size. It uses the `wxm_admin_api_service_queue_mr_jhist_items_total` metric, which collects the number of items in the queue.
MR Task Log Zookeeper Queue Size	This health test raises an alert when the size of the MR Task Log ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size. It uses the `wxm_admin_api_service_queue_mr_task_log_items_total` metric, which collects the number of items in the queue.
Oozie Workflow Zookeeper Queue Size	This health test raises an alert when the size of the Oozie Workflow ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size. It uses the `wxm_admin_api_service_queue_oozie_workflow_items_total` metric, which collects the number of items in the queue.
Pse Zookeeper Queue Size	This health test raises an alert when the size of the Pse ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 400K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 800K threshold size. It uses the `wxm_admin_api_service_queue_pse_items_total` metric, which collects the number of items in the queue.
Sdx Details Zookeeper Queue Size	This health test raises an alert when the size of the Sdx Details ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size. It uses the `wxm_admin_api_service_queue_sdx_details_items_total` metric, which collects the number of items in the queue.
Impala Query Zookeeper Queue Size	This health test raises an alert when the size of the Impala Query ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size. It uses the `wxm_admin_api_service_queue_impala_query_profile_items_total` metric, which collects the number of items in the queue.
Yarn App Metric Zookeeper Queue Size	This health test raises an alert when the size of the Yarn App Metric ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size. It uses the `wxm_admin_api_service_queue_yarn_app_metrics_items_total` metric, which collects the number of items in the queue.
Hive On MR Table Zookeeper Queue Size	This health test raises an alert when the size of the Hive On MR Table ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size. It uses the `wxm_admin_api_service_queue_hive_on_mr_table_items_total` metric, which collects the number of items in the queue.
Tez History Protobuf Zookeeper Queue Size	This health test raises an alert when the size of the Tez History Protobuf ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size. It uses the `wxm_admin_api_service_queue_tez_history_protobuf_items_total` metric, which collects the number of items in the queue.
Hive History Protobuf Zookeeper Queue Size	This health test raises an alert when the size of the Hive History Protobuf ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size. It uses the `wxm_admin_api_service_queue_hive_history_protobuf_items_total` metric, which collects the number of items in the queue.
Llap History Protobuf Zookeeper Queue Size	This health test raises an alert when the size of the Llap History Protobuf ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size. It uses the `wxm_admin_api_service_queue_llap_history_protobuf_items_total` metric, which collects the number of items in the queue.
Hive HDP26 Log Zookeeper Queue Size	This health test raises an alert when the size of the Hive HDP26 Log ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size. It uses the `wxm_admin_api_service_queue_hivehdp26btc_items_total` metric, which collects the number of items in the queue.
Cluster Metrics Zookeeper Queue Size	This health test raises an alert when the size of the Cluster Metrics ZooKeeper queue is above the Concerning and Bad threshold size. Where: A Good result implies that there are no processing delays. A Concerning result implies that due to a slight delay in processing the number of items in the queue is higher than the 200K threshold size. A Bad result implies that due to a significant delay in processing the number of items in the queue is higher than the 400K threshold size. It uses the `wxm_admin_api_service_queue_cluster_metrics_items_total` metric, which collects the number of items in the queue.