Understanding the Workload XM Services Health Check Alerts

Understanding the Health Check Alert Threshold Colors

Workload XM Health Check Alerts and suggested actions are located on the Workload XM service's health test page. When completed they are compared to their defined thresholds that determine if the service element is Good, Concerning, or Bad. For example, when a service's ZooKeeper queue size has exceeded the Critical threshold (Red Alert) limit, the health check will trigger an alert and display an alert message, the cause, and corrective actions.

For descriptions of the health checks performed on each Workload XM cluster service, click the Related Information link below.

To help you recognize the severity level of the Workload XM health check, the health check results include the following colors:

Table 1. Health Check Alert Colors
Alert Color	Severity
Green	Good - The health check result is normal and within the acceptable range.
Yellow	Concerning - The health check result has exceeded the Warning threshold limit and indicates a potential problem, which eventually must be resolved but does not have to be completed at this time. See the corrective actions in the Actions and Advice sections.
Red	Bad - The health check result has exceeded the Critical threshold limit and indicates a serious problem, which must be resolved immediately. See the corrective actions in the Actions and Advice sections. For example, the Hive Audit Zookeeper queue size has exceeded the Critical threshold limit and can no longer process messages. Possible actions are: Change the Hive Audit Zookeeper queue size for this role instance, which will reduce the number of messages in the queue. View the log for the role instance at the time of the health test to see what changed.

Elevated Queue Count

A Workload XM health check alert is triggered when the number of messages in the workload queue exceeds the defined yellow and red threshold limits.

Table 2. ZooKeeper Elevated Queue Count
Queue Name	Default Yellow Alert Threshold	Default Red Alert Threshold
SparkEventLog	100K	200K
PSE	400K	800K
Other services	200K	400K

To address an alert consider the following:

Check the status of Telemetry Publisher, specifically did it restart after a long pause, as this will create a sudden influx of pending workload data records and increase the size of the queue.
Check whether any pipelines or ADB services are down, as this will prevent the queues from clearing and workloads from being processed.
Check whether any new environment, cluster, or workloads are now publishing to your Workload XM cluster, as this could result in new jobs sending large amounts of data at the same time as your jobs.
Check the health of the Zookeeper service.
note
The Zookeeper service is used by Workload XM to manage workload queues.
Check whether the maximum number of Zookeeper connections is configured correctly for your environment.

If none of the above corrects the problem, contact Cloudera Support and create a support ticket.

Slower Payload Processing times

A health check alert is triggered when less than 75% of the service's audit payloads are processing slower than the defined yellow and red timer threshold limits.

Table 3. Slower Payload Processing Times
Payload Type	Default Yellow Alert Threshold	Default Red Alert Threshold
All services	30 seconds	60 seconds

To address an alert consider the following:

Check the number of items in the ZooKeeper queue, as too many items can slow down processing.
Check that the HBase Region Servers are in good health.
Check that the Phoenix Query Server (PSQ) instances are up and running.
Check that the Pipeline server instances are up and running.
Check the Pipeline Server payload size metric, which denotes the size of each job and how much data is being sent. An increase in the average payload size will lead to longer processing times.

If none of the above corrects the problem, contact Cloudera Support and create a support ticket.