Understanding the Workload XM Services Health Check Alerts

Describes the Workload XM cluster services health check alerts and thresholds, and what actions are required to resolve the problem.

Understanding the Health Check Alert Threshold Colors

Workload XM Health Check Alerts and suggested actions are located on the Workload XM service's health test page. When completed they are compared to their defined thresholds that determine if the service element is Good, Concerning, or Bad. For example, when a service's ZooKeeper queue size has exceeded the Critical threshold (Red Alert) limit, the health check will trigger an alert and display an alert message, the cause, and corrective actions.

For descriptions of the health checks performed on each Workload XM cluster service, click the Related Information link below.

To help you recognize the severity level of the Workload XM health check, the health check results include the following colors:

Table 1. Health Check Alert Colors
Alert Color Severity
Green Good - The health check result is normal and within the acceptable range.
Yellow Concerning - The health check result has exceeded the Warning threshold limit and indicates a potential problem, which eventually must be resolved but does not have to be completed at this time.

See the corrective actions in the Actions and Advice sections.

Red Bad - The health check result has exceeded the Critical threshold limit and indicates a serious problem, which must be resolved immediately.

See the corrective actions in the Actions and Advice sections.

For example, the Hive Audit Zookeeper queue size has exceeded the Critical threshold limit and can no longer process messages. Possible actions are:
  • Change the Hive Audit Zookeeper queue size for this role instance, which will reduce the number of messages in the queue.
  • View the log for the role instance at the time of the health test to see what changed.

Elevated Queue Count

A Workload XM health check alert is triggered when the number of messages in the workload queue exceeds the defined yellow and red threshold limits.

Table 2. ZooKeeper Elevated Queue Count
Queue Name Default Yellow Alert Threshold Default Red Alert Threshold
SparkEventLog 100K 200K
PSE 400K 800K
Other services 200K 400K
To address an alert consider the following:
  • Check the status of Telemetry Publisher, specifically did it restart after a long pause, as this will create a sudden influx of pending workload data records and increase the size of the queue.
  • Check whether any pipelines or ADB services are down, as this will prevent the queues from clearing and workloads from being processed.
  • Check whether any new environment, cluster, or workloads are now publishing to your Workload XM cluster, as this could result in new jobs sending large amounts of data at the same time as your jobs.
  • Check the health of the Zookeeper service.
  • Check whether the maximum number of Zookeeper connections is configured correctly for your environment.

If none of the above corrects the problem, contact Cloudera Support and create a support ticket.

Slower Payload Processing times

A health check alert is triggered when less than 75% of the service's audit payloads are processing slower than the defined yellow and red timer threshold limits.

Table 3. Slower Payload Processing Times
Payload Type Default Yellow Alert Threshold Default Red Alert Threshold
All services 30 seconds 60 seconds
To address an alert consider the following:
  • Check the number of items in the ZooKeeper queue, as too many items can slow down processing.
  • Check that the HBase Region Servers are in good health.
  • Check that the Phoenix Query Server (PSQ) instances are up and running.
  • Check that the Pipeline server instances are up and running.
  • Check the Pipeline Server payload size metric, which denotes the size of each job and how much data is being sent. An increase in the average payload size will lead to longer processing times.

If none of the above corrects the problem, contact Cloudera Support and create a support ticket.