Understanding the Workload XM Services Health Check Alerts
Describes the Workload XM cluster services health check alerts and thresholds, and what actions are required to resolve the problem.
Understanding the Health Check Alert Threshold Colors
Workload XM Health Check Alerts and suggested actions are located on the Workload XM service's health test page. When completed they are compared to their defined thresholds that determine if the service element is Good, Concerning, or Bad. For example, when a service's ZooKeeper queue size has exceeded the Critical threshold (Red Alert) limit, the health check will trigger an alert and display an alert message, the cause, and corrective actions.
For descriptions of the health checks performed on each Workload XM cluster service, click the Related Information link below.
To help you recognize the severity level of the Workload XM health check, the health check results include the following colors:
Alert Color | Severity |
---|---|
Green | Good - The health check result is normal and within the acceptable range. |
Yellow | Concerning - The health check result has exceeded the
Warning threshold limit and indicates a potential problem,
which eventually must be resolved but does not have to be completed at this time.
See the corrective actions in the Actions and Advice sections. |
Red | Bad - The health check result has exceeded the
Critical threshold limit and indicates a serious problem,
which must be resolved immediately. See the corrective actions in the Actions and Advice sections. For example, the Hive Audit Zookeeper queue size has exceeded the
Critical threshold limit and can no longer process messages. Possible actions
are:
|
Elevated Queue Count
A Workload XM health check alert is triggered when the number of messages in the workload queue exceeds the defined yellow and red threshold limits.
Queue Name | Default Yellow Alert Threshold | Default Red Alert Threshold |
---|---|---|
SparkEventLog | 100K | 200K |
PSE | 400K | 800K |
Other services | 200K | 400K |
- Check the status of Telemetry Publisher, specifically did it restart after a long pause, as this will create a sudden influx of pending workload data records and increase the size of the queue.
- Check whether any pipelines or ADB services are down, as this will prevent the queues from clearing and workloads from being processed.
- Check whether any new environment, cluster, or workloads are now publishing to your Workload XM cluster, as this could result in new jobs sending large amounts of data at the same time as your jobs.
- Check the health of the Zookeeper service.
- Check whether the maximum number of Zookeeper connections is configured correctly for your environment.
If none of the above corrects the problem, contact Cloudera Support and create a support ticket.
Slower Payload Processing times
A health check alert is triggered when less than 75% of the service's audit payloads are processing slower than the defined yellow and red timer threshold limits.
Payload Type | Default Yellow Alert Threshold | Default Red Alert Threshold |
---|---|---|
All services | 30 seconds | 60 seconds |
- Check the number of items in the ZooKeeper queue, as too many items can slow down processing.
- Check that the HBase Region Servers are in good health.
- Check that the Phoenix Query Server (PSQ) instances are up and running.
- Check that the Pipeline server instances are up and running.
- Check the Pipeline Server payload size metric, which denotes the size of each job and how much data is being sent. An increase in the average payload size will lead to longer processing times.
If none of the above corrects the problem, contact Cloudera Support and create a support ticket.