YARN Health Tests

YARN Container Usage Aggregation

This health test checks whether Cloudera Manager can successfully aggregate YARN container usage metrics by reading them from HDFS and running a MapReduce job to aggregate them and store them in Service Monitor storage. A failure of this health test indicates that Cloudera Manager cannot generate cluster usage metrics for YARN jobs. If this test fails, look at Service Monitor logs for more details. This test can be enabled or disabled using the Enable Container Usage Metrics Collection YARN monitoring setting. .

Short Name: YARN Container Usage Aggregation

Enable Container Usage Metrics Collection

Description
Enables storing YARN container usage metrics in HDFS and periodically running a MapReduce job from Cloudera Service Monitor to aggregate them into per-application metrics. This is required for YARN usage reporting to work.
Template Name
cm_yarn_enable_container_usage_aggregation
Default Value
CDH=[[CDH 5.7.0..CDH 8.0.0)=false]
Unit(s)
no unit

YARN JobHistory Server Health

This YARN service-level health test checks for the presence of a running, healthy JobHistory Server. The test returns "Bad" health if the service is running and the JobHistory Server is not running. In all other cases it returns the health of the JobHistory Server. A failure of this health test indicates a stopped or unhealthy JobHistory Server. Check the status of the JobHistory Server for more information. This test can be enabled or disabled using the JobHistory Server Role Health Test JobHistory Server service-wide monitoring setting.

Short Name: JobHistory Server Health

JobHistory Server Role Health Test

Description
When computing the overall YARN health, consider JobHistory Server's health
Template Name
yarn_jobhistoryserver_health_enabled
Default Value
true
Unit(s)
no unit

YARN NodeManager Health

This is a YARN service-level health test that checks that enough of the NodeManagers in the cluster are healthy. The test returns "Concerning" health if the number of healthy NodeManagers falls below a warning threshold, expressed as a percentage of the total number of NodeManagers. The test returns "Bad" health if the number of healthy and "Concerning" NodeManagers falls below a critical threshold, expressed as a percentage of the total number of NodeManagers. For example, if this test is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 NodeManagers, this test would return "Good" health if 95 or more NodeManagers have good health. This test would return "Concerning" health if at least 90 NodeManagers have either "Good" or "Concerning" health. If more than 10 NodeManagers have bad health, this test would return "Bad" health. A failure of this health test indicates unhealthy NodeManagers. Check the status of the individual NodeManagers for more information. This test can be configured using the YARN YARN service-wide monitoring setting.

Short Name: NodeManager Health

Healthy NodeManager Monitoring Thresholds

Description
The health test thresholds of the overall NodeManager health. The check returns "Concerning" health if the percentage of "Healthy" NodeManagers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" NodeManagers falls below the critical threshold.
Template Name
yarn_nodemanagers_healthy_thresholds
Default Value
critical:90.0, warning:95.0
Unit(s)
PERCENT

YARN ResourceManager Health (HA)

This is a YARN service-level health test that checks for and active, healthy ResourceManager. The test returns "Bad" health if the service is running and an active ResourceManager cannot be found. If an active ResourceManager is found, then the test checks the health of that ResourceManager as well as the health of any standby ResourceManager configured. A "Good" health result will only be returned if both the active and Standby ResourceManagers are healthy. A failure of this health test may indicate stopped or unhealthy ResourceManager roles, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and the ResourceManagers. Check the status of the YARN service's ResourceManager roles and look in the Cloudera Manager Service Monitor's log files for more information when this test fails. This test can be enabled or disabled using the Active ResourceManager Role Health Check YARN service-wide monitoring setting. The check for a healthy standby ResourceManager can be enabled or disabled with Standby ResourceManager Health Check. In addition, the Active ResourceManager Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active ResourceManager before this health test fails, and the ResourceManager Activation Startup Tolerance can be used to adjust the amount of time around ResourceManager startup that the test allows for a ResourceManager to be made active.

Short Name: ResourceManager Health

Active ResourceManager Detection Window

Description
The tolerance window used in YARN service tests that depend on detection of the active ResourceManager.
Template Name
yarn_active_resourcemanager_detection_window
Default Value
CDH=[[CDH 5.0.0..CDH 8.0.0)=3]
Unit(s)
MINUTES

Active ResourceManager Role Health Check

Description
When computing the overall YARN service health, whether to consider the active ResourceManager's health.
Template Name
yarn_resourcemanagers_health_enabled
Default Value
CDH=[[CDH 5.0.0..CDH 8.0.0)=true]
Unit(s)
no unit

ResourceManager Activation Startup Tolerance

Description
The amount of time after ResourceManager(s) start that the lack of an active ResourceManager will be tolerated. This is an advanced option that does not often need to be changed.
Template Name
yarn_resourcemanager_activation_startup_tolerance
Default Value
CDH=[[CDH 5.0.0..CDH 8.0.0)=180]
Unit(s)
SECONDS

Standby ResourceManager Health Check

Description
When computing the overall YARN service health, whether to consider the health of the standby ResourceManager.
Template Name
yarn_standby_resourcemanager_health_enabled
Default Value
CDH=[[CDH 5.0.0..CDH 8.0.0)=true]
Unit(s)
no unit