YARN Health Tests
YARN Container Usage Aggregation
This health test checks whether Cloudera Manager can successfully aggregate YARN container usage metrics by reading them from HDFS and running a MapReduce job to aggregate them and store them in Service Monitor storage. A failure of this health test indicates that Cloudera Manager cannot generate cluster usage metrics for YARN jobs. If this test fails, look at Service Monitor logs for more details. This test can be enabled or disabled using the Enable Container Usage Metrics Collection YARN monitoring setting. .
Short Name: YARN Container Usage Aggregation
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Enable Container Usage Metrics Collection | Enables storing YARN container usage metrics in HDFS and periodically running a MapReduce job from Cloudera Service Monitor to aggregate them into per-application metrics. This is required for YARN usage reporting to work. | cm_yarn_enable_container_usage_aggregation | CDH=[[CDH 5.7.0‥CDH 8.0.0)=false] | no unit |
YARN JobHistory Server Health
This YARN service-level health test checks for the presence of a running, healthy JobHistory Server. The test returns "Bad" health if the service is running and the JobHistory Server is not running. In all other cases it returns the health of the JobHistory Server. A failure of this health test indicates a stopped or unhealthy JobHistory Server. Check the status of the JobHistory Server for more information. This test can be enabled or disabled using the JobHistory Server Role Health Test JobHistory Server service-wide monitoring setting.
Short Name: JobHistory Server Health
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
JobHistory Server Role Health Test | When computing the overall YARN health, consider JobHistory Server's health | yarn_jobhistoryserver_health_enabled | true | no unit |
YARN NodeManager Health
This is a YARN service-level health test that checks that enough of the NodeManagers in the cluster are healthy. The test returns "Concerning" health if the number of healthy NodeManagers falls below a warning threshold, expressed as a percentage of the total number of NodeManagers. The test returns "Bad" health if the number of healthy and "Concerning" NodeManagers falls below a critical threshold, expressed as a percentage of the total number of NodeManagers. For example, if this test is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 NodeManagers, this test would return "Good" health if 95 or more NodeManagers have good health. This test would return "Concerning" health if at least 90 NodeManagers have either "Good" or "Concerning" health. If more than 10 NodeManagers have bad health, this test would return "Bad" health. A failure of this health test indicates unhealthy NodeManagers. Check the status of the individual NodeManagers for more information. This test can be configured using the YARN YARN service-wide monitoring setting.
Short Name: NodeManager Health
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Healthy NodeManager Monitoring Thresholds | The health test thresholds of the overall NodeManager health. The check returns "Concerning" health if the percentage of "Healthy" NodeManagers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" NodeManagers falls below the critical threshold. | yarn_nodemanagers_healthy_thresholds | critical:90.0, warning:95.0 | PERCENT |
YARN ResourceManager Health (HA)
This is a YARN service-level health test that checks for and active, healthy ResourceManager. The test returns "Bad" health if the service is running and an active ResourceManager cannot be found. If an active ResourceManager is found, then the test checks the health of that ResourceManager as well as the health of any standby ResourceManager configured. A "Good" health result will only be returned if both the active and Standby ResourceManagers are healthy. A failure of this health test may indicate stopped or unhealthy ResourceManager roles, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and the ResourceManagers. Check the status of the YARN service's ResourceManager roles and look in the Cloudera Manager Service Monitor's log files for more information when this test fails. This test can be enabled or disabled using the Active ResourceManager Role Health Check YARN service-wide monitoring setting. The check for a healthy standby ResourceManager can be enabled or disabled with Standby ResourceManager Health Check. In addition, the Active ResourceManager Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active ResourceManager before this health test fails, and the ResourceManager Activation Startup Tolerance can be used to adjust the amount of time around ResourceManager startup that the test allows for a ResourceManager to be made active.
Short Name: ResourceManager Health
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Active ResourceManager Detection Window | The tolerance window used in YARN service tests that depend on detection of the active ResourceManager. | yarn_active_resourcemanager_detection_window | CDH=[[CDH 5.0.0‥CDH 8.0.0)=3] | MINUTES |
Active ResourceManager Role Health Check | When computing the overall YARN service health, whether to consider the active ResourceManager's health. | yarn_resourcemanagers_health_enabled | CDH=[[CDH 5.0.0‥CDH 8.0.0)=true] | no unit |
ResourceManager Activation Startup Tolerance | The amount of time after ResourceManager(s) start that the lack of an active ResourceManager will be tolerated. This is an advanced option that does not often need to be changed. | yarn_resourcemanager_activation_startup_tolerance | CDH=[[CDH 5.0.0‥CDH 8.0.0)=180] | SECONDS |
Standby ResourceManager Health Check | When computing the overall YARN service health, whether to consider the health of the standby ResourceManager. | yarn_standby_resourcemanager_health_enabled | CDH=[[CDH 5.0.0‥CDH 8.0.0)=true] | no unit |