Spark (Standalone) Health Tests

Spark (Standalone) History Server Health

This Spark (Standalone) service-level health test checks for the presence of a running, healthy History Server. The test returns "Bad" health if the service is running and the History Server is not running. In all other cases it returns the health of the History Server. A failure of this health test indicates a stopped or unhealthy History Server. Check the status of the History Server for more information. This test can be enabled or disabled using the History Server Role Health Test History Server service-wide monitoring setting.

Short Name: History Server Health

Property Name Description Template Name Default Value Unit
History Server Role Health Test When computing the overall SPARK health, consider History Server's health SPARK_HISTORY_SERVER_health_enabled true no unit

Spark (Standalone) Master Health

This Spark (Standalone) service-level health test checks for the presence of a running, healthy Master. The test returns "Bad" health if the service is running and the Master is not running. In all other cases it returns the health of the Master. A failure of this health test indicates a stopped or unhealthy Master. Check the status of the Master for more information. This test can be enabled or disabled using the Master Role Health Test Master service-wide monitoring setting.

Short Name: Master Health

Property Name Description Template Name Default Value Unit
Master Role Health Test When computing the overall SPARK health, consider Master's health SPARK_MASTER_health_enabled true no unit

Spark (Standalone) Worker Health

This is a Spark (Standalone) service-level health test that checks that enough of the Workers in the cluster are healthy. The test returns "Concerning" health if the number of healthy Workers falls below a warning threshold, expressed as a percentage of the total number of Workers. The test returns "Bad" health if the number of healthy and "Concerning" Workers falls below a critical threshold, expressed as a percentage of the total number of Workers. For example, if this test is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 Workers, this test would return "Good" health if 95 or more Workers have good health. This test would return "Concerning" health if at least 90 Workers have either "Good" or "Concerning" health. If more than 10 Workers have bad health, this test would return "Bad" health. A failure of this health test indicates unhealthy Workers. Check the status of the individual Workers for more information. This test can be configured using the Spark (Standalone) Spark (Standalone) service-wide monitoring setting.

Short Name: Worker Health

Property Name Description Template Name Default Value Unit
Healthy Worker Monitoring Thresholds The health test thresholds of the overall Worker health. The check returns "Concerning" health if the percentage of "Healthy" Workers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" Workers falls below the critical threshold. SPARK_WORKER_healthy_thresholds critical:80.0, warning:90.0 PERCENT