ZooKeeper Health Tests

ZooKeeper Canary

This is a ZooKeeper service-level health test that checks that basic client operations are working and are completing in a reasonable amount of time. This test reports the results of a periodic "canary" test that performs the following sequence of operations. First, it connects to and establishes a session (the root session) with the ZooKeeper service and creates a permanent znode to serve as the root of all canary operations. The canary test then connects to and establishes sessions (the child sessions) with each ZooKeeper server of the service. Each child session is used to create an ephemeral child znode under the canary root. After the child znodes have been created, watches that await znode deletion events are registered with each of the child znodes for each of the child sessions. The canary test then deletes each of the child znodes and then verifies that each child session has received deletion notifications for each of the child znodes. Finally the canary test closes all the child sessions, deletes the root znode and closes the root session. The test returns "Bad" health if the establishment of the root session to the ZooKeeper service fails, the creation of znodes (permanent or ephemeral) fails, the deletion of znodes fails or the retrieval of child znodes of the root znode fails. The test returns "Concerning" health when the canary test succeeds but has one or more servers that could not participate in the canary test operations or if the canary test runs too slowly. A failure of this health test may indicate that ZooKeeper is failing to satisfy client requests correctly or in a timely fashion. Check the status of the ZooKeeper servers, and look in the ZooKeeper server logs for more details. This test can be enabled or disabled using the ZooKeeper Canary Health Check ZooKeeper service monitoring setting. The ZooKeeper Canary Root Znode Path, ZooKeeper Canary Connection Timeout, ZooKeeper Canary Session Timeout, ZooKeeper Canary Operation Timeout settings control the operation of the canary.

Short Name: ZooKeeper Canary

ZooKeeper Canary Connection Timeout

Description
Configures the timeout used by the canary for connection establishment with ZooKeeper servers
Template Name
zookeeper_canary_connection_timeout
Default Value
10000
Unit(s)
MILLISECONDS

ZooKeeper Canary Health Check

Description
Enables the health check that a client can connect to ZooKeeper and perform basic operations
Template Name
zookeeper_canary_health_enabled
Default Value
true
Unit(s)
no unit

ZooKeeper Canary Operation Timeout

Description
Configures the timeout used by the canary for ZooKeeper operations
Template Name
zookeeper_canary_operation_timeout
Default Value
30000
Unit(s)
MILLISECONDS

ZooKeeper Canary Root Znode Path

Description
Configures the path of the root znode under which all canary updates are performed
Template Name
zookeeper_canary_root_path
Default Value
/cloudera_manager_zookeeper_canary
Unit(s)
no unit

ZooKeeper Canary Session Timeout

Description
Configures the timeout used by the canary sessions with ZooKeeper servers
Template Name
zookeeper_canary_session_timeout
Default Value
30000
Unit(s)
MILLISECONDS

ZooKeeper Server Health

This is a ZooKeeper service-level health test that checks that enough of the Servers in the cluster are healthy. The test returns "Concerning" health if the number of healthy Servers falls below a warning threshold, expressed as a percentage of the total number of Servers. The test returns "Bad" health if the number of healthy and "Concerning" Servers falls below a critical threshold, expressed as a percentage of the total number of Servers. For example, if this test is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 Servers, this test would return "Good" health if 95 or more Servers have good health. This test would return "Concerning" health if at least 90 Servers have either "Good" or "Concerning" health. If more than 10 Servers have bad health, this test would return "Bad" health. A failure of this health test indicates unhealthy Servers. Check the status of the individual Servers for more information. This test can be configured using the ZooKeeper ZooKeeper service-wide monitoring setting.

Short Name: Server Health

Healthy Server Monitoring Thresholds

Description
The health test thresholds of the overall Server health. The check returns "Concerning" health if the percentage of "Healthy" Servers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" Servers falls below the critical threshold.
Template Name
zookeeper_servers_healthy_thresholds
Default Value
critical:51.0, warning:99.0
Unit(s)
PERCENT