Host Health Tests

Host Agent Log Directory

This is a host health test that checks that the filesystem containing the Cloudera Manager Agent's log directory has sufficient free space. This test can be configured using the Cloudera Manager Agent Log Directory Free Space Monitoring Absolute Thresholds and Cloudera Manager Agent Log Directory Free Space Monitoring Percentage Thresholds host monitoring settings.

Short Name: Agent Log Directory

Cloudera Manager Agent Log Directory Free Space Monitoring Absolute Thresholds

Description
The health check thresholds for monitoring of free space on the filesystem that contains the Cloudera Manager Agent's log directory.
Template Name
host_agent_log_directory_free_space_absolute_thresholds
Default Value
critical:1.073741824E9, warning:2.147483648E9
Unit(s)
BYTES

Cloudera Manager Agent Log Directory Free Space Monitoring Percentage Thresholds

Description
The health check thresholds for monitoring of free space on the filesystem that contains the Cloudera Manager Agent's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Cloudera Manager Agent Log Directory Free Space Monitoring Absolute Thresholds setting is configured.
Template Name
host_agent_log_directory_free_space_percentage_thresholds
Default Value
critical:never, warning:never
Unit(s)
PERCENT

Host Agent Parcel Directory

This is a host health test that checks whether the filesystem containing the Cloudera Manager Agent's parcel directory has sufficient free space. This test can be configured using the Cloudera Manager Agent Parcel Directory Free Space Monitoring Absolute Thresholds and Cloudera Manager Agent Parcel Directory Free Space Monitoring Percentage Thresholds host monitoring settings.

Short Name: Agent Parcel Directory

Cloudera Manager Agent Parcel Directory Free Space Monitoring Absolute Thresholds

Description
The health check thresholds for monitoring of free space on the filesystem that contains the Cloudera Manager Agent's parcel directory.
Template Name
host_agent_parcel_directory_free_space_absolute_thresholds
Default Value
critical:5.36870912E9, warning:1.073741824E10
Unit(s)
BYTES

Cloudera Manager Agent Parcel Directory Free Space Monitoring Percentage Thresholds

Description
The health check thresholds for monitoring of free space on the filesystem that contains the Cloudera Manager Agent's parcel directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Cloudera Manager Agent Parcel Directory Free Space Monitoring Absolute Thresholds setting is configured.
Template Name
host_agent_parcel_directory_free_space_percentage_thresholds
Default Value
critical:never, warning:never
Unit(s)
PERCENT

Host Agent Process Directory

This is a host health test that checks that the filesystem containing the Cloudera Manager Agent's process directory has sufficient free space. The process directory contains the configuration files for the processes which the Cloudera Manager Agent starts. This test can be configured using the Cloudera Manager Agent Process Directory Free Space Monitoring Absolute Thresholds and Cloudera Manager Agent Process Directory Free Space Monitoring Percentage Thresholds host monitoring settings.

Short Name: Agent Process Directory

Cloudera Manager Agent Process Directory Free Space Monitoring Absolute Thresholds

Description
The health check thresholds for monitoring of free space on the filesystem that contains the Cloudera Manager Agent's process directory.
Template Name
host_agent_process_directory_free_space_absolute_thresholds
Default Value
critical:1.048576E8, warning:2.097152E8
Unit(s)
BYTES

Cloudera Manager Agent Process Directory Free Space Monitoring Percentage Thresholds

Description
The health check thresholds for monitoring of free space on the filesystem that contains the Cloudera Manager Agent's process directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Cloudera Manager Agent Process Directory Free Space Monitoring Absolute Thresholds setting is configured.
Template Name
host_agent_process_directory_free_space_percentage_thresholds
Default Value
critical:never, warning:never
Unit(s)
PERCENT

Host Agent Status

This is a host health test that checks that the host's Cloudera Manager Agent is heartbeating correctly to the Cloudera Manager Server, is in contact with the Host Monitor, and has the correct software version. A failure of this health test may indicate a lack of connectivity between the host's Cloudera Manager Agent and the Cloudera Manager Server, a lack of connectivity betwen the host's Cloudera Manager Agent and the Host Monitor, or that the Cloudera Manager Agent or Host Monitor software is out of date. Check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the host, or look in the host's Cloudera Manager Agent logs for more details. If this test reports a software version mismatch between the Cloudera Manager Agent and the Host Monitor, check the version of each component by consulting the appropriate logs or the appropriate status web pages. This test can be enabled or disabled using the Host Process Health Test host configuration setting.

Short Name: Agent Status

Host Process Health Test

Description
Enables the health test that the host's process state is consistent with the role configuration
Template Name
host_scm_health_enabled
Default Value
true
Unit(s)
no unit

Host Certificate Expiration

This health test checks the expiry time of the TLS certificate for the Cloudera Manager Agent on the given host. This test can be configured using the Cloudera Manager Agent TLS Certificate Expiry Thresholds host monitoring setting. This test alerts when the certificate is close to expiry.

Short Name: Certificate Expiration

Cloudera Manager Agent TLS Certificate Expiry Thresholds

Description
The health test thresholds for monitoring the certificate of Cloudera Manager Agent.
Template Name
host_agent_certificate_expiry_thresholds
Default Value
critical:7.0, warning:60.0
Unit(s)
DAYS

Host Clock Offset

This is a host health test that checks whether the host's system clock appears to be out-of-sync with its NTP server(s). The test uses either the 'ntpq -np' (if ntpd is running) or 'chronyc sources' (if chronyd is running) command to check that whether the host is synchronized to an NTP peer and that the absolute value of the host's clock offset from that peer is not too large. The test returns "Bad" health If the command fails to find a synchronized NTP peer or an offset between the peer and the host time is too large. To troubleshoot, run 'ntpq -np' or 'chronyc sources' on the host. The output should contain a row for each of the host's NTP servers. The row starting with a '*' (if ntpdc) or '^*' (if chronyc) contains the peer to which the host is currently synchronized. If no row starts with a '*' or '^*', this indicates that the host is not currently synchronized. Make sure you have a pool of valid ntp servers in your ntp.conf (if ntpdc) or chrony.conf (if chronyc). Make sure that UDP port 123 is open on any firewall that is in use. Check the system log for ntpd or chronyd messages related to configuration errors. If the issue persists or the offset between the peer and the host time is too large, please seek assistance from your system administrator. If NTP is not in use on the host, disable this check by setting both 'Warning' and 'Critical' thresholds to 'NEVER' using the configuration options shown below for the host. Cloudera recommends using NTP for time synchronization of Hadoop clusters. This test can be configured using the Host Clock Offset Thresholds host configuration setting.

Short Name: Clock Offset

Host Clock Offset Thresholds

Description
The thresholds for the host clock offset health test. The test compares this threshold against the absolute value of the clock offset reported by the host's NTP service from the 'ntpdc -np' or 'chronyc sources' command. Setting both the warning and critical threshold values to never turns off collection of the clock offset by the Cloudera Manager Agent. If no time synchronization is in use, both threshold values should be set to never. Cloudera recommends using NTP for time synchronization of Hadoop clusters.
Template Name
host_clock_offset_thresholds
Default Value
critical:10000.0, warning:3000.0
Unit(s)
MILLISECONDS

Host DNS Resolution

This is a host health test that checks that the host's hostname and canonical name are consistent when checked from a Java process, and that the DNS resolution completes in a timely manner. The DNS resolution duration is calculated by measuring the time taken to call getLocalHost in a Java process on this host. Note that DNS information may be cached on the host, and this caching can affect the reported resolution duration. A failure of this health test may indicate that the host's DNS configuration is not correct or the host's DNS server is responding slowly. Check the Cloudera Manager Agent log for the names that were detected by this test or for errors running the Java process. The hostname and canonical name are considered to be consistent if the hostname or the hostname plus a domain name is the same as the canonical name. This health test uses domain names from the domain and search lines in /etc/resolv.conf. This health test does not consult /etc/nsswitch.conf and may give incorrect results if /etc/resolv.conf is not used by the host. There may be a delay of up to 5 minutes before this health test picks up changes to /etc/resolv.conf. This test can be configured using the Hostname and Canonical Name Health Check host configuration setting.

Short Name: DNS Resolution

Hostname and Canonical Name Health Check

Description
Whether the hostname and canonical names for this host are consistent when checked from a Java process.
Template Name
host_dns_resolution_enabled
Default Value
true
Unit(s)
no unit

Host Entropy

This is a health test that checks that the host in a cluster has sufficient entropy. A failure of this health may indicate that entropy pool is depleted, which might cause applications waiting for the randomness to pause until more entropy is available. Random connection timeout issues usually happen when the available entropy count reaches zero. Increase the entropy using rng-tools for the host. This test can be configured using the Host Entropy Thresholds host configuration settings.

Short Name: Entropy

Host Entropy Thresholds

Description
The health check thresholds for the available entropy on the host.
Template Name
host_available_entropy_thresholds
Default Value
critical:50.0, warning:100.0
Unit(s)
no unit

Host Frame Errors

This is a host health test that checks for network frame errors across all network interfaces. A failure of this health test may indicate a problem with network hardware (e.g. switches) and can potentially cause other service or role-level performance problems. Check the host and network hardware logs for more details. This test can be configured using the Host Network Frame Error Percentage Thresholds, Host Network Frame Error Check Window, Host Network Frame Error Test Minimum Required Packets host configuration settings.

Short Name: Frame Errors

Host Network Frame Error Check Window

Description
The amount of time over which the host frame error checks for frame errors.
Template Name
host_network_frame_errors_window
Default Value
15
Unit(s)
MINUTES

Host Network Frame Error Percentage Thresholds

Description
The health check thresholds for the percentage of received packets that are frame errors.
Template Name
host_network_frame_errors_thresholds
Default Value
critical:0.5, warning:any
Unit(s)
PERCENT

Host Network Frame Error Test Minimum Required Packets

Description
The minimum number of received packets that must be received within the test window for this test to return "Bad" health. If less that this number of packets is received during the test window, the health check will never return "Bad" health.
Template Name
host_network_frame_errors_floor
Default Value
0
Unit(s)
no unit

Host Network Interface Speed

This is a host health test that checks for network interfaces that appear to be operating at less than full speed. The Cloudera Manager Agent uses ethtool ioctl to determine the network interface speed and duplex mode. A failure of this health test may indicate that network interface(s) may be configured incorrectly and may be causing performance problems. Use the ethtool command to check and configure the host's network interfaces to use the fastest available link speed and duplex mode. For inactive interfaces or for interfaces that do not support ethtool ioctl, the Cloudera Manager Agent cannot collect speed metrics. If the Cloudera Manager Agent fails to collect speed metrics for all network interfaces on the host, the test is disabled. This test can be configured using the Host's Network Interfaces Slow Link Modes Thresholds, Network Interface Expected Link Speed and Network Interface Expected Duplex Mode host configuration settings.

Short Name: Network Interface Speed

Host's Network Interfaces Slow Link Modes Thresholds

Description
The thresholds for the health check of the number of network interfaces that appear to be operating at less than full speed.
Template Name
host_network_interfaces_slow_mode_thresholds
Default Value
critical:never, warning:any
Unit(s)
no unit

Network Interface Expected Duplex Mode

Description
The expected duplex mode for network interfaces.
Template Name
host_nic_expected_duplex_mode
Default Value
Full
Unit(s)
no unit

Network Interface Expected Link Speed

Description
The expected network interface link speed.
Template Name
host_nic_expected_speed
Default Value
1000
Unit(s)
no unit

Host Swapping

This is a health test that checks that the host has not swapped out more than a certain number of pages over the last fifteen minutes. A failure of this health test may indicate misconfiguration of the host operating system, or too many processes running on the host. Try reducing vm.swappiness, or add more memory to the host. This test can be configured using the Host Memory Swapping Thresholds, Host Memory Swapping Check Window host configuration settings.

Short Name: Swapping

Host Memory Swapping Check Window

Description
The amount of time over which the memory swapping test checks for pages swapped.
Template Name
host_memswap_window
Default Value
15
Unit(s)
MINUTES

Host Memory Swapping Thresholds

Description
The health test thresholds of the number of pages swapped out on the host in the last 15 minutes
Template Name
host_memswap_thresholds
Default Value
critical:never, warning:200.0
Unit(s)
PAGES