HDFS Health Tests

HDFS Canary

This is an HDFS service-level health test that checks that basic client operations are working and are completing in a reasonable amount of time. This test reports the results of a periodic "canary" test that performs the following sequence of operations. First, it creates a file. By default, the path is /tmp/.cloudera_health_monitoring_canary_timestamp. The canary test then writes a small amount of data to that file, reads that data back, and verifies that the data is correct. Lastly, the canary test removes the created file. The test returns "Bad" health if any of the basic operations fail. The test returns "Concerning" health if the canary test runs too slowly. A failure of this health test may indicate that the cluster is failing to satisfy basic client requests correctly or in a timely fashion. Check the status of the NameNode and other service-level health tests. Look in the Service Monitor logs for log messages from the canary test. Look in the NameNode logs for more details about the processing of the canary test requests. This test can be enabled or disabled using the HDFS Canary Health Check HDFS service-wide monitoring setting.

Short Name: HDFS Canary

HDFS Canary Health Check

Description
Enables the health check that a client can create, read, write, and delete files
Template Name
hdfs_canary_health_enabled
Default Value
true
Unit(s)
no unit

HDFS Corrupt Blocks

This is an HDFS service-level health test that checks that the number of corrupt blocks does not rise above some percentage of the cluster's total blocks. A block is called corrupt by HDFS if it has at least one corrupt replica along with at least one live replica. As such, a corrupt block does not indicate unavailable data, but they do indicate an increased chance that data may become unavailable. If none of a block's replicas are live, the block is called a missing block by HDFS, not a corrupt block. HDFS automatically fixes corrupt blocks in the background. A failure of this health test may indicate a problem with the underlying storage or filesystem of a DataNode. Use the HDFS fsck command to identify which files contain corrupt blocks. This test can be configured using the Blocks With Corrupt Replicas Monitoring Thresholds HDFS service-wide monitoring setting.

Short Name: Corrupt Blocks

Blocks With Corrupt Replicas Monitoring Thresholds

Description
The health check thresholds of the number of blocks that have at least one corrupt replica. Specified as a percentage of the total number of blocks.
Template Name
hdfs_blocks_with_corrupt_replicas_thresholds
Default Value
critical:1.0, warning:0.5
Unit(s)
PERCENT

HDFS DataNode Health

This is a HDFS service-level health test that checks that enough of the DataNodes in the cluster are healthy. The test returns "Concerning" health if the number of healthy DataNodes falls below a warning threshold, expressed as a percentage of the total number of DataNodes. The test returns "Bad" health if the number of healthy and "Concerning" DataNodes falls below a critical threshold, expressed as a percentage of the total number of DataNodes. For example, if this test is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 DataNodes, this test would return "Good" health if 95 or more DataNodes have good health. This test would return "Concerning" health if at least 90 DataNodes have either "Good" or "Concerning" health. If more than 10 DataNodes have bad health, this test would return "Bad" health. A failure of this health test indicates unhealthy DataNodes. Check the status of the individual DataNodes for more information. This test can be configured using the HDFS HDFS service-wide monitoring setting.

Short Name: DataNode Health

Healthy DataNode Monitoring Thresholds

Description
The health test thresholds of the overall DataNode health. The check returns "Concerning" health if the percentage of "Healthy" DataNodes falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" DataNodes falls below the critical threshold.
Template Name
hdfs_datanodes_healthy_thresholds
Default Value
critical:90.0, warning:95.0
Unit(s)
PERCENT

HDFS Erasure Coding Policy Verification Test

There is a high risk of losing data if the offending erasure coding policy is used. For more information, see Best Practices for Rack and Node Setup for EC. Do one of the following to clear this health issue:
  • Add more DataNodes or Racks to your Cluster, if you want to use all the enabled erasure coding policies.
  • Disable the offending erasure coding policy:
    • Change the default erasure coding policy, if the offending policy is set as the default policy: Go to the HDFS configuration page, search for 'Fallback Erasure Coding Policy', and select one that is compliant with your existing cluster setup.
    • Disable the offending erasure coding policy: Use the 'hdfs ec -disablePolicy -policy <policyName>' command.
  • Disable the health test: Go to the HDFS configuration page, search for 'Erasure Coding Policy Verification Health Check', and unselect it. This is not recommended, because there is a high risk of losing data if the offending erasure coding policy is used.

Short Name: Erasure Coding Policy Verification Test

Erasure Coding Policy Verification Health Check

Description
Enables the health test for verifying if the cluster topology supports all the enabled erasure coding policies.
Template Name
hdfs_verify_ec_with_topology_enabled
Default Value
false
Unit(s)
no unit

Fallback Erasure Coding Policy

Description
The fallback Erasure Coding policy that HDFS uses if no policy is specified when you run the -setPolicy command.
Template Name
erasure_coding_default_policy
Default Value
CDH=[[CDH 6.0.0..CDH 8.0.0)=RS_6_3_1024K]
Unit(s)
no unit

HDFS Failover Controllers Health

This is a HDFS service-level health test that checks that all the Failover Controllers associated with this service are healthy and running. The test returns "Bad" health if any of Failover Controllers that the service depends on is unhealthy or not running. Check the Failover Controllers logs for more details This test can be enabled or disabled using the Failover Controllers Healthy service-wide monitoring setting.

Short Name: Failover Controllers Health

Failover Controllers Healthy

Description
Enables the health check that verifies that the failover controllers associated with this service are healthy and running.
Template Name
failover_controllers_healthy_enabled
Default Value
true
Unit(s)
no unit

HDFS Free Space

This is an HDFS service-level health test that checks that the amount of free space in the HDFS cluster does not fall below some percentage of total configured capacity. A failure of this health test may indicate a capacity planning problem, or a loss of DataNodes. An unfinalized HDFS metadata upgrade can also significantly decrease free space. This test can be configured using the HDFS Free Space Monitoring Thresholds HDFS service-wide monitoring setting.

Short Name: Free Space

HDFS Free Space Monitoring Thresholds

Description
The health check thresholds of free space in HDFS. Specified as a percentage of total HDFS capacity.
Template Name
hdfs_free_space_thresholds
Default Value
critical:10.0, warning:20.0
Unit(s)
PERCENT

HDFS Missing Blocks

This is an HDFS service-level health test that checks the number of missing blocks does not rise above some percentage of the cluster's total blocks. A missing block is a block with no live replicas. All replicas are either missing or corrupt. This may happen because of corruption or because DataNodes are offline or being decommissioned. A failure of this health test may indicate the loss of several DataNodes at once. If there are files stored in the cluster with a replication factor value of 1, you may see missing blocks with the loss or malfunction of a single DataNode. Use the HDFS fsck command to identify which files contain missing blocks. This test can be configured using the Missing Block Monitoring Thresholds HDFS service-wide monitoring setting.

Short Name: Missing Blocks

Missing Block Monitoring Thresholds

Description
The health check thresholds of the number of missing blocks. Specified as a percentage of the total number of blocks.
Template Name
hdfs_missing_blocks_thresholds
Default Value
critical:any, warning:never
Unit(s)
PERCENT

HDFS NameNode Health

This is an HDFS service-level health test that checks for and active, healthy NameNode. The test returns "Bad" health if the service is running and an active NameNode cannot be found. If an active NameNode is found, then the test checks the health of that NameNode as well as the health of any standby NameNode configured. A "Good" health result will only be returned if both the active and Standby NameNodes are healthy. A failure of this health test may indicate stopped or unhealthy NameNode roles, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and the NameNodes. Check the status of the HDFS service's NameNode roles and look in the Cloudera Manager Service Monitor's log files for more information when this test fails. This test can be enabled or disabled using the Active NameNode Role Health Check HDFS service-wide monitoring setting. The check for a healthy standby NameNode can be enabled or disabled with Standby NameNode Health Check. In addition, the Active NameNode Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active NameNode before this health test fails, and the NameNode Activation Startup Tolerance can be used to adjust the amount of time around NameNode startup that the test allows for a NameNode to be made active.

Short Name: NameNode Health

Active NameNode Detection Window

Description
The tolerance window that will be used in HDFS service tests that depend on detection of the active NameNode.
Template Name
hdfs_active_namenode_detection_window
Default Value
3
Unit(s)
MINUTES

Active NameNode Role Health Check

Description
When computing the overall HDFS cluster health, consider the active NameNode's health
Template Name
hdfs_namenode_health_enabled
Default Value
true
Unit(s)
no unit

NameNode Activation Startup Tolerance

Description
The amount of time after NameNode(s) start that the lack of an active NameNode will be tolerated. This is intended to allow either the auto-failover daemon to make a NameNode active, or a specifically issued failover command to take effect. This is an advanced option that does not often need to be changed.
Template Name
hdfs_namenode_activation_startup_tolerance
Default Value
180
Unit(s)
SECONDS

Standby NameNode Health Check

Description
When computing the overall HDFS cluster health, consider the health of the standby NameNode.
Template Name
hdfs_standby_namenodes_health_enabled
Default Value
true
Unit(s)
no unit

HDFS Under-Replicated Blocks

This is an HDFS service-level health test that checks that the number of under-replicated blocks does not rise above some percentage of the cluster's total blocks. A failure of this health test may indicate a loss of DataNodes. Use the HDFS fsck command to identify which files contain under-replicated blocks. This test can be configured using the Under-replicated Block Monitoring Thresholds HDFS service-wide monitoring setting.

Short Name: Under-Replicated Blocks

Under-replicated Block Monitoring Thresholds

Description
The health check thresholds of the number of under-replicated blocks. Specified as a percentage of the total number of blocks.
Template Name
hdfs_under_replicated_blocks_thresholds
Default Value
critical:40.0, warning:10.0
Unit(s)
PERCENT