HDFS Health Tests
HDFS Canary
This is an HDFS service-level health test that checks that basic client operations are working and are completing in a reasonable amount of time. This test reports the results of a periodic "canary" test that performs the following sequence of operations. First, it creates a file. By default, the path is /tmp/.cloudera_health_monitoring_canary_timestamp. The canary test then writes a small amount of data to that file, reads that data back, and verifies that the data is correct. Lastly, the canary test removes the created file. The test returns "Bad" health if any of the basic operations fail. The test returns "Concerning" health if the canary test runs too slowly. A failure of this health test may indicate that the cluster is failing to satisfy basic client requests correctly or in a timely fashion. Check the status of the NameNode and other service-level health tests. Look in the Service Monitor logs for log messages from the canary test. Look in the NameNode logs for more details about the processing of the canary test requests. This test can be enabled or disabled using the HDFS Canary Health Check HDFS service-wide monitoring setting.
Short Name: HDFS Canary
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
HDFS Canary Health Check | Enables the health check that a client can create, read, write, and delete files | hdfs_canary_health_enabled | true | no unit |
HDFS Corrupt Blocks
This is an HDFS service-level health test that checks that the number of corrupt blocks does not rise above some percentage of the cluster's total blocks. A block is called corrupt by HDFS if it has at least one corrupt replica along with at least one live replica. As such, a corrupt block does not indicate unavailable data, but they do indicate an increased chance that data may become unavailable. If none of a block's replicas are live, the block is called a missing block by HDFS, not a corrupt block. HDFS automatically fixes corrupt blocks in the background. A failure of this health test may indicate a problem with the underlying storage or filesystem of a DataNode. Use the HDFS fsck command to identify which files contain corrupt blocks. This test can be configured using the Blocks With Corrupt Replicas Monitoring Thresholds HDFS service-wide monitoring setting.
Short Name: Corrupt Blocks
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Blocks With Corrupt Replicas Monitoring Thresholds | The health check thresholds of the number of blocks that have at least one corrupt replica. Specified as a percentage of the total number of blocks. | hdfs_blocks_with_corrupt_replicas_thresholds | critical:1.0, warning:0.5 | PERCENT |
HDFS DataNode Health
This is a HDFS service-level health test that checks that enough of the DataNodes in the cluster are healthy. The test returns "Concerning" health if the number of healthy DataNodes falls below a warning threshold, expressed as a percentage of the total number of DataNodes. The test returns "Bad" health if the number of healthy and "Concerning" DataNodes falls below a critical threshold, expressed as a percentage of the total number of DataNodes. For example, if this test is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 DataNodes, this test would return "Good" health if 95 or more DataNodes have good health. This test would return "Concerning" health if at least 90 DataNodes have either "Good" or "Concerning" health. If more than 10 DataNodes have bad health, this test would return "Bad" health. A failure of this health test indicates unhealthy DataNodes. Check the status of the individual DataNodes for more information. This test can be configured using the HDFS HDFS service-wide monitoring setting.
Short Name: DataNode Health
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Healthy DataNode Monitoring Thresholds | The health test thresholds of the overall DataNode health. The check returns "Concerning" health if the percentage of "Healthy" DataNodes falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" DataNodes falls below the critical threshold. | hdfs_datanodes_healthy_thresholds | critical:90.0, warning:95.0 | PERCENT |
HDFS Erasure Coding Policy Verification Test
- Add more DataNodes or Racks to your Cluster, if you want to use all the enabled erasure coding policies.
- Disable the offending erasure coding policy:
- Change the default erasure coding policy, if the offending policy is set as the default policy: Go to the HDFS configuration page, search for 'Fallback Erasure Coding Policy', and select one that is compliant with your existing cluster setup.
- Disable the offending erasure coding policy: Use the 'hdfs ec -disablePolicy -policy <policyName>' command.
- Disable the health test: Go to the HDFS configuration page, search for 'Erasure Coding Policy Verification Health Check', and unselect it. This is not recommended, because there is a high risk of losing data if the offending erasure coding policy is used.
Short Name: Erasure Coding Policy Verification Test
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Erasure Coding Policy Verification Health Check | Enables the health test for verifying if the cluster topology supports all the enabled erasure coding policies. | hdfs_verify_ec_with_topology_enabled | false | no unit |
Fallback Erasure Coding Policy | The fallback Erasure Coding policy that HDFS uses if no policy is specified when you run the -setPolicy command. | erasure_coding_default_policy | CDH=[[CDH 6.0.0‥CDH 8.0.0)=RS_6_3_1024K] | no unit |
HDFS Failover Controllers Health
This is a HDFS service-level health test that checks that all the Failover Controllers associated with this service are healthy and running. The test returns "Bad" health if any of Failover Controllers that the service depends on is unhealthy or not running. Check the Failover Controllers logs for more details This test can be enabled or disabled using the Failover Controllers Healthy service-wide monitoring setting.
Short Name: Failover Controllers Health
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Failover Controllers Healthy | Enables the health check that verifies that the failover controllers associated with this service are healthy and running. | failover_controllers_healthy_enabled | true | no unit |
HDFS Free Space
This is an HDFS service-level health test that checks that the amount of free space in the HDFS cluster does not fall below some percentage of total configured capacity. A failure of this health test may indicate a capacity planning problem, or a loss of DataNodes. An unfinalized HDFS metadata upgrade can also significantly decrease free space. This test can be configured using the HDFS Free Space Monitoring Thresholds HDFS service-wide monitoring setting.
Short Name: Free Space
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
HDFS Free Space Monitoring Thresholds | The health check thresholds of free space in HDFS. Specified as a percentage of total HDFS capacity. | hdfs_free_space_thresholds | critical:10.0, warning:20.0 | PERCENT |
HDFS Missing Blocks
This is an HDFS service-level health test that checks the number of missing blocks does not rise above some percentage of the cluster's total blocks. A missing block is a block with no live replicas. All replicas are either missing or corrupt. This may happen because of corruption or because DataNodes are offline or being decommissioned. A failure of this health test may indicate the loss of several DataNodes at once. If there are files stored in the cluster with a replication factor value of 1, you may see missing blocks with the loss or malfunction of a single DataNode. Use the HDFS fsck command to identify which files contain missing blocks. This test can be configured using the Missing Block Monitoring Thresholds HDFS service-wide monitoring setting.
Short Name: Missing Blocks
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Missing Block Monitoring Thresholds | The health check thresholds of the number of missing blocks. Specified as a percentage of the total number of blocks. | hdfs_missing_blocks_thresholds | critical:any, warning:never | PERCENT |
HDFS NameNode Health
This is an HDFS service-level health test that checks for and active, healthy NameNode. The test returns "Bad" health if the service is running and an active NameNode cannot be found. If an active NameNode is found, then the test checks the health of that NameNode as well as the health of any standby NameNode configured. A "Good" health result will only be returned if both the active and Standby NameNodes are healthy. A failure of this health test may indicate stopped or unhealthy NameNode roles, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and the NameNodes. Check the status of the HDFS service's NameNode roles and look in the Cloudera Manager Service Monitor's log files for more information when this test fails. This test can be enabled or disabled using the Active NameNode Role Health Check HDFS service-wide monitoring setting. The check for a healthy standby NameNode can be enabled or disabled with Standby NameNode Health Check. In addition, the Active NameNode Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active NameNode before this health test fails, and the NameNode Activation Startup Tolerance can be used to adjust the amount of time around NameNode startup that the test allows for a NameNode to be made active.
Short Name: NameNode Health
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Active NameNode Detection Window | The tolerance window that will be used in HDFS service tests that depend on detection of the active NameNode. | hdfs_active_namenode_detection_window | 3 | MINUTES |
Active NameNode Role Health Check | When computing the overall HDFS cluster health, consider the active NameNode's health | hdfs_namenode_health_enabled | true | no unit |
NameNode Activation Startup Tolerance | The amount of time after NameNode(s) start that the lack of an active NameNode will be tolerated. This is intended to allow either the auto-failover daemon to make a NameNode active, or a specifically issued failover command to take effect. This is an advanced option that does not often need to be changed. | hdfs_namenode_activation_startup_tolerance | 180 | SECONDS |
Standby NameNode Health Check | When computing the overall HDFS cluster health, consider the health of the standby NameNode. | hdfs_standby_namenodes_health_enabled | true | no unit |
HDFS Under-Replicated Blocks
This is an HDFS service-level health test that checks that the number of under-replicated blocks does not rise above some percentage of the cluster's total blocks. A failure of this health test may indicate a loss of DataNodes. Use the HDFS fsck command to identify which files contain under-replicated blocks. This test can be configured using the Under-replicated Block Monitoring Thresholds HDFS service-wide monitoring setting.
Short Name: Under-Replicated Blocks
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Under-replicated Block Monitoring Thresholds | The health check thresholds of the number of under-replicated blocks. Specified as a percentage of the total number of blocks. | hdfs_under_replicated_blocks_thresholds | critical:40.0, warning:10.0 | PERCENT |