HDFS Health Tests

HDFS Canary

This is an HDFS service-level health test that checks that basic client operations are working and are completing in a reasonable amount of time. This test reports the results of a periodic "canary" test that performs the following sequence of operations. First, it creates a file. By default, the path is /tmp/.cloudera_health_monitoring_canary_timestamp. The canary test then writes a small amount of data to that file, reads that data back, and verifies that the data is correct. Lastly, the canary test removes the created file. The test returns "Bad" health if any of the basic operations fail. The test returns "Concerning" health if the canary test runs too slowly. A failure of this health test may indicate that the cluster is failing to satisfy basic client requests correctly or in a timely fashion. Check the status of the NameNode and other service-level health tests. Look in the Service Monitor logs for log messages from the canary test. Look in the NameNode logs for more details about the processing of the canary test requests. This test can be enabled or disabled using the HDFS Canary Health Check HDFS service-wide monitoring setting.

Short Name: HDFS Canary

HDFS Canary Health Check

Description
Enables the health check that a client can create, read, write, and delete files
Template Name
hdfs_canary_health_enabled
Default Value
true
Unit(s)
no unit

HDFS Corrupt Blocks

This is an HDFS service-level health test that checks that the number of corrupt blocks does not rise above some percentage of the cluster's total blocks. A block is called corrupt by HDFS if it has at least one corrupt replica along with at least one live replica. As such, a corrupt block does not indicate unavailable data, but they do indicate an increased chance that data may become unavailable. If none of a block's replicas are live, the block is called a missing block by HDFS, not a corrupt block. HDFS automatically fixes corrupt blocks in the background. A failure of this health test may indicate a problem with the underlying storage or filesystem of a DataNode. Use the HDFS fsck command to identify which files contain corrupt blocks. This test can be configured using the Blocks With Corrupt Replicas Monitoring Thresholds HDFS service-wide monitoring setting.

Short Name: Corrupt Blocks

Blocks With Corrupt Replicas Monitoring Thresholds

Description
The health check thresholds of the number of blocks that have at least one corrupt replica. Specified as a percentage of the total number of blocks.
Template Name
hdfs_blocks_with_corrupt_replicas_thresholds
Default Value
critical:1.0, warning:0.5
Unit(s)
PERCENT

HDFS DataNode Health

This is a HDFS service-level health test that checks that enough of the DataNodes in the cluster are healthy. The test returns "Concerning" health if the number of healthy DataNodes falls below a warning threshold, expressed as a percentage of the total number of DataNodes. The test returns "Bad" health if the number of healthy and "Concerning" DataNodes falls below a critical threshold, expressed as a percentage of the total number of DataNodes. For example, if this test is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 DataNodes, this test would return "Good" health if 95 or more DataNodes have good health. This test would return "Concerning" health if at least 90 DataNodes have either "Good" or "Concerning" health. If more than 10 DataNodes have bad health, this test would return "Bad" health. A failure of this health test indicates unhealthy DataNodes. Check the status of the individual DataNodes for more information. This test can be configured using the HDFS HDFS service-wide monitoring setting.

Short Name: DataNode Health

Healthy DataNode Monitoring Thresholds

Description
The health test thresholds of the overall DataNode health. The check returns "Concerning" health if the percentage of "Healthy" DataNodes falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" DataNodes falls below the critical threshold.
Template Name
hdfs_datanodes_healthy_thresholds
Default Value
critical:90.0, warning:95.0
Unit(s)
PERCENT

HDFS Erasure Coding Policy Verification Test

There is a high risk of losing data if the offending erasure coding policy is used. For more information, see Best Practices for Rack and Node Setup for EC. Do one of the following to clear this health issue:
  • Add more DataNodes or Racks to your Cluster, if you want to use all the enabled erasure coding policies.
  • Disable the offending erasure coding policy:
    • Change the default erasure coding policy, if the offending policy is set as the default policy: Go to the HDFS configuration page, search for 'Fallback Erasure Coding Policy', and select one that is compliant with your existing cluster setup.
    • Disable the offending erasure coding policy: Use the 'hdfs ec -disablePolicy -policy <policyName>' command.
  • Disable the health test: Go to the HDFS configuration page, search for 'Erasure Coding Policy Verification Health Check', and unselect it. This is not recommended, because there is a high risk of losing data if the offending erasure coding policy is used.
  • Short Name: Erasure Coding Policy Verification Test

    Erasure Coding Policy Verification Health Check

    Description
    Enables the health test for verifying if the cluster topology supports all the enabled erasure coding policies.
    Template Name
    hdfs_verify_ec_with_topology_enabled
    Default Value
    false
    Unit(s)
    no unit

    Fallback Erasure Coding Policy

    Description
    The fallback Erasure Coding policy that HDFS uses if no policy is specified when you run the -setPolicy command.
    Template Name
    erasure_coding_default_policy
    Default Value
    CDH=[[CDH 6.0.0..CDH 8.0.0)=RS_6_3_1024K]
    Unit(s)
    no unit

    HDFS Failover Controllers Health

    This is a HDFS service-level health test that checks that all the Failover Controllers associated with this service are healthy and running. The test returns "Bad" health if any of Failover Controllers that the service depends on is unhealthy or not running. Check the Failover Controllers logs for more details This test can be enabled or disabled using the Failover Controllers Healthy service-wide monitoring setting.

    Short Name: Failover Controllers Health

    Failover Controllers Healthy

    Description
    Enables the health check that verifies that the failover controllers associated with this service are healthy and running.
    Template Name
    failover_controllers_healthy_enabled
    Default Value
    true
    Unit(s)
    no unit

    HDFS Free Space

    This is an HDFS service-level health test that checks that the amount of free space in the HDFS cluster does not fall below some percentage of total configured capacity. A failure of this health test may indicate a capacity planning problem, or a loss of DataNodes. An unfinalized HDFS metadata upgrade can also significantly decrease free space. This test can be configured using the HDFS Free Space Monitoring Thresholds HDFS service-wide monitoring setting.

    Short Name: Free Space

    HDFS Free Space Monitoring Thresholds

    Description
    The health check thresholds of free space in HDFS. Specified as a percentage of total HDFS capacity.
    Template Name
    hdfs_free_space_thresholds
    Default Value
    critical:10.0, warning:20.0
    Unit(s)
    PERCENT

    HDFS Missing Blocks

    This is an HDFS service-level health test that checks the number of missing blocks does not rise above some percentage of the cluster's total blocks. A missing block is a block with no live replicas. All replicas are either missing or corrupt. This may happen because of corruption or because DataNodes are offline or being decommissioned. A failure of this health test may indicate the loss of several DataNodes at once. If there are files stored in the cluster with a replication factor value of 1, you may see missing blocks with the loss or malfunction of a single DataNode. Use the HDFS fsck command to identify which files contain missing blocks. This test can be configured using the Missing Block Monitoring Thresholds HDFS service-wide monitoring setting.

    Short Name: Missing Blocks

    Missing Block Monitoring Thresholds

    Description
    The health check thresholds of the number of missing blocks. Specified as a percentage of the total number of blocks.
    Template Name
    hdfs_missing_blocks_thresholds
    Default Value
    critical:any, warning:never
    Unit(s)
    PERCENT

    HDFS NameNode Health

    This is an HDFS service-level health test that checks for and active, healthy NameNode. The test returns "Bad" health if the service is running and an active NameNode cannot be found. If an active NameNode is found, then the test checks the health of that NameNode as well as the health of any standby NameNode configured. A "Good" health result will only be returned if both the active and Standby NameNodes are healthy. A failure of this health test may indicate stopped or unhealthy NameNode roles, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and the NameNodes. Check the status of the HDFS service's NameNode roles and look in the Cloudera Manager Service Monitor's log files for more information when this test fails. This test can be enabled or disabled using the Active NameNode Role Health Check HDFS service-wide monitoring setting. The check for a healthy standby NameNode can be enabled or disabled with Standby NameNode Health Check. In addition, the Active NameNode Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active NameNode before this health test fails, and the NameNode Activation Startup Tolerance can be used to adjust the amount of time around NameNode startup that the test allows for a NameNode to be made active.

    Short Name: NameNode Health

    Active NameNode Detection Window

    Description
    The tolerance window that will be used in HDFS service tests that depend on detection of the active NameNode.
    Template Name
    hdfs_active_namenode_detection_window
    Default Value
    3
    Unit(s)
    MINUTES

    Active NameNode Role Health Check

    Description
    When computing the overall HDFS cluster health, consider the active NameNode's health
    Template Name
    hdfs_namenode_health_enabled
    Default Value
    true
    Unit(s)
    no unit

    NameNode Activation Startup Tolerance

    Description
    The amount of time after NameNode(s) start that the lack of an active NameNode will be tolerated. This is intended to allow either the auto-failover daemon to make a NameNode active, or a specifically issued failover command to take effect. This is an advanced option that does not often need to be changed.
    Template Name
    hdfs_namenode_activation_startup_tolerance
    Default Value
    180
    Unit(s)
    SECONDS

    Standby NameNode Health Check

    Description
    When computing the overall HDFS cluster health, consider the health of the standby NameNode.
    Template Name
    hdfs_standby_namenodes_health_enabled
    Default Value
    true
    Unit(s)
    no unit

    HDFS Under-Replicated Blocks

    This is an HDFS service-level health test that checks that the number of under-replicated blocks does not rise above some percentage of the cluster's total blocks. A failure of this health test may indicate a loss of DataNodes. Use the HDFS fsck command to identify which files contain under-replicated blocks. This test can be configured using the Under-replicated Block Monitoring Thresholds HDFS service-wide monitoring setting.

    Short Name: Under-Replicated Blocks

    Under-replicated Block Monitoring Thresholds

    Description
    The health check thresholds of the number of under-replicated blocks. Specified as a percentage of the total number of blocks.
    Template Name
    hdfs_under_replicated_blocks_thresholds
    Default Value
    critical:40.0, warning:10.0
    Unit(s)
    PERCENT