Minimum Required Role: Operator (also provided by Configurator, Cluster Administrator, Full Administrator)

How NameNode Manages Blocks on a Failed DataNode

After a period without any heartbeats (which by default is 10.5 minutes), a DataNode is assumed to be failed. The following describes how the NameNode manages block replication in such cases.
  1. NameNode determines which blocks were on the failed DataNode.
  2. NameNode locates other DataNodes with copies of these blocks.
  3. The DataNodes with block copies are instructed to copy those blocks to other DataNodes to maintain the configured replication factor.
  4. Follow the procedure in Replacing a Disk on a DataNode Host to bring a repaired DataNode back online.

Replacing a Disk on a DataNode Host

If one of your DataNode hosts experiences a disk failure, follow this process to replace the disk:
  1. Stop managed services.
  2. Decommission the DataNode role instance.
  3. Replace the failed disk.
  4. Recommission the DataNode role instance.
  5. Run the HDFS fsck utility to validate the health of HDFS. The utility normally reports over-replicated blocks immediately after a DataNode is reintroduced to the cluster, which is automatically corrected over time.
  6. Start managed services.