How NameNode manages blocks on a failed DataNode
A DataNode is considered dead after a set period without any heartbeats (10.5 minutes by default).
When this happens, the NameNode performs the following actions to maintain the configured
replication factor (3x replication by default):
- The NameNode determines which blocks were on the failed DataNode.
- The NameNode locates other DataNodes with copies of these blocks.
- The DataNodes with block copies are instructed to copy those blocks to other DataNodes to maintain the configured replication factor.
Keep the following in mind when working with dead DataNodes:
- If the DataNode failed due to a disk failure, follow the procedure in Replace a disk on a DataNode host or Perform a disk hot swap for DataNodes using Cloudera Manager to bring a repaired DataNode back online. If a DataNode failed to heartbeat for other reasons, they need to be recommissioned to be added back to the cluster.
- If a DataNode fails to heartbeat for reasons other than disk failure, it needs to be recommissioned to be added back to the cluster.
- If a DataNode rejoins the cluster, there is a possibility for surplus replicas of blocks that were on that DataNode. The NameNode will randomly remove excess replicas adhering to Rack-Awareness policies.