How NameNode manages blocks on a failed DataNode
A DataNode is considered dead after a set period without any heartbeats (10.5 minutes by default).
When this happens, the NameNode performs the following actions to maintain the configured
replication factor (3x replication by default):
- The NameNode determines which blocks were on the failed DataNode.
- The NameNode locates other DataNodes with copies of these blocks.
- The DataNodes with block copies are instructed to copy those blocks to other DataNodes to maintain the configured replication factor.
Keep the following in mind when working with dead DataNodes:
- If a DataNode fails to heartbeat for reasons other than disk failure, it needs to be recommissioned to be added back to the cluster.
- If a DataNode rejoins the cluster, there is a possibility for surplus replicas of blocks that were on that DataNode. The NameNode will randomly remove excess replicas adhering to Rack-Awareness policies.