How NameNode manages blocks on a failed DataNode

A DataNode is considered dead after a set period without any heartbeats (10.5 minutes by default).

When this happens, the NameNode performs the following actions to maintain the configured replication factor (3x replication by default):
  1. The NameNode determines which blocks were on the failed DataNode.
  2. The NameNode locates other DataNodes with copies of these blocks.
  3. The DataNodes with block copies are instructed to copy those blocks to other DataNodes to maintain the configured replication factor.
Keep the following in mind when working with dead DataNodes:
  • If a DataNode fails to heartbeat for reasons other than disk failure, it needs to be recommissioned to be added back to the cluster.
  • If a DataNode rejoins the cluster, there is a possibility for surplus replicas of blocks that were on that DataNode. The NameNode will randomly remove excess replicas adhering to Rack-Awareness policies.