Node repair
Cloudera Data Hub monitors clusters, ensuring that when host-level failures occur, they are reported right away and can be quickly resolved by performing manual repair which deletes and replaces failed nodes and reattaches the disks.
For each Cloudera Data Hub cluster, Cloudera checks for Cloudera Manager agent heartbeat on all cluster nodes. If the Cloudera Manager agent heartbeat is lost on a node, a failure is reported for that node. This may happen for the following reasons:
- The Cloudera Manager agent process exited on a node
- An instance crashed
- An instance was terminated
Once a failure is reported, options are available for you to repair the failure manually.
- Repair the failed nodes: (1) All non-ephemeral disks are detached from the failed nodes. (2) Failed nodes are removed (3) New nodes of the same type are provisioned. (4) The disks are attached to the new volumes, preserving the data.
- Delete the failed nodes: Failed nodes are deleted with their attached volumes.
This section describes repair operation behaviors when clusters have instances in STOPPED state.
| Repair Action | Allowed | Result |
|---|---|---|
| Repair a subset of STOPPED instances (with other STOPPED instances on the cluster) | No | System returns the following message: Either select all Stopped nodes, or Start the nodes before attempting 'Repair'. |
| Repair CM master (with STOPPED instances on the cluster) | No | System returns the following message: Either select all Stopped nodes, or Start the nodes before attempting 'Repair'. |
| Repair a subset of RUNNING instances (with STOPPED instances on the cluster in different or same host groups) | No | System returns the following message: Either select all Stopped nodes, or Start the nodes before attempting 'Repair'. |
| Repair all STOPPED instances | Yes |
The operation completes and the cluster becomes AVAILABLE after the repair but eventually enters into Node Failure state after the syncer has run. Nodes are not removed from the CM. Because these STOPPED instances are marked as DECOMMISSIONED and are in Maintenance mode, they will remain in the same state when they come back online. |
