Data Hub monitors clusters, ensuring that when host-level failures occur, they are reported right away and can be quickly resolved by performing manual repair which deletes and replaces failed nodes and reattaches the disks.
For each Data Hub cluster, CDP checks for Cloudera Manager agent heartbeat on all cluster nodes. If the Cloudera Manager agent heartbeat is lost on a node, a failure is reported for that node. This may happen for the following reasons:
- The Cloudera Manager agent process exited on a node
- An instance crashed
- An instance was terminated
Once a failure is reported, options are available for you to repair the failure manually.
- Repair the failed nodes: (1) All non-ephemeral disks are detached from the failed nodes. (2) Failed nodes are removed (3) New nodes of the same type are provisioned. (4) The disks are attached to the new volumes, preserving the data.
- Delete the failed nodes: Failed nodes are deleted with their attached volumes.