Data Hub monitors clusters, ensuring that when host-level failures occur, they are
reported right away and can be quickly resolved by performing manual repair which deletes and
replaces failed nodes and reattaches the disks.
For each Data Hub cluster, CDP checks for Cloudera Manager agent heartbeat on all
cluster nodes. If the Cloudera Manager agent heartbeat is lost on a node, a failure is reported
for that node. This may happen for the following reasons:
The Cloudera Manager agent process exited on a node
An instance crashed
An instance was terminated
Once a failure is reported, options are available for you to repair the failure
Manual repair is enabled for all clusters by default and covers all nodes, including the
Cloudera Manager server node (by default, Cloudera Manager is installed on the master node).
When a node fails, a notification about node failure is printed in the Event History, the
affected node is marked as unhealthy, and an option to repair the cluster is available from
the Actions menu. From the Hardware tab you can choose to repair
a single node or specific nodes within a host group. There are two ways to repair the cluster:
Repair the failed nodes: (1) All non-ephemeral disks are detached from the failed nodes.
(2) Failed nodes are removed (3) New nodes of the same type are provisioned. (4) The disks
are attached to the new volumes, preserving the data.
Delete the failed nodes: Failed nodes are deleted with their attached volumes.