Node repair

Data Hub monitors clusters, ensuring that when host-level failures occur, they are reported right away and can be quickly resolved by performing manual repair which deletes and replaces failed nodes and reattaches the disks.

For each Data Hub cluster, CDP checks for Cloudera Manager agent heartbeat on all cluster nodes. If the Cloudera Manager agent heartbeat is lost on a node, a failure is reported for that node. This may happen for the following reasons:

  • The Cloudera Manager agent process exited on a node

  • An instance crashed

  • An instance was terminated

Once a failure is reported, options are available for you to repair the failure manually.

Manual repair is enabled for all clusters by default and covers all nodes except the Cloudera Manager server node (By default, Cloudera Manager is installed on the master node). When a worker or compute node fails, a notification about node failure is printed in the Event History, the affected node is marked as unhealthy, and an option to repair the cluster is available from the Actions menu. There are two ways to repair the cluster:
  • Repair the failed nodes: (1) All non-ephemeral disks are detached from the failed nodes. (2) Failed nodes are removed (3) New nodes of the same type are provisioned. (4) The disks are attached to the new volumes, preserving the data.
  • Delete the failed nodes: Failed nodes are deleted with their attached volumes.