Node repair

Cloudera Data Hub monitors clusters, ensuring that when host-level failures occur, they are reported right away and can be quickly resolved by performing manual repair which deletes and replaces failed nodes and reattaches the disks.

For each Cloudera Data Hub cluster, Cloudera checks for Cloudera Manager agent heartbeat on all cluster nodes. If the Cloudera Manager agent heartbeat is lost on a node, a failure is reported for that node. This may happen for the following reasons:

The Cloudera Manager agent process exited on a node
An instance crashed
An instance was terminated

Once a failure is reported, options are available for you to repair the failure manually.

important

A manual repair operation on an unhealthy node will replace the disk mounted as root, which means all data on the root disk is lost.

To avoid the loss of important data, do not store it in the root disk. Instead, do one of the following:

Store data that needs to persist beyond the lifetime of the cluster in S3 or ADLS Gen 2, depending on the platform in use.
Store data that needs to survive a repair operation in /hadoopfs/fsN/ (where N is an integer), as long as it is not so large that it could crowd out components that use that location.

For example, storing 1 GB of data in /hadoopfs/fs1/save_me would be an option to ensure that the data is available in a replacement node after a manual repair operation.

Manual repair is enabled for all clusters by default and covers all nodes, including the Cloudera Manager server node (by default, Cloudera Manager is installed on the master node). When a node fails, a notification about node failure is printed in the Event History, the affected node is marked as unhealthy, and an option to repair the cluster is available from the Actions menu. From the Hardware tab you can choose to repair a single node or specific nodes within a host group. There are two ways to repair the cluster:

Repair the failed nodes: (1) All non-ephemeral disks are detached from the failed nodes. (2) Failed nodes are removed (3) New nodes of the same type are provisioned. (4) The disks are attached to the new volumes, preserving the data.
Delete the failed nodes: Failed nodes are deleted with their attached volumes.

This section describes repair operation behaviors when clusters have instances in STOPPED state.

Repair Action	Allowed	Result
Repair a subset of STOPPED instances (with other STOPPED instances on the cluster)	No	System returns the following message: `Either select all Stopped nodes, or Start the nodes before attempting 'Repair'`.
Repair CM master (with STOPPED instances on the cluster)	No	System returns the following message: `Either select all Stopped nodes, or Start the nodes before attempting 'Repair'`.
Repair a subset of RUNNING instances (with STOPPED instances on the cluster in different or same host groups)	No	System returns the following message: `Either select all Stopped nodes, or Start the nodes before attempting 'Repair'`.
Repair all STOPPED instances	Yes	The operation completes and the cluster becomes AVAILABLE after the repair but eventually enters into Node Failure state after the syncer has run. Nodes are not removed from the CM. Because these STOPPED instances are marked as DECOMMISSIONED and are in Maintenance mode, they will remain in the same state when they come back online.