Managing Clusters
Also available as:
PDF

Repairing a cluster

Cloudbreak monitors clusters, ensuring that when host-level failures occur, they are quickly resolved by deleting and replacing failed nodes along with their attached volumes.

For each cluster, Cloudbreak checks for Ambari agent heartbeat on all cluster nodes. If the Ambari agent heartbeat is lost on a node, a failure is reported for that node. This may happen for the following reasons:

  • The ambari-agent process exited on a node

  • An instance crashed

  • An instance was terminated

Once a failure is reported, it is repaired automatically (if auto repair is enabled), or options are available for you to repair the failure manually (if auto repair is disabled).

When a cluster is fault-tolerant (HA cluster), all nodes can be repaired, including the Ambari server node. In case of a cluster without fault tolerance (non-HA cluster), all nodes can be repaired except the Ambari server node.

Auto repair

If auto repair is enabled, once a failure of a worker or compute node is reported, it is repaired automatically be removing and replacing the failed node. The flow is:

  1. A notification about node failure is printed in the UI.
  2. The recovery flow is triggered. Cluster status changes to 'REPAIR'.
  3. Downscale: Remove failed nodes, copy data from volumes attached to the failed nodes to other volumes, and then remove the attached volumes.
  4. Upscale: New nodes and attached volumes of the same type are added in place of the failed nodes).
  5. The recovery flow is completed. The cluster status changes to 'RUNNING'.

Corresponding events are written to cluster’s event history.

Manual repair

Manual repair is enabled for all clusters by default. When manual repair is enabled, when a worker or compute node fails:
  1. A notification about node failure is printed in the UI:
    • Cluster tile on the cluster dashboard shows unhealthy nodes
    • Nodes are marked as "UNHEALTHY" in the Hardware section
    • Cluster's event history shows "Manual recovery is needed for the following failed nodes"
  2. You have an option to repair or delete failed nodes. This option is available from the cluster details in the UI (Actions > Repair) and from the CLI (the cluster repair command).
  3. If repair was chosen: (1) Failed nodes are removed and data from volumes attached to the failed nodes in copied to other volumes, and then the attached volumes are removed. (2) New nodes and attached volumes of the same type are added in place of the failed nodes.
  4. If delete was chosen, failed nodes are deleted with their attached volumes.
  5. Once the recovery flow is completed, the cluster status changes to 'RUNNING'.

Corresponding events are written to cluster’s event history.