Managing Clusters
Also available as:
PDF

Repairing a cluster

Cloudbreak monitors clusters, ensuring that when host-level failures occur, they are quickly resolved by deleting and replacing failed nodes along with their attached volumes.

For each cluster, Cloudbreak checks for Ambari agent heartbeat on all cluster nodes. If the Ambari agent heartbeat is lost on a node, a failure is reported for that node. This may happen for the following reasons:

  • The ambari-agent process exited on a node

  • An instance crashed

  • An instance was terminated

Once a failure is reported, it is repaired automatically (if auto repair is enabled), or options are available for you to repair the failure manually (if auto repair is disabled).

Note
Note

On regular non-HA clusters, all nodes are repairable, except the Ambari node. The Ambari node can only be repaired on fault tolerant clusters, where the Ambari server is configured to be HA ("HA clusters"). This is because HA clusters have two gateway nodes in the gateway group with Ambari installed on them, and so if the primary node fails, the secondary node can be used.

Warning
Warning

In order to use the repair feature, you need to set up custom hostnames. This is required because Ambari stores hostname-related metadata in its database and it requires custom hostnames to keep consistent metadata. You can set this up by using either Custom internal hostnames for cluster hosts or Custom hostnames based on DNS on AWS.

Auto repair

If auto repair is enabled, once a failure of a worker or compute node is reported, it is repaired automatically be removing and replacing the failed node. The flow is:

  1. A notification about node failure is printed in the UI.
  2. The recovery flow is triggered. Cluster status changes to 'REPAIR'.
  3. Downscale: Remove failed nodes, copy data from volumes attached to the failed nodes to other volumes, and then remove the attached volumes.
  4. Upscale: New nodes and attached volumes of the same type are added in place of the failed nodes).
  5. The recovery flow is completed. The cluster status changes to 'RUNNING'.

Corresponding events are written to cluster’s event history.

Manual repair

Manual repair is enabled for all clusters by default. When manual repair is enabled, when a worker or compute node fails:
  1. A notification about node failure is printed in the UI:
    • Cluster tile on the cluster dashboard shows unhealthy nodes
    • Nodes are marked as "UNHEALTHY" in the Hardware section
    • Cluster's event history shows "Manual recovery is needed for the following failed nodes"
  2. You have an option to repair or delete failed nodes. This option is available from the cluster details in the UI (Actions > Repair) and from the CLI (the cluster repair command).
  3. If repair was chosen: (1) Failed nodes are removed and data from volumes attached to the failed nodes in copied to other volumes, and then the attached volumes are removed. (2) New nodes and attached volumes of the same type are added in place of the failed nodes.
  4. If delete was chosen, failed nodes are deleted with their attached volumes.
  5. Once the recovery flow is completed, the cluster status changes to 'RUNNING'.

Corresponding events are written to cluster’s event history.