Repairing a cluster
Cloudbreak monitors clusters, ensuring that when host-level failures occur, they are quickly resolved by deleting and replacing failed nodes along with their attached volumes.
For each cluster, Cloudbreak checks for Ambari agent heartbeat on all cluster nodes. If the Ambari agent heartbeat is lost on a node, a failure is reported for that node. This may happen for the following reasons:
-
The ambari-agent process exited on a node
-
An instance crashed
-
An instance was terminated
Once a failure is reported, it is repaired automatically (if auto repair is enabled), or options are available for you to repair the failure manually (if auto repair is disabled).
When a cluster is fault-tolerant (HA cluster), all nodes can be repaired, including the Ambari server node. In case of a cluster without fault tolerance (non-HA cluster), all nodes can be repaired except the Ambari server node.
Warning | |
---|---|
In order to be able to use the repair feature, you need to set up custom hostnames. This is required because Ambari stores hostname-related metadata in its database and it requires custom hostnames to keep consistent metadata. You can set this up by using either Custom internal hostnames for cluster hosts or Custom hostnames based on DNS on AWS. |
Auto repair
If auto repair is enabled, once a failure of a worker or compute node is reported, it is repaired automatically be removing and replacing the failed node. The flow is:
- A notification about node failure is printed in the UI.
- The recovery flow is triggered. Cluster status changes to 'REPAIR'.
- Downscale: Remove failed nodes, copy data from volumes attached to the failed nodes to other volumes, and then remove the attached volumes.
- Upscale: New nodes and attached volumes of the same type are added in place of the failed nodes).
- The recovery flow is completed. The cluster status changes to 'RUNNING'.
Corresponding events are written to cluster’s event history.
Manual repair
- A notification about node failure is printed in the UI:
- Cluster tile on the cluster dashboard shows unhealthy nodes
- Nodes are marked as "UNHEALTHY" in the Hardware section
- Cluster's event history shows "Manual recovery is needed for the following failed nodes"
- You have an option to repair or delete failed nodes. This option is available
from the cluster details in the UI (Actions > Repair) and from the CLI (the
cluster repair
command). - If repair was chosen: (1) Failed nodes are removed and data from volumes attached to the failed nodes in copied to other volumes, and then the attached volumes are removed. (2) New nodes and attached volumes of the same type are added in place of the failed nodes.
- If delete was chosen, failed nodes are deleted with their attached volumes.
- Once the recovery flow is completed, the cluster status changes to 'RUNNING'.
Corresponding events are written to cluster’s event history.