Data Lake repair

If a Data Lake node fails, an administrator can trigger a manual repair process to restore the failed node and reconnect it to the persistent Data Lake storage.

For each Data Lake cluster, CDP detects the following failures indicate that one or more nodes needs repair:

  • The node is unresponsive, from a crash or termination
  • The Cloudera Manager agent process is unresponsive

When CDP detects a node failure, a CDP administrator has the option to repair the failure manually. Note that during the repair process, the Data Lake services are not available to the attached workload clusters. Therefore, before triggering a Data Lake repair, consider stopping any jobs running on your workload clusters and restarting them after the Data Lake is restored. Audits and metadata will continue to be queued for collection through the restoration process.

When a node fails, you'll see a notification about node failure printed in the Event History tab for the Data Lake, the affected node is marked as unhealthy in the Hardware tab, and a button to start the repair process appears at the top of the Data Lake details. You can also select the Repair icon next to a host group on the Hardware tab to select specific nodes for repair. When your CDP administrator triggers node repair, the repair process:

  1. Detaches all non-ephemeral disks from the failed nodes.
  2. Removes the failed nodes.
  3. Provisions new nodes of the same type, no upgrades are applied.
  4. Reattaches the disks to the new volumes.
  5. Reconnects services to the external database.