Performing disaster recovery using failover

Failover in Cloudera AI is a manual disaster recovery process that switches operations to a backup file system if the primary one fails, ensuring continued access to your workbenches. This active-passive solution relies on a CDP CLI command to move your workbench from the unavailable primary system to the replicated secondary system.

Failover is a critical reliability mechanism in file systems and databases that automatically switches operations to a backup system when the primary system fails or becomes unavailable. Cloudera AI supports a failover disaster recovery solution to ensure that Cloudera AI Workbench can be recovered in the event of a catastrophe.

The failover and failback process must be executed manually using a CDP CLI command.

Cloudera AI supports an active-passive disaster recovery solution that can span two file systems. If the primary AWS EFS becomes unavailable, Cloudera AI can be forced to fail over to the secondary (backup) file system. During normal operation, the primary file system is writable, while the backup file system is read-only due to the replication configuration in place.

Run the following CDP CLI command to perform a failover.

During the failover, your workbench is moved from an existing file system to the replication file system that has been created.
cdp ml fail-over-file-system --workspace-crn [***WORKSPACE_CRN***] 
[--x-entitlements [X_ENTITLEMENTS ...]] [--delete-primary-storage] [--no-delete-primary-storage] 
[--cli-input-json CLI_INPUT_JSON] 
[--generate-cli-skeleton] [help]
For example,
cdp ml fail-over-file-system --workspace-crn 
crn:cdp:ml:us-west-1:csdb-ccce-4f8d-a581-830970ba9808:workspace:678a6b
13-69cc-34ff-a111-f934552afabf --profile int
When you run the command, the following failover tasks take place. In the following example, the primary file system is referred to as F1 and the backup file system as F2.
  1. Suspends the workbench.
  2. Access points and mount targets are deleted from F1.
  3. When the replication configuration is deleted, F2 becomes read-write.
  4. Access points and mount targets are created on the replica (F2).
  5. F2 is mounted by performing a Helm installation.
  6. Workbench is scaled up and brought back to its normal operating state.

Now, F2 becomes the primary file system. From the backend, another replica of F2 (F3) is created. This makes F2 writable and F3 the new replica file system.

Using the CDP CLI, you can delete the corrupted (failed) file system F1 by setting the deletePrimaryStorage flag to true.