Performing disaster recovery using failover
Failover in Cloudera AI is a manual disaster recovery process that switches operations to a backup file system if the primary one fails, ensuring continued access to your workbenches. This active-passive solution relies on a CDP CLI command to move your workbench from the unavailable primary system to the replicated secondary system.
Failover is a critical reliability mechanism in file systems and databases that automatically switches operations to a backup system when the primary system fails or becomes unavailable. Cloudera AI supports a failover disaster recovery solution to ensure that Cloudera AI Workbench can be recovered in the event of a catastrophe.
The failover and failback process must be executed manually using a CDP CLI command.
Cloudera AI supports an active-passive disaster recovery solution that can span two file systems. If the primary AWS EFS becomes unavailable, Cloudera AI can be forced to fail over to the secondary (backup) file system. During normal operation, the primary file system is writable, while the backup file system is read-only due to the replication configuration in place.
Run the following CDP CLI command to perform a failover.
cdp ml fail-over-file-system --workspace-crn [***WORKSPACE_CRN***]
[--x-entitlements [X_ENTITLEMENTS ...]] [--delete-primary-storage] [--no-delete-primary-storage]
[--cli-input-json CLI_INPUT_JSON]
[--generate-cli-skeleton] [help]
cdp ml fail-over-file-system --workspace-crn
crn:cdp:ml:us-west-1:csdb-ccce-4f8d-a581-830970ba9808:workspace:678a6b
13-69cc-34ff-a111-f934552afabf --profile int
- Suspends the workbench.
- Access points and mount targets are deleted from F1.
- When the replication configuration is deleted, F2 becomes read-write.
- Access points and mount targets are created on the replica (F2).
- F2 is mounted by performing a Helm installation.
- Workbench is scaled up and brought back to its normal operating state.
Now, F2 becomes the primary file system. From the backend, another replica of F2 (F3) is created. This makes F2 writable and F3 the new replica file system.
Using the CDP CLI, you can delete the corrupted (failed) file system F1 by setting the
deletePrimaryStorage
flag to true
.