Backing up Cloudera AI Workbenches
Cloudera AI makes it easy to create machine learning projects, jobs, experiments, ML models, and applications in workbenches. The data and metadata of these artifacts are stored in different types of storage systems in the cloud .
You can backup an Cloudera AI Workbench, and restore it to a new workbench later. The backup preserves all files, models, applications and other assets in the workbench (files are not backed up by Cloudera AI automatically for external NFS-based workbenches). All workbench backups can be viewed in the Workbench Backup Catalog UI.
The Backup and Restore feature gives you the ability to backup all of your data (except files in external NFS-backed workbenches) to protect your machine learning artifacts against disasters. If your Cloudera AI Workbench is backed up, this feature lets you restore the saved data into a new Cloudera AI Workbench so that you can recover your Cloudera AI artifacts as they were saved in the desired backup. The Backup and Restore feature gives the administrator the ability to take “on-demand” backups of Cloudera AI Workbenches. Core services running in the workbench are shut down during the backup process to ensure consistency in the backup data. It is recommended that backups are taken during off-peak hours to minimize user impacts.
The time required to complete backing up a workbench depends on the amount of data to copy. The backup process copies data from both EBS volumes and EFS. In general, the time taken to backup EFS is more significant than for EBS. Due to the incremental nature of backups, the first backup always takes the longest amount of time. Subsequent backups should complete faster as they are built on top of the initial backup copy. For this reason, we recommend that Cloudera AI Workbenches be backed up regularly.
The time to backup EFS is highly dependent on the amount of data, and on the nature and number of files. It is also affected by available bandwidth in the AWS cloud backend. We have seen first-time backup of a 600 GB EFS file system taking around 10 hours. If you have much more than 600 GB on your EFS file system, the default backup timeout of 12 hours may not be long enough. In such cases, we recommend you take your first backup with a lower timeout, such as 2 hours. The Cloudera AI Control Plane may abort the backup due to the timeout expiry. However, the Control Plane does not cancel the underlying backup jobs. You can monitor these backup jobs on the AWS Backup console, and if all eventually complete successfully, you can initiate the backup operation again from the Cloudera AI Control Plane. This should complete in a relatively shorter time, and you will have a good backup copy to restore from if necessary.
There is currently no restriction on the number of backups one can take, and the backup snapshots are retained indefinitely in the backup service vault of the underlying cloud platform . Cloudera AI Workbench backup details are stored in the Workbench Backup Catalog UI in the Cloudera AI control plane, and these entries may be listed, viewed, deleted or restored as desired.
Restoring a backup creates a new Cloudera AI Workbench wherein the restored data is automatically imported. All the projects, jobs, applications, etc., that were in existence during the backup are automatically available in the new workbench. Restoring a Cloudera AI backup provisions a new cluster, and then launches restore jobs to create storage volumes from the backup snapshots. The restore process takes longer than a regular workbench provisioning operation due to the extra work in copying data from backup to the new storage volumes. While backups are incremental, restores are always full-copy restores. The time to restore is dominated by EFS restoration, which takes at least as long as the time to backup the file system. The restored workbench is always created with the latest Cloudera AI software version, which may be different from the Cloudera AI version of the original workbench that was backed up.