Backing up ML workspaces

Cloudera Machine Learning makes it easy to create machine learning projects, jobs, experiments, ML models, and applications in workspaces. The data and metadata of these artifacts are stored in different types of storage systems in the cloud.

You can backup an ML workspace, and restore it to a new workspace later. The backup preserves all files, models, applications and other assets in the workspace. All backed-up workspaces are stored in the Workspace Backups UI.

The Backup and Restore feature gives you the ability to backup all of your data to protect your machine learning artifacts against disasters. If your Cloudera Machine Learning workspace is backed up, this feature lets you restore the saved data into a new CML workspace so that you can recover your ML artifacts as they were saved in the desired backup. The Backup and Restore feature gives the administrator the ability to take “on-demand” backups of CML workspaces. Core services running in the workspace are shut down during the backup process to ensure consistency in the backup data. It is recommended that backups are taken during off-peak hours to minimize user impacts.

The time required to complete backing up a workspace depends on the amount of data to copy. The backup process copies data from both EBS volumes and EFS. In general, the time taken to backup EFS is the dominant one. Due to the incremental nature of backups, the first backup always takes the longest amount of time. Subsequent backups should complete faster as they are built on top of the initial backup copy. For this reason, we recommend that CML workspaces be backed up regularly.

The time to backup EFS is highly dependent on the amount of data, and on the nature and number of files. It is also affected by available bandwidth in the AWS cloud backend. We have seen first-time backup of a 600 GB EFS file system taking around 10 hours. If you have much more than 600 GB on your EFS file system, the default backup timeout of 12 hours may not be long enough. In such cases, we recommend you take your first backup with a lower timeout, such as 2 hours. The CML Control Plane may abort the backup due to the timeout expiry. However, the Control Plane does not cancel the underlying backup jobs. You can monitor these backup jobs on the AWS Backup console, and if all eventually complete successfully, you can initiate the backup operation again from the CML Control Plane. This should complete in a relatively shorter time, and you will have a good backup copy to restore from if necessary.

There is currently no restriction on the number of backups one can take, and the backup snapshots are retained indefinitely in the backup service vault of the underlying cloud platform. CML workspace backup details are stored in the Workspace Backups UI in the CML control plane, and these entries may be listed, viewed, deleted or restored as desired.

Restoring a backup creates a new CML workspace wherein the restored data is automatically imported. All the projects, jobs, applications, etc., that were in existence during the backup are automatically available in the new workspace. Restoring a CML backup provisions a new cluster, and then launches restore jobs to create storage volumes from the backup snapshots. The restore process takes longer than a regular workspace provisioning operation due to the extra work in copying data from backup to the new storage volumes. While backups are incremental, restores are always full-copy restores. The time to restore is dominated by EFS restoration, which takes at least as long as the time to backup the file system. The restored workspace is always created with the latest CML software version, which may be different from the CML version of the original workspace that was backed up.