Backup and Disaster Recovery for Cloudera Data Science Workbench
All application data for Cloudera Data Science Workbench, including project files and database state, is stored on the master host at /var/lib/cdsw. Given typical access patterns, it is strongly recommended that /var/lib/cdsw be stored on a dedicated SSD block device or SSD RAID configuration. Because application data is not replicated to HDFS or backed up by default, site administrators must enable a backup strategy to meet any disaster recovery scenarios.
Cloudera strongly recommends both regular backups and backups before upgrades and is not responsible for any data loss.
Creating a Backup
To stop Cloudera Data Science Workbench:
- Cloudera Data Science Workbench 1.4.2 or lower
Run the script on your master host and stop Cloudera Data Science Workbench (instructions below) only when instructed to do so by the script. Then proceed with step 2 of this process.
- Cloudera Data Science Workbench 1.4.3 or higher
Depending on your deployment, use one of the following sets of instructions to stop the application.
- CSD - Log in to Cloudera Manager. On the
tab, click to the right of the CDSW service
and select Stop from the dropdown. Wait for the action to complete.
- RPM - Run the following command on the master host:
- After stopping CDSW, and before running the following tar command, wait 2-5 minutes (depending on your disk speed) to ensure that all data from CDSW is successfully written to the disks. Otherwise the tar command may not capture all recent changes.
To create the backup, run the following command on the master host:
tar cvzf cdsw.tar.gz /var/lib/cdsw/*
(Optional) If needed, the following command can be used to unpack the tar bundle.
tar xvzf cdsw.tar.gz -C /var/lib/cdsw