Backup and Disaster Recovery for Cloudera Data Science Workbench
All application data for Cloudera Data Science Workbench, including project files and database state, is stored on the master host at /var/lib/cdsw. Given typical access patterns, it is strongly recommended that /var/lib/cdsw be stored on a dedicated SSD block device or SSD RAID configuration. Because application data is not replicated to HDFS or backed up by default, site administrators must enable a backup strategy to meet any disaster recovery scenarios.
Cloudera strongly recommends both regular backups and backups before upgrades and is not responsible for any data loss.
Creating a Backup
-
- Cloudera Data Science Workbench 1.4.2 or lower
-
Do not stop or restart Cloudera Data Science Workbench without using the cdsw_protect_stop_restart.sh script. This is to help avoid the data loss issue detailed in TSB-346.
Run the script on your master host and stop Cloudera Data Science Workbench (instructions below) only when instructed to do so by the script. Then proceed with step 2 of this process.
- Cloudera Data Science Workbench 1.4.3 or higher
-
Depending on your deployment, use one of the following sets of instructions to stop the application.
To stop Cloudera Data Science Workbench:- CSD - Log in to Cloudera Manager. On the
OR
tab, click to the right of the CDSW service
and select Stop from the dropdown. Wait for the action to complete.
- RPM - Run the following command on the master host:
cdsw stop
- After stopping CDSW, and before running the following tar command, wait 2-5 minutes (depending on your disk speed) to ensure that all data from CDSW is successfully written to the disks. Otherwise the tar command may not capture all recent changes.
-
To create the backup, run the following command on the master host:
tar cvzf cdsw.tar.gz /var/lib/cdsw/*
-
(Optional) If needed, the following command can be used to unpack the tar bundle.
tar xvzf cdsw.tar.gz -C /var/lib/cdsw