Backup and Disaster Recovery for Cloudera Data Science Workbench

All application data for Cloudera Data Science Workbench, including project files and database state, is stored on the master host at /var/lib/cdsw. Given typical access patterns, it is strongly recommended that /var/lib/cdsw be stored on a dedicated SSD block device or SSD RAID configuration. Because application data is not replicated to HDFS or backed up by default, site administrators must enable a backup strategy to meet any disaster recovery scenarios.

Cloudera strongly recommends both regular backups and backups before upgrades and is not responsible for any data loss.

Creating a Backup

Cloudera Data Science Workbench 1.4.2 or lower

Do not stop or restart Cloudera Data Science Workbench without using the cdsw_protect_stop_restart.sh script. This is to help avoid the data loss issue detailed in TSB-346.

Run the script on your master host and stop Cloudera Data Science Workbench (instructions below) only when instructed to do so by the script. Then proceed with step 2 of this process.

Cloudera Data Science Workbench 1.4.3 or higher

Depending on your deployment, use one of the following sets of instructions to stop the application.
To stop Cloudera Data Science Workbench:
- CSD - Log in to Cloudera Manager. On the Home > Status tab, click to the right of the CDSW service and select Stop from the dropdown. Wait for the action to complete.
  OR
- RPM - Run the following command on the master host:
```
cdsw stop
```
After stopping CDSW, and before running the following tar command, wait 2-5 minutes (depending on your disk speed) to ensure that all data from CDSW is successfully written to the disks. Otherwise the tar command may not capture all recent changes.
To create the backup, run the following command on the master host:
```
tar cvzf cdsw.tar.gz /var/lib/cdsw/*
```
Note: The /var/lib/cdsw directory contains all the persistent information as well as project-related information, such as database information, configurations, image details, etc. Using tar to create the backup preserves important file metadata such as file ownership, and is the recommended method to migrate the information to the new host. Other methods of copying/saving files might not preserve this information. This metadata is required for tasks such as migrating CDSW to another cluster.
(Optional) If needed, the following command can be used to unpack the tar bundle.
```
tar xvzf cdsw.tar.gz -C /var/lib/cdsw
```

Categories: Administrators | Backup | Cloudera Data Science Workbench | Data Scientists | All Categories

Cluster Monitoring with Grafana

Scaling Guidelines