Backup and Disaster Recovery for Cloudera Data Science Workbench

All application data for Cloudera Data Science Workbench, including project files and database state, is stored on the master host at /var/lib/cdsw. Given typical access patterns, it is strongly recommended that /var/lib/cdsw be stored on a dedicated SSD block device or SSD RAID configuration. Because application data is not replicated to HDFS or backed up by default, site administrators must enable a backup strategy to meet any disaster recovery scenarios.

Cloudera strongly recommends both regular backups and backups before upgrades and is not responsible for any data loss.

Creating a Backup

  1. Cloudera Data Science Workbench 1.4.2 or lower

    Do not stop or restart Cloudera Data Science Workbench without using the cdsw_protect_stop_restart.sh script. This is to help avoid the data loss issue detailed in TSB-346.

    Run the script on your master host and stop Cloudera Data Science Workbench (instructions below) only when instructed to do so by the script. Then proceed with step 2 of this process.

    Cloudera Data Science Workbench 1.4.3 or higher

    Depending on your deployment, use one of the following sets of instructions to stop the application.

    To stop Cloudera Data Science Workbench:
    • CSD - Log in to Cloudera Manager. On the Home > Status tab, click to the right of the CDSW service and select Stop from the dropdown. Wait for the action to complete.

      OR

    • RPM - Run the following command on the master host:
      cdsw stop
  2. After stopping CDSW, and before running the following tar command, wait 2-5 minutes (depending on your disk speed) to ensure that all data from CDSW is successfully written to the disks. Otherwise the tar command may not capture all recent changes.
  3. To create the backup, run the following command on the master host:
    tar -cvzf cdsw.tar.gz -C /var/lib/cdsw/ .

Restoring from a Backup

You can restore across versions. For example, you can restore a tarball from CDSW 1.5 to, say, version is 1.6. To restore from a backup:

  1. Stop the Cloudera Data Science Workbench.
    1. CSD - Log in to Cloudera Manager. On the Home > Status tab, click to the right of the CDSW service and select Stop from the dropdown. Wait for the action to complete. OR
    2. RPM - Run the following command on the master host:
      cdsw stop
  2. Restore the tarball to the following location: var/lib/cdsw by using the following command:
    tar -xvzf cdsw.tar.gz -C /var/lib/cdsw/
  3. Start the Cloudera Data Science Workbench.