Migrating from an RPM-based Deployment to the Latest 1.5.x CSD

  1. Before you begin the migration process, make sure you read the Cloudera Data Science Workbench Release Notes relevant to the version you are migrating to/from.

  2. Save a backup of the Cloudera Data Science Workbench configuration file located at /etc/cdsw/config/cdsw.conf.
    • (Required for Upgrades from CDSW 1.4.2 or lower) Safely stop Cloudera Data Science Workbench. To avoid running into the data loss issue described in TSB-346, run the cdsw_protect_stop_restart.sh script on the master host and follow the sequence of steps as instructed by the script.

      The script will first back up your project files to the specified target folder. It will then temporarily move your project files aside to protect against the data loss condition. At that point, it is safe to stop Cloudera Data Science Workbench. To stop Cloudera Data Science Workbench, run the following command on all Cloudera Data Science Workbench hosts (master and workers):
      cdsw reset

      After Cloudera Data Science Workbench has stopped, press enter to continue running the script as instructed. It will then move your project files back into place.


    • (Upgrading from CDSW 1.4.3 or higher) Run the following command on all Cloudera Data Science Workbench hosts (master and workers) to stop Cloudera Data Science Workbench.
      cdsw reset
  3. (Strongly Recommended) On the master host, backup all your application data that is stored in the /var/lib/cdsw directory.
    To create the backup, run the following command on the master host:
    tar -cvzf cdsw.tar.gz -C /var/lib/cdsw/ .
  4. Save a backup of the Cloudera Data Science Workbench configuration file at /etc/cdsw/config/cdsw.conf.
  5. (Required for Upgrades from CDSW 1.4.0 - RedHat only) Cloudera Data Science Workbench 1.4.2 (and higher) includes a fix for a slab leak issue found in RedHat kernels. To have this fix go into effect, RedHat users must reboot all Cloudera Data Science Workbench hosts before proceeding with an upgrade from CDSW 1.4.0.

    As a precaution, consult your cluster/IT administrator before you start rebooting hosts.

  6. Uninstall the previous release of Cloudera Data Science Workbench. Perform this step on the master host, as well as all the worker hosts.
    yum remove cloudera-data-science-workbench 
  7. Install the latest version of Cloudera Data Science Workbench using the CSD and parcel. Note that when you are configuring role assignments for the Cloudera Data Science Workbench service, the Master role must be assigned to the same host that was running as master prior to the upgrade.

    For installation instructions, see Installing Cloudera Data Science Workbench 1.5.x Using Packages. You might be able to skip the first few steps assuming you have the wildcard DNS domain and block devices already set up.

  8. Use your copy of the backup cdsw.conf created in Step 3 to recreate those settings in Cloudera Manager by configuring the corresponding properties under the Cloudera Data Science Workbench service.
    1. Log into the Cloudera Manager Admin Console.
    2. Go to the Cloudera Data Science Workbench service.
    3. Click the Configuration tab.
    4. The following table lists all the cdsw.conf properties and their corresponding Cloudera Manager properties (in bold). Use the search box to bring up the properties you want to modify.
    5. Click Save Changes.
    cdsw.conf Property Corresponding Cloudera Manager Property and Description


    Enable TLS: Enable and enforce HTTPS (TLS/SSL) access to the web application (optional). Both internal and external termination are supported. To enable internal termination, you must also set the TLS Certificate for Internal Termination and TLS Key for Internal Termination parameters. If these parameters are not set, terminate TLS using an external proxy.

    For more details on TLS termination, see Enabling TLS/SSL for Cloudera Data Science Workbench.



    TLS Certificate for Internal Termination, TLS Key for Internal Termination

    Complete path to the certificate and private key (in PEM format) to be used for internal TLS termination. Set these parameters only if you are not terminating TLS externally. You must also set the Enable TLS property to enable and enforce termination. The certificate must include both DOMAIN and *.DOMAIN as hostnames.

    Self-signed certificates are not supported unless trusted fully by clients. Accepting an invalid certificate manually can cause connection failures for unknown subdomains.Set these only if you are not terminating TLS externally. For details on certificate requirements and enabling TLS termination, see Enabling TLS/SSL for Cloudera Data Science Workbench.



    HTTP Proxy, HTTPS Proxy

    If your deployment is behind an HTTP or HTTPS proxy, set the respective HTTP Proxy or HTTPS Proxy property to the hostname of the proxy you are using.

    If you are using an intermediate proxy such as Cntlm to handle NTLM authentication, add the Cntlm proxy address to the HTTP Proxy or HTTPS Proxy fields. That is, either http://localhost:3128 or https://localhost:3128 respectively.

    If the proxy server uses TLS encryption to handle connection requests, you will need to add the proxy's root CA certificate to your host's store of trusted certificates. This is because proxy servers typically sign their server certificate with their own root certificate. Therefore, any connection attempts will fail until the Cloudera Data Science Workbench host trusts the proxy's root CA certificate. If you do not have access to your proxy's root certificate, contact your Network / IT administrator.

    To enable trust, copy the proxy's root certificate to the trusted CA certificate store (ca-trust) on the Cloudera Data Science Workbench host.
    cp /tmp/<proxy-root-certificate>.crt /etc/pki/ca-trust/source/anchors/
    Use the following command to rebuild the trusted certificate store.
    update-ca-trust extract


    SOCKS Proxy: If a SOCKS proxy is in use, set this parameter to socks5://<host>:<port>/.


    No Proxy: Comma-separated list of hostnames that should be skipped from the proxy.

    Starting with version 1.4, if you have defined a proxy in the HTTP_PROXY(S) or ALL_PROXY properties, Cloudera Data Science Workbench automatically appends the following list of IP addresses to the NO_PROXY configuration. Note that this is the minimum required configuration for this field.

    This list includes, localhost, and any private Docker registries and HTTP services inside the firewall that Cloudera Data Science Workbench users might want to access from the engines.



    Enable GPU Support: When this property is enabled, and the NVIDIA GPU Driver Library Path parameter is set, the GPUs installed on Cloudera Data Science Workbench hosts will be available for use in its workloads. By default, this parameter is disabled.

    For instructions on how to enable GPU-based workloads on Cloudera Data Science Workbench, see Using NVIDIA GPUs for Cloudera Data Science Workbench Projects.


    NVIDIA GPU Driver Library Path: Complete path to the NVIDIA driver libraries. For instructions on how to create this directory, see Enable Docker NVIDIA Volumes on GPU Hosts.

  9. Cloudera Manager will prompt you to restart the service if needed.

  10. Upgrade Projects to Use the Latest Base Engine Images

    If the release you have just upgraded to includes a new version of the base engine image (see release notes), you will need to manually configure existing projects to use the new engine. Cloudera recommends you do so to take advantage of any new features and bug fixes included in the newly released engine.

    To upgrade a project to the new engine, go to the project's Settings > Engine page and select the new engine from the dropdown. If any of your projects are using custom extended engines, you will need to modify them to use the new base engine image.

    Note that this is a required step if you have upgraded to using Cloudera Data Science Workbench on CDH 6.

    The base engine image you use must be compatible with the version of CDH you are running. This is especially important if you are running workloads on Spark. Older base engines (v5 and lower) cannot support the latest versions of CDH 6. That is because these engines were configured to point to the Spark 2 parcel. However, on C6 clusters, Spark is now packaged as part of CDH 6 and the separate add-on Spark 2 parcel is no longer supported. If you want to use Spark on C6, you must upgrade your projects to base engine 7 (or higher).

    CDSW Base Engine Compatibility for Spark Workloads on CDH 5 and CDH 6
    Base Engine Versions CDH 5 CDH 6
    Base engines 6 (and lower) Yes No
    Base engines 7 (and higher) Yes Yes