Installing and Upgrading Cloudera Data Science Workbench 1.4.x Using Cloudera Manager

This topic describes how to install and upgrade Cloudera Data Science Workbench using Cloudera Manager.

Installing Cloudera Data Science Workbench 1.4.x Using Cloudera Manager

Prerequisites

Before you begin installing Cloudera Data Science Workbench, make sure you have completed the steps to secure your hosts, set up DNS subdomains, and configure block devices.

Install CDS 2.x Powered by Apache Spark

If you have not already done so, install and configure the Cloudera Distribution of Apache Spark 2 parcel and CSD. For instructions, see Installing CDS 2.x Powered by Apache Spark.

To be able to use Spark 2, each user must have their own /home directory in HDFS. If you sign in to Hue first, these directories will automatically be created for you. Alternatively, you can have cluster administrators create these directories.
hdfs dfs -mkdir /user/<username>
hdfs dfs -chown <username>:<username> /user/<username>

Configure JAVA_HOME

On CSD-based deployments, Cloudera Manager automatically detects the path and version of Java installed on Cloudera Data Science Workbench gateway hosts. You do not need to explicitly set the value for JAVA_HOME unless you want to use a custom location, use JRE, or in the case of Spark 2, force Cloudera Manager to use JDK 1.8 as explained below.

Setting a value for JAVA_HOME - The value for JAVA_HOME depends on whether you are using JDK or JRE. For example, if you're using JDK 1.8_162, set JAVA_HOME to /usr/java/jdk1.8.0_162. If you are only using JRE, set it to /usr/java/jdk1.8.0_162/jre.

Issues with Spark 2.2 and higher - Spark 2.2 (and higher) requires JDK 1.8. However, if a host has both JDK 1.7 and JDK 1.8 installed, Cloudera Manager might choose to use JDK 1.7 over JDK 1.8. If you are using Spark 2.2 (or higher), this will create a problem during the first run of the service because Spark will not work with JDK 1.7. To work around this, explicitly configure Cloudera Manager to use JDK 1.8 on the gateway hosts that are running Cloudera Data Science Workbench.

For instructions on how to set JAVA_HOME, see Configuring a Custom Java Home Location in Cloudera Manager.

To upgrade the whole CDH cluster to JDK 1.8, see Upgrading to Oracle JDK 1.8.

Download and Install the Cloudera Data Science Workbench CSD

CDSW 1.4.x is no longer available for download. Refer to the CDSW documentation for information on suppported versions.

Install the Cloudera Data Science Workbench Parcel

CDSW 1.4.x is no longer available for installation. Refer to the CDSW documentation for information on suppported versions.

Add the Cloudera Data Science Workbench Service

To add the Cloudera Data Science Workbench service to your cluster:

  1. Log in to the Cloudera Manager Admin Console.
  2. On the Home > Status tab, click to the right of the cluster name and select Add a Service to launch the wizard. A list of services will be displayed.
  3. Select the Cloudera Data Science Workbench service and click Continue.
  4. Assign the Master and Worker roles to the gateway hosts. You must assign the Cloudera Data Science Workbench Master role to one gateway host, and optionally, assign the Worker role to one or more gateway hosts. Other Cloudera Data Science Workbench Role Groups - In addition to Master and Worker, there are two more role groups that fall under the Cloudera Data Science Workbench service: the Docker Daemon role, and the Application role.
    • The Docker Daemon role must be assigned to every Cloudera Data Science Workbench gateway host. On First Run, Cloudera Manager will automatically assign this role to each Cloudera Data Science Workbench gateway host. However, if any more hosts are added or reassigned to Cloudera Data Science Workbench, you must explicitly assign the Docker Daemon role to them.

    • On First Run, Cloudera Manager will assign the Application role to the host running the Cloudera Data Science Workbench Master role. The Application role is always assigned to the same host as the Master. Consequently, this role must never be assigned to a Worker host.
  5. Configure the following parameters and click Continue.
    Properties Description

    Cloudera Data Science Workbench Domain

    DNS domain configured to point to the master node.

    If the previously configured DNS subdomain entries are cdsw.<your_domain>.com and *.cdsw.<your_domain>.com, then this parameter should be set to cdsw.<your_domain>.com.

    Users' browsers will then be able to contact the Cloudera Data Science Workbench web application at http://cdsw.<your_domain>.com.

    This domain for DNS only, and is unrelated to Kerberos or LDAP domains.

    Master Node IPv4 Address

    IPv4 address for the master node that is reachable from the worker nodes. By default, this field is left blank and Cloudera Manager uses the IPv4 address of the Master node.

    Within an AWS VPC, set this parameter to the internal IP address of the master node; for instance, if your hostname is ip-10-251-50-12.ec2.internal, set this property to the corresponding IP address, 10.251.50.12.

    Install Required Packages

    When this parameter is enabled, the Prepare Node command will install all the required package dependencies on First Run. If you choose to disable this property, you must manually install the following packages on all gateway hosts running Cloudera Data Science Workbench roles.
    nfs-utils
    libseccomp
    lvm2
    bridge-utils
    libtool-ltdl
    iptables   
    rsync 
    policycoreutils-python 
    selinux-policy-base 
    selinux-policy-targeted 
    ntp 
    ebtables 
    bind-utils 
    nmap-ncat  
    openssl 
    e2fsprogs 
    redhat-lsb-core 
    socat

    Docker Block Device

    Block device(s) for Docker images. Use the full path to specify the image(s), for instance, /dev/xvde.

    The Cloudera Data Science Workbench installer will format and mount Docker on each gateway host that is assigned the Docker Daemon role. Do not mount these block devices prior to installation.

  6. The wizard will now begin a First Run of the Cloudera Data Science Workbench service. This includes deploying client configuration for HDFS, YARN and Spark 2, installing the package dependencies on all hosts, and formatting the Docker block device. The wizard will also assign the Application role to the host running Master, and the Docker Daemon role to all the gateway hosts running Cloudera Data Science Workbench.
  7. Once the First Run command has completed successfully, click Finish to go back to the Cloudera Manager home page.

Create the Administrator Account

After your installation is complete, set up the initial administrator account. Go to the Cloudera Data Science Workbench web application at http://cdsw.<your_domain>.com.

You must access Cloudera Data Science Workbench from the Cloudera Data Science Workbench Domain configured when setting up the service, and not the hostname of the master node. Visiting the hostname of the master node will result in a 404 error.

The first account that you create becomes the site administrator. You may now use this account to create a new project and start using the workbench to run data science workloads. For a brief example, see Getting Started with the Cloudera Data Science Workbench.

Next Steps

As a site administrator, you can invite new users, monitor resource utilization, secure the deployment, and upload a license key for the product. For more details on these tasks, see the Administration and Security guides.

You can also start using the product by configuring your personal account and creating a new project. For a quickstart that walks you through creating and running a simple template project, see Getting Started with Cloudera Data Science Workbench. For more details on collaborating with teams, working on projects, and sharing results, see the Managing Cloudera Data Science Workbench Users.

Upgrading to the Latest Version of Cloudera Data Science Workbench 1.4.x

This section describes how to use a Cloudera Manager CSD and parcel to upgrade Cloudera Data Science Workbench. Before you begin the upgrade, read the Cloudera Data Science Workbench Release Notes relevant to the version you are upgrading to.

Depending on your deployment, choose from one of the following upgrade paths:

Upgrading a CSD-based Deployment to the Latest 1.4.x CSD

  1. (Strongly Recommended) Safely stop Cloudera Data Science Workbench. To avoid running into the data loss issue described in TSB-346, run the cdsw_protect_stop_restart.sh script on the master node and follow the sequence of steps as instructed by the script.

    The script will first back up your project files to the specified target folder. It will then temporarily move your project files aside to protect against the data loss condition. At that point, it is safe to stop the CDSW service in Cloudera Manager.

    After Cloudera Data Science Workbench has stopped, press enter to continue running the script as instructed. It will then move your project files back into place.

  2. (Strongly Recommended) On the master node, back up all your application data that is stored in the /var/lib/cdsw directory.
    To create the backup, run the following command on the master host.
    tar cvzf cdsw.tar.gz /var/lib/cdsw/*
  3. (Required for Upgrades from CDSW 1.4.0 - RedHat only) Cloudera Data Science Workbench 1.4.2 (and higher) includes a fix for a slab leak issue found in RedHat kernels. To have this fix go into effect, RedHat users must reboot all Cloudera Data Science Workbench hosts before proceeding with the upgrade.

    As a precaution, consult your cluster/IT administrator before you start rebooting hosts.

  4. Deactivate the existing Cloudera Data Science Workbench parcel. Go to the Cloudera Manager Admin Console. In the top navigation bar, click Hosts > Parcels.

    Locate the current active CDSW parcel and click Deactivate. On the confirmation pop-up, select Deactivate Only and click OK.

  5. Download and save the latest Cloudera Data Science Workbench CSD to the Cloudera Manager Server host.
    1. Download the latest Cloudera Data Science Workbench CSD.
    2. Log on to the Cloudera Manager Server host, and place the CSD file under /opt/cloudera/csd, which is the default location for CSD files.
    3. Delete any CSD files belonging to older versions of Cloudera Data Science Workbench from /opt/cloudera/csd.

      This is required because older versions of the CSD will not work with the latest Cloudera Data Science Workbench 1.4 parcel. Make sure your CSD and parcel are always the same version.

    4. Set the CSD file ownership to cloudera-scm:cloudera-scm with permission 644.
    5. Restart the Cloudera Manager Server:
      service cloudera-scm-server restart
    6. Log in to the Cloudera Manager Admin Console and restart the Cloudera Management Service.
      1. Select Clusters > Cloudera Management Service.
      2. Select Actions > Restart.
  6. Distribute and activate the new parcel on your cluster.
    1. Log in to the Cloudera Manager Admin Console.
    2. Click Hosts > Parcels in the main navigation bar.
    3. Add the Cloudera Data Science Workbench parcel repository URL to Cloudera Manager.
      1. On the Parcels page, click Configuration.
      2. In the Remote Parcel Repository URLs list, click the addition symbol to create a new row.
      3. Enter the path to the repository.
      4. Click Save Changes.
    4. Go back to the Hosts > Parcels page. The latest parcel should now appear in the set of parcels available for download. Click Download. Once the download is complete, click Distribute to distribute the parcel to all the CDH hosts in your cluster. Then click Activate. On the pop-up screen, click OK. For more detailed information on each of these tasks, see Managing Parcels.
  7. Run the Prepare Node command on all Cloudera Data Science Workbench hosts.
    1. Before you run Prepare Node, you must make sure that the command is allowed to install all the required packages on your cluster. This is controlled by the Install Required Packages property.

      1. Navigate to the CDSW service.
      2. Click Configuration.
      3. Search for the Install Required Packages property. If this property is enabled, you can move on to the next step and run Prepare Node.
        However, if the property has been disabled, you must either enable it or manually install the following packages on all Cloudera Data Science Workbench gateway hosts.
        nfs-utils
        libseccomp
        lvm2
        bridge-utils
        libtool-ltdl
        iptables   
        rsync 
        policycoreutils-python 
        selinux-policy-base 
        selinux-policy-targeted 
        ntp 
        ebtables 
        bind-utils 
        nmap-ncat  
        openssl 
        e2fsprogs 
        redhat-lsb-core 
        socat
    2. Run the Prepare Node command.
      1. In Cloudera Manager, navigate to the Cloudera Data Science Workbench service.
      2. Click the Instances tab.
      3. Use the checkboxes to select all host instances and click Actions for Selected (x).
      4. Click Prepare Node. Once again, click Prepare Node to confirm the action.
  8. Log in to the Cloudera Manager Admin Console and restart the Cloudera Data Science Workbench service.
    1. On the Home > Status tab, click to the right of the CDSW service and select Restart from the dropdown.
    2. Confirm your choice on the next screen. Note that a complete restart of the service will take time. Even though the CDSW service status shows Good Health, the application itself will take some more time to get ready.
  9. Additional Post-Upgrade Tasks for Cloudera Data Science Workbench 1.4.x
    1. Check for a New Base Engine - If the release you have just upgraded to includes a new version of the base engine image (see release notes), you will need to manually configure existing projects to use the new engine. Cloudera recommends you do so to take advantage of any new features and bug fixes included in the newly released engine.

      To upgrade a project to the new engine, go to the project's Settings > Engine page and select the new engine from the dropdown. If any of your projects are using custom extended engines, you will need to modify them to use the new base engine image.

Migrating from an RPM-based Deployment to the Latest 1.4.x CSD

  1. Save a backup of the Cloudera Data Science Workbench configuration file located at /etc/cdsw/config/cdsw.conf.
  2. (Strongly Recommended) Safely stop Cloudera Data Science Workbench. To avoid running into the data loss issue described in TSB-346, run the cdsw_protect_stop_restart.sh script on the master node and follow the sequence of steps as instructed by the script.

    The script will first back up your project files to the specified target folder. It will then temporarily move your project files aside to protect against the data loss condition. At that point, it is safe to stop Cloudera Data Science Workbench. To stop Cloudera Data Science Workbench, run the following command on all Cloudera Data science Workbench nodes (master and workers):
    cdsw reset

    After Cloudera Data Science Workbench has stopped, press enter to continue running the script as instructed. It will then move your project files back into place.

  3. (Strongly Recommended) On the master node, back up all your application data that is stored in the /var/lib/cdsw directory, and the configuration file at /etc/cdsw/config/cdsw.conf.
    To create the backup, run the following command on the master host.
    tar cvzf cdsw.tar.gz /var/lib/cdsw/*
  4. (Required for Upgrades from CDSW 1.4.0 - RedHat only) Cloudera Data Science Workbench 1.4.2 (and higher) includes a fix for a slab leak issue found in RedHat kernels. To have this fix go into effect, RedHat users must reboot all Cloudera Data Science Workbench hosts before proceeding with the upgrade.

    As a precaution, consult your cluster/IT administrator before you start rebooting hosts.

  5. Uninstall the previous release of Cloudera Data Science Workbench. Perform this step on the master node, as well as all the worker nodes.
    yum remove cloudera-data-science-workbench 
  6. Install the latest version of Cloudera Data Science Workbench using the CSD and parcel. Note that when you are configuring role assignments for the Cloudera Data Science Workbench service, the Master role must be assigned to the same node that was running as master prior to the upgrade.

    For installation instructions, see Installing Cloudera Data Science Workbench 1.4.x Using Cloudera Manager. You might be able to skip the first few steps assuming you have the wildcard DNS domain and block devices already set up.

  7. Use your copy of the backup cdsw.conf created in Step 3 to recreate those settings in Cloudera Manager by configuring the corresponding properties under the Cloudera Data Science Workbench service.
    1. Log in to the Cloudera Manager Admin Console.
    2. Go to the Cloudera Data Science Workbench service.
    3. Click the Configuration tab.
    4. The following table lists all the cdsw.conf properties and their corresponding Cloudera Manager properties (in bold). Use the search box to bring up the properties you want to modify.
    5. Click Save Changes.
    cdsw.conf Property Corresponding Cloudera Manager Property and Description

    TLS_ENABLE

    Enable TLS: Enable and enforce HTTPS (TLS/SSL) access to the web application (optional). Both internal and external termination are supported. To enable internal termination, you must also set the TLS Certificate for Internal Termination and TLS Key for Internal Termination parameters. If these parameters are not set, terminate TLS using an external proxy.

    For more details on TLS termination, see Enabling TLS/SSL for Cloudera Data Science Workbench.

    TLS_CERT

    TLS_KEY

    TLS Certificate for Internal Termination, TLS Key for Internal Termination

    Complete path to the certificate and private key (in PEM format) to be used for internal TLS termination. Set these parameters only if you are not terminating TLS externally. You must also set the Enable TLS property to enable and enforce termination. The certificate must include both DOMAIN and *.DOMAIN as hostnames.

    Self-signed certificates are not supported unless trusted fully by clients. Accepting an invalid certificate manually can cause connection failures for unknown subdomains.Set these only if you are not terminating TLS externally. For details on certificate requirements and enabling TLS termination, see Enabling TLS/SSL for Cloudera Data Science Workbench.

    HTTP_PROXY

    HTTPS_PROXY

    HTTP Proxy, HTTPS Proxy

    If your deployment is behind an HTTP or HTTPS proxy, set the respective HTTP Proxy or HTTPS Proxy property to the hostname of the proxy you are using.
    http://<proxy_host>:<proxy-port>
    or
    https://<proxy_host>:<proxy_port>

    If you are using an intermediate proxy such as Cntlm to handle NTLM authentication, add the Cntlm proxy address to the HTTP Proxy or HTTPS Proxy fields. That is, either http://localhost:3128 or https://localhost:3128 respectively.

    If the proxy server uses TLS encryption to handle connection requests, you will need to add the proxy's root CA certificate to your host's store of trusted certificates. This is because proxy servers typically sign their server certificate with their own root certificate. Therefore, any connection attempts will fail until the Cloudera Data Science Workbench host trusts the proxy's root CA certificate. If you do not have access to your proxy's root certificate, contact your Network / IT administrator.

    To enable trust, copy the proxy's root certificate to the trusted CA certificate store (ca-trust) on the Cloudera Data Science Workbench host.
    cp /tmp/<proxy-root-certificate>.crt /etc/pki/ca-trust/source/anchors/
    Use the following command to rebuild the trusted certificate store.
    update-ca-trust extract

    ALL_PROXY

    SOCKS Proxy: If a SOCKS proxy is in use, set this parameter to socks5://<host>:<port>/.

    NO_PROXY

    No Proxy: Comma-separated list of hostnames that should be skipped from the proxy.

    Starting with version 1.4, if you have defined a proxy in the HTTP_PROXY(S) or ALL_PROXY properties, Cloudera Data Science Workbench automatically appends the following list of IP addresses to the NO_PROXY configuration. Note that this is the minimum required configuration for this field.

    This list includes 127.0.0.1, localhost, and any private Docker registries and HTTP services inside the firewall that Cloudera Data Science Workbench users might want to access from the engines.

    "127.0.0.1,localhost,100.66.0.1,100.66.0.2,100.66.0.3,
    100.66.0.4,100.66.0.5,100.66.0.6,100.66.0.7,100.66.0.8,100.66.0.9,
    100.66.0.10,100.66.0.11,100.66.0.12,100.66.0.13,100.66.0.14,
    100.66.0.15,100.66.0.16,100.66.0.17,100.66.0.18,100.66.0.19,
    100.66.0.20,100.66.0.21,100.66.0.22,100.66.0.23,100.66.0.24,
    100.66.0.25,100.66.0.26,100.66.0.27,100.66.0.28,100.66.0.29,
    100.66.0.30,100.66.0.31,100.66.0.32,100.66.0.33,100.66.0.34,
    100.66.0.35,100.66.0.36,100.66.0.37,100.66.0.38,100.66.0.39,
    100.66.0.40,100.66.0.41,100.66.0.42,100.66.0.43,100.66.0.44,
    100.66.0.45,100.66.0.46,100.66.0.47,100.66.0.48,100.66.0.49,
    100.66.0.50,100.77.0.10,100.77.0.128,100.77.0.129,100.77.0.130,
    100.77.0.131,100.77.0.132,100.77.0.133,100.77.0.134,100.77.0.135,
    100.77.0.136,100.77.0.137,100.77.0.138,100.77.0.139"

    NVIDIA_GPU_ENABLE

    Enable GPU Support: When this property is enabled, and the NVIDIA GPU Driver Library Path parameter is set, the GPUs installed on Cloudera Data Science Workbench nodes will be available for use in its workloads. By default, this parameter is disabled.

    For instructions on how to enable GPU-based workloads on Cloudera Data Science Workbench, see Using NVIDIA GPUs for Cloudera Data Science Workbench Projects.

    NVIDIA_LIBRARY_PATH

    NVIDIA GPU Driver Library Path: Complete path to the NVIDIA driver libraries. For instructions on how to create this directory, see Enable Docker NVIDIA Volumes on GPU Nodes.

    RESERVE_PATH

    This property allows you to reserve the master node for Cloudera Data Science Workbench's internal components and services such as Livelog, the PostgreSQL database, and so on. When enabled, user workloads will run exclusively on worker nodes, while the master is reserved for internal application services.

    Note that this property is not yet available as a configuration property in Cloudera Manager. However, you can use an Advanced Configuration Snippet (Safety Valve) to configure this as described here: Reserving the Master Host for Internal CDSW Components.

  8. Cloudera Manager will prompt you to restart the service if needed.
  9. Additional Post-Upgrade Tasks for Cloudera Data Science Workbench 1.4.x
    1. Check for a New Engine - If the release you have just upgraded to includes a new version of the base engine image (see release notes), you will need to manually configure existing projects to use the new engine. Cloudera recommends you do so to take advantage of any new features and bug fixes included in the newly released engine.

      To upgrade a project to the new engine, go to the project's Settings > Engine page and select the new engine from the dropdown. If any of your projects are using custom extended engines, you will need to modify them to use the new base engine image.