Installing Cloudera Data Science Workbench 1.0.x

This topic describes how to install the Cloudera Data Science Workbench package on a CDH cluster managed by Cloudera Manager. Currently, we do not support a Custom Service Descriptor (CSD) or parcel-based installs.

Prerequisites

Review the complete list of prerequisites at Cloudera Data Science Workbench 1.0.x Requirements and Supported Platforms before you proceed with the installation.

Installing Cloudera Data Science Workbench 1.0.x from Packages

Set Up a Wildcard DNS Subdomain

Cloudera Data Science Workbench uses subdomains to provide isolation for user-generated HTML and JavaScript, and routing requests between services.. To access Cloudera Data Science Workbench, you must configure the wildcard DNS name *.cdsw.<company>.com for the master host as an A record, along with a root entry for cdsw.<company>.com.

For example, if your master IP address is 172.46.47.48, configure two A records as follows:

cdsw.<company>.com.   IN A 172.46.47.48
*.cdsw.<company>.com.   IN A 172.46.47.48

You can also use a wildcard CNAME record if it is supported by your DNS provider.

Disable Untrusted SSH Access

Cloudera Data Science Workbench assumes that users only access the gateway hosts through the web application. Untrusted users with SSH access to a Cloudera Data Science Workbench host can gain full access to the cluster, including access to other users' workloads. Therefore, untrusted (non-sudo) SSH access to Cloudera Data Science Workbench hosts must be disabled to ensure a secure deployment.

For more information on the security capabilities of Cloudera Data Science Workbench, see the Cloudera Data Science Workbench Security Guide.

Configure Gateway Hosts Using Cloudera Manager

Cloudera Data Science Workbench hosts must be added to your CDH cluster as gateway hosts, with gateway roles properly configured. To configure gateway hosts:
  1. If you have not already done so and plan to use PySpark, install either the Anaconda parcel or Python (versions 2.7.11 and 3.6.1) on your CDH cluster. For more information see, Python Supported Versions.

  2. To support workloads running on Cloudera's Distribution of Apache Spark 2, you must configure the Spark 2 parcel and the Spark 2 CSD. For instructions, see Installing Cloudera Distribution of Apache Spark 2.

    To be able to use Spark 2, each user must have their own /home directory in HDFS. If you sign in to Hue first, these directories will automatically be created for you. Alternatively, you can have cluster administrators create these directories.
    hdfs dfs -mkdir /user/<username>
    hdfs dfs -chown <username>:<username> /user/<username>
  3. Use Cloudera Manager to create add gateway hosts to your CDH cluster.
    1. Create a new host template that includes gateway roles for HDFS, YARN, and Spark 2.
    2. Use the instructions at Adding a Host to the Cluster to add gateway hosts to the cluster. Apply the template created in the previous step to these gateway hosts. If your cluster is kerberized, confirm that the krb5.conf file on your gateway hosts is correct.
  4. Test Spark 2 integration on the gateway hosts.
    1. SSH to a gateway host.
    2. If your cluster is kerberized, run kinit to authenticate to the CDH cluster’s Kerberos Key Distribution Center. The Kerberos ticket you create is not visible to Cloudera Data Science Workbench users.
    3. Submit a test job to Spark 2 by executing the following command:
      spark2-submit --class org.apache.spark.examples.SparkPi 
      --master yarn --deploy-mode client 
      /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-example*.jar 100

Configure Block Devices

Docker Block Device

Cloudera Data Science Workbench installer will format and mount Docker on each gateway host. Do not mount these block devices prior to installation.

Every Cloudera Data Science Workbench gateway host must have one or more block devices with at least 500 GB dedicated to storage of Docker images. The Docker block devices store the Cloudera Data Science Workbench Docker images including the Python, R, and Scala engines. Each engine image can weigh 15GB.

Application Block Device or Mount Point

The master host on Cloudera Data Science Workbench requires at least 500 GB for database and project storage. This recommended capacity is contingent on the expected number of users and projects on the cluster. While large data files should be stored on HDFS, it is not uncommon to find gigabytes of data or libraries in individual projects. Running out of storage will cause the application to fail. Cloudera recommends allocating at least 5 GB per project and at least 1 TB of storage in total. Make sure you continue to carefully monitor disk space usage and I/O using Cloudera Manager.

All application data will be located at /var/lib/cdsw on the master node. If an application block device is specified during initialization, Cloudera Data Science Workbench will format it as ext4 and mount it to /var/lib/cdsw. If no device is explicitly specified during initialization, Cloudera Data Science Workbench will store all data at /var/lib/cdsw and assume the system administrator has formatted and mounted one or more block devices to this location. The second option is recommended for production installations.

Regardless of the application data storage configuration you choose, /var/lib/cdsw must be stored on a separate block device. Given typical database and user access patterns, an SSD is strongly recommended.

By default, data in /var/lib/cdsw is not backed up or replicated to HDFS or other nodes. Reliable storage and backup strategy is critical for production installations. See Backup and Disaster Recovery for Cloudera Data Science Workbench for more information.

Install Cloudera Data Science Workbench on the Master Host

To install Cloudera Data Science Workbench and its dependencies:

  1. Download the Cloudera Data Science Workbench repository installer (cloudera-cdsw.repo) from the following table and save it to /etc/yum.repos.d/.
  2. Add the Cloudera Public GPG repository key. This key verifies that you are downloading genuine packages.
    sudo rpm --import https://archive.cloudera.com/cdsw/1/redhat/7/x86_64/cdsw/RPM-GPG-KEY-cloudera
  3. Install the latest RPM with the following command:
    sudo yum install cloudera-data-science-workbench
    For guidance on any warnings displayed during the installation process, see Understanding Installation Warnings.
  4. Edit the configuration file at /etc/cdsw/config/cdsw.conf. The following table lists the configuration properties that can be configured in cdsw.conf.
    Properties Description
    Required Configuration

    DOMAIN

    Wildcard DNS domain configured to point to the master node.

    If the wildcard DNS entries are configured as cdsw.<company>.com and *.cdsw.<company>.com, then DOMAIN should be set to cdsw.<company>.com. Users' browsers should then contact Cloudera Data Science Workbench at http://cdsw.<company>.com.

    This domain for DNS and is unrelated to Kerberos or LDAP domains.

    MASTER_IP

    IPv4 address for the master node that is reachable from the worker nodes.

    Within an AWS VPC, MASTER_IP should be set to the internal IP address of the master node; for instance, if your hostname is ip-10-251-50-12.ec2.internal, set MASTER_IP to the corresponding IP address, 10.251.50.12.

    DOCKER_BLOCK_DEVICES

    Block device(s) for Docker images (space separated if there are multiple).

    Use the full path to specify the image(s), for instance, /dev/xvde.

    Optional Configuration

    APPLICATION_BLOCK_DEVICE

    (Master Node Only) Configure a block device for application state.

    If this property is left blank, the filesystem mounted at /var/lib/cdsw on the master node will be used to store all user data. For production deployments, Cloudera recommends you use this option with a dedicated SSD block device for the /var/lib/cdsw mount.

    (Not recommended) If set, Cloudera Data Science Workbench will format the provided block device as ext4, mount it to /var/lib/cdsw, and store all user data on it. This option has only been provided for proof-of-concept setups, and Cloudera is not responsible for any data loss.

    Use the full path to specify the mount point, for instance, /dev/xvdf.

    TLS_ENABLE

    Enable and enforce HTTPS (TLS/SSL) for web access.

    Set to true to enable and enforce HTTPS access to the web application.

    You can also set this property to true to enable external TLS termination. For more details on TLS termination, see Enabling TLS/SSL for Cloudera Data Science Workbench.

    TLS_CERT

    TLS_KEY

    Certificate and private key for internal TLS termination.

    Setting TLS_CERT and TLS_KEY will enable internal TLS termination. You must also set TLS_ENABLE to true above to enable and enforce internal termination. Set these only if you are not terminating TLS externally.

    Make sure you specify the full path to the certificate and key files, which must be in PEM format.

    For details on certificate requirements and enabling TLS termination, see Enabling TLS/SSL for Cloudera Data Science Workbench.

    HTTP_PROXY

    HTTPS_PROXY

    If your deployment is behind an HTTP or HTTPS proxy, set the respective HTTP_PROXY or HTTPS_PROXY property in /etc/cdsw/config/cdsw.conf to the hostname of the proxy you are using.
    HTTP_PROXY="<http://proxy_host>:<proxy-port>"
    HTTPS_PROXY="<http://proxy_host>:<proxy_port>"
    If you are using an intermediate proxy such as Cntlm to handle NTLM authentication, add the Cntlm proxy address to the HTTP_PROXY or HTTPS_PROXY fields in cdsw.conf.
    HTTP_PROXY="http://localhost:3128"
    HTTPS_PROXY="http://localhost:3128"

    If the proxy server uses TLS encryption to handle connection requests, you will need to add the proxy's root CA certificate to your host's store of trusted certificates. This is because proxy servers typically sign their server certificate with their own root certificate. Therefore, any connection attempts will fail until the Cloudera Data Science Workbench host trusts the proxy's root CA certificate. If you do not have access to your proxy's root certificate, contact your Network / IT administrator.

    To enable trust, copy the proxy's root certificate to the trusted CA certificate store (ca-trust) on the Cloudera Data Science Workbench host.
    cp /tmp/<proxy-root-certificate>.crt /etc/pki/ca-trust/source/anchors/
    Use the following command to rebuild the trusted certificate store.
    update-ca-trust extract

    ALL_PROXY

    If a SOCKS proxy is in use, set to socks5://<host>:<port>/.

    NO_PROXY

    Comma-separated list of hostnames that should be skipped from the proxy.

    These include 127.0.0.1, localhost, the value of MASTER_IP, and any private Docker registries and HTTP services inside the firewall that Cloudera Data Science Workbench users might want to access from the engines.

    At a minimum, Cloudera recommends the following NO_PROXY configuration.
    NO_PROXY="127.0.0.1,localhost,<MASTER_IP>,100.66.0.1,100.66.0.2,
    100.66.0.3,100.66.0.4,100.66.0.5,100.66.0.6,100.66.0.7,100.66.0.8,
    100.66.0.9,100.66.0.10,100.66.0.11,100.66.0.12,100.66.0.13,100.66.0.14,
    100.66.0.15,100.66.0.16,100.66.0.17,100.66.0.18,100.66.0.19,100.66.0.20,
    100.66.0.21,100.66.0.22,100.66.0.23,100.66.0.24,100.66.0.25,100.66.0.26,
    100.66.0.27,100.66.0.28,100.66.0.29,100.66.0.30,100.66.0.31,100.66.0.32,
    100.66.0.33,100.66.0.34,100.66.0.35,100.66.0.36,100.66.0.37,100.66.0.38,
    100.66.0.39,100.66.0.40,100.66.0.41,100.66.0.42,100.66.0.43,100.66.0.44,
    100.66.0.45,100.66.0.46,100.66.0.47,100.66.0.48,100.66.0.49,100.66.0.50"
  5. Initialize and start Cloudera Data Science Workbench.
    cdsw init
    This initialization process requires Internet access to download the Docker image dependencies. The application can take 20-30 minutes to initially download and bootstrap. You can watch the status of application installation and startup with watch cdsw status.

(Optional) Install Cloudera Data Science Workbench on Worker Hosts

Cloudera Data Science Workbench supports adding and removing additional worker hosts at any time. Worker hosts allow you to transparently scale the number of concurrent workloads users can run.

To add worker hosts:
  1. Download the Cloudera Data Science Workbench repository installer (cloudera-cdsw.repo) from the following table and save it to /etc/yum.repos.d/.
  2. Add the Cloudera Public GPG repository key. This key verifies that you are downloading genuine packages.
    sudo rpm --import https://archive.cloudera.com/cdsw/1/redhat/7/x86_64/cdsw/RPM-GPG-KEY-cloudera
  3. Install the latest RPM with the following command:
    sudo yum install cloudera-data-science-workbench
    For guidance on any warnings displayed during the installation process, see Understanding Installation Warnings.
  4. Copy cdsw.conf file from the master host:
    scp root@cdsw-host-1.<company>.com:/etc/cdsw/config/cdsw.conf /etc/cdsw/config/cdsw.conf

    After initialization, the cdsw.conf file includes a generated bootstrap token that allows worker hosts to securely join the cluster. You can get this token by copying the configuration file from master and ensuring it has 644 permissions.

    If your hosts have heterogeneous block device configurations, modify the Docker block device settings in the worker host configuration file after you copy it. Worker hosts do not need application block devices, which store the project files and database state, and this configuration option is ignored.

  5. On the master node, whitelist the IPv4 address of the worker node for the NFS server.
    cdsw enable <IPv4_address_of_worker>
  6. On the worker node, run the following command to add the host to the cluster:
    cdsw join

    This causes the worker nodes to register themselves with the Cloudera Data Science Workbench master node and increase the available pool of resources for workloads.

  7. Return to the master node and verify the host is registered with this command:
    cdsw status

Create the Administrator Account

Installation typically takes 30 minutes, although it might take an additional 60 minutes for the R, Python, and Scala engine to be available on all hosts.

After your installation is complete, set up the initial administrator account. Go to the Cloudera Data Science Workbench web application at http://cdsw.<company>.com.

The first account that you create becomes the site administrator. You may now use this account to create a new project and start using the workbench to run data science workloads. For a brief example, see Getting Started with the Cloudera Data Science Workbench.

Next Steps

As a site administrator, you can invite new users, monitor resource utilization, secure the deployment, and upload a license key for the product. For more details on these tasks, see the Administration and Security guides.

You can also start using the product by configuring your personal account and creating a new project. For a quickstart that walks you through creating a simple template project, see Getting Started with Cloudera Data Science Workbench. For more details on collaborating with teams, working on projects, and sharing results, see the Cloudera Data Science Workbench User Guide.

Upgrading to the Latest Version of Cloudera Data Science Workbench 1.0.x

  1. Reset the state of every worker node.
    cdsw reset
  2. Reset the state of the master node.
    cdsw reset
  3. (Optional) On the master node, backup the contents of the /var/lib/cdsw directory. This is the directory that stores all your application data.
  4. Uninstall the previous release of Cloudera Data Science Workbench. Perform this step on the master node, as well as all the worker nodes.
    yum remove cloudera-data-science-workbench 
  5. Install the latest version of Cloudera Data Science Workbench on the master node and on all the worker nodes. Follow the same process as you would for a fresh installation. However, note that even though you have installed the latest RPM, your previous configuration settings in cdsw.conf will remain unchanged.