Installing and Upgrading Cloudera Data Science Workbench 1.4.x

This topic walks you through the installation and upgrade paths available for Cloudera Data Science Workbench 1.4.x. It also describes the steps needed to configure your cluster gateway hosts and block devices before you can begin installing the Cloudera Data Science Workbench parcel/package.

Installing Cloudera Data Science Workbench 1.4.x

You can use one of the following ways to install Cloudera Data Science Workbench 1.4.x:
  • Using a Custom Service Descriptor (CSD) and Parcel - Starting with version 1.2.x, Cloudera Data Science Workbench is available as an add-on service for Cloudera Manager. Two files are required for this type of installation: a CSD JAR file that contains all the configuration needed to describe and manage the new Cloudera Data Science Workbench service, and the Cloudera Data Science Workbench parcel. To install this service, first download and copy the CSD file to the Cloudera Manager Server host. Then use Cloudera Manager to distribute the Cloudera Data Science Workbench parcel to the relevant gateway nodes.

    or

  • Using a Package (RPM) - Alternatively, you can install the Cloudera Data Science Workbench package directly on the CDH cluster's gateway nodes. In this case, the Cloudera Data Science Workbench service will not be available in Cloudera Manager.

To begin the installation process, continue reading Required Pre-Installation Steps.

Upgrading to the Latest Version of Cloudera Data Science Workbench 1.4.x

Depending on your deployment, choose from one of the following upgrade paths:

Airgapped Installations

Sometimes organizations choose to restrict parts of their network from the Internet for security reasons. Isolating segments of a network can provide assurance that valuable data is not being compromised by individuals out of maliciousness or for personal gain. However, in such cases isolated hosts are unable to access Cloudera repositories for new installations or upgrades. Effective version 1.1.1, Cloudera Data Science Workbench supports installation on CDH clusters that are not connected to the Internet.

For CSD-based installs in an airgapped environment, put the Cloudera Data Science Workbench parcel into a new hosted or local parcel repository, and then configure the Cloudera Manager Server to target this newly-created repository.

Required Pre-Installation Steps

The rest of this topic describes the steps you should take to review your platforms and configure your gateway hosts before you begin to install Cloudera Data Science Workbench.

  1. Review Requirements and Supported Platforms
  2. Set Up a Wildcard DNS Subdomain
  3. Disable Untrusted SSH Access
  4. Configure Block Devices
  5. Install Cloudera Data Science Workbench

Review Requirements and Supported Platforms

Review the complete list of Cloudera Data Science Workbench 1.4.x Requirements and Supported Platforms before you proceed with the installation.

Set Up a Wildcard DNS Subdomain

Cloudera Data Science Workbench uses DNS to route HTTP requests to specific engines and services. Wildcard subdomains (such as *.cdsw.<your_domain>.com) are required in order to provide isolation for user-generated content. In particular, wildcard subdomains help:
  • Securely expose interactive session services, such as visualizations, the terminal, and web UIs such as TensorBoard, Shiny, Plotly, and so on.

  • Securely isolate user-generated content from the application.

To set up subdomains for Cloudera Data Science Workbench, configure your DNS server with an A record for a wildcard DNS name such as *.cdsw.<your_domain>.com for the master host, and a second A record for the root entry of cdsw.<your_domain>.com.

For example, if your master IP address is 172.46.47.48, you'd configure two A records as follows:

cdsw.<your_domain>.com.   IN A 172.46.47.48
*.cdsw.<your_domain>.com.   IN A 172.46.47.48

You can also use a wildcard CNAME record if it is supported by your DNS provider.

Disable Untrusted SSH Access

Cloudera Data Science Workbench assumes that users only access the gateway hosts through the web application. Untrusted users with SSH access to a Cloudera Data Science Workbench host can gain full access to the cluster, including access to other users' workloads. Therefore, untrusted (non-sudo) SSH access to Cloudera Data Science Workbench hosts must be disabled to ensure a secure deployment.

For more information on the security capabilities of Cloudera Data Science Workbench, see the Cloudera Data Science Workbench Security Guide.

Configure Block Devices

Docker Block Device

The Cloudera Data Science Workbench installer will format and mount Docker on each gateway host. Make sure there is no important data stored on these devices. Do not mount these block devices prior to installation.

Every Cloudera Data Science Workbench gateway host must have one or more block devices with at least 1 TB dedicated to storage of Docker images. The Docker block devices store the Cloudera Data Science Workbench Docker images including the Python, R, and Scala engines. Each engine image can occupy 15 GB.

Application Block Device or Mount Point

The master host on Cloudera Data Science Workbench requires at least 500 GB for database and project storage. This recommended capacity is contingent on the expected number of users and projects on the cluster. While large data files should be stored on HDFS, it is not uncommon to find gigabytes of data or libraries in individual projects. Running out of storage will cause the application to fail. Cloudera recommends allocating at least 5 GB per project and at least 1 TB of storage in total. Make sure you continue to carefully monitor disk space usage and I/O using Cloudera Manager.

Cloudera Data Science Workbench stores all application data at /var/lib/cdsw. On a CSD-based deployment, this location is not configurable. Cloudera Data Science Workbench will assume the system administrator has formatted and mounted one or more block devices to /var/lib/cdsw on the master node. Note that Application Block Device mounts are not required on worker nodes.

Regardless of the application data storage configuration you choose, /var/lib/cdsw must be stored on a separate block device. Given typical database and user access patterns, an SSD is strongly recommended.

By default, data in /var/lib/cdsw is not backed up or replicated to HDFS or other nodes. Reliable storage and backup strategy is critical for production installations. For more information, see Backup and Disaster Recovery for Cloudera Data Science Workbench.

Install Cloudera Data Science Workbench

To use the Cloudera Manager CSD and parcel to install Cloudera Data Science Workbench, follow the steps at Installation and Upgrade Using Cloudera Manager.

OR

To install the Cloudera Data Science Workbench package on the cluster gateway hosts, follow the steps at Installation and Upgrade Using Packages.