Cloudera Data Science Workbench 1.7.2 Requirements and Supported Platforms

This topic lists the software and hardware configuration required to successfully install and run Cloudera Data Science Workbench. Cloudera Data Science Workbench does not support hosts or clusters that do not conform to the requirements listed on this page.

Cloudera Manager and CDH Requirements

Cloudera Data Science Workbench 1.7.2 is supported on the following versions of CDH and Cloudera Manager:
Type CDH Cloudera Manager
CSD Deployments
  • CDH 5.10 or higher
  • CDH 6.1.x or higher
  • Cloudera Runtime Data Center 7.0.3 or higher
  • Cloudera Manager 5.x: 5.16.2.4505 or higher
  • Cloudera Manager 6.1.x: 6.1.1.4505 or higher
  • Cloudera Manager 6.2.x: 6.2.1.4505 or higher
  • Cloudera Manager 6.3.x+: 6.3.3 or higher
  • Cloudera Manager Data Center 7.0.3 or higher
RPM Deployments
  • CDH 5.10 or higher
  • CDH 6.1.x or higher
  • Cloudera Manager 5.11 or higher
  • Cloudera Manager 6.1.x or higher

All cluster hosts must be managed by Cloudera Manager. Note that all Cloudera Data Science Workbench administrative tasks require root access to the cluster's gateway hosts where Cloudera Data Science Workbench is installed. Therefore, Cloudera Data Science Workbench does not support single-user mode installations.

Apache Spark Requirements

CDH Version Spark 2 Compatibility
CDH 5

CDS 2.1.x Powered by Apache Spark (and higher)

CDH 6

On CDH 6 clusters, Apache Spark 2 is packaged with CDH and can no longer be installed separately.

To find out which version of Spark 2 ships with your version of CDH 6, refer the CDH 6 Packaging Information guide.

Cloudera Runtime 7

On Cloudera Runtime clusters, Apache Spark 2 is packaged with Cloudera Runtime and can no longer be installed separately.

To find out which version of Spark 2 ships with your version of Cloudera Runtime, refer the Cloudera Runtime Component Versions guide.

Operating System Requirements

Cloudera Data Science Workbench 1.7.2 is supported on the following operating systems. A gateway host that is dedicated to running Cloudera Data Science Workbench must use one of the following supported versions even if the remaining CDH hosts in your cluster are running any of the other operating systems supported by Cloudera Enterprise 5 or 6.

Operating System Versions Notes
RHEL / CentOS / Oracle Linux RHCK 7.8, 7.7, 7.6, 7.5, 7.4, 7.3, 7.2
  • When IPv6 is disabled, CDSW installations on RHEL/CentOS 7.3 fail due to an issue in kernel versions 3.10.0-514 - 3.10.0-693. For details, see https://access.redhat.com/solutions/3039771.

  • CDSW installations on RHEL/CentOS 7.2 might fail due to an issue with certain versions of the nfs-utils package. To fix the issue, either downgrade the nfs-utils package or upgrade to a version with the fix.

    View the complete Red Hat bug report here.

Oracle Linux (UEK - default) 7.3 -

Cloudera Data Science Workbench publishes placeholder parcels for other operating systems as well. However, note that these do not work and have only been included to support mixed-OS clusters.

Additional OS-level Settings

  • Enable memory cgroups on your operating system.
  • Disable swap for optimum stability and performance. For instructions, see Setting the vm.swappiness Linux Kernel Parameter.
  • Cloudera Data Science Workbench uses uid 8536 and uid 28536 for internal service accounts. Make sure that these user IDs are not assigned to any other service or user account.
  • Cloudera recommends that all users have the max-user-processes ulimit set to at least 65536.
  • Cloudera recommends that all users have the max-open-files ulimit set to 1048576.

JDK Requirements

The entire CDH cluster, including Cloudera Data Science Workbench gateway hosts, must use the same version of JDK. Points to remember:
  • Oracle JDK 7 is supported across all versions of Cloudera Manager 5 and CDH 5. Oracle JDK 8 is supported in Cloudera Enterprise 5.3.x and higher. Note the JDK 8 Requirement for Spark 2.2 (or higher).

  • OpenJDK 8 is supported in Cloudera Enterprise 5.16.1 and higher. OpenJDK 7 is not supported.

  • For Red Hat/CentOS deployments in particular, Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction must be enabled on the Cloudera Data Science Workbench gateway hosts.

For more specifics on the versions of Oracle JDK and OpenJDK recommended for CDH and Cloudera Manager clusters, and instructions on how to install the Java Cryptography Extension, see the Cloudera Product Compatibility Matrix - Supported JDK Versions.

JDK 8 Requirement for Spark 2.2 (or higher)

CSD-based deployments:

On CSD-based deployments, Cloudera Manager automatically detects the path and version of Java installed on Cloudera Data Science Workbench gateway hosts. You do not need to explicitly set the value for JAVA_HOME unless you want to use a custom location, use JRE, or (in the case of Spark 2) force Cloudera Manager to use JDK 1.8 as explained below.

To upgrade your entire CDH cluster to JDK 1.8, see Upgrading to Oracle JDK 1.8.

Package-based deployments:

Set JAVA_HOME to the JDK 8 path in cdsw.conf during the installation process. If you need to modify JAVA_HOME after the fact, restart the master and worker hosts to have the changes go into effect.

Networking and Security Requirements

  • Enable IPv6 on all Cloudera Data Science Workbench gateway hosts. For instructions, refer the workaround provided here: Known Issue: CDSW cannot start sessions due to connection errors.
  • All Cloudera Data Science Workbench gateway hosts must be part of the same datacenter and use the same network. Hosts from different data-centers or networks can result in unreliable performance.
  • A wildcard subdomain such as *.cdsw.company.com must be configured. Wildcard subdomains are used to provide isolation for user-generated content.

    The wildcard DNS hostname configured for Cloudera Data Science Workbench must be resolvable from both, the CDSW cluster, and your browser.

  • Disable all pre-existing iptables rules. While Kubernetes makes extensive use of iptables, it’s difficult to predict how pre-existing iptables rules will interact with the rules inserted by Kubernetes. Therefore, Cloudera recommends you to disable all pre-existing rules before you proceed with the installation.
    It is recommended to save the iptables and check whether the changes have been written to the /etc/sysconfig/iptables file before you disable them. If you disable the iptables without saving, then the settings can get erased upon system reboot.
    1. Save the iptables by running the following command:
      service iptables save
    2. Verify whether the changes have been written to the file by running the following command:
      ls -l /etc/sysconfig/iptables
    3. Disable the iptables by running the following commands:
      sudo iptables -P INPUT ACCEPT
      sudo iptables -P FORWARD ACCEPT
      sudo iptables -P OUTPUT ACCEPT
      sudo iptables -t nat -F
      sudo iptables -t mangle -F
      sudo iptables -F
      sudo iptables -X
  • Cloudera Data Science Workbench sets the following sysctl options in /etc/sysctl.d/k8s.conf:
    • net.bridge.bridge-nf-call-iptables=1
    • net.bridge.bridge-nf-call-ip6tables=1
    • net.ipv4.ip_forward=1
    • net.ipv4.conf.default.forwarding=1
    Underlying components of Cloudera Data Science Workbench (Docker, Kubernetes, and NFS) require these options to work correctly. Make sure they are not overridden by high-priority configuration such as /etc/sysctl.conf.
  • SELinux must either be disabled or run in permissive mode.
  • Multi-homed networks are supported with Cloudera Data Science Workbench 1.2.2 (and higher). However, you will need to explicitly configure the private IP address of the worker nodes in the kubelet start script as follows:
    # vi /opt/cloudera/parcels/CDSW/scripts/start-kubelet-worker-standalone-core.sh
    88 kubelet_opts+=(--v=2)
    89 kubelet_opts+=(--node-ip=172.x.x.x)
  • Firewall restrictions must be disabled across Cloudera Data Science Workbench and CDH/HDP cluster hosts. For more details on cluster communication, see Ports Required by Cloudera Data Science Workbench.
  • Untrusted (non-sudo) SSH access to Cloudera Data Science Workbench hosts must be disabled to ensure a secure deployment.

    Cloudera Data Science Workbench assumes that users only access the gateway hosts through the web application. Untrusted users with SSH access to a Cloudera Data Science Workbench host can gain full access to the cluster, including access to other users' workloads.

  • localhost must resolve to 127.0.0.1.
  • Forward and reverse DNS lookup must be enabled for the Cloudera Data Science Workbench domain name and IP address (CDSW master host).
  • Cloudera Data Science Workbench does not support DNS servers running on 127.0.0.1:53. This IP address resolves to the container localhost within Cloudera Data Science Workbench containers. As a workaround, use either a non-loopback address or a remote DNS server.
  • All third-party security software (such as McAfee, Tanium, Symantec, etc.) must be disabled on CDSW hosts. Failure to do so can result in Cloudera Data Science Workbench failing randomly.

Cloudera Data Science Workbench does not support hosts or clusters that do not conform to these restrictions.

Ports Required by Cloudera Data Science Workbench

Cloudera Data Science Workbench runs on gateway hosts in a CDH/HDP cluster. As such, Cloudera Data Science Workbench acts as a gateway and requires full connectivity to cluster services such as Impala, Spark 2, etc. Additionally, in the case of Spark 2, cluster hosts will require access to the Spark driver running on a set of random ports (20050-32767) on Cloudera Data Science Workbench hosts.

Firewall restrictions must be disabled across Cloudera Data Science Workbench and CDH/HDP cluster hosts. Internally, the Cloudera Data Science Workbench master and worker hosts require full connectivity with no firewalls. Externally, end users connect to Cloudera Data Science Workbench exclusively through a web server running on the master host, and therefore do not need direct access to any other internal Cloudera Data Science Workbench or CDH services.

This information has been summarized in the following table.
Components Details

Communication with the CDH / HDP cluster

CDH / HDP -> Cloudera Data Science Workbench

The CDH/HDP cluster must have access to the Spark driver that runs on Cloudera Data Science Workbench hosts, on a set of randomized ports in the range, 20050-32767.

Cloudera Data Science Workbench -> CDH / HDP

As a gateway service, Cloudera Data Science Workbench must have access to all the ports used by CDH and Cloudera Manager.

Communication with the Web Browser

The Cloudera Data Science Workbench web application is available at port 80. HTTPS access is available over port 443.

Recommended Hardware Configuration

Cloudera Data Science Workbench hosts are added to your CDH cluster as gateway hosts. The recommended minimum hardware configuration for Cloudera Data Science Workbench gateway hosts is:

Resource Type Master Workers Notes
Supported architecture     Intel 64 bit, AMD 64
CPU 16+ CPU (vCPU) cores 16+ CPU (vCPU) cores  
RAM 32+ GB 32+ GB  
Disk Space  
Root Volume 100+ GB 100+ GB

If you are going to partition the root volume, make sure you allocate at least 20 GB to / so that the installer can proceed without running out of space.

Application Block Device 1 TB -

The Application Block Device is only required on the Master where it is mounted to /var/lib/cdsw.

You will be asked to create a /var/lib/cdsw directory on all the Worker hosts during the installation process. However, they do not need to be mounted to a block device. It is only used to store client configuration for HDP cluster services on Workers.

Docker Block Device 1 TB 1 TB The Docker Block Device is required on all Master and Worker hosts.

Scaling Guidelines

New hosts can be added and removed from a Cloudera Data Science Workbench deployment without interrupting any jobs already scheduled on existing hosts. Therefore, it is rather straightforward to increase capacity based on observed usage. At a minimum, Cloudera recommends you allocate at least 1 CPU core and 2 GB of RAM per concurrent session or job. CPU can burst above a 1 CPU core share when spare resources are available. Therefore, a 1 CPU core allocation is often adequate for light workloads. Allocating less than 2 GB of RAM can lead to out-of-memory errors for many applications.

As a general guideline, Cloudera recommends hosts with RAM between 60GB and 256GB, and between 16 and 48 cores. This provides a useful range of options for end users. Note that SSDs are strongly recommended for application data storage. Using standard HDDs can sometimes result in poor application performance.

For some data science and machine learning applications, users can collect a significant amount of data in memory within a single R or Python process, or use a significant amount of CPU resources that cannot be easily distributed into the CDH cluster. If individual users frequently run larger workloads or run workloads in parallel over long durations, increase the total resources accordingly. Understanding your users' concurrent workload requirements or observing actual usage is the best approach to scaling Cloudera Data Science Workbench.

Python Supported Versions

The default Cloudera Data Science Workbench engine currently includes Python 2.7.17 and Python 3.6.9. CDSW supports what comes bundled with the base image. To use PySpark with lambda functions that run within the CDH cluster, the Spark executors must have access to a matching version of Python. For many common operating systems, the default system Python will not match the minor release of Python included in Data Science Workbench.

To ensure that the Python versions match, Python can either be installed on every CDH host or made available per job run using Spark’s ability to distribute dependencies. Given the size of a typical isolated Python environment and the desire to avoid repeated uploads from gateway hosts, Cloudera recommends installing Python 2.7 and 3.6 on the cluster if you are using PySpark with lambda functions.

You can install Python 2.7 and 3.6 on the cluster using any method and set the corresponding PYSPARK_PYTHON environment variable in your project. Cloudera Data Science Workbench 1.3 (and higher) include a separate environment variable for Python 3 sessions called PYSPARK3_PYTHON. Python 2 sessions continue to use the default PYSPARK_PYTHON variable. This will allow you to run Python 2 and Python 3 sessions in parallel without either variable being overridden by the other.

For an example on distributing Python dependencies dynamically, see Example: Distributing Dependencies on a PySpark Cluster.

Anaconda

Continuum Analytics and Cloudera have partnered to create an Anaconda parcel for CDH to enable simple distribution, installation, and management of popular Python packages and their dependencies. Note that this parcel is not directly supported by Cloudera.

Docker and Kubernetes Support

Cloudera Data Science Workbench only supports the versions of Docker and Kubernetes that are shipped with each release. Upgrading Docker or Kubernetes, or running on third-party Kubernetes clusters is not supported.

Supported Browsers

  • Chrome (latest stable version)
  • Firefox (latest released version and latest ESR version)
  • Safari 9+

Cloudera Altus Director Support (AWS and Azure Only)

Altus Director support for Cloudera Data Science Workbench is available for the following platforms:
  • Amazon Web Services (AWS) - Cloudera Altus Director 2.6.0 (and higher)

    Microsoft Azure - Cloudera Altus Director 2.7 (and higher)

  • Cloudera Manager 5.13.1 (and higher)
  • CSD-based Cloudera Data Science Workbench 1.2.x (and higher)

Deploying Cloudera Data Science Workbench with Altus Director

Points to note when using Altus Director to install Cloudera Data Science Workbench:
  • (Required for Director 2.6) Before you run the command to bootstrap a new cluster, set the lp.normalization.mountAllUnmountedDisksRequired property to false in the Altus Director server's application.properties file, and then restart Altus Director.

    Higher versions of Altus Director do not require this step. Altus Director 2.7 (and higher) include an instance-level setting called mountAllUnmountedDisks that must be set to false as demonstrated in the following sample configuration files.

  • Depending on your cloud platform, you can use one of the following sample configuration files to deploy a Cloudera Manager cluster with Cloudera Data Science Workbench.

    Note that these sample files are tailored to Altus Director 2.7 (and higher) and they install a very limited CDH cluster with just the following services: HDFS, YARN, and Spark 2. You can extend them as needed to match your use case.

Recommended Configuration on Amazon Web Services (AWS)

On AWS, Cloudera Data Science Workbench must be used with persistent/long-running Apache Hadoop clusters only.

CDH and Cloudera Manager Hosts
Cloudera Data Science Workbench Hosts
  • Operations
    • Use Cloudera Director to orchestrate operations. Use Cloudera Manager to monitor the cluster.
  • Networking
    • No security group or network restrictions between hosts.
    • HTTP connectivity to the corporate network for browser access. Do not use proxies or manual SSH tunnels.
  • Recommended Instance Types
    • m4.4xlarge–m4.16xlarge

      In this case, bigger is better. That is, one m4.16large is better than four m4.4xlarge hosts. AWS pricing scales linearly, and larger instances have more EBS bandwidth.

  • Storage
    • 100 GB root volume block device (gp2) on all hosts
    • 500 GB Docker block devices (gp2) on all hosts
    • 1 TB Application block device (io1) on master host

Recommended Configuration on Microsoft Azure

CDH and Cloudera Manager Hosts
Cloudera Data Science Workbench Hosts
  • Operations
    • Use Cloudera Director to orchestrate operations. Use Cloudera Manager to monitor the cluster.
  • Networking
    • No security group or network restrictions between hosts.
    • HTTP connectivity to the corporate network for browser access. Do not use proxies or manual SSH tunnels.
  • Recommended Instance Types
    • DS13-DS14 v2 instances on all hosts.
  • Storage
    • P30 premium storage for the Application and Docker block devices.

      Cloudera Data Science Workbench requires premium disks for its block devices on Azure. Standard disks can lead to unacceptable performance even on small clusters.