Cloudera Data Science Workbench 1.2.x Requirements and Supported Platforms

This topic lists the software and hardware configuration required to successfully install and run Cloudera Data Science Workbench. Cloudera Data Science Workbench does not support hosts or clusters that do not conform to the requirements listed on this page.

Cloudera Manager and CDH Requirements

Cloudera Data Science Workbench 1.2.x is supported on the following versions of CDH and Cloudera Manager:
  • CDH 5.7 or higher 5.x versions.

  • CSD-based deployments: Cloudera Manager 5.13 or higher 5.x versions.

    Package-based deployments: Cloudera Manager 5.11 or higher 5.x versions.

    All cluster hosts must be managed by Cloudera Manager. Note that all Cloudera Data Science Workbench administrative tasks require root access to the cluster's gateway hosts where Cloudera Data Science Workbench is installed. Therefore, Cloudera Data Science Workbench does not support single-user mode installations.

  • CDS 2.1 Powered by Apache Spark and higher.

Operating System Requirements

Cloudera Data Science Workbench 1.2.x is supported on the following operating systems:
  • RHEL/CentOS 7.2, 7.3, 7.4
  • Oracle Linux 7.3 (UEK - default)
  • SLES 12 SP2 (supported for Cloudera Data Science Workbench 1.2.2 and higher)

A gateway node that is dedicated to running Cloudera Data Science Workbench must use one of the aforementioned supported versions even if the remaining CDH hosts in your cluster are running any of the other operating systems supported by Cloudera Enterprise.

Cloudera Data Science Workbench publishes placeholder parcels for other operating systems as well. However, note that these do not work and have only been included to support mixed-OS clusters.

JDK Requirements

The entire CDH cluster, including Cloudera Data Science Workbench gateway nodes, must use Oracle JDK. OpenJDK is not currently supported by Cloudera Data Science Workbench.

For more specifics on the versions of Oracle JDK recommended for CDH and Cloudera Manager clusters, see the Cloudera Product Compatibility Matrix - Supported JDK Versions.

JDK 8 Requirement for Spark 2.2

CSD-based deployments:

Spark 2.2 requires JDK 1.8. On CSD-based deployments, Cloudera Manager automatically detects the path and version of Java installed on Cloudera Data Science Workbench gateway hosts. However, if a host has both JDK 1.7 and JDK 1.8 installed, Cloudera Manager might choose to use JDK 1.7 over JDK 1.8. If you are using Spark 2.2, this will create a problem during the first run of the service because Spark 2.2 will not work with JDK 1.7. To work around this, configure Cloudera Manager to use JDK 1.8 on Cloudera Data Science Workbench gateway hosts. For instructions, see Configuring a Custom Java Home Location in Cloudera Manager.

To upgrade your entire CDH cluster to JDK 1.8, see Upgrading to Oracle JDK 1.8.

Package-based deployments:

Set JAVA_HOME to the JDK 8 path in cdsw.conf during the installation process. If you need to modify JAVA_HOME after the fact, restart the master and worker nodes to have the changes go into effect.

Networking and Security Requirements

  • A wildcard subdomain such as *.cdsw.company.com. Wildcard subdomains are used to provide isolation for user-generated content.
  • Disable all pre-existing iptables rules. While Kubernetes makes extensive use of iptables, it’s difficult to predict how pre-existing iptables rules will interact with the rules inserted by Kubernetes. Therefore, Cloudera recommends you use the following commands to disable all pre-existing rules before you proceed with the installation.
    sudo iptables -P INPUT ACCEPT
    sudo iptables -P FORWARD ACCEPT
    sudo iptables -P OUTPUT ACCEPT
    sudo iptables -t nat -F
    sudo iptables -t mangle -F
    sudo iptables -F
    sudo iptables -X
  • Cloudera Data Science Workbench sets the following sysctl options in /etc/sysctl.d/k8s.conf:
    • net.bridge.bridge-nf-call-iptables=1
    • net.bridge.bridge-nf-call-ip6tables=1
    • net.ipv4.ip_forward=1
    Underlying components of Cloudera Data Science Workbench (Docker, Kubernetes, and NFS) require these options to work correctly. Make sure they are not overridden by high-priority configuration such as /etc/sysctl.conf.
  • SELinux must either be disabled or run in permissive mode.
  • Multi-homed networks are supported only with Cloudera Data Science Workbench 1.2.2 (and higher).
  • No firewall restrictions across Cloudera Data Science Workbench or CDH hosts.
  • Non-root SSH access is not allowed on Cloudera Data Science Workbench hosts.
  • localhost must resolve to 127.0.0.1.
  • Cloudera Data Science Workbench does not support DNS servers running on 127.0.0.1:53. This IP address resolves to the container localhost within Cloudera Data Science Workbench containers. As a workaround, use either a non-loopback address or a remote DNS server.

Cloudera Data Science Workbench does not support hosts or clusters that do not conform to these restrictions.

Recommended Hardware Configuration

Cloudera Data Science Workbench hosts are added to your CDH cluster as gateway hosts. The recommended minimum hardware configuration for the master host is:

  • CPU: 16+ CPU (vCPU) cores

  • RAM: 32+ GB RAM

  • Disk
    • Root Volume: 100+ GB.

      The Cloudera Data Science Workbench installer temporarily decompresses the engine image file located in /etc/cdsw/images to the /var/lib/docker/tmp/ directory. If you are going to partition the root volume, make sure you allocate at least 20 GB to /var/lib/docker/tmp so that the installer can proceed without running out of space.

    • Application Block Device or Mount Point (Master Host Only): 500+ GB
    • Docker Image Block Device: 500+ GB

Scaling Guidelines

New nodes can be added and removed from a Cloudera Data Science Workbench deployment without interrupting any jobs already scheduled on existing hosts. Therefore, it is rather straightforward to increase capacity based on observed usage. At a minimum, Cloudera recommends you allocate at least 1 CPU core and 2 GB of RAM per concurrent session or job. CPU can burst above a 1 CPU core share when spare resources are available. Therefore, a 1 CPU core allocation is often adequate for light workloads. Allocating less than 2 GB of RAM can lead to out-of-memory errors for many applications.

As a general guideline, Cloudera recommends nodes with RAM between 60GB and 256GB, and between 16 and 48 cores. This provides a useful range of options for end users. SSDs are strongly recommended for application data storage.

For some data science and machine learning applications, users can collect a significant amount of data in memory within a single R or Python process, or use a significant amount of CPU resources that cannot be easily distributed into the CDH cluster. If individual users frequently run larger workloads or run workloads in parallel over long durations, increase the total resources accordingly. Understanding your users' concurrent workload requirements or observing actual usage is the best approach to scaling Cloudera Data Science Workbench.

Python Supported Versions

The default Cloudera Data Science Workbench engine (Base Image Version 1) includes Python 2.7.11 and Python 3.6.1. To use PySpark with lambda functions that run within the CDH cluster, the Spark executors must have access to a matching version of Python. For many common operating systems, the default system Python will not match the minor release of Python included in Data Science Workbench.

To ensure that the Python versions match, Python can either be installed on every CDH node or made available per job run using Spark’s ability to distribute dependencies. Given the size of a typical isolated Python environment and the desire to avoid repeated uploads from gateway hosts, Cloudera recommends installing Python 2.7 and 3.6 on the cluster if you are using PySpark with lambda functions. You can install Python 2.7 and 3.6 on the cluster using any method and set the corresponding PYSPARK_PYTHON variable in your project.

Anaconda - Continuum Analytics and Cloudera have partnered to create an Anaconda parcel for CDH to enable simple distribution, installation, and management of popular Python packages and their dependencies. The public Anaconda parcel ships Python 2.7.11. Note that the Anaconda parcel is not directly supported by Cloudera and no publicly available parcel exists for Python 3.6. For an example on distributing Python dependencies dynamically, see Example: Distributing Dependencies on a PySpark Cluster.

Docker and Kubernetes Support

Cloudera Data Science Workbench only supports the versions of Docker and Kubernetes that are shipped with each release. Upgrading Docker or Kubernetes, or running on third-party Kubernetes clusters is not supported.

Supported Browsers

  • Chrome (latest stable version)
  • Firefox (latest released version and latest ESR version)
  • Safari 9+
  • Internet Explorer (IE) 11+
    • IE's Compatibility View mode is not supported.

Cloudera Altus Director Support (AWS Only)

Starting with Cloudera Data Science Workbench 1.2.x, you can use Cloudera Altus Director to deploy clusters with Cloudera Data Science Workbench.

Altus Director support is available only for the following platforms:
  • Cloudera Altus Director 2.6.0 (and higher)
  • Cloudera Manager 5.13.1 (and higher)
  • CSD-based Cloudera Data Science Workbench 1.2.x (and higher)
  • Currently, only installations on Amazon Web Services (AWS) are supported.

Deploying Cloudera Data Science Workbench with Cloudera Altus Director

Points to note when using Cloudera Altus Director to install Cloudera Data Science Workbench:
  • (Required) Before you run the command to bootstrap a new cluster, set the lp.normalization.mountAllUnmountedDisksRequired property to false in the Altus Director server's application.properties file, and then restart Altus Director.

  • Use the following sample configuration file to bootstrap a cluster with the Altus Director CLI: aws.cdsw.conf. This will deploy a Cloudera Manager cluster with Cloudera Data Science Workbench on AWS.

    Note that this sample file installs a very limited CDH cluster with just the following services: HDFS, YARN, and Spark 2. You can extend it as needed to match your use case.

Recommended Configuration on Amazon Web Services (AWS)

On AWS, Cloudera Data Science Workbench must be used with persistent/long-running Apache Hadoop clusters only.

CDH and Cloudera Manager Hosts
Cloudera Data Science Workbench Hosts
  • Operations
    • Use Cloudera Director to orchestrate operations. Use Cloudera Manager to monitor the cluster.
  • Networking
    • No security group or network restrictions between hosts.
    • HTTP connectivity to the corporate network for browser access. Do not use proxies or manual SSH tunnels.
  • Recommended Instance Types
    • m4.4xlarge–m4.16xlarge

      In this case, bigger is better. That is, one m4.16large is better than four m4.4xlarge hosts. AWS pricing scales linearly, and larger instances have more EBS bandwidth.

  • Storage
    • 100 GB root volume block device (gp2) on all hosts
    • 500 GB Docker block devices (gp2) on all hosts
    • 1 TB Application block device (io1) on master host

Recommended Configuration on Microsoft Azure

CDH and Cloudera Manager Hosts
Cloudera Data Science Workbench Hosts
  • Operations
    • Use Cloudera Director to orchestrate operations. Use Cloudera Manager to monitor the cluster.
  • Networking
    • No security group or network restrictions between hosts.
    • HTTP connectivity to the corporate network for browser access. Do not use proxies or manual SSH tunnels.
  • Recommended Instance Types
    • DS13-DS14 v2 instances on all hosts.
  • Storage
    • P30 premium storage for the Application and Docker block devices.