Cloudera Data Science Workbench 1.1.x Requirements and Supported Platforms

This topic lists the software and hardware configuration required to successfully install and run Cloudera Data Science Workbench. Cloudera Data Science Workbench does not support hosts or clusters that do not conform to the requirements listed on this page.

Cloudera Manager and CDH Requirements

Cloudera Data Science Workbench 1.1.x is supported on the following versions of CDH and Cloudera Manager:
  • CDH 5.7 or higher 5.x versions.
  • Cloudera Manager 5.11 or higher 5.x versions.

    All cluster hosts must be managed by Cloudera Manager. All Cloudera Data Science Workbench administration tasks require root access to the cluster hosts. Therefore, Cloudera Data Science Workbench does not support single-user mode installations.

  • Cloudera's Distribution of Apache Spark 2.1 and higher.

Operating System Requirements

Cloudera Data Science Workbench 1.1.x is supported on the following operating systems:
  • RHEL/CentOS 7.2, 7.3
  • Oracle Linux 7.3 (UEK - default)

A gateway node that is dedicated to running Cloudera Data Science Workbench must use one of the aforementioned supported versions even if the remaining CDH hosts in your cluster are running any of the other supported operating systems.

JDK Requirements

The entire CDH cluster, including Cloudera Data Science Workbench gateway nodes, must use Oracle JDK. OpenJDK is not currently supported by Cloudera Data Science Workbench.

For more specifics on the versions of Oracle JDK recommended for CDH and Cloudera Manager clusters, see the Cloudera Product Compatibility Matrix - Supported JDK Versions.

Networking and Security Requirements

  • A wildcard subdomain such as *.cdsw.company.com. Wildcard subdomains are used to provide isolation for user-generated content.
  • Disable all pre-existing iptables rules. While Kubernetes makes extensive use of iptables, it’s difficult to predict how pre-existing iptables rules will interact with the rules inserted by Kubernetes. Therefore, Cloudera recommends you disable all pre-existing rules before you proceed with the installation.
  • Cloudera Data Science Workbench sets the following sysctl options in /etc/sysctl.d/k8s.conf:
    • net.ipv6.conf.all.disable_ipv6=0
    • net.ipv6.conf.default.disable_ipv6=0
    • net.bridge.bridge-nf-call-iptables=1
    • net.bridge.bridge-nf-call-ip6tables=1
    • net.ipv4.ip_forward=1
    Underlying components of Cloudera Data Science Workbench (Docker, Kubernetes, and NFS) require these options to work correctly. Make sure they are not overridden by high-priority configuration such as /etc/sysctl.conf.
  • SELinux must be disabled.
  • No firewall restrictions across Cloudera Data Science Workbench or CDH hosts.
  • No multi-homed networks.
  • Non-root SSH access is not allowed on Cloudera Data Science Workbench hosts.

Cloudera Data Science Workbench does not support hosts or clusters that do not conform to these restrictions.

Recommended Hardware Configuration

Cloudera Data Science Workbench hosts are added to your CDH cluster as gateway hosts. The recommended minimum hardware configuration for the master host is:

  • CPU: 16+ CPU (vCPU) cores

  • RAM: 32+ GB RAM

  • Disk:
    • Root Volume: 100+ GB
    • Application Block Device or Mount Point (Master Host Only): 500+ GB
    • Docker Image Block Device: 500+ GB

Scaling Guidelines

New nodes can be added and removed from a Cloudera Data Science Workbench deployment without interrupting any jobs already scheduled on existing hosts. Therefore, it is rather straightforward to increase capacity based on observed usage. At a minimum, Cloudera recommends you allocate at least 1 CPU core and 2 GB of RAM per concurrent session or job. CPU can burst above a 1 CPU core share when spare resources are available. Therefore, a 1 CPU core allocation is often adequate for light workloads. Allocating less than 2 GB of RAM can lead to out-of-memory errors for many applications.

As a general guideline, Cloudera recommends nodes with RAM between 60GB and 256GB, and between 16 and 48 cores. This provides a useful range of options for end users. SSDs are strongly recommended for application data storage.

For some data science and machine learning applications, users can collect a significant amount of data in memory within a single R or Python process, or use a significant amount of CPU resources that cannot be easily distributed into the CDH cluster. If individual users frequently run larger workloads or run workloads in parallel over long durations, increase the total resources accordingly. Understanding your users' concurrent workload requirements or observing actual usage is the best approach to scaling Cloudera Data Science Workbench.

Python Supported Versions

The default Cloudera Data Science Workbench engine (Base Image Version 1) includes Python 2.7.11 and Python 3.6.1. To use PySpark with lambda functions that run within the CDH cluster, the Spark executors must have access to a matching version of Python. For many common operating systems, the default system Python will not match the minor release of Python included in Data Science Workbench.

To ensure that the Python versions match, Python can either be installed on every CDH node or made available per job run using Spark’s ability to distribute dependencies. Given the size of a typical isolated Python environment and the desire to avoid repeated uploads from gateway hosts, Cloudera recommends installing Python 2.7 and 3.6 on the cluster if you are using PySpark with lambda functions. You can install Python 2.7 and 3.6 on the cluster using any method and set the corresponding PYSPARK_PYTHON variable in your project.

Anaconda - Continuum Analytics and Cloudera have partnered to create an Anaconda parcel for CDH to enable simple distribution, installation, and management of popular Python packages and their dependencies. The public Anaconda parcel ships Python 2.7.11. Note that the Anaconda parcel is not directly supported by Cloudera and no publicly available parcel exists for Python 3.6. For an example on distributing Python dependencies dynamically, see Example: Distributing Dependencies on a PySpark Cluster.

Docker and Kubernetes Support

Cloudera Data Science Workbench only supports the versions of Docker and Kubernetes that are shipped with each release. Upgrading Docker or Kubernetes, or running on third-party Kubernetes clusters is not supported.

Supported Browsers

  • Chrome (latest stable version)
  • Firefox (latest released version and latest ESR version)
  • Safari 9+
  • Internet Explorer (IE) 11+

Recommended Configuration on Amazon Web Services (AWS)

On AWS, Cloudera Data Science Workbench must be used with persistent/long-running Apache Hadoop clusters only.

CDH and Cloudera Manager Hosts
Cloudera Data Science Workbench Hosts
  • Operations
    • Use Cloudera Director to orchestrate operations. Use Cloudera Manager to monitor the cluster.
  • Networking
    • No security group or network restrictions between hosts.
    • HTTP connectivity to the corporate network for browser access. Do not use proxies or manual SSH tunnels.
  • Recommended Instance Types
    • m4.4xlarge–m4.16xlarge

      In this case, bigger is better. That is, one m4.16large is better than four m4.4xlarge hosts. AWS pricing scales linearly, and larger instances have more EBS bandwidth.

  • Storage
    • 100 GB root volume block device (gp2) on all hosts
    • 500 GB Docker block devices (gp2) on all hosts
    • 1 TB Application block device (io1) on master host

Recommended Configuration on Microsoft Azure

CDH and Cloudera Manager Hosts
Cloudera Data Science Workbench Hosts
  • Operations
    • Use Cloudera Director to orchestrate operations. Use Cloudera Manager to monitor the cluster.
  • Networking
    • No security group or network restrictions between hosts.
    • HTTP connectivity to the corporate network for browser access. Do not use proxies or manual SSH tunnels.
  • Recommended Instance Types
    • DS13-DS14 v2 instances on all hosts.
  • Storage
    • P30 premium storage for the Application and Docker block devices.