Recommended Hardware Configuration

Cloudera Data Science Workbench hosts are added to your CDP cluster as gateway hosts.

The recommended minimum hardware configuration for Cloudera Data Science Workbench gateway hosts is:

Resource Type Master Workers Notes
Supported architecture Intel 64 bit, AMD 64
CPU Minimum 16+ CPU (vCPU) cores Minimum 16+ CPU (vCPU) cores
RAM Minimum 32+ GB Minimum 32+ GB See the following Scaling Guidelines for more information.
Disk Space
Root Volume 200+ GB 200+ GB

If you are going to partition the root volume, make sure you allocate at least 200 GB to /var/lib/cdsw/docker-tmp so that the installer can proceed without running out of space.

Application Block Device 1 TB (Not required)

The Application Block Device is only required on the Master where it is mounted to /var/lib/cdsw.

You will be asked to create a /var/lib/cdsw directory on all the Worker hosts during the installation process. However, they do not need to be mounted to a block device. It is only used to store client configuration for HDP cluster services on Workers.

Docker Block Device 1 TB 1 TB A raw device is expected on all Master and Worker hosts. Docker will create the necessary volumes on these hosts.
Parcel directory 50 GB 50 GB If using Cloudera Manager parcels to distribute CDSW, all hosts inside Cloudera Manager must have at least 50 GB of free space inside the parcel directory, which defaults to /opt/cloudera/parcels/.

Scaling Guidelines

New hosts can be added and removed from a Cloudera Data Science Workbench deployment without interrupting any jobs already scheduled on existing hosts. Therefore, it is rather straightforward to increase capacity based on observed usage. At a minimum, Cloudera recommends you allocate at least 1 CPU core and 2 GB of RAM per concurrent session or job. CPU can burst above a 1 CPU core share when spare resources are available. Therefore, a 1 CPU core allocation is often adequate for light workloads. Allocating less than 2 GB of RAM can lead to out-of-memory errors for many applications.

As a general guideline, Cloudera recommends hosts with RAM between 60GB and 256GB, and between 16 and 48 cores. This provides a useful range of options for end users. Note that SSDs are strongly recommended for application data storage. Using standard HDDs can sometimes result in poor application performance.

For some data science and machine learning applications, users can collect a significant amount of data in memory within a single R or Python process, or use a significant amount of CPU resources that cannot be easily distributed into the CDH cluster. If individual users frequently run larger workloads or run workloads in parallel over long durations, increase the total resources accordingly. Understanding your users' concurrent workload requirements or observing actual usage is the best approach to scaling Cloudera Data Science Workbench.