Recommended Hardware Configuration

Cloudera Data Science Workbench hosts are added to your CDH cluster as gateway hosts.

The recommended minimum hardware configuration for Cloudera Data Science Workbench gateway hosts is:

Resource Type Master Workers Notes
Supported architecture Intel 64 bit, AMD 64
CPU 16+ CPU (vCPU) cores 16+ CPU (vCPU) cores
RAM 32+ GB 32+ GB
Disk Space
Root Volume 200+ GB 200+ GB

If you are going to partition the root volume, make sure you allocate at least 20 GB to / so that the installer can proceed without running out of space.

Application Block Device 1 TB -

The Application Block Device is only required on the Master where it is mounted to /var/lib/cdsw.

You will be asked to create a /var/lib/cdsw directory on all the Worker hosts during the installation process. However, they do not need to be mounted to a block device. It is only used to store client configuration for HDP cluster services on Workers.

Docker Block Device 1 TB 1 TB A raw device is expected on all Master and Worker hosts. Docker will create the necessary volumes on these hosts.

Scaling Guidelines

New hosts can be added and removed from a Cloudera Data Science Workbench deployment without interrupting any jobs already scheduled on existing hosts. Therefore, it is rather straightforward to increase capacity based on observed usage. At a minimum, Cloudera recommends you allocate at least 1 CPU core and 2 GB of RAM per concurrent session or job. CPU can burst above a 1 CPU core share when spare resources are available. Therefore, a 1 CPU core allocation is often adequate for light workloads. Allocating less than 2 GB of RAM can lead to out-of-memory errors for many applications.

As a general guideline, Cloudera recommends hosts with RAM between 60GB and 256GB, and between 16 and 48 cores. This provides a useful range of options for end users. Note that SSDs are strongly recommended for application data storage. Using standard HDDs can sometimes result in poor application performance.

For some data science and machine learning applications, users can collect a significant amount of data in memory within a single R or Python process, or use a significant amount of CPU resources that cannot be easily distributed into the CDH cluster. If individual users frequently run larger workloads or run workloads in parallel over long durations, increase the total resources accordingly. Understanding your users' concurrent workload requirements or observing actual usage is the best approach to scaling Cloudera Data Science Workbench.