Recommended Hardware Configuration
Cloudera Data Science Workbench hosts are added to your CDP cluster as gateway hosts.
The recommended minimum hardware configuration for Cloudera Data Science Workbench gateway hosts is:
Resource Type | Master | Workers | Notes |
---|---|---|---|
Supported architecture | Intel 64 bit, AMD 64 | ||
CPU | Minimum 16+ CPU (vCPU) cores | Minimum 16+ CPU (vCPU) cores | |
RAM | Minimum 32+ GB | Minimum 32+ GB | See the following Scaling Guidelines for more information. |
Disk Space | |||
Root Volume | 200+ GB | 200+ GB |
If you are going to partition the root volume, make sure you allocate at
least 200 GB to |
Application Block Device | 1 TB | (Not required) |
The Application Block Device is only required on the Master where it is
mounted to You will be asked to create a |
Docker Block Device | 1 TB | 1 TB | A raw device is expected on all Master and Worker hosts. Docker will create the necessary volumes on these hosts. |
Parcel directory | 50 GB | 50 GB | If using Cloudera Manager parcels to distribute CDSW, all hosts inside Cloudera Manager must have at least 50 GB of free space inside the parcel directory, which defaults to /opt/cloudera/parcels/. |
Scaling Guidelines
New hosts can be added and removed from a Cloudera Data Science Workbench deployment without interrupting any jobs already scheduled on existing hosts. Therefore, it is rather straightforward to increase capacity based on observed usage. At a minimum, Cloudera recommends you allocate at least 1 CPU core and 2 GB of RAM per concurrent session or job. CPU can burst above a 1 CPU core share when spare resources are available. Therefore, a 1 CPU core allocation is often adequate for light workloads. Allocating less than 2 GB of RAM can lead to out-of-memory errors for many applications.
As a general guideline, Cloudera recommends hosts with RAM between 60GB and 256GB, and between 16 and 48 cores. This provides a useful range of options for end users. Note that SSDs are strongly recommended for application data storage. Using standard HDDs can sometimes result in poor application performance.
For some data science and machine learning applications, users can collect a significant amount of data in memory within a single R or Python process, or use a significant amount of CPU resources that cannot be easily distributed into the CDH cluster. If individual users frequently run larger workloads or run workloads in parallel over long durations, increase the total resources accordingly. Understanding your users' concurrent workload requirements or observing actual usage is the best approach to scaling Cloudera Data Science Workbench.