Cluster sizing best practices

Review the best practices on how to size your physical hardware for setting up a CDP Private Cloud Base cluster.

Each worker node typically has a number of physical disks dedicated to raw storage for Hadoop. This number is used to calculate the total available storage for each cluster. Also, the calculations listed below assume that 10% disk space is allocated for YARN temporary storage. Cloudera recommends allocating 10-25% of the raw disk space for temporary storage as a general guideline. This can be changed within Cloudera Manager and should be adjusted after analyzing production workloads. For example, MapReduce jobs that send less data to reducers allow for adjusting this number percentage down considerably.

The following table contains example calculations for a cluster that contains 17 worker nodes. Each server has twelve 3 TB drives available for use by Hadoop. The table below outlines the available Hadoop storage based upon the number of worker nodes:

Table 1. Default replication factor
Raw Storage 612 TB
HDFS Storage (configurable) 550.8 TB
HDFS Unique Storage (default replication factor) 183.6 TB
MapReduce Intermediate Storage (configurable) 61.2 TB
Table 2. Erasure coding RS-6-3
Raw Storage 612 TB
HDFS Storage (configurable) 550.8 TB
HDFS Unique Storage (EC RS-6-3 -- 1.5x overhead) 367.2 TB
MapReduce Intermediate Storage (configurable) 61.2 TB
Table 3. Erasure Coding RS-10-4
Raw Storage 612 TB
HDFS Storage (configurable) 550.8 TB
HDFS Unique Storage (EC RS-10-4 -- 1.4x overhead) 393.4 TB
MapReduce Intermediate Storage (configurable) 61.2 TB

While Cloudera Manager provides tools such as Static Resource Pools which utilize Linux Cgroups to allow multiple components to share hardware, in high volume production clusters, it can be beneficial to allocate dedicated hosts for roles such as Solr and Kafka.