Hardware Requirements

To assess the hardware and resource allocations for your cluster, you need to analyze the types of workloads you want to run on your cluster, and the CDH components you will be using to run these workloads. You should also consider the size of data to be stored and processed, the frequency of workloads, the number of concurrent jobs that need to run, and the speed required for your applications.

As you create the architecture of your cluster, you will need to allocate Cloudera Manager and CDH roles among the hosts in the cluster to maximize your use of resources. Cloudera provides some guidelines about how to assign roles to cluster hosts. See Recommended Cluster Hosts and Role Distribution. When multiple roles are assigned to hosts, add together the total resource requirements (memory, CPUs, disk) for each role on a host to determine the required hardware.

Cloudera Manager

Cloudera Manager Server Storage Requirements

Component Storage Notes
Partition hosting /usr 1 GB  
Partition hosting /var 5 GB to 1 TB Scales according to number of nodes managed. See table below.
Partition hosting /opt 15 GB minimum Usage grows as the number of parcels downloaded increases.
Cloudera Manager Database Server 5 GB If the Cloudera Manager Database is shared with the Service Monitor and Host Monitor, more storage space is required to meet the requirements for those components.

Host Based Cloudera Manager Server Requirements

Number of Cluster Hosts Database Host Configuration Heap Size Logical Processors Cloudera Manager Server /var Directory
Very small (≤10) Shared 2 GB 4 5 GB
Small (≤20) Shared 4 GB 6 20 GB minimum
Medium (≤200) Dedicated 8 GB 6 200 GB minimum
Large (≤500) Dedicated 10 GB 8 500 GB minimum
Extra Large (>500) Dedicated 16 GB 16 1 TB minimum

Service Monitor Requirements

Service Monitor can be the most resource heavy service, which needs special attention. Service Monitor requirements are based on the number of monitored entities.

To see the number of monitored entities, perform the following steps:

  1. Open the Cloudera Manager Admin Console and click Clusters > Cloudera Management Service.
  2. Find the Cloudera Management Service Monitored Entities chart. If the chart does not exist, add it from the Chart Library.

For more information about Cloudera Manager entities, see Cloudera Manager Entity Types.

Tuning

Java Heap Size values (see the tables below) are rough estimates and some tuning might be necessary. From Cloudera Manager 6.3, Cloudera recommends using G1 garbage collector (G1GC) for Service Monitor. G1GC eliminates long JVM pauses, but uses a bit more CPU and RAM. It is the default for new installations.

Using G1GC
  1. Open the Service Monitor configuration page and find Java Configuration Options for Service Monitor.
  2. Add -XX:+UseG1GC -XX:-UseConcMarkSweepGC -XX:-UseParNewGC to the text box if it has not been added before.
Verifying your tuned settings
  1. Go to the Service Monitor.
  2. Check the Garbage Collection Time chart. It should show values lower than 3s.
  3. Check the JVM Heap Memory Usage chart. It should show a healthy zig-zag shaped memory usage pattern.

The log should not contain "OutOfMemoryError" and "JVM Pause" messages with longer duration than 3s. See the "Service Monitor Log Directory" configuration for log files location. The default location is /var/log/cloudera-scm-firehose.

Clusters with HDFS, YARN, or Impala

Use the recommendations in this table for clusters where the only services with worker roles are HDFS, YARN, or Impala.

Number of Monitored Entities Number of Hosts Required Java Heap Size Recommended Non-Java Heap Size
0-2,000 0-100 1 GB 6 GB
2,000-4,000 100-200 1.5 GB 6 GB
4,000-8,000 200-400 1.5 GB 12 GB
8,000-16,000 400-800 2.5 GB 12 GB
16,000-20,000 800-1,000 3.5 GB 12 GB

Clusters with HBase, Solr, Kafka, or Kudu

Use these recommendations when services such as HBase, Solr, Kafka, or Kudu are deployed in the cluster. These services typically have larger quantities of monitored entities.

Number of Monitored Entities Number of Hosts Required Java Heap Size Recommended Non-Java Heap Size
0-30,000 0-100 2 GB 12 GB
30,000-60,000 100-200 3 GB 12 GB
60,000-120,000 200-400 3.5 GB 12 GB
120,000-240,000 400-800 8 GB 20 GB

Host Monitor

The requirements for the Host Monitor are based on the number of monitored entities.

To see the number of monitored entities, perform the following steps:

  1. Open the Cloudera Manager Admin Console and click Clusters > Cloudera Management Service.
  2. Find the Cloudera Management Service Monitored Entities chart. If the chart does not exist, add it from the Chart Library.

For information about tuning, see Tuning.

For more information about Cloudera Manager entities, see Cloudera Manager Entity Types.

Number of Hosts Number of Monitored Entities Heap Size Non-Java Heap Size
0-200 <6k 1 GB 2 GB
200-800 6k-24k 2 GB 6 GB
800-1000 24k-30k 3 GB 6 GB

Ensure that you have at least 25 GB of disk space available for the Host Monitor, Service Monitor, Reports Manager, and Events Server databases.

For more information refer to Host Monitor and Service Monitor Memory Configuration.

Reports Manager

The Reports Manager fetches the fsimage from the NameNode at regular intervals. It reads the fsimage and creates a Lucene index for it. To improve the indexing performance, Cloudera recommends provisioning a host as powerful as possible and dedicating an SSD disk to the Reports Manager.
Reports Manager
Component Java Heap CPU Disk
Reports Manager 3-4 times the size of the fsimage.
  • Minimum: 8 cores
  • Recommended: 16 cores (32 cores, with hyperthreading enabled.)
1 dedicated disk that is at least 20 times the size of the fsimage. Cloudera strongly recommends using SSD disks.

Agent Hosts

An unpacked parcel requires approximately three times the space of the packed parcel that is stored on the Cloudera Manager Server.

Component Storage Notes
Partition hosting /opt 15 GB minimum Usage grows as new parcels are downloaded to cluster hosts.
/var/log 2 GB per role Each role running on the host will need at least 2 GB of disk space.

Event Server

The following table lists the minimum requirements for the Event Server:

CPU RAM Storage
1 core 256 MB
  • 5 GB for the Event Database
  • 20 GB for the Event Server Index Directory. The location of this directory is set by the Event Server Index Directory Event Server configuration property.

Alert Publisher

The following table lists the minimum requirements for the Alert Publisher:

CPU RAM Storage
1 core 1 GB Minimum of 1 disk for log files

Cloudera Navigator

The sizing of Navigator components varies heavily depending on the size of the cluster and the number of audit events generated. Refer to Minimum Recommended Memory and Disk Space for more information.

Cloudera Navigator
Component Java Heap / Memory CPU Disk
Navigator Audit Server Minimum: 2-3 GB of Java Heap

Configure this value using the Java Heap Size of Auditing Server in Bytes configuration property.

Minimum: 1 core The database used by the Navigator Audit Server must be able to accommodate hundreds of gigabytes (or tens of millions of rows per day). The database size may reach a terabyte.

Ideally, the database should not be shared with other services because the audit insertion rate can overwhelm the database server making other services using same database less responsive.

See Storage Space Planning for Cloudera Navigator.

Navigator Metadata Server
  • Minimum: 10 GB of Java Heap
  • Recommended: 20 GB Java Heap

Add 20 GB for operating system buffer cache, however memory requirements can be much higher on a busy cluster and could require provisioning a dedicated host. Navigator logs include estimates based on the number of objects it is tracking.

Configure this value using the Java Heap Size of Navigator Metadata Server in Bytes configuration property.

Minimum: 1 core
  • Minimum: 50 GB
  • Recommended: 100-200 GB

Cloudera Data Science Workbench

Hardware Component Requirement Notes
CPU 16+ CPU (vCPU) cores Allocate at least 1 CPU core per session. 1 CPU core is often adequate for light workloads.
Memory 32 GB RAM
  • As a general guideline, Cloudera recommends nodes with RAM between 60GB and 256GB
  • Allocating less than 2 GB of RAM can lead to out-of-memory errors for many applications.

Disk
  • Root Volume: 100 GB
  • Application Block Device or Mount Point (Master Host Only): 1 TB
  • Docker Image Block Device: 1 TB

SSDs are strongly recommended for application data storage.

For more information on scaling guidelines and storage requirements for cloud providers such as AWS and Azure, see Requirements and Supported Platforms in the Cloudera Data Science Workbench documentation.

CDH

Accumulo

Running Apache Accumulo on top of a CDH 6.0.0 cluster is not currently supported. If you try to upgrade to CDH 6.0.0 you will be asked to remove the Accumulo service from your cluster. Running Accumulo on top of CDH 6 will be supported in a future release.

Flume

Component Java Heap CPU Disk
Flume
  • Minimum: 1 GB
  • Maximum 4 GB
  • Java Heap size should be greater than the maximum channel capacity

Set this value using the Java Heap Size of Agent in Bytes Flume configuration property.

See Flume Memory Consumption

Calculate the number of cores using the following formula:

(Number of sources + Number of sinks ) / 2

Multiple disks are recommended for file channels, either a JBOD setup or RAID10 (preferred due to increased reliability).

HDFS

HDFS
Component Memory CPU Disk
JournalNode 1 GB (default)

Set this value using the Java Heap Size of JournalNode in Bytes HDFS configuration property.

1 core minimum 1 dedicated disk
NameNode
  • Minimum: 1 GB (for proof-of-concept deployments)
  • Add an additional 1 GB for each additional 1,000,000 blocks

    Snapshots and encryption can increase the required heap memory.

See Sizing NameNode Heap Memory

Set this value using the Java Heap Size of NameNode in Bytes HDFS configuration property.

Minimum of 4 dedicated cores; more may be required for larger clusters
  • Minimum of 2 dedicated disks for metadata
  • 1 dedicated disk for log files (This disk may be shared with the operating system.)
  • Maximum disks: 4
DataNode

Minimum: 4 GB

Increase the memory for higher replica counts or a higher number of blocks per DataNode. When increasing the memory, Cloudera recommends an additional 1 GB of memory for every 1 million replicas above 4 million on the DataNodes. For example, 5 million replicas require 5 GB of memory.

Set this value using the Java Heap Size of DataNode in Bytes HDFS configuration property.

Minimum: 4 cores. Add more cores for highly active clusters.

Minimum: 4

Maximum: 24

The maximum acceptable size will vary depending upon how large average block size is. The DN’s scalability limits are mostly a function of the number of replicas per DN, not the overall number of bytes stored. That said, having ultra-dense DNs will affect recovery times in the event of machine or rack failure. Cloudera does not support exceeding 100 TB per data node. You could use 12 x 8 TB spindles or 24 x 4TB spindles. Cloudera does not support drives larger than 8 TB.

HBase

Component Java Heap CPU Disk
Master
  • 100-10,000 regions: 4 GB
  • 10,000 or more regions with 200 or more Region Servers: 8 GB
  • 10,000 or more regions with 300 or more Region Servers: 12 GB

Set this value using the Java Heap Size of HBase Master in Bytes HBase configuration property.

Minimum 4 dedicated cores. You can add more cores for larger clusters, when using replication, or for bulk loads. 1 disk for local logs, which can be shared with the operating system and/or other Hadoop logs
Region Server

Set this value using the Java Heap Size of HBase RegionServer in Bytes HBase configuration property.

Minimum: 4 dedicated cores
  • 4 or more spindles for each HDFS DataNode
  • 1 disk for local logs (this disk can be shared with the operating system and/or other Hadoop logs
Thrift Server 1 GB - 4 GB

Set this value using the Java Heap Size of HBase Thrift Server in Bytes HBase configuration property.

Minimum 2 dedicated cores. 1 disk for local logs, which can be shared with the operating system and other Hadoop logs.

Hive

Component Java Heap CPU Disk
HiveServer 2 Single Connection 4 GB Minimum 4 dedicated cores

Minimum 1 disk

This disk is required for the following:

  • HiveServer2 log files
  • stdout and stderr output files
  • Configuration files
  • Operation logs stored in the operation_logs_dir directory, which is configurable
  • Any temporary files that might be created by local map tasks under the /tmp directory
2-10 connections 4-6 GB
11-20 connections 6-12 GB
21-40 connections 12-16 GB
41 to 80 connections 16-24 GB

Cloudera recommends splitting HiveServer2 into multiple instances and load balancing them once you start allocating more than 16 GB to HiveServer2. The objective is to adjust the size to reduce the impact of Java garbage collection on active processing by the service.

Set this value using the Java Heap Size of HiveServer2 in Bytes Hive configuration property.

Hive Metastore Single Connection 4 GB Minimum 4 dedicated cores

Minimum 1 disk

This disk is required so that the Hive metastore can store the following artifacts:

  • Logs
  • Configuration files
  • Backend database that is used to store metadata if the database server is also hosted on the same node
2-10 connections 4-10 GB
11-20 connections 10-12 GB
21-40 connections 12-16 GB
41 to 80 connections 16-24 GB

Set this value using the Java Heap Size of Hive Metastore Server in Bytes Hive configuration property.

Beeline CLI Minimum: 2 GB N/A N/A

Hive on Spark Executor Nodes

Component Memory CPU Disk
Hive-on-Spark
  • Minimum: 16 GB
  • Recommended: 32 GB for larger data sizes

Individual executor heaps should be no larger than 16 GB so machines with more RAM can use multiple executors.

  • Minimum: 4 cores
  • Recommended: 8 cores for larger data sizes
Disk space requirements are driven by scratch space requirements for Spark spill.
For more information on how to reserve YARN cores and memory that will be used by Spark executors, refer to Tuning Apache Hive on Spark in CDH.

HSM KMS

Component Memory CPU Disk
Navigator HSM KMS 16 GB RAM Minimum: 2 GHz 64-bit quad core 40 GB, using moderate to high-performance drives.

Hue

Component Memory CPU Disk
Hue Server
  • Minimum: 4 GB
  • Maximum 10 GB
  • If the cluster uses the Hue load balancer, add additional memory
Minimum: 1 Core to run Django

When Hue is configured for high availability, add additional cores

Minimum: 10 GB for the database, which grows proportionally according to the cluster size and workloads.

When Hue is configured for high availability, add temp space is required

For more information about Hue high availability, see How to Add a Hue Load Balancer.

Impala

Sizing requirements for Impala can vary significantly depending on the size and types of workloads using Impala.

Component Native Memory JVM Heap CPU Disk
Impala Daemon Set this value using the Impala Daemon Memory Limit configuration property.
  • Minimum: 32 GB
  • Recommended: 128 GB
Set this value using the Java Heap Size of Impala Daemon in Bytes configuration property for the Coordinator Impala Daemons.
  • Minimum: 4 GB
  • Recommended: 8 GB
  • Minimum: 4
  • Recommended: 16 or more

CPU instruction set: AVX2

  • Minimum: 1 disk
  • Recommended: 8 or more
Catalog Server Set this value using the Java Heap Size of Catalog Server in Bytes configuration property.
  • Minimum: 4 GB
  • Recommended: 8 GB
  • Minimum: 4
  • Recommended: 16 or more

CPU instruction set: AVX2

  • Minimum and Recommended: 1 disk

For the networking topology for multi-rack cluster, Leaf-Spine is recommended for the optimal performance.

Kafka

Kafka requires a fairly small amount of resources, especially with some configuration tuning. By default, Kafka, can run on as little as 1 core and 1GB memory with storage scaled based on requirements for data retention.

CPU is rarely a bottleneck because Kafka is I/O heavy, but a moderately-sized CPU with enough threads is still important to handle concurrent connections and background tasks.

Kafka brokers tend to have a similar hardware profile to HDFS data nodes. How you build them depends on what is important for your Kafka use cases. Use the following guidelines:
To affect performance of these features: Adjust these parameters:
Message Retention Disk size
Client Throughput (Producer & Consumer) Network capacity
Producer throughput Disk I/O
Consumer throughput Memory

A common choice for a Kafka node is as follows:

Component Memory/Java Heap CPU Disk
Broker
  • RAM: 64 GB
  • Recommended Java heap: 4 GB

Set this value using the Java Heap Size of Broker Kafka configuration property.

See Other Kafka Broker Properties table.

12- 24 cores
  • 1 HDD For operating system
  • 1 HDD for Zookeeper dataLogDir
  • 10- HDDs, using Raid 10, for Kafka data
MirrorMaker 1 GB heap

Set this value using the Java Heap Size of MirrorMaker Kafka configuration property.

1 core per 3-4 streams No disk space needed on MirrorMaker instance. Destination brokers should have sufficient disk space to store the topics being copied over.

Networking requirements: Gigabit Ethernet or 10 Gigabit Ethernet. Avoid clusters that span multiple data centers.

Key Trustee Server

Component Memory CPU Disk
Key Trustee Server 8 GB 1 GHz 64-bit quad core 20 GB, using moderate to high-performance drives

Key Trustee KMS

Component Memory CPU Disk
Key Trustee KMS 16 GB 2 GHz 64-bit quad core 40 GB, using moderate to high-performance drives

Kudu

Component Memory CPU Disk
Tablet Server
  • Minimum: 4 GB
  • Recommended: 10 GB

Additional hardware may be required, depending on the workloads running in the cluster. If you are using Impala, see the Impala sizing guidelines.

Kudu currently requires a CPU that supports the SSSE3 and SSE4.2 instruction sets.

If you are to run Kudu inside a VM, enable SSE4.2 pass-through to pass through SSE4.2 support into the VM.

1 disk for write-ahead log (WAL). Using an SSD drive may improve performance.
Master
  • Minimum: 256 MB
  • Recommended: 1 GB
  1 disk

For more information, see Kudu Server Management.

Oozie

Component Java Heap CPU Disk
Oozie
  • Minimum: 1 GB (this is the default set by Cloudera Manager). This is sufficient for less than 10 simultaneous workflows, without forking.
  • If you notice excessive garbage collection, or out-of-memory errors, increase the heap size to 4 GB for medium-size production clusters or to 8 GB for large-size production clusters.
  • Set this value using the Java Heap Size of Oozie Server in Bytes Oozie configuration property.
No resources required No resources required

Additional tuning:

For workloads with many coordinators that run with complex workflows (a max concurrency reached! warning appears in the log and the Oozie admin -queuedump command shows a large queue):
  • Increase the value of the oozie.service.CallableQueueService.callable.concurrency property to 50.
  • Increase the value of the oozie.service.CallableQueueService.threads property to 200.

Do not use a Derby database as a backend database for Oozie.

Search

Component Java Heap CPU Disk
Solr
  • Small workloads, or evaluations: 16 GB
  • Smaller production environments: 32 GB
  • Larger production environments: 96 GB is sufficient for most clusters.

Set this value using the Java Heap Size of Solr Server in Bytes Solr configuration property.

See

  • Minimum: 4
  • Recommended: 16 for production workloads
No requirement. Solr uses HDFS for storage.
Note the following considerations for determining the optimal amount of heap memory:
  • Size of searchable material: The more searchable material you have, the more memory you need. All things being equal, 10 TB of searchable data requires more memory than 1 TB of searchable data.
  • Content indexed in the searchable material: Indexing all fields in a collection of logs, email messages, or Wikipedia entries requires more memory than indexing only the Date Created field.
  • The level of performance required: If the system must be stable and respond quickly, more memory may help. If slow responses are acceptable, you may be able to use less memory.

For more information refer to Deployment Planning for Cloudera Search.

Sentry

Component Java Heap CPU Disk
Sentry Server
  • Minimum: 576 MB
  • Recommended: 2.5 GB per million objects in the Hive database (servers, databases, tables, partitions, columns, URIs, and views)

Set this value with the Java Heap Size of Sentry Server in Bytes Sentry configuration property.

Minimum: 4  

For more information about Sentry requirements, see Before You Install Sentry.

Spark

Component Java Heap CPU Disk
Spark History Server Minimum: 512 MB

Set this value using the Java Heap Size of History Server in Bytes Spark configuration property.

1

Minimum 1 disk for log files.

YARN

Component Java Heap CPU Other Recommendations
Job History Server
  • Minimum: 1 GB
  • Increase memory by 1.6 GB for each 100,000 tasks kept in memory. For example:

    5 jobs @ 100, 000 mappers + 20,000 reducers = 600,000 total tasks requiring 9.6 GB of heap.

    See the Other Recommendations column for additional tuning suggestions.

Set this value using the Java Heap Size of JobHistory Server in Bytes YARN configuration property.

Minimum: 1 core
  • Set the mapreduce.jobhistory.jhist.format property to binary (history files will load about 2x-3x faster with this setting). Available in CDH 5.5.0 or higher only.

  • Set the mapreduce.jobhistory.loadedtasks.cache.size property to a total loaded task count. Using the example in the Java Heap column to the left, of 650,000 total tasks, you can set it to 700,000 to allow for some safety margin. This should also prevent the JobHistoryServer from hanging during garbage collection, since the job count limit does not have a task limit.

NodeManager Minimum: 1 GB.

Configure additional heap memory for the following conditions:

  • Large number of containers
  • Large shuffle sizes in Spark or MapReduce

Set this value using the Java Heap Size of NodeManager in Bytes YARN configuration property.

  • Minimum: 8-16 cores
  • Recommended: 32-64 cores
Disks:
  • Minimum: 8 disks
  • Recommended: 12 or more disks
Networking:
  • Minimum: Dual 1Gbps or faster
  • Recommended: Single/Dual 10 Gbps or faster
ResourceManager Minimum: 6 GB
Configure additional heap memory for the following conditions:
  • More jobs
  • Larger cluster size
  • Number of retained finished applications (configured with the yarn.resourcemanager.max-completed-applications property.
  • Scheduler configuration

Set this value using the Java Heap Size of ResourceManager in Bytes YARN configuration property.

Minimum: 1 core  
Other Settings
  • Set the ApplicationMaster Memory YARN configuration property to 512 MB
  • Set the Container Memory Minimum YARN configuration property to 1 GB.
N/A N/A

For more information, see Tuning YARN.

ZooKeeper

Component Java Heap CPU Disk
 ZooKeeper Server 
  • Minimum: 1 GB
  • Increase heap size when watching 10,000 - 100,000 ephemeral znodes and are using 1,000 or more clients.

Set this value using the Java Heap Size of ZooKeeper Server in Bytes ZooKeeper configuration property.

Minimum: 4 cores

ZooKeeper was not designed to be a low-latency service and does not benefit from the use of SSD drives. The ZooKeeper access patterns – append-only writes and sequential reads – were designed with spinning disks in mind. Therefore Cloudera recommends using HDD drives.