Hardware Requirements
To assess the hardware and resource allocations for your cluster, you need to analyze the types of workloads you want to run on your cluster, and the CDH components you will be using to run these workloads. You should also consider the size of data to be stored and processed, the frequency of workloads, the number of concurrent jobs that need to run, and the speed required for your applications.
As you create the architecture of your cluster, you will need to allocate Cloudera Manager and CDH roles among the hosts in the cluster to maximize your use of resources. Cloudera provides some guidelines about how to assign roles to cluster hosts. See Recommended Cluster Hosts and Role Distribution. When multiple roles are assigned to hosts, add together the total resource requirements (memory, CPUs, disk) for each role on a host to determine the required hardware.
For more information about sizing for a particular component, see the following minimum requirements:
Cloudera Manager
Cloudera Manager Server Storage Requirements
Component | Storage | Notes |
---|---|---|
Partition hosting /usr | 1 GB | |
Partition hosting /var | 5 GB to 1 TB | Scales according to number of nodes managed. See table below. |
Partition hosting /opt | 15 GB minimum | Usage grows as the number of parcels downloaded increases. |
Cloudera Manager Database Server | 5 GB | If the Cloudera Manager Database is shared with the Service Monitor and Host Monitor, more storage space is required to meet the requirements for those components. |
Host Based Cloudera Manager Server Requirements
Number of Cluster Hosts | Database Host Configuration | Heap Size | Logical Processors | Cloudera Manager Server /var Directory |
---|---|---|---|---|
Very small (≤10) | Shared | 2 GB | 4 | 5 GB |
Small (≤20) | Shared | 4 GB | 6 | 20 GB minimum |
Medium (≤200) | Dedicated | 8 GB | 6 | 200 GB minimum |
Large (≤500) | Dedicated | 10 GB | 8 | 500 GB minimum |
Extra Large (>500) | Dedicated | 16 GB | 16 | 1 TB minimum |
Service Monitor Requirements
Service Monitor can be the most resource heavy service, which needs special attention. Service Monitor requirements are based on the number of monitored entities.
To see the number of monitored entities, perform the following steps:
- Open the Cloudera Manager Admin Console and click .
- Find the Cloudera Management Service Monitored Entities chart. If the chart does not exist, add it from the Chart Library.
For more information about Cloudera Manager entities, see Cloudera Manager Entity Types.
Tuning
Java Heap Size values (see the tables below) are rough estimates and some tuning might be necessary. From Cloudera Manager 6.3, Cloudera recommends using G1 garbage collector (G1GC) for Service Monitor. G1GC eliminates long JVM pauses, but uses a bit more CPU and RAM. It is the default for new installations.
- Open the Service Monitor configuration page and find Java Configuration Options for Service Monitor.
- Add -XX:+UseG1GC -XX:-UseConcMarkSweepGC -XX:-UseParNewGC to the text box if it has not been added before.
- Go to the Service Monitor.
- Check the Garbage Collection Time chart. It should show values lower than 3s.
- Check the JVM Heap Memory Usage chart. It should show a healthy zig-zag shaped memory usage pattern.
The log should not contain "OutOfMemoryError" and "JVM Pause" messages with longer duration than 3s. See the "Service Monitor Log Directory" configuration for log files location. The default location is /var/log/cloudera-scm-firehose.
Clusters with HDFS, YARN, or Impala
Use the recommendations in this table for clusters where the only services with worker roles are HDFS, YARN, or Impala.
Number of Monitored Entities | Number of Hosts | Required Java Heap Size | Recommended Non-Java Heap Size |
---|---|---|---|
0-2,000 | 0-100 | 1 GB | 6 GB |
2,000-4,000 | 100-200 | 1.5 GB | 6 GB |
4,000-8,000 | 200-400 | 1.5 GB | 12 GB |
8,000-16,000 | 400-800 | 2.5 GB | 12 GB |
16,000-20,000 | 800-1,000 | 3.5 GB | 12 GB |
Clusters with HBase, Solr, Kafka, or Kudu
Use these recommendations when services such as HBase, Solr, Kafka, or Kudu are deployed in the cluster. These services typically have larger quantities of monitored entities.
Number of Monitored Entities | Number of Hosts | Required Java Heap Size | Recommended Non-Java Heap Size |
---|---|---|---|
0-30,000 | 0-100 | 2 GB | 12 GB |
30,000-60,000 | 100-200 | 3 GB | 12 GB |
60,000-120,000 | 200-400 | 3.5 GB | 12 GB |
120,000-240,000 | 400-800 | 8 GB | 20 GB |
Host Monitor
The requirements for the Host Monitor are based on the number of monitored entities.
To see the number of monitored entities, perform the following steps:
- Open the Cloudera Manager Admin Console and click .
- Find the Cloudera Management Service Monitored Entities chart. If the chart does not exist, add it from the Chart Library.
For information about tuning, see Tuning.
For more information about Cloudera Manager entities, see Cloudera Manager Entity Types.
Number of Hosts | Number of Monitored Entities | Heap Size | Non-Java Heap Size |
---|---|---|---|
0-200 | <6k | 1 GB | 2 GB |
200-800 | 6k-24k | 2 GB | 6 GB |
800-1000 | 24k-30k | 3 GB | 6 GB |
Ensure that you have at least 25 GB of disk space available for the Host Monitor, Service Monitor, Reports Manager, and Events Server databases.
For more information refer to Host Monitor and Service Monitor Memory Configuration.
Reports Manager
Component | Java Heap | CPU | Disk |
---|---|---|---|
Reports Manager | 3-4 times the size of the fsimage. |
|
1 dedicated disk that is at least 20 times the size of the fsimage. Cloudera strongly recommends using SSD disks. |
Agent Hosts
An unpacked parcel requires approximately three times the space of the packed parcel that is stored on the Cloudera Manager Server.
Component | Storage | Notes |
---|---|---|
Partition hosting /opt | 15 GB minimum | Usage grows as new parcels are downloaded to cluster hosts. |
/var/log | 2 GB per role | Each role running on the host will need at least 2 GB of disk space. |
Event Server
The following table lists the minimum requirements for the Event Server:
CPU | RAM | Storage |
---|---|---|
1 core | 256 MB |
|
Alert Publisher
The following table lists the minimum requirements for the Alert Publisher:
CPU | RAM | Storage |
---|---|---|
1 core | 1 GB | Minimum of 1 disk for log files |
Cloudera Navigator
The sizing of Navigator components varies heavily depending on the size of the cluster and the number of audit events generated. Refer to Minimum Recommended Memory and Disk Space for more information.
Component | Java Heap / Memory | CPU | Disk |
---|---|---|---|
Navigator Audit Server | Minimum: 2-3 GB of Java Heap
Configure this value using the Java Heap Size of Auditing Server in Bytes configuration property. |
Minimum: 1 core | The database used by the Navigator Audit Server must be able to accommodate hundreds of gigabytes (or tens of millions of rows per
day). The database size may reach a terabyte.
Ideally, the database should not be shared with other services because the audit insertion rate can overwhelm the database server making other services using same database less responsive. |
Navigator Metadata Server |
Add 20 GB for operating system buffer cache, however memory requirements can be much higher on a busy cluster and could require provisioning a dedicated host. Navigator logs include estimates based on the number of objects it is tracking. Configure this value using the Java Heap Size of Navigator Metadata Server in Bytes configuration property. |
Minimum: 1 core |
|
Cloudera Data Science Workbench
Hardware Component | Requirement | Notes |
---|---|---|
CPU | 16+ CPU (vCPU) cores | Allocate at least 1 CPU core per session. 1 CPU core is often adequate for light workloads. |
Memory | 32 GB RAM |
|
Disk |
|
SSDs are strongly recommended for application data storage. |
For more information on scaling guidelines and storage requirements for cloud providers such as AWS and Azure, see Requirements and Supported Platforms in the Cloudera Data Science Workbench documentation.
CDH
Accumulo
Running Apache Accumulo on top of a CDH 6.0.0 cluster is not currently supported. If you try to upgrade to CDH 6.0.0 you will be asked to remove the Accumulo service from your cluster. Running Accumulo on top of CDH 6 will be supported in a future release.
Flume
Component | Java Heap | CPU | Disk |
---|---|---|---|
Flume |
Set this value using the Java Heap Size of Agent in Bytes Flume configuration property. |
Calculate the number of cores using the following formula:
(Number of sources + Number of sinks ) / 2 |
Multiple disks are recommended for file channels, either a JBOD setup or RAID10 (preferred due to increased reliability). |
HDFS
Component | Memory | CPU | Disk |
---|---|---|---|
JournalNode | 1 GB (default)
Set this value using the Java Heap Size of JournalNode in Bytes HDFS configuration property. |
1 core minimum | 1 dedicated disk |
NameNode |
See Sizing NameNode Heap Memory Set this value using the Java Heap Size of NameNode in Bytes HDFS configuration property. |
Minimum of 4 dedicated cores; more may be required for larger clusters |
|
DataNode |
Minimum: 4 GB Increase the memory for higher replica counts or a higher number of blocks per DataNode. When increasing the memory, Cloudera recommends an additional 1 GB of memory for every 1 million replicas above 4 million on the DataNodes. For example, 5 million replicas require 5 GB of memory. Set this value using the Java Heap Size of DataNode in Bytes HDFS configuration property. |
Minimum: 4 cores. Add more cores for highly active clusters. |
Minimum: 4 Maximum: 24 The maximum acceptable size will vary depending upon how large average block size is. The DN’s scalability limits are mostly a function of the number of replicas per DN, not the overall number of bytes stored. That said, having ultra-dense DNs will affect recovery times in the event of machine or rack failure. Cloudera does not support exceeding 100 TB per data node. You could use 12 x 8 TB spindles or 24 x 4TB spindles. Cloudera does not support drives larger than 8 TB. |
HBase
Component | Java Heap | CPU | Disk |
---|---|---|---|
Master |
Set this value using the Java Heap Size of HBase Master in Bytes HBase configuration property. |
Minimum 4 dedicated cores. You can add more cores for larger clusters, when using replication, or for bulk loads. | 1 disk for local logs, which can be shared with the operating system and/or other Hadoop logs |
Region Server |
Set this value using the Java Heap Size of HBase RegionServer in Bytes HBase configuration property. |
Minimum: 4 dedicated cores |
|
Thrift Server | 1 GB - 4 GB
Set this value using the Java Heap Size of HBase Thrift Server in Bytes HBase configuration property. |
Minimum 2 dedicated cores. | 1 disk for local logs, which can be shared with the operating system and other Hadoop logs. |
Hive
Component | Java Heap | CPU | Disk | |
---|---|---|---|---|
HiveServer 2 | Single Connection | 4 GB | Minimum 4 dedicated cores |
Minimum 1 disk This disk is required for the following:
|
2-10 connections | 4-6 GB | |||
11-20 connections | 6-12 GB | |||
21-40 connections | 12-16 GB | |||
41 to 80 connections | 16-24 GB | |||
Cloudera recommends splitting HiveServer2 into multiple instances and load balancing them once you start allocating more than 16 GB to HiveServer2. The objective is to adjust the size to reduce the impact of Java garbage collection on active processing by the service. |
||||
Set this value using the Java Heap Size of HiveServer2 in Bytes Hive configuration property. |
||||
Hive Metastore | Single Connection | 4 GB | Minimum 4 dedicated cores |
Minimum 1 disk This disk is required so that the Hive metastore can store the following artifacts:
|
2-10 connections | 4-10 GB | |||
11-20 connections | 10-12 GB | |||
21-40 connections | 12-16 GB | |||
41 to 80 connections | 16-24 GB | |||
Set this value using the Java Heap Size of Hive Metastore Server in Bytes Hive configuration property. |
||||
Beeline CLI | Minimum: 2 GB | N/A | N/A |
Hive on Spark Executor Nodes
Component | Memory | CPU | Disk | |
---|---|---|---|---|
Hive-on-Spark |
Individual executor heaps should be no larger than 16 GB so machines with more RAM can use multiple executors. |
|
Disk space requirements are driven by scratch space requirements for Spark spill. | |
For more information on how to reserve YARN cores and memory that will be used by Spark executors, refer to Tuning Apache Hive on Spark in CDH. |
HSM KMS
Component | Memory | CPU | Disk |
---|---|---|---|
Navigator HSM KMS | 16 GB RAM | Minimum: 2 GHz 64-bit quad core | 40 GB, using moderate to high-performance drives. |
Hue
Component | Memory | CPU | Disk |
---|---|---|---|
Hue Server |
|
Minimum: 1 Core to run Django
When Hue is configured for high availability, add additional cores |
Minimum: 10 GB for the database, which grows proportionally according to the cluster size and workloads.
When Hue is configured for high availability, add temp space is required |
For more information about Hue high availability, see How to Add a Hue Load Balancer.
Impala
Sizing requirements for Impala can vary significantly depending on the size and types of workloads using Impala.
Component | Native Memory | JVM Heap | CPU | Disk |
---|---|---|---|---|
Impala Daemon | Set this value using the Impala Daemon Memory Limit configuration property.
|
Set this value using the Java Heap Size of Impala Daemon in Bytes configuration property for the
Coordinator Impala Daemons.
|
CPU instruction set: AVX2 |
|
Catalog Server | Set this value using the Java Heap Size of Catalog Server in Bytes configuration property.
|
CPU instruction set: AVX2 |
|
For the networking topology for multi-rack cluster, Leaf-Spine is recommended for the optimal performance.
Kafka
Kafka requires a fairly small amount of resources, especially with some configuration tuning. By default, Kafka, can run on as little as 1 core and 1GB memory with storage scaled based on requirements for data retention.
CPU is rarely a bottleneck because Kafka is I/O heavy, but a moderately-sized CPU with enough threads is still important to handle concurrent connections and background tasks.
To affect performance of these features: | Adjust these parameters: |
---|---|
Message Retention | Disk size |
Client Throughput (Producer & Consumer) | Network capacity |
Producer throughput | Disk I/O |
Consumer throughput | Memory |
A common choice for a Kafka node is as follows:
Component | Memory/Java Heap | CPU | Disk |
---|---|---|---|
Broker |
Set this value using the Java Heap Size of Broker Kafka configuration property. |
12- 24 cores |
|
MirrorMaker | 1 GB heap
Set this value using the Java Heap Size of MirrorMaker Kafka configuration property. |
1 core per 3-4 streams | No disk space needed on MirrorMaker instance. Destination brokers should have sufficient disk space to store the topics being copied over. |
Networking requirements: Gigabit Ethernet or 10 Gigabit Ethernet. Avoid clusters that span multiple data centers.
Key Trustee Server
Component | Memory | CPU | Disk |
---|---|---|---|
Key Trustee Server | 8 GB | 1 GHz 64-bit quad core | 20 GB, using moderate to high-performance drives |
Key Trustee KMS
Component | Memory | CPU | Disk |
---|---|---|---|
Key Trustee KMS | 16 GB | 2 GHz 64-bit quad core | 40 GB, using moderate to high-performance drives |
Kudu
Component | Memory | CPU | Disk |
---|---|---|---|
Tablet Server |
Additional hardware may be required, depending on the workloads running in the cluster. If you are using Impala, see the Impala sizing guidelines. |
Kudu currently requires a CPU that supports the SSSE3 and SSE4.2 instruction sets.
If you are to run Kudu inside a VM, enable SSE4.2 pass-through to pass through SSE4.2 support into the VM. |
1 disk for write-ahead log (WAL). Using an SSD drive may improve performance. |
Master |
|
1 disk |
For more information, see Kudu Server Management.
Oozie
Component | Java Heap | CPU | Disk |
---|---|---|---|
Oozie |
|
No resources required | No resources required |
Additional tuning:
- Increase the value of the oozie.service.CallableQueueService.callable.concurrency property to 50.
- Increase the value of the oozie.service.CallableQueueService.threads property to 200.
Do not use a Derby database as a backend database for Oozie.
Search
Component | Java Heap | CPU | Disk |
---|---|---|---|
Solr |
Set this value using the Java Heap Size of Solr Server in Bytes Solr configuration property. See |
|
No requirement. Solr uses HDFS for storage. |
- Size of searchable material: The more searchable material you have, the more memory you need. All things being equal, 10 TB of searchable data requires more memory than 1 TB of searchable data.
- Content indexed in the searchable material: Indexing all fields in a collection of logs, email messages, or Wikipedia entries requires more memory than indexing only the Date Created field.
-
The level of performance required: If the system must be stable and respond quickly, more memory may help. If slow responses are acceptable, you may be able to use less memory.
For more information refer to Deployment Planning for Cloudera Search.
Sentry
Component | Java Heap | CPU | Disk |
---|---|---|---|
Sentry Server |
Set this value with the Java Heap Size of Sentry Server in Bytes Sentry configuration property. |
Minimum: 4 |
For more information about Sentry requirements, see Before You Install Sentry.
Spark
Component | Java Heap | CPU | Disk |
---|---|---|---|
Spark History Server | Minimum: 512 MB
Set this value using the Java Heap Size of History Server in Bytes Spark configuration property. |
1 |
Minimum 1 disk for log files. |
YARN
Component | Java Heap | CPU | Other Recommendations |
---|---|---|---|
Job History Server |
Set this value using the Java Heap Size of JobHistory Server in Bytes YARN configuration property. |
Minimum: 1 core |
|
NodeManager | Minimum: 1 GB.
Configure additional heap memory for the following conditions:
Set this value using the Java Heap Size of NodeManager in Bytes YARN configuration property. |
|
Disks:
Networking:
|
ResourceManager | Minimum: 6 GB
Configure additional heap memory for the following conditions:
Set this value using the Java Heap Size of ResourceManager in Bytes YARN configuration property. |
Minimum: 1 core | |
Other Settings |
|
N/A | N/A |
For more information, see Tuning YARN.
ZooKeeper
Component | Java Heap | CPU | Disk |
---|---|---|---|
ZooKeeper Server |
Set this value using the Java Heap Size of ZooKeeper Server in Bytes ZooKeeper configuration property. |
Minimum: 4 cores |
ZooKeeper was not designed to be a low-latency service and does not benefit from the use of SSD drives. The ZooKeeper access patterns – append-only writes and sequential reads – were designed with spinning disks in mind. Therefore Cloudera recommends using HDD drives. |