Number of cores and multithreading
Other than cost, there is no negative for buying more and better CPUs. However, the return of investment on additional CPU power must be evaluated carefully.
Following are some points to consider:
- Cluster Bottleneck: In general, CPU resources (and lack thereof) are not the bottleneck for MapReduce and HBase. Almost always, the bottleneck is drive and/or network performance. Certainly, there are exceptions to this, such as inefficient Hive queries. Other compute frameworks like Impala, Spark, and Cloudera Search may be CPU-bound depending on the workload.
- Additional Cores/Threads: Within a given MapReduce job, a single task typically uses one thread at a time. With Spark this is different, because a single task might use multiple threads in parallel. As outlined earlier, the number of slots allocated per node may be a function of the number of drives in the node. A more careful evaluation is required in scenarios where the Virtual Private Cluster model is being considered. In such scenarios, if storage and compute are clearly distinct roles, then requirements might vary significantly. As long as there is no huge disparity in the number of cores (threads) and the number of drives, there is no need for additional cores. In addition, a MapReduce task is going to be I/O bound for typical jobs. Thus, a given thread used by the task will have a large amount of idle time while waiting for an I/O response.
- Clock Speed: Because Cloudera clusters often begin with a small number of use cases and associated workloads and grow over time, it makes sense to purchase the fastest CPUs available. Actual CPU usage is use case and workload dependent. For instance, computationally intensive Spark jobs benefit more from faster CPUs than I/O bound MapReduce applications.
For worker nodes, a mid-range 12-14 core CPU running at 2.4-2.5 GHz provides a good cost/performance tradeoff. For master nodes, a mid-range 8 core CPU with a faster clock speed (For example 2.6 GHz) can suffice. Where available, Simultaneous Multi-Threading implementations should be enabled (for example Intel’s HyperThreading). BIOS settings for CPU and memory should be set to Maximum Performance mode or equivalent.
See Hardware Requirements for detailed workload-specific practices.