Disk space, I/O Bandwidth (required by Hadoop), and computational power (required for the MapReduce processes) are the most important parameters for accurate hardware sizing. Additionally, if you are installing HBase, you also need to analyze your application and its memory requirements, because HBase is a memory intensive component. Based on the typical use cases for Hadoop, the following workload patterns are commonly observed in production environments:
Balanced Workload
If your jobs are distributed equally across the various job types (CPU bound, Disk I/O bound, or Network I/O bound), your cluster has balanced workload pattern. This is a good default configuration for unknown or evolving workloads.
Compute Intensive
These workloads are CPU bound and are characterized by the need of large number of CPUs and large amounts of memory to store in-process data. (This usage pattern is typical for natural language processing or HPCC workloads.)
I/O Intensive
Typical MapReduce job (like sorting) requires very little compute power but relies more on the I/O bound capacity of the cluster (for example if you have lot of cold data). Hadoop clusters utilized for such workloads are typically I/O intensive. For this type of workload, we recommend investing in more disks per box.
Unknown or evolving workload patterns
Most teams looking to build a Hadoop cluster are often unaware of their workload patterns. Also, the first jobs submitted to Hadoop are very different than the actual jobs in the production environments. For these reasons, Hortonworks recommends that you either use the Balanced workload configuration or invest in a pilot Hadoop cluster and plan to evolve as you analyze the workload patterns in your environment.