5. Hardware selection for HBase

HBase uses different types of caches to fill up the memory, and as a general rule, the more memory HBase has, the better it can cache read requests. Each slave node in an HBase cluster (RegionServer) maintains a number of regions (regions are the chunks of the data in memory). For large clusters, it is important to ensure that the HBase Master and NameNode run on separate server machines. Note that in large scale deployments, the Zookeeper nodes are not co-deployed with the Hadoop/HBase slave nodes.

Choosing storage options

In a distributed setup, HBase stores its data in Hadoop DataNodes. To get the maximum read/write local­ity, HBase RegionServers and DataNodes are co-deployed on the same machines. Therefore, all recom­mendations for the DataNode/TaskTracker hardware setup are also applicable to the RegionServers. Depending on whether your HBase applications are read/write or processing oriented, you must balance the number of disks with the number of CPU cores available. Typically, you should have at least one core per disk.

Memory sizing

HBase Master nodes(s) are not as compute intensive as a typical RegionServer or the NameNode server. Therefore a more modest memory setting can be chosen for the HBase master. RegionServer memory requirements depend heavily on the workload characteristics of your HBase cluster. Although over provi­sioning for memory benefits all the workload patterns, with very large heap sizes Java’s stop-the-world GC pauses may cause problems.

In addition, when running HBase cluster with Hadoop core, you must ensure that you over-provision the memory for Hadoop MapReduce by at least 1 GB to 2 GB per task on top of the HBase memory.