Node Count and JVM Configuration
The number of nodes in an HBase cluster is typically driven by the following considerations:
Physical size of the data
Read/Write Throughput
Physical Size of the Data
The physical size of data on disk is affected by the following factors:
Factor Affecting Size of Physical Data | Description |
---|---|
HBase Overhead | The default amount of disk space required for a single HBase table cell. Smaller table cells require less overhead. The minimum cell size is 24 bytes and the default maximum is 10485760 bytes. Administrators can customize the maximum cell size with the |
Compression | Choose a data compression tool that makes sense for your data to reduce the physical size of data on disk. Unfortunately, HBase cannot ship with LZO due to licensing issues. However, HBase administrators may install LZO after installing HBase. GZIP provides better compression than LZO but is slower. HBase also supports Snappy. |
HDFS Replication | HBase uses HDFS for storage, so replicating HBase data stored in HDFS affects the total physical size of data. A typical replication factor of 3 for all HBase tables in a cluster would triple the physical size of the stored data. |
RegionServer Write Ahead Log (WAL) | The size of the Write Ahead Log, or WAL, for each RegionServer has minimal impact on the physical size of data. The size of the WAL is typically fixed at less than half of the memory for the RegionServer. Although this factor is included here for completeness, the impact of the WAL on data size is negligible and its size is usually not configured. |
Read/Write Throughput
The number of nodes in an HBase cluster may also be driven by required throughput for disk reads and/or writes. The throughput per node greatly depends on table cell size, data request patterns, as well as node and cluster configuration. Use YCSB tools to test the throughput of a single node or a cluster to determine if read/write throughput should drive your the number of nodes in your HBase cluster. A typical throughput for write operations for one RegionServer is 5-15 MB/s. Unfortunately, there is no good estimate for read throughput; it varies greatly depending on physical data size, request patterns, and hit rate for the block cache.