The number of nodes in an HBase cluster is typically driven by the following considerations:
Physical size of the data
Read/Write Throughput
Physical Size of the Data
The physical size of data on disk is affected by the following factors:
Table 10.2. Factors Affecting Physical Size of Data
Factor Affecting Size of Physical Data Description HBase Overhead The default amount of disk space required for a single HBase table cell. Smaller table cells require less overhead. The minimum cell size is 24 bytes and the default maximum is 10485760 bytes. Administrators can customize the maximum cell size with the hbase.client.keyvalue.maxsize
property in thehbase-site.xml
configuration file. HBase table cells are aggregated into blocks, and the block size is also configurable for each column family with thehbase.mapreduce.hfileoutputformat.blocksize
property. The default value is 65536 bytes. Administrators may reduce this value for tables with highly random data access patterns to improve query latency.Compression Choose a data compression tool that makes sense for your data to reduce the physical size of data on disk. Unfortunately, HBase cannot ship with LZO due to licensing issues. However, HBase administrators may install LZO after installing HBase. GZIP provides better compression than LZO but is slower. HBase also supports Snappy. HDFS Replication
HBase uses HDFS for storage, so replicating HBase data stored in HDFS affects the total physical size of data. A typical replication factor of 3 for all HBase tables in a cluster would triple the physical size of the stored data. region server Write Ahead Log (WAL) The size of the Write Ahead Log, or WAL, for each region server has minimal impact on the physical size of data. The size of the WAL is typically fixed at less than half of the memory for the region server. Although this factor is included here for completeness, the impact of the WAL on data size is negligible and its size is usually not configured.
Read/Write Throughput
The number of nodes in an HBase cluster may also be driven by required throughput for disk reads and/or writes. The throughput per node greatly depends on table cell size, data request patterns, as well as node and cluster configuration. Use YCSB tools to test the throughput of a single node or a cluster to determine if read/write throughput should drive your the number of nodes in your HBase cluster. A typical throughput for write operations for one region server is 5-15 Mb/s. Unfortunately, there is no good estimate for read throughput; it varies greatly depending on physical data size, request patterns, and hit rate for the block cache.