1.1. Physical Size of the Data

The physical size of data on disk is affected by the following factors:

Factor Affecting Size of Physical Data

Description

HBase Overhead

The default amount of disk space required for a single HBase table cell. Smaller table cells require less overhead. The minimum cell size is 24 bytes and the default maximum is 10485760 bytes. Administrators can customize the maximum cell size with the hbase.client.keyvalue.maxsize property in the hbase-site.xml configuration file. HBase table cells are aggregated into blocks, and the block size is also configurable for each column family with the hbase.mapreduce.hfileoutputformat.blocksize property. The default value is 65536 bytes. Administrators may reduce this value for tables with highly random data access patterns to improve query latency.

Compression

Choose a data compression tool that makes sense for your data to reduce the physical size of data on disk. Unfortunately, HBase cannot ship with LZO due to licensing issues. However, HBase administrators may install LZO after installing HBase. GZIP provides better compression than LZO but is slower. HBase also supports Snappy.

HDFS Replication

HBase uses HDFS for storage, so replicating HBase data stored in HDFS affects the total physical size of data. A typical replication factor of 3 for all HBase tables in a cluster would triple the physical size of the stored data.

RegionServer Write Ahead Log (WAL)

The size of the Write Ahead Log, or WAL, for each RegionServer has minimal impact on the physical size of data. The size of the WAL is typically fixed at less than half of the memory for the region server. Although this factor is included here for completeness, the impact of the WAL on data size is negligible and its size is usually not configured.


loading table of contents...