Configuring Apache HBasePDF version

Using HBase blocksize

You must configure the HBase blocksize to set the smallest unit of data HBase can read from the column family's HFiles.

HBase data is stored in one (after a major compaction) or more (possibly before a major compaction) HFiles per column family per region. The blocksize determines:

  • The blocksize for a given column family determines the smallest unit of data HBase can read from the column family's HFiles.
  • The basic unit of measure cached by a RegionServer in the BlockCache.
The default blocksize is 64 KB. The appropriate blocksize is dependent upon your data and usage patterns. Use the following guidelines to tune the blocksize size, in combination with testing and benchmarking as appropriate.
  • Consider the average key/value size for the column family when tuning the blocksize. You can find the average key/value size using the HFile utility:
    $ hbase org.apache.hadoop.hbase.io.hfile.HFile -f /path/to/HFILE -m -v
    ...
    Block index size as per heapsize: 296
    reader=hdfs://srv1.example.com:9000/path/to/HFILE, \
    compression=none, inMemory=false, \
    firstKey=US6683275_20040127/mimetype:/1251853756871/Put, \
    lastKey=US6684814_20040203/mimetype:/1251864683374/Put, \
    avgKeyLen=37, avgValueLen=8, \
    entries=1554, length=84447
    ...
  • Consider the pattern of reads to the table or column family. For instance, if it is common to scan for 500 rows on various parts of the table, performance might be increased if the blocksize is large enough to encompass 500-1000 rows, so that often, only one read operation on the HFile is required. If your typical scan size is only 3 rows, returning 500-1000 rows would be overkill.

    It is difficult to predict the size of a row before it is written, because the data will be compressed when it is written to the HFile. Perform testing to determine the correct blocksize for your data.