Using HBase Blocksize
You must configure the HBase blocksize to set the smallest unit of data HBase can read from the column family's HFiles.
HBase data is stored in one (after a major compaction) or more (possibly before a major compaction) HFiles per column family per region. The blocksize determines:
- The blocksize for a given column family determines the smallest unit of data HBase can read from the column family's HFiles.
- The basic unit of measure cached by a RegionServer in the BlockCache.
- Consider the average key/value size for the column family when tuning the blocksize. You can
find the average key/value size using the HFile
utility:
$ hbase org.apache.hadoop.hbase.io.hfile.HFile -f /path/to/HFILE -m -v ... Block index size as per heapsize: 296 reader=hdfs://srv1.example.com:9000/path/to/HFILE, \ compression=none, inMemory=false, \ firstKey=US6683275_20040127/mimetype:/1251853756871/Put, \ lastKey=US6684814_20040203/mimetype:/1251864683374/Put, \ avgKeyLen=37, avgValueLen=8, \ entries=1554, length=84447 ...
-
Consider the pattern of reads to the table or column family. For instance, if it is common to scan for 500 rows on various parts of the table, performance might be increased if the blocksize is large enough to encompass 500-1000 rows, so that often, only one read operation on the HFile is required. If your typical scan size is only 3 rows, returning 500-1000 rows would be overkill.
It is difficult to predict the size of a row before it is written, because the data will be compressed when it is written to the HFile. Perform testing to determine the correct blocksize for your data.