This chapter discusses the topics of planning the capacity of an HBase cluster and the size of its region servers. There are several considerations when planning the capacity of an Apache HBase cluster and performing the initial configuration:
Initial Configuration and Tuning
This chapter requires an understanding of the following HBase concepts:
Table 10.1. HBase Concepts
HBase Concept Description Region A group of contiguous HBase table rows. Tables start with one region and additional regions are dynamically added as the table grows. Regions can be spread across multiple hosts to provide load balancing and quick recovery from failure. There are two types of region: primary and secondary. A secondary region is a replicated primary region located on a different region server. region server Serves data requests for one or more regions. A single region is serviced by only one region server, but a region server may serve multiple regions. Column family A group of semantically related columns stored together. Memstore In-memory storage for a region server. region servers write files to HDFS after the memstore reaches a configurable maximum value specified with the hbase.hregion.memstore.flush.size
property in thehbase-site.xml
configuration file.Write Ahead Log (WAL) In-memory log where operations are recorded before they are stored in the memstore. Compaction storm When the operations stored in the memstore are flushed to disk, HBase consolidates and merges many smaller files into fewer large files. This consolidation is called compaction, and it is usually very fast. However, if many region servers hit the data limit specified by the memstore at the same time, HBase performance may degrade from the large number of simultaneous major compactions. Administrators can avoid this by manually splitting tables over time.