Chapter 4. High Availability for HBase
HDP enables HBase administrators to configure HBase clusters with read-only High Availability, or HA. This feature benefits HBase applications that require low-latency queries and can tolerate minimal (near-zero-second) staleness for read operations. Examples include queries on remote sensor data, distributed messaging, object stores, and user profile management.
High Availability for HBase features the following functionality:
Data is safely protected in HDFS
Failed nodes are automatically recovered
No single point of failure
All HBase API and region operations are supported, including scans, region split/merge, and META table support (the META table stores information about regions)
However, HBase administrators should carefully consider the following costs associated with using High Availability features:
Double or triple MemStore usage
Increased BlockCache usage
Increased network traffic for log replication
Extra backup RPCs for secondary region replicas
HBase is a distributed key-value store designed for fast table scans and read operations at petabyte scale. Before configuring HA for HBase, you should understand the concepts in the following table.
Table 4.1. Basic HBase Concepts
HBase Concept | Description |
---|---|
Region | A group of contiguous rows in an HBase table. Tables start with one region; additional regions are added dynamically as the table grows. Regions can be spread across multiple hosts to balance workloads and recover quickly from failure. There are two types of regions: primary and secondary. A secondary region is a copy of a primary region, replicated on a different RegionServer. |
RegionServer | A RegionServer serves data requests for one or more regions. A single region is serviced by only one RegionServer, but a RegionServer may serve multiple regions. When region replication is enabled, a RegionServer can serve regions in primary and secondary mode concurrently. |
Column family | A column family is a group of semantically related columns that are stored together. |
Memstore | Memstore is in-memory storage for a RegionServer. RegionServers write files to HDFS after
the MemStore reaches a configurable maximum value specified with the
|
Write Ahead Log (WAL) | The WAL is a log file that records all changes to data until the data is successfully written to disk (MemStore is flushed). This protects against data loss in the event of a failure before MemStore contents are written to disk. |
Compaction | When operations stored in the MemStore are flushed to disk, HBase consolidates and merges many smaller files into fewer large files. This consolidation is called compaction, and it is usually very fast. However, if many RegionServers hit the data limit (specified by the MemStore) at the same time, HBase performance may degrade from the large number of simultaneous major compactions. Administrators can avoid this by manually splitting tables over time. |
For information about configuring regions, see Deploying Apache HBase in the HDP Data Access Guide.