Chapter 4. Highly Available Reads with HBase
HDP enables HBase administrators to configure HBase clusters with read-only High Availability, or HA. This feature benefits HBase applications that require low-latency queries and can tolerate minimal (near-zero-second) staleness for read operations. Examples include queries on remote sensor data, distributed messaging, object stores, and user profile management.
High Availability for HBase features the following functionality:
Data is safely protected in HDFS
Failed nodes are automatically recovered
No single point of failure
All HBase API and region operations are supported, including scans, region split/merge, and META table support (the META table stores information about regions)
However, HBase administrators should carefully consider the following costs associated with using High Availability features:
Double or triple MemStore usage
Increased BlockCache usage
Increased network traffic for log replication
Extra backup RPCs for secondary region replicas
HBase is a distributed key-value store designed for fast table scans and read operations at petabyte scale. Before configuring HA for HBase, you should understand the concepts in the following table.
HBase Concept | Description |
---|---|
Region | A group of contiguous rows in an HBase table. Tables start with one region; additional regions are added dynamically as the table grows. Regions can be spread across multiple hosts to balance workloads and recover quickly from failure. There are two types of regions: primary and secondary. A secondary region is a copy of a primary region, replicated on a different Region Server. |
Region server | A Region server serves data requests for one or more regions. A single region is serviced by only one Region Server, but a Region Server may serve multiple regions. When region replication is enabled, a Region Server can serve regions in primary and secondary mode concurrently. |
Column family | A column family is a group of semantically related columns that are stored together. |
Memstore | Memstore is in-memory storage for a Region Server. Region Servers write files to HDFS after
the MemStore reaches a configurable maximum value specified with the
|
Write Ahead Log (WAL) | The WAL is a log file that records all changes to data until the data is successfully written to disk (MemStore is flushed). This protects against data loss in the event of a failure before MemStore contents are written to disk. |
Compaction | When operations stored in the MemStore are flushed to disk, HBase consolidates and merges many smaller files into fewer large files. This consolidation is called compaction, and it is usually very fast. However, if many Region Servers hit the data limit (specified by the MemStore) at the same time, HBase performance may degrade from the large number of simultaneous major compactions. Administrators can avoid this by manually splitting tables over time. |
For information about configuring regions, see "HBase Cluster Capacity and Region Sizing" in the System Administration Guide.