Hadoop High Availability
Also available as:
PDF
loading table of contents...

Chapter 4. Highly Available Reads with HBase

HDP enables HBase administrators to configure HBase clusters with read-only High Availability, or HA. This feature benefits HBase applications that require low-latency queries and can tolerate minimal (near-zero-second) staleness for read operations. Examples include queries on remote sensor data, distributed messaging, object stores, and user profile management.

High Availability for HBase features the following functionality:

  • Data is safely protected in HDFS

  • Failed nodes are automatically recovered

  • No single point of failure

  • All HBase API and region operations are supported, including scans, region split/merge, and META table support (the META table stores information about regions)

However, HBase administrators should carefully consider the following costs associated with using High Availability features:

  • Double or triple MemStore usage

  • Increased BlockCache usage

  • Increased network traffic for log replication

  • Extra backup RPCs for secondary region replicas

HBase is a distributed key-value store designed for fast table scans and read operations at petabyte scale. Before configuring HA for HBase, you should understand the concepts in the following table.

HBase ConceptDescription

Region

A group of contiguous rows in an HBase table. Tables start with one region; additional regions are added dynamically as the table grows. Regions can be spread across multiple hosts to balance workloads and recover quickly from failure.

There are two types of regions: primary and secondary. A secondary region is a copy of a primary region, replicated on a different Region Server.

Region server

A Region server serves data requests for one or more regions. A single region is serviced by only one Region Server, but a Region Server may serve multiple regions. When region replication is enabled, a Region Server can serve regions in primary and secondary mode concurrently.

Column family

A column family is a group of semantically related columns that are stored together.

Memstore

Memstore is in-memory storage for a Region Server. Region Servers write files to HDFS after the MemStore reaches a configurable maximum value specified with the hbase.hregion.memstore.flush.size property in the hbase-site.xml configuration file.

Write Ahead Log (WAL)

The WAL is a log file that records all changes to data until the data is successfully written to disk (MemStore is flushed). This protects against data loss in the event of a failure before MemStore contents are written to disk.

Compaction

When operations stored in the MemStore are flushed to disk, HBase consolidates and merges many smaller files into fewer large files. This consolidation is called compaction, and it is usually very fast. However, if many Region Servers hit the data limit (specified by the MemStore) at the same time, HBase performance may degrade from the large number of simultaneous major compactions. Administrators can avoid this by manually splitting tables over time.

For information about configuring regions, see "HBase Cluster Capacity and Region Sizing" in the System Administration Guide.