Apache HBase Guide

Apache HBase is a scalable, distributed, column-oriented datastore. Apache HBase provides real-time read/write random access to very large datasets hosted on HDFS.

Configuration Settings

HBase has a number of settings that you need to configure. For information, see Configuration Settings for HBase.

By default, HBase ships configured for standalone mode. In this mode of operation, a single JVM hosts the HBase Master, an HBase RegionServer, and a ZooKeeper quorum peer. HBase stores your data in a location on the local filesystem, rather than using HDFS. Standalone mode is only appropriate for initial testing.

Pseudo-distributed mode differs from standalone mode in that each of the component processes run in a separate JVM. It differs from distributed mode in that each of the separate processes run on the same server, rather than multiple servers in a cluster. For more information, see Configuring HBase in Pseudo-Distributed Mode.

Managing HBase

You can manage and configure various aspects of HBase using Cloudera Manager. For more information, see Managing HBase.

HBase Security

For the most part, securing an HBase cluster is a one-way operation, and moving from a secure to an unsecure configuration should not be attempted without contacting Cloudera support for guidance. For an overview security in HBase, see Managing HBase Security.

For information about authentication and authorization with HBase, see HBase Authentication and Configuring HBase Authorization.

HBase Replication

If your data is already in an HBase cluster, replication is useful for getting the data into additional HBase clusters. In HBase, cluster replication refers to keeping one cluster state synchronized with that of another cluster, using the write-ahead log (WAL) of the source cluster to propagate the changes. Replication is enabled at column family granularity. Before enabling replication for a column family, create the table and all column families to be replicated, on the destination cluster.

Cluster replication uses an active-push methodology. An HBase cluster can be a source (also called active, meaning that it writes new data), a destination (also called passive, meaning that it receives data using replication), or can fulfill both roles at once. Replication is asynchronous, and the goal of replication is consistency.

When data is replicated from one cluster to another, the original source of the data is tracked with a cluster ID, which is part of the metadata. In CDH 5, all clusters that have already consumed the data are also tracked. This prevents replication loops.

For more information about replication in HBase, see HBase Replication.

HBase High Availability

Most aspects of HBase are highly available in a standard configuration. A cluster typically consists of one Master and three or more RegionServers, with data stored in HDFS. To ensure that every component is highly available, configure one or more backup Masters. The backup Masters run on other hosts than the active Master.

For information about configuring high availability in HBase, see HBase High Availability.

Troubleshooting HBase

The Cloudera HBase packages have been configured to place logs in /var/log/hbase. Cloudera recommends tailing the .log files in this directory when you start HBase to check for any error messages or failures.

For information about HBase troubleshooting, see Troubleshooting HBase.

Upstream Information for HBase

More HBase information is available on the Apache Software Foundation site on the HBase project page.

For Apache HBase documentation, see the following:

Because Cloudera does not support all HBase features, always check external Hive documentation against the current version and supported features of HBase included in CDH distribution.

HBase has its own JIRA issue tracker.