Deploying Apache HBase
Apache HBase (often simply referred to as HBase) operates with many other big data components of the Apache Hadoop environment. Some of these components might or might not be suitable for use with the HBase deployment in your environment. However, two components that must coexist on your HBase cluster are Apache Hadoop Distributed File System (HDFS) and Apache ZooKeeper. These components are bundled with all HDP distributions.
Apache Hadoop Distributed File System (HDFS) is the persistent data store that holds data in a state that allows users and applications to quickly retrieve and write to HBase tables. While technically it is possible to run HBase on a different distributed filesystem, the vast majority of HBase clusters run with HDFS. HDP uses HDFS as its filesystem.
Apache ZooKeeper (or simply ZooKeeper) is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services in Hadoop ecosystems. ZooKeeper is essential for maintaining stability for HBase applications in the event of node failures, as well as to to store and mediate updates to important configuration information across the Hadoop cluster.
If you want to use a SQL-like interface to work with the semistructured data of an HBase cluster, a good complement to the other Hadoop components is Apache Phoenix (or simply Phoenix). Phoenix is a SQL abstraction layer for interacting with HBase. Phoenix enables you to create and interact with tables in the form of typical DDL and DML statements through its standard JDBC API. HDP supports integration of Phoenix with HBase. See Orchestrating SQL and APIs with Apache Phoenix.
The following table defines some main HBase concepts:
HBase Concept | Description |
---|---|
region |
A group of contiguous HBase table rows Tables start with one region, with regions dynamically added as the table grows. Regions can be spread across multiple hosts to provide load balancing and quick recovery from failure. There are two types of regions: primary and secondary. A secondary region is a replicated primary region located on a different RegionServer. |
RegionServer |
Serves data requests for one or more regions A single region is serviced by only one RegionServer, but a RegionServer may serve multiple regions. |
column family | A group of semantically related columns stored together |
MemStore |
In-memory storage for a RegionServer RegionServers write files to HDFS after the MemStore reaches a
configurable maximum value specified with the
|
Write Ahead Log (WAL) | In-memory log in which operations are recorded before they are stored in the MemStore |
compaction storm |
A short period when the operations stored in the MemStore are flushed to disk and HBase consolidates and merges many smaller files into fewer large files This consolidation is called compaction, and it is usually very fast. However, if many RegionServers reach the data limit specified by the MemStore at the same time, HBase performance might degrade from the large number of simultaneous major compactions. You can avoid this by manually splitting tables over time. |
Installation and Setup
You can install and configure HBase for your HDP cluster by using either of the following methods:
Ambari installation wizard
The wizard is the part of the Apache Ambari web-based platform that guides HDP installation, including deploying various Hadoop components, such as HBase, depending on the needs of your cluster. See the Ambari Install Guide.
Manual installation
You can fetch one of the repositories bundled with HBase and install it on the command line. See the Non-Ambari Installation Guide.
Important | |
---|---|
Your HBase installation must be the same version as the one that is packaged with the distribution of the HDP stack version that is deployed across your cluster. |
Cluster Capacity and Region Sizing
This section provides information to help you plan the capacity of an HBase cluster and the size of its RegionServers.
Node Count and JVM Configuration
The number of nodes in an HBase cluster is typically driven by physical size of the data set and read/write throughput.
Physical Size of the Data
The physical size of data on disk is affected by the following factors:
Factor Affecting Size of Physical Data | Description |
---|---|
HBase Overhead |
The default amount of disk space required for a single HBase table cell. Smaller table cells require less overhead. The minimum cell size is 24 bytes and the default maximum is 10485760 bytes. You can customize the maximum cell size by
using the hbase.client.keyvalue.maxsize
property in the |
Compression | You should choose the data compression tool that is most appropriate to reducing the physical size of your data on disk. Although HBase is not shipped with LZO due to licensing issues, you can install LZO after installing HBase. GZIP provides better compression than LZO but is slower. HBase also supports Snappy. |
HDFS Replication | HBase uses HDFS for storage, so replicating HBase data stored in HDFS affects the total physical size of data. A typical replication factor of 3 for all HBase tables in a cluster triples the physical size of the stored data. |
RegionServer Write Ahead Log (WAL) | The size of the Write Ahead Log, or WAL, for each RegionServer has minimal impact on the physical size of data: typically fixed at less than half of the memory for the RegionServer. The data size of WAL is usually not configured. |
Read/Write Throughput
The number of nodes in an HBase cluster might also be driven by required throughput for disk reads and writes. The throughput per node greatly depends on table cell size and data request patterns, as well as node and cluster configuration. You can use YCSB tools to test the throughput of a single node or a cluster to determine if read/write throughput should drive the number of nodes in your HBase cluster. A typical throughput for write operations for one RegionServer is 5 through 15 MB/s. Unfortunately, there is no good estimate for read throughput, which varies greatly depending on physical data size, request patterns, and hit rate for the block cache.
Region Count and Size
In general, an HBase cluster runs more smoothly with fewer regions. Although administrators cannot directly configure the number of regions for a RegionServer, they can indirectly increase the number of regions in the following ways:
Increase the size of the MemStore for a RegionServer
Increase the size of a region
Administrators also can increase the number of regions for a RegionServer by
splitting large regions to spread data and the request load across the cluster.
HBase enables administrators to configure each HBase table individually, which is
useful when tables have different workloads and use cases. Most region settings can
be set on a per-table basis by using HTableDescriptor class, as well as by using the HBase CLI. These methods
override the properties in the hbase-site.xml
configuration
file. For further information, see Configuring Compactions.
Note | |
---|---|
The HDFS replication factor defined in the previous table affects only disk usage and should not be considered when planning the size of regions. |
Increase MemStore size for RegionServer
Use of the RegionServer MemStore largely determines the maximum number of
regions for the RegionServer. Each region has one MemStore for each column
family, which grows to a configurable size, usually between 128 and 256 MB.
Administrators specify this size by using the
hbase.hregion.memstore.flush.size property in the
hbase-site.xml
configuration file. The RegionServer
dedicates some fraction of total memory to region MemStores based on the value
of the hbase.regionserver.global.memstore.size
configuration property. If usage exceeds this configurable size, HBase might
become unresponsive or compaction storms might occur.
You can use the following formula to estimate the number of regions for a RegionServer:
(regionserver_memory_size) * (memstore_fraction) / ((memstore_size) * (num_column_families))
For example, assume that your environment uses the following configuration:
RegionServer with 16 GB RAM (or 16384 MB)
MemStore fraction of .4
MemStore with 128 MB RAM
One column family in table
The formula for this configuration is as follows:
(16384 MB * .4) / ((128 MB * 1) = approximately 51 regions
The easiest way to decrease the number of regions for this example RegionServer is to increase the RAM of the memstore to 256 MB. The reconfigured RegionServer then has approximately 25 regions, and the HBase cluster runs more smoothly if the reconfiguration is applied to all RegionServers in the cluster. The formula can be used for multiple tables with the same configuration by using the total number of column families in all the tables.
Note | |
---|---|
The formula is based on the assumption that all regions are filled at approximately the same rate. If a fraction of the cluster's regions are written to, divide the result by this fraction. |
If the data request pattern is dominated by write operations rather than read operations, you should increase the MemStore fraction. However, this increase negatively impacts the block cache.
Increase Size of Region
The other way to indirectly increase the number of regions for a RegionServer
is to increase the size of the region by using the
hbase.hregion.max.filesize
property in the
hbase-site.xml
configuration file. Administrators
increase the number of regions for a RegionServer by increasing the specified
size at which new regions are dynamically allocated.
Maximum region size is primarily limited by compactions. Very large compactions can degrade cluster performance. The recommended maximum region size is 10 through 20 GB. For HBase clusters running version 0.90.x, the maximum recommended region size is 4 GB and the default is 256 MB. If you are unable to estimate the size of your tables, you should retain the default value. You should increase the region size only if your table cells tend to be 100 KB or larger.
Note | |
---|---|
HBase 0.98 introduced stripe compactions as an experimental feature that also enables administrators to increase the size of regions. For more information, see Experimental: Stripe Compactions on the Apache HBase website. |
Initial Tuning of the Cluster
HBase administrators typically use the following methods to initially configure the cluster:
Increasing the request handler thread count
Configuring the size and number of WAL files
Configuring compactions
Splitting tables
Tuning JVM garbage collection in RegionServers
Increasing the Request Handler Thread Count
Administrators who expect their HBase cluster to experience a high volume
request pattern should increase the number of listeners generated by the
RegionServers. You can use the
hbase.regionserver.handler.count
property in the
hbase-site.xml
configuration file to set the number
higher than the default value of 30
.
Configuring the Size and Number of WAL Files
HBase uses the Write Ahead Log, or WAL, to recover MemStore data not yet flushed to disk if a RegionServer crashes. Administrators should configure these WAL files to be slightly smaller than the HDFS block size. By default, an HDFS block is 64 MB and a WAL is approximately 60 MB. You should ensure that enough WAL files are allocated to contain the total capacity of the MemStores. Use the following formula to determine the number of WAL files needed:
(regionserver_heap_size * memstore fraction) / (default_WAL_size)
For example, assume that your environment has the following HBase cluster configuration:
16 GB RegionServer heap
0.4 MemStore fraction
60 MB default WAL size
The formula for this configuration is as follows:
(16384 MB * 0.4) / 60 MB = approximately 109 WAL files
Use the following properties in the hbase-site.xml
configuration file to configure the size and number of WAL files:
Configuration Property | Description | Default |
---|---|---|
hbase.regionserver.maxlogs | Sets the maximum number of WAL files | 32 |
hbase.regionserver.logroll.multiplier | Multiplier of HDFS block size | 0.95 |
hbase.regionserver.hlog.blocksize | Optional override of HDFS block size | Value assigned to actual HDFS block size |
Note | |
---|---|
If recovery from failure takes longer than expected, try reducing the number of WAL files to improve performance. |
Configuring Compactions
Administrators who expect their HBase clusters to host large amounts of data
should consider the effect that compactions have on write throughput. For
write-intensive data request patterns, administrators should consider less
frequent compactions and more store files per region. Use the
hbase.hstore.compaction.min property in the
hbase-site.xml
configuration file to increase the
minimum number of files required to trigger a compaction. Administrators opting
to increase this value should also increase the value assigned to the
hbase.hstore.blockingStoreFiles property because more
files will accumulate.
Splitting Tables
Administrators can split tables during table creation based on the target number of regions per RegionServer to avoid costly dynamic splitting as the table starts to fill. In addition, it ensures that the regions in the pre-split table are distributed across many host machines. Pre-splitting a table avoids the cost of compactions required to rewrite the data into separate physical files during automatic splitting.
If a table is expected to grow very large, administrators should create at least one region per RegionServer. However, you should not immediately split the table into the total number of desired regions. Rather, choose a low to intermediate value. For multiple tables, you should not create more than one region per RegionServer, especially if you are uncertain how large the table will grow. Creating too many regions for a table that will never exceed 100 MB is not useful; a single region can adequately service a table of this size.
Tuning JVM Garbage Collection in RegionServers
A RegionServer cannot utilize a very large heap due to the cost of garbage collection. Administrators should specify no more than 24 GB for one RegionServer.
To tune garbage collection in HBase RegionServers for stability, make the following configuration changes:
Specify the following configurations in the
HBASE_REGIONSERVER_OPTS
configuration option in the/conf/hbase-env.sh
file :-XX:+UseConcMarkSweepGC -Xmn2500m (depends on MAX HEAP SIZE, but should not be less than 1g and more than 4g) -XX:PermSize=128m -XX:MaxPermSize=128m -XX:SurvivorRatio=4 -XX:CMSInitiatingOccupancyFraction=50 -XX:+UseCMSInitiatingOccupancyOnly -XX:ErrorFile=/var/log/hbase/hs_err_pid%p.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps
Ensure that the block cache size and the MemStore size combined do not significantly exceed
0.5*MAX_HEAP
, which is defined in theHBASE_HEAP_SIZE
configuration option of the/conf/hbase-env.sh
file.
Enabling Multitenancy with Namepaces
A namespace is a logical grouping of tables analogous to a database or a schema in a relational database system. With namespaces, a group of users can share access to a set of tables but the users can be assigned different privileges. Similarly, one application can run using the tables in a namespace simultaneously with other applications. Each group of users and each application with access to the instance of the tables defined as a namespace is a tenant.
A namespace can support varying ACL-based security modules that can exist among different tenants. Read/write permissions based on groups and users with access to one instance of the namespace function independently from the permissions in another instance.
Unlike relational databases, HBase table names can contain a dot (.) Therefore, HBase
uses different syntax, a colon (:), as the separator between the namespace name and
table name. For example, a table with the name store1
in a namespace that is called orders
has store1:orders
as a fully qualified table name. If you do not assign a table to a
namespace, then the table belongs to the special default
namespace.
The namespace file, which contains the objects and data for the tables assigned to a namespace, is stored in a subdirectory of the
HBase root directory ($hbase.rootdir
) on the HDFS layer of your cluster.
If $hbase.rootdir
is at the default location, the path to the namespace file and table is
/apps/hbase/data/data/
.
namespace
/table_name
Example 5.1. Simple Example of Namespace Usage
A software company develops applications with HBase. Developers and quality-assurance (QA) engineers who are testing the code must have access to the same HBase tables that contain sample data for testing. The HBase tables with sample data are a subset of all HBase tables on the system. Developers and QA engineers have different goals in their interaction with the tables and need to separate their data read/write privileges accordingly.
By assigning the sample-data tables to a namespace, access privileges can be provisioned appropriately so that QA engineers do not overwrite developers' work and vice versa. As tenants of the sample-data table namespace, when developers and QA engineers are logged in as users of this namespace domain they do not access other HBase tables in different domains. This helps ensure that not every user can view all tables on the HBase cluster for the sake of security and ease-of-use.
Default HBase Namespace Actions
Tip | |
---|---|
If you do not require multitenancy or formalized schemas for HBase data, then do not concern yourself with namespace definitions and assignments. HBase automatically assigns a default namespace when you create a table and do not associate it with a namespace. |
The default namespaces are the following:
- hbase
A namespace that is used to contain HBase internal system tables
- default
A namespace that contains all other tables when you do not assign a specific user-defined namespace
Defining and Dropping Namespaces
Important | |
---|---|
You can assign a table to only one namespace, and you should ensure that the table correctly belongs to the namespace before you make the association in HBase. You cannot change the namespace that is assigned to the table later. |
The HBase shell has a set of straightforward commands for creating and dropping namespaces. You can assign a table to a namespace when you create the table.
- create_namespace 'my_ns'
Creates a namespace with the name
my_ns
.- create 'my_ns:my_table', 'fam1'
Creates
my_table
with a column family identified asfam1
in themy_ns
namespace.- drop_namespace 'my_ns'
Removes the
my_ns
namespace from the system. The command only functions when there are no tables with data that are assigned to the namespace.
Security Features Available in Technical Preview
The following security features are in Hortonworks Technical Preview:
Cell-level access control lists (cell-level ACLs): These ACLs are supported in tables of HBase 0.98 and later versions.
Column family encryption: This feature is supported in HBase 0.98 and later versions.
Important | |
---|---|
Cell-level ACLs and column family encryption are considered under development. Do not use these features in your production systems. If you have questions about these features, contact Support by logging a case on the Hortonworks Support Portal. |