Using Impala to Query HBase Tables
You can use Impala to query HBase tables. This capability allows convenient access to a storage system that is tuned for different kinds of workloads than the default with Impala. The default Impala tables use data files stored on HDFS, which are ideal for bulk loads and queries using full-table scans. In contrast, HBase can do efficient queries for data organized for OLTP-style workloads, with lookups of individual rows or ranges of values.
From the perspective of an Impala user, coming from an RDBMS background, HBase is a kind of key-value store where the value consists of multiple fields. The key is mapped to one column in the Impala table, and the various fields of the value are mapped to the other columns in the Impala table.
For background information on HBase, see the snapshot of the Apache HBase site (including documentation) for the level of HBase that comes with CDH 4 or CDH 5. To install HBase on a CDH cluster, see the installation instructions for CDH 4 or CDH 5
- Overview of Using HBase with Impala
- Configuring HBase for Use with Impala
- Supported Data Types for HBase Columns
- Performance Considerations for the Impala-HBase Integration
- Use Cases for Querying HBase through Impala
- Loading Data into an HBase Table
- Limitations and Restrictions of the Impala and HBase Integration
- Examples of Querying HBase Tables from Impala
Overview of Using HBase with Impala
When you use Impala with HBase:
-
You create the tables on the Impala side using the Hive shell, because the Impala
CREATE TABLE statement currently does not support custom
SerDes and some other syntax needed for these tables:
- You designate it as an HBase table using the STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' clause on the Hive CREATE TABLE statement.
- You map these specially created tables to corresponding tables that exist in HBase, with the clause TBLPROPERTIES("hbase.table.name" = "table_name_in_hbase") on the Hive CREATE TABLE statement.
- See Examples of Querying HBase Tables from Impala for a full example.
- You define the column corresponding to the HBase row key as a string with the #string keyword, or map it to a STRING column.
- Because Impala and Hive share the same metastore database, once you create the table in Hive, you can query or insert into it through Impala. (After creating a new table through Hive, issue the INVALIDATE METADATA statement in impala-shell to make Impala aware of the new table.)
- You issue queries against the Impala tables. For efficient queries, use WHERE clauses to find a single key value or a range of key values wherever practical, by testing the Impala column corresponding to the HBase row key. Avoid queries that do full-table scans, which are efficient for regular Impala tables but inefficient in HBase.
To work with an HBase table from Impala, ensure that the impala user has read/write privileges for the HBase table, using the GRANT command in the HBase shell. For details about HBase security, see http://hbase.apache.org/book/ch08s04.html#hbase.accesscontrol.configuration.
Configuring HBase for Use with Impala
HBase works out of the box with Impala. There is no mandatory configuration needed to use these two components together.
To avoid delays if HBase is unavailable during Impala startup or after an INVALIDATE METADATA statement, Cloudera recommends setting timeout values as follows in /etc/impala/conf/hbase-site.xml (for environments not managed by Cloudera Manager):
<property> <name>hbase.client.retries.number</name> <value>3</value> </property> <property> <name>hbase.rpc.timeout</name> <value>3000</value> </property>
Currently, Cloudera Manager does not have an Impala-only override for HBase settings, so any HBase configuration change you make through Cloudera Manager would take affect for all HBase applications. Therefore, this change is not recommended on systems managed by Cloudera Manager.
Supported Data Types for HBase Columns
To understand how Impala column data types are mapped to fields in HBase, you should have some background knowledge about HBase first. You set up the mapping by running the CREATE TABLE statement in the Hive shell. See the Hive wiki for a starting point, and Examples of Querying HBase Tables from Impala for examples.
HBase works as a kind of
For best performance of Impala queries against HBase tables, most queries will perform comparisons in the
WHERE against the column that corresponds to the HBase row key.
When creating the table through the Hive shell, use the STRING
data type for the column that corresponds to the HBase row key. Impala can translate conditional tests (through
operators such as =,
<, BETWEEN,
and IN) against this column into fast lookups in HBase, but
this optimization (
Starting in Impala 1.1, Impala also supports reading and writing to columns that are defined in the Hive CREATE TABLE statement using binary data types, represented in the Hive table definition using the #binary keyword, often abbreviated as #b. Defining numeric columns as binary can reduce the overall data volume in the HBase tables. You should still define the column that corresponds to the HBase row key as a STRING, to allow fast lookups using those columns.
Performance Considerations for the Impala-HBase Integration
To understand the performance characteristics of SQL queries against data stored in HBase, you should have some background knowledge about how HBase interacts with SQL-oriented systems first. See the Hive wiki for a starting point; because Impala shares the same metastore database as Hive, the information about mapping columns from Hive tables to HBase tables is generally applicable to Impala too.
Impala uses the HBase client API via Java Native Interface (JNI) to query data stored in HBase. This querying does not read HFiles directly. The extra communication overhead makes it important to choose what data to store in HBase or in HDFS, and construct efficient queries that can retrieve the HBase data efficiently:
- Use HBase table for queries that return a single row or a range of rows, not queries that scan the entire table. (If a query has no WHERE clause, that is a strong indicator that it is an inefficient query for an HBase table.)
- If you have join queries that do aggregation operations on large fact tables and join the results against small dimension tables, consider using Impala for the fact tables and HBase for the dimension tables. (Because Impala does a full scan on the HBase table in this case, rather than doing single-row HBase lookups based on the join column, only use this technique where the HBase table is small enough that doing a full table scan does not cause a performance bottleneck for the query.)
Query predicates are applied to row keys as start and stop keys, thereby limiting the scope of a particular lookup. If row keys are not mapped to string columns, then ordering is typically incorrect and comparison operations do not work. For example, if row keys are not mapped to string columns, evaluating for greater than (>) or less than (<) cannot be completed.
Predicates on non-key columns can be sent to HBase to scan as SingleColumnValueFilters, providing some performance gains. In such a case, HBase returns fewer rows than if those same predicates were applied using Impala. While there is some improvement, it is not as great when start and stop rows are used. This is because the number of rows that HBase must examine is not limited as it is when start and stop rows are used. As long as the row key predicate only applies to a single row, HBase will locate and return that row. Conversely, if a non-key predicate is used, even if it only applies to a single row, HBase must still scan the entire table to find the correct result.
Interpreting EXPLAIN Output for HBase Queries
For example, here are some queries against the following Impala table, which is mapped to an HBase table. The examples show excerpts from the output of the EXPLAIN statement, demonstrating what things to look for to indicate an efficient or inefficient query against an HBase table.
The first column (cust_id) was specified as the key column in
the CREATE EXTERNAL TABLE statement; for performance, it is
important to declare this column as STRING. Other columns,
such as BIRTH_YEAR and
NEVER_LOGGED_ON, are also declared as
STRING, rather than their
describe hbase_table; Query: describe hbase_table +-----------------------+--------+---------+ | name | type | comment | +-----------------------+--------+---------+ | cust_id | string | | | birth_year | string | | | never_logged_on | string | | | private_email_address | string | | | year_registered | int | | +-----------------------+--------+---------+
The best case for performance involves a single row lookup using an equality comparison on the column defined as the row key:
explain select count(*) from hbase_table where cust_id = 'some_user@example.com'; +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | | WARNING: The following tables are missing relevant table and/or column statistics. | | hbase.hbase_table | | | | 03:AGGREGATE [MERGE FINALIZE] | | | output: sum(count(*)) | | | | | 02:EXCHANGE [PARTITION=UNPARTITIONED] | | | | | 01:AGGREGATE | | | output: count(*) | | | | | 00:SCAN HBASE [hbase.hbase_table] | | start key: some_user@example.com | | stop key: some_user@example.com\0 | +------------------------------------------------------------------------------------+
Another type of efficient query involves a range lookup on the row key column, using SQL operators such as greater than (or equal), less than (or equal), or BETWEEN. This example also includes an equality test on a non-key column; because that column is a STRING, Impala can let HBase perform that test, indicated by the hbase filters: line in the EXPLAIN output. Doing the filtering within HBase is more efficient than transmitting all the data to Impala and doing the filtering on the Impala side.
explain select count(*) from hbase_table where cust_id between 'a' and 'b' and never_logged_on = 'true'; +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ ... | 01:AGGREGATE | | | output: count(*) | | | | | 00:SCAN HBASE [hbase.hbase_table] | | start key: a | | stop key: b\0 | | hbase filters: cols:never_logged_on EQUAL 'true' | +------------------------------------------------------------------------------------+
The query is less efficient if Impala has to evaluate any of the predicates, because Impala must scan the entire HBase table. Impala can only push down predicates to HBase for columns declared as STRING. This example tests a column declared as INT, and the predicates: line in the EXPLAIN output indicates that the test is performed after the data is transmitted to Impala.
explain select count(*) from hbase_table where year_registered = 2010; +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ ... | 01:AGGREGATE | | | output: count(*) | | | | | 00:SCAN HBASE [hbase.hbase_table] | | predicates: year_registered = 2010 | +------------------------------------------------------------------------------------+
The same inefficiency applies if the key column is compared to any non-constant value. Here, even though the key column is a STRING, and is tested using an equality operator, Impala must scan the entire HBase table because the key column is compared to another column value rather than a constant.
explain select count(*) from hbase_table where cust_id = private_email_address; +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ ... | 01:AGGREGATE | | | output: count(*) | | | | | 00:SCAN HBASE [hbase.hbase_table] | | predicates: cust_id = private_email_address | +------------------------------------------------------------------------------------+
Currently, tests on the row key using OR or IN clauses are not optimized into direct lookups either. Such limitations might be lifted in the future, so always check the EXPLAIN output to be sure whether a particular SQL construct results in an efficient query or not for HBase tables.
explain select count(*) from hbase_table where cust_id = 'some_user@example.com' or cust_id = 'other_user@example.com'; +----------------------------------------------------------------------------------------+ | Explain String | +----------------------------------------------------------------------------------------+ ... | 01:AGGREGATE | | | output: count(*) | | | | | 00:SCAN HBASE [hbase.hbase_table] | | predicates: cust_id = 'some_user@example.com' OR cust_id = 'other_user@example.com' | +----------------------------------------------------------------------------------------+ explain select count(*) from hbase_table where cust_id in ('some_user@example.com', 'other_user@example.com'); +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ ... | 01:AGGREGATE | | | output: count(*) | | | | | 00:SCAN HBASE [hbase.hbase_table] | | predicates: cust_id IN ('some_user@example.com', 'other_user@example.com') | +------------------------------------------------------------------------------------+
Either rewrite into separate queries for each value and combine the results in the application, or combine the single-row queries using UNION ALL:
select count(*) from hbase_table where cust_id = 'some_user@example.com'; select count(*) from hbase_table where cust_id = 'other_user@example.com'; explain select count(*) from hbase_table where cust_id = 'some_user@example.com' union all select count(*) from hbase_table where cust_id = 'other_user@example.com'; +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ ... | | 04:AGGREGATE | | | | output: count(*) | | | | | | | 03:SCAN HBASE [hbase.hbase_table] | | | start key: other_user@example.com | | | stop key: other_user@example.com\0 | | | | | 10:MERGE | ... | 02:AGGREGATE | | | output: count(*) | | | | | 01:SCAN HBASE [hbase.hbase_table] | | start key: some_user@example.com | | stop key: some_user@example.com\0 | +------------------------------------------------------------------------------------+
Configuration Options for Java HBase Applications
If you have an HBase Java application that calls the setCacheBlocks or setCaching methods of the class org.apache.hadoop.hbase.client.Scan, you can set these same caching behaviors through Impala query options, to control the memory pressure on the HBase region server. For example, when doing queries in HBase that result in full-table scans (which by default are inefficient for HBase), you can reduce memory usage and speed up the queries by turning off the HBASE_CACHE_BLOCKS setting and specifying a large number for the HBASE_CACHING setting.
To set these options, issue commands like the following in impala-shell:
-- Same as calling setCacheBlocks(true) or setCacheBlocks(false). set hbase_cache_blocks=true; set hbase_cache_blocks=false; -- Same as calling setCaching(rows). set hbase_caching=1000;
Or update the impalad defaults file /etc/default/impala and include settings for HBASE_CACHE_BLOCKS and/or HBASE_CACHING in the -default_query_options setting for IMPALA_SERVER_ARGS. See Modifying Impala Startup Options for details.
Use Cases for Querying HBase through Impala
The following are popular use cases for using Impala to query HBase tables:
- Keeping large fact tables in Impala, and smaller dimension tables in HBase. The fact tables use Parquet or other binary file format optimized for scan operations. Join queries scan through the large Impala fact tables, and cross-reference the dimension tables using efficient single-row lookups in HBase.
- Using HBase to store rapidly incrementing counters, such as how many times a web page has been viewed, or on a social network, how many connections a user has or how many votes a post received. HBase is efficient for capturing such changeable data: the append-only storage mechanism is efficient for writing each change to disk, and a query always returns the latest value. An application could query specific totals like these from HBase, and combine the results with a broader set of data queried from Impala.
-
Storing very wide tables in HBase. Wide tables have many columns, possibly thousands, typically recording many attributes for an important subject such as a user of an online service. These tables are also often sparse, that is, most of the columns values are NULL, 0, false, empty string, or other blank or placeholder value. (For example, any particular web site user might have never used some site feature, filled in a certain field in their profile, visited a particular part of the site, and so on.) A typical query against this kind of table is to look up a single row to retrieve all the information about a specific subject, rather than summing, averaging, or filtering millions of rows as in typical Impala-managed tables.
Or the HBase table could be joined with a larger Impala-managed table. For example, analyze the large Impala table representing web traffic for a site and pick out 50 users who view the most pages. Join that result with the wide user table in HBase to look up attributes of those users. The HBase side of the join would result in 50 efficient single-row lookups in HBase, rather than scanning the entire user table.
Loading Data into an HBase Table
The Impala INSERT statement works for HBase tables. The INSERT ... VALUES syntax is ideally suited to HBase tables, because inserting a single row is an efficient operation for an HBase table. (For regular Impala tables, with data files in HDFS, the tiny data files produced by INSERT ... VALUES are extremely inefficient, so you would not use that technique with tables containing any significant data volume.)
When you use the INSERT ... SELECT syntax, the result in the HBase table could be fewer rows than you expect. HBase only stores the most recent version of each unique row key, so if an INSERT ... SELECT statement copies over multiple rows containing the same value for the key column, subsequent queries will only return one row with each key column value:
Although Impala does not have an UPDATE statement, you can achieve the same effect by doing successive INSERT statements using the same value for the key column each time:
Limitations and Restrictions of the Impala and HBase Integration
The Impala integration with HBase has the following limitations and restrictions, some inherited from the integration between HBase and Hive, and some unique to Impala:
-
If you issue a DROP TABLE for an internal (Impala-managed) table that is mapped to an HBase table, the underlying table is not removed in HBase. The Hive DROP TABLE statement also removes the HBase table in this case.
-
The INSERT OVERWRITE statement is not available for HBase tables. You can insert new data, or modify an existing row by inserting a new row with the same key value, but not replace the entire contents of the table. You can do an INSERT OVERWRITE in Hive if you need this capability.
-
If you issue a CREATE TABLE LIKE statement for a table mapped to an HBase table, the new table is also an HBase table, but inherits the same underlying HBase table name as the original. The new table is effectively an alias for the old one, not a new table with identical column structure. Avoid using CREATE TABLE LIKE for HBase tables, to avoid any confusion.
-
Copying data into an HBase table using the Impala INSERT ... SELECT syntax might produce fewer new rows than are in the query result set. If the result set contains multiple rows with the same value for the key column, each row supercedes any previous rows with the same key value. Because the order of the inserted rows is unpredictable, you cannot rely on this technique to preserve the
" latest" version of a particular key value.
Examples of Querying HBase Tables from Impala
The following examples use HBase with the following table definition. Note that in HBase shell, the table name
is quoted in CREATE and
DROP statements. Tables created in HBase begin in
$ hbase shell ... create 'hbasealltypessmall', 'bools', 'ints', 'floats', 'strings' quit
With a String Row Key
Issue the following CREATE TABLE statement in the Hive shell. (The Impala CREATE TABLE statement currently does not support all the required clauses, so you switch into Hive to create the table, then back to Impala and the impala-shell interpreter to issue the queries.)
This example creates an external table mapped to the HBase table, usable by both Impala and Hive. It is an external table so that when dropped by Impala or Hive, the original HBase table is not touched at all. The STORED BY clause is the clause not currently supported by Impala that requires using the Hive shell for the CREATE TABLE. The WITH SERDEPROPERTIES clause specifies that the first column (ID) represents the row key, and maps the remaining columns of the SQL table to HBase column families. The first column is defined to be the lookup key; the STRING data type produces the fastest key-based lookups for HBase tables.
$ hive ... hive> CREATE EXTERNAL TABLE hbasestringids ( id string, bool_col boolean, tinyint_col tinyint, smallint_col smallint, int_col int, bigint_col bigint, float_col float, double_col double, date_string_col string, string_col string, timestamp_col timestamp) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,bools:bool_col,ints:tinyint_col,ints:smallint_col,ints:int_col,ints:\ bigint_col,floats:float_col,floats:double_col,strings:date_string_col,\ strings:string_col,strings:timestamp_col" ) TBLPROPERTIES("hbase.table.name" = "hbasealltypessmall");
Without a String Row Key
This example defines the lookup key column as INT instead of STRING.
Again, issue the following CREATE TABLE statement through Hive, then switch back to Impala and the impala-shell interpreter to issue the queries.
$ hive ... CREATE EXTERNAL TABLE hbasealltypessmall ( id int, bool_col boolean, tinyint_col tinyint, smallint_col smallint, int_col int, bigint_col bigint, float_col float, double_col double, date_string_col string, string_col string, timestamp_col timestamp) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,bools:bool_col,ints:tinyint_col,ints:smallint_col,ints:int_col,ints:bigint_col,floats\ :float_col,floats:double_col,strings:date_string_col,strings:string_col,strings:timestamp_col" ) TBLPROPERTIES("hbase.table.name" = "hbasealltypessmall");
Example Queries
Once you have established the mapping to an HBase table, you can issue queries.
# if the row key is mapped as a string col, range predicates are applied to the scan select * from hbasestringids where id = '5'; # predicate on row key doesn't get transformed into scan parameter, because # it's mapped as an int (but stored in ASCII and ordered lexicographically) select * from hbasealltypessmall where id < 5;
<< Using the SequenceFile File Format with Impala Tables | Using Impala Logging >> | |