You can importa data with a bulk load operation to bypass the HBase API and
writes content, properly formatted as HBase data files (HFiles), directly to the file
system. It uses fewer CPU and network resources than using the HBase API for similar
work.
The following recommended bulk load procedure uses Apache HCatalog and Apache Pig.
-
Prepare the input file, as shown in the following data.tsv
example input file:
row1 c1 c2
row2 c1 c2
row3 c1 c2
row4 c1 c2
row5 c1 c2
row6 c1 c2
row7 c1 c2
row8 c1 c2
row9 c1 c2
row10 c1 c2
-
Make the data available on the cluster, as shown in this continuation of the
example:
hadoop fs -put data.tsv /tmp/
-
Define the HBase schema for the data, shown here as creating a script file
called
simple.ddl
, which contains the HBase schema for
data.tsv
:
CREATE TABLE simple_hcat_load_table (id STRING, c1 STRING, c2 STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ( 'hbase.columns.mapping' = 'd:c1,d:c2' )
TBLPROPERTIES ( 'hbase.table.name' = 'simple_hcat_load_table'
);
-
Create and register the HBase table in HCatalog:
-
Create the import file.
The following example instructs Pig to load data from
data.tsv
and store it in simple_hcat_load_table
.
For the purposes of this example, assume that you have saved the following statement in a
file named simple.bulkload.pig.
A = LOAD 'hdfs:///tmp/data.tsv' USING PigStorage('\t') AS (id:chararray, c1:chararray,
c2:chararray);
-- DUMP A;
STORE A INTO 'simple_hcat_load_table' USING org.apache.hive.hcatalog.pig.HCatStorer();
| Note |
---|
Modify the filenames and table schema for your environment.
|
-
Execute the following command on your HBase server machine. The command directs Pig to populate
the HBase table by using HCatalog
bulkload
.
pig -useHCatalog simple.bulkload.pig