Importing Data into HBase
Also available as:
PDF

Chapter 1. Importing Data into HBase

Bulk import bypasses the HBase API and writes contents, properly formatted as HBase data files (HFiles), directly to the file system. Bulk load uses fewer CPU and network resources than using the HBase API for similar work.

To bulk load data into HBase using Pig:

  1. Prepare the input file. The following data.tsv file is an example input file:

    row1 c1 c2
    row2 c1 c2
    row3 c1 c2
    row4 c1 c2
    row5 c1 c2
    row6 c1 c2
    row7 c1 c2
    row8 c1 c2
    row9 c1 c2
    row10 c1 c2
  2. Make the data available on the cluster.

    hadoop fs -put $filename /tmp/

    For example:

    hadoop fs -put data.tsv /tmp/ 
  3. Define the HBase schema for the data. Continuing with the data.tsv example, create a script file called simple.ddl, which contains the HBase schema for data.tsv:

    CREATE TABLE simple_hcat_load_table (id STRING, c1 STRING, c2 STRING)
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES ( 'hbase.columns.mapping' = 'd:c1,d:c2' )
    TBLPROPERTIES ( 'hbase.table.name' = 'simple_hcat_load_table'
    ); 
  4. Create and register the HBase table in HCatalog.

    hcat -f $HBase_Table_Name

    The following HCatalog command-line command runs the DDL script simple.ddl:

    hcat -f simple.ddl
  5. Create the import file.

    The following example instructs Pig to load data from data.tsv and store it in simple_hcat_load_table. For the purposes of this example, assume that you have saved the following statement in a file named simple.bulkload.pig.

    A = LOAD 'hdfs:///tmp/data.tsv' USING PigStorage('\t') AS (id:chararray, c1:chararray,
    c2:chararray);
    -- DUMP A;
    STORE A INTO 'simple_hcat_load_table' USING org.apache.hive.hcatalog.pig.HCatStorer();
    [Note]Note

    Modify the filenames and table schema for your environment.

  6. Use Pig to populate the HBase table via HCatalog bulkload.

    Continuing with the example, execute the following command on your HBase Server machine:

    pig -useHCatalog simple.bulkload.pig