1. Using Hive to Bulk Load Data Into HBase

Use the following instructions to bulk load data into HBase using Hive:

  1. Prepare the input files on the Hive client machine:

    1. Create a tables.ddl file using the following content:

      -- pagecounts data comes from http://dumps.wikimedia.org/other/pagecounts-raw/
      -- documented http://www.mediawiki.org/wiki/Analytics/Wikistats
      
      -- define an external table over raw pagecounts data
      CREATE TABLE IF NOT EXISTS pagecounts (projectcode STRING, pagename STRING, pageviews STRING, bytes STRING)
      ROW FORMAT
        DELIMITED FIELDS TERMINATED BY ' '
        LINES TERMINATED BY '\n'
      STORED AS TEXTFILE
      LOCATION '/tmp/wikistats';
      
      -- create a view, building a custom hbase rowkey
      CREATE VIEW IF NOT EXISTS pgc (rowkey, pageviews, bytes) AS
      SELECT concat_ws('/',
               projectcode,
               concat_ws('/',
                 pagename,
                 regexp_extract(INPUT__FILE__NAME, 'pagecounts-(\\d{8}-\\d{6})\\..*$', 1))),
             pageviews, bytes
      FROM pagecounts;
      
      -- create a table to hold the input split partitions
      CREATE EXTERNAL TABLE IF NOT EXISTS hbase_splits(partition STRING)
      ROW FORMAT
        SERDE 'org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe'
      STORED AS
        INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
        OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'
      LOCATION '/tmp/hbase_splits_out';
      
      -- create a location to store the resulting HFiles
      CREATE TABLE hbase_hfiles(rowkey STRING, pageviews STRING, bytes STRING)
      STORED AS
        INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
        OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat'
      TBLPROPERTIES('hfile.family.path' = '/tmp/hbase_hfiles/w');
      

    2. Create a HFile splits file. For example, sample.hql as shown below:

      -- prepate range partitioning of hfiles
      
      ADD JAR /usr/lib/hive/lib/hive-contrib-0.11.0.1.3.0.0-104.jar;
      SET mapred.reduce.tasks=1;
      
      CREATE TEMPORARY FUNCTION row_seq AS 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
      
      -- input file contains ~4mm records. Sample it so as to produce 5 input splits.
      INSERT OVERWRITE TABLE hbase_splits
      SELECT rowkey FROM
        (SELECT rowkey, row_seq() AS seq FROM pgc
         TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rowkey) s
         ORDER BY rowkey
         LIMIT 400) x
      WHERE (seq % 100) = 0
      ORDER BY rowkey
      LIMIT 4;
      
      -- after this is finished, combined the splits file:
      dfs -cp /tmp/hbase_splits_out/* /tmp/hbase_splits;
      

    3. Create hfiles.hql as shown below:

      ADD JAR /usr/lib/hbase/hbase-0.94.6.1.3.0.0-104-security.jar;
      ADD JAR /usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.0.0-104.jar;
      
      SET mapred.reduce.tasks=5;
      SET hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
      SET total.order.partitioner.path=/tmp/hbase_splits;
      
      -- generate hfiles using the splits ranges
      INSERT OVERWRITE TABLE hbase_hfiles
      SELECT * FROM pgc
      CLUSTER BY rowkey;
      

  2. Load the input data. Execute the following commands on the Hive client machine:

    mkdir /$Path_to_Input_Files_on_Hive_Client/wikistats
     wget http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081001-000000.gz 
     hadoop fs -mkdir /$Path_to_Input_Files_on_Hive_Client/wikistats
     hadoop fs -put pagecounts-20081001-000000.gz /$Path_to_Input_Files_on_Hive_Client/wikistats/  

    where $Path_to_Input_Files_on_Hive_Client is the location of the input files on Hive client machine, created in step 1 previously.

  3. Create the necessary tables. Execute the following commands on your Hive client machine:

    $HCATALOG_USER -f /$Path_to_Input_Files_on_Hive_Client/tables.ddl

    where, $HCATALOG_USER is the HCatalog service user (for example, hcat) and $Path_to_Input_Files_on_Hive_Client is the location of the input files on Hive client machine, created in step 1 previously.

    You should see the following message on your console:

    OK
    Time taken: 1.886 seconds
    OK
    Time taken: 0.654 seconds
    OK
    Time taken: 0.047 seconds
    OK
    Time taken: 0.115 seconds
  4. Verify if tables were created correctly.Execute the following commands on your Hive client machine:

    $HIVE_USER -e "select * from pagecounts limit 10;"

    You should see the following output:

    ...
    OK
    aa      Main_Page       4       41431
    aa      Special:ListUsers       1       5555
    aa      Special:Listusers       1       1052
    ...

    $HIVE_USER -e "select * from pgc limit 10;"

    You should see the following output:

    ...
    OK
    aa/Main_Page/20081001-000000    4       41431
    aa/Special:ListUsers/20081001-000000    1       5555
    aa/Special:Listusers/20081001-000000    1       1052
    ...

    where, $HIVE_USER is the Hive service user (for example, hive).

  5. Produce the HFile splits file.

    $HIVE_USER -f /$Path_to_Input_Files_on_Hive_Client/sample.hql
    hadoop fs -ls /$Path_to_Input_Files_on_Hive_Client/hbase_splits

    To verify, execute the following command:

    hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.2.0.1.3.0.0-104.jar -libjars /usr/lib/hive/lib/hive-exec-0.11.0.1.3.0.0-104.jar -input /tmp/hbase_splits -output /tmp/hbase_splits_txt -inputformat SequenceFileAsTextInputFormat

    You should see the output similar to the following:

    ...
     INFO streaming.StreamJob: Output: /tmp/hbase_splits_txt

    Execute the following command:

    hadoop fs -cat /tmp/hbase_splits_txt/*

    You should see the output similar to the following:

    1 61 66 2e 71 2f 4d 61 69 6e 5f 50 61 67 65 2f 32 30 30 38 31 30 30 31 2d 30 30 30 30 30 30 00 (null)
    01 61 66 2f 31 35 35 30 2f 32 30 30 38 31 30 30 31 2d 30 30 30 30 30 30 00      (null)
    01 61 66 2f 32 38 5f 4d 61 61 72 74 2f 32 30 30 38 31 30 30 31 2d 30 30 30 30 30 30 00  (null)
    01 61 66 2f 42 65 65 6c 64 3a 31 30 30 5f 31 38 33 30 2e 4a 50 47 2f 32 30 30 38 31 30 30 31 2d 30 30 30 30 30 30 00    (null)

  6. Generate the Hfiles.

    HADOOP_CLASSPATH=/usr/lib/hbase/hbase-0.94.6.1.3.0.0-104-security.jar hive -f /$Path_to_Input_Files_on_Hive_Client/hfiles.hql