Use the following instructions to bulk load data into HBase using Hive:
Prepare the input files on the Hive client machine:
Create a
tables.ddl
file using the following content:-- pagecounts data comes from http://dumps.wikimedia.org/other/pagecounts-raw/ -- documented http://www.mediawiki.org/wiki/Analytics/Wikistats -- define an external table over raw pagecounts data CREATE TABLE IF NOT EXISTS pagecounts (projectcode STRING, pagename STRING, pageviews STRING, bytes STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/tmp/wikistats'; -- create a view, building a custom hbase rowkey CREATE VIEW IF NOT EXISTS pgc (rowkey, pageviews, bytes) AS SELECT concat_ws('/', projectcode, concat_ws('/', pagename, regexp_extract(INPUT__FILE__NAME, 'pagecounts-(\\d{8}-\\d{6})\\..*$', 1))), pageviews, bytes FROM pagecounts; -- create a table to hold the input split partitions CREATE EXTERNAL TABLE IF NOT EXISTS hbase_splits(partition STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat' LOCATION '/tmp/hbase_splits_out'; -- create a location to store the resulting HFiles CREATE TABLE hbase_hfiles(rowkey STRING, pageviews STRING, bytes STRING) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat' TBLPROPERTIES('hfile.family.path' = '/tmp/hbase_hfiles/w');
Create a HFile splits file. For example,
sample.hql
as shown below:-- prepate range partitioning of hfiles ADD JAR /usr/lib/hive/lib/hive-contrib-0.11.0.1.3.0.0-104.jar; SET mapred.reduce.tasks=1; CREATE TEMPORARY FUNCTION row_seq AS 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence'; -- input file contains ~4mm records. Sample it so as to produce 5 input splits. INSERT OVERWRITE TABLE hbase_splits SELECT rowkey FROM (SELECT rowkey, row_seq() AS seq FROM pgc TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rowkey) s ORDER BY rowkey LIMIT 400) x WHERE (seq % 100) = 0 ORDER BY rowkey LIMIT 4; -- after this is finished, combined the splits file: dfs -cp /tmp/hbase_splits_out/* /tmp/hbase_splits;
Create
hfiles.hql
as shown below:ADD JAR /usr/lib/hbase/hbase-0.94.6.1.3.0.0-104-security.jar; ADD JAR /usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.0.0-104.jar; SET mapred.reduce.tasks=5; SET hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner; SET total.order.partitioner.path=/tmp/hbase_splits; -- generate hfiles using the splits ranges INSERT OVERWRITE TABLE hbase_hfiles SELECT * FROM pgc CLUSTER BY rowkey;
Load the input data. Execute the following commands on the Hive client machine:
mkdir /$Path_to_Input_Files_on_Hive_Client/wikistats wget http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081001-000000.gz hadoop fs -mkdir /$Path_to_Input_Files_on_Hive_Client/wikistats hadoop fs -put pagecounts-20081001-000000.gz /$Path_to_Input_Files_on_Hive_Client/wikistats/
where
$Path_to_Input_Files_on_Hive_Client
is the location of the input files on Hive client machine, created in step 1 previously.Create the necessary tables. Execute the following commands on your Hive client machine:
$HCATALOG_USER -f /$Path_to_Input_Files_on_Hive_Client/tables.ddl
where,
$HCATALOG_USER
is the HCatalog service user (for example,hcat
) and$Path_to_Input_Files_on_Hive_Client
is the location of the input files on Hive client machine, created in step 1 previously.You should see the following message on your console:
OK Time taken: 1.886 seconds OK Time taken: 0.654 seconds OK Time taken: 0.047 seconds OK Time taken: 0.115 seconds
Verify if tables were created correctly.Execute the following commands on your Hive client machine:
$HIVE_USER -e "select * from pagecounts limit 10;"
You should see the following output:
... OK aa Main_Page 4 41431 aa Special:ListUsers 1 5555 aa Special:Listusers 1 1052 ...
$HIVE_USER -e "select * from pgc limit 10;"
You should see the following output:
... OK aa/Main_Page/20081001-000000 4 41431 aa/Special:ListUsers/20081001-000000 1 5555 aa/Special:Listusers/20081001-000000 1 1052 ...
where,
$HIVE_USER
is the Hive service user (for example,hive
).Produce the HFile splits file.
$HIVE_USER -f /$Path_to_Input_Files_on_Hive_Client/sample.hql hadoop fs -ls /$Path_to_Input_Files_on_Hive_Client/hbase_splits
To verify, execute the following command:
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.2.0.1.3.0.0-104.jar -libjars /usr/lib/hive/lib/hive-exec-0.11.0.1.3.0.0-104.jar -input /tmp/hbase_splits -output /tmp/hbase_splits_txt -inputformat SequenceFileAsTextInputFormat
You should see the output similar to the following:
... INFO streaming.StreamJob: Output: /tmp/hbase_splits_txt
Execute the following command:
hadoop fs -cat /tmp/hbase_splits_txt/*
You should see the output similar to the following:
1 61 66 2e 71 2f 4d 61 69 6e 5f 50 61 67 65 2f 32 30 30 38 31 30 30 31 2d 30 30 30 30 30 30 00 (null) 01 61 66 2f 31 35 35 30 2f 32 30 30 38 31 30 30 31 2d 30 30 30 30 30 30 00 (null) 01 61 66 2f 32 38 5f 4d 61 61 72 74 2f 32 30 30 38 31 30 30 31 2d 30 30 30 30 30 30 00 (null) 01 61 66 2f 42 65 65 6c 64 3a 31 30 30 5f 31 38 33 30 2e 4a 50 47 2f 32 30 30 38 31 30 30 31 2d 30 30 30 30 30 30 00 (null)
Generate the Hfiles.
HADOOP_CLASSPATH=/usr/lib/hbase/hbase-0.94.6.1.3.0.0-104-security.jar hive -f /$Path_to_Input_Files_on_Hive_Client/hfiles.hql