Perform ETL: Ingest data from Kafka into Hive

You can extract, transform, and load a Kafka record into Hive in a single transaction.

Create a table to represent source Kafka record offsets.

CREATE TABLE kafka_table_offsets(partition_id int, max_offset bigint, insert_time timestamp);

Initialize the table.

INSERT OVERWRITE TABLE kafka_table_offsets 
SELECT `__partition`, min(`__offset`) - 1, CURRENT_TIMESTAMP 
FROM wiki_kafka_hive 
GROUP BY `__partition`, CURRENT_TIMESTAMP;

Create the destination table.

CREATE TABLE orc_kafka_table (partition_id int, koffset bigint, ktimestamp bigint,
  `timestamp` timestamp , `page` string, `user` string, `diffurl` string, 
  `isrobot` boolean, added int, deleted int, delta bigint
) STORED AS ORC;

Insert Kafka data into the ORC table.

FROM wiki_kafka_hive ktable JOIN kafka_table_offsets offset_table
ON (ktable.`__partition` = offset_table.partition_id 
AND ktable.`__offset` > offset_table.max_offset )
INSERT INTO TABLE orc_kafka_table 
SELECT `__partition`, `__offset`, `__timestamp`,
  `timestamp`, `page`, `user`, `diffurl`, `isrobot`, added , deleted , delta
INSERT OVERWRITE TABLE kafka_table_offsets 
SELECT `__partition`, max(`__offset`), CURRENT_TIMESTAMP 
GROUP BY `__partition`, CURRENT_TIMESTAMP;

Check the insertion.

SELECT MAX(`koffset`) FROM orc_kafka_table LIMIT 10;

SELECT COUNT(*) AS c FROM orc_kafka_table 
GROUP BY partition_id, koffset HAVING c > 1;

Repeat step 4 periodically until all the data is loaded into Hive.