Updating imported tables involves importing incremental changes made to the original table using
Sqoop and then merging changes with the tables imported into Hive.
After ingesting data from an operational database to Hive, you usually need to set up a process
for periodically synchronizing the imported table with the operational database table. The base
table is a Hive-managed table that was created during the first data ingestion. Incrementally
updating Hive tables from operational database systems involves merging the base table and change
records to reflect the latest record set. You create the incremental table as a Hive external
table, typically from .CSV data in HDFS, to store the change records. This external table
contains the changes (INSERTs and UPDATEs) from the operational database since the last data
ingestion. Generally, the table is partitioned and only the latest partition is updated, making
this process more efficient.You can automate the steps to incrementally update data in Hive by using Oozie.
- The first time the data was ingested into hive, you stored entire base table in Hive in ORC format.
- The base table definition after moving it from the external table to a Hive-managed table
has the following
schema:
CREATE TABLE base_table (
id STRING,
field1 STRING,
modified_date DATE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
-
Store the incremental table as an external table in Hive and to fetch records newer than
last_import_date
, which is the date of
the last incremental data update.You frequently import incremental changes since the last time data
was updated and then merging it.
- using
--check-column
to fetch records
- use
--query
to fetch records
sqoop import --connect jdbc:teradata://{host name}/Database=retail --connection-manager org.apache.sqoop.teradata.TeradataConnManager --username dbc --password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table -m 1 --check-column modified_date --incremental lastmodified --last-value {last_import_date}
sqoop import --connect jdbc:teradata://{host name}/Database=retail --connection-manager org.apache.sqoop.teradata.TeradataConnManager --username dbc --password dbc --target-dir /user/hive/incremental_table -m 1 --query 'select * from SOURCE_TBL where modified_date > {last_import_date} AND $CONDITIONS’
-
After the incremental table data is moved into HDFS using Sqoop, you can define an external Hive table over it with the following command
CREATE EXTERNAL TABLE incremental_table (
id STRING,
field1 STRING,
modified_date DATE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location '/user/hive/incremental_table';
-
Use the MERGE command to merge the data and reconcile the base table records with
the new records:
merge into base_table
using incremental_table on base.id = incremental_table.id
when matched then update set
fieldl1=incremental_table.email,
modified_date=incremental_table.state
when not matched then insert
values(incremental_table.id, incremental_table.field1, incremental_table.modified_data);