Bootstrap and incremental replication

Replication Manager allows you to replicate Hive databases from a source cluster to a target location on a destination cluster.

When you initiate the replication of Hive data, all of the data from the source location is copied to the destination. This bootstrapping of data can take hours to days, depending on factors such as the amount of data being copied and available network bandwidth. Subsequent replication jobs from the same source location to the same target on the destination are incremental, so only the changed data is copied.

If a bootstrap replication is interrupted, such as due to a network failure or an unrecoverable error, Replication Manager automatically retries the job. If a retry succeeds, the replication job continues from the point at which it was interrupted. If the automatic retries are not successful, you must manually correct the problem before running the policy again. When you activate the policy again, the replication job resumes from the point at which it was suspended.

After the bootstrap replication succeeds, an incremental replication is automatically performed. This job synchronizes, between the source and destination clusters, any events that occurred during the bootstrap process. After the data is synchronized, the replicated data is ready for use on the destination.

Functions such as User Defined Functions (UDF) in Hive are replicated. To enable this, UDFs have to be created using a syntax. An example of UDF creation syntax:
CREATE FUNCTION [db_name.]function_name AS class_name  USING JAR|FILE|ARCHIVE 'file_uri' [, JAR|FILE|ARCHIVE 'file_uri'] ;
  • ACID tables, external tables, storage handler-based tables (such as HBase), and column statistics are currently not replicated.
  • When creating a schedule for a Hive replication policy, you should set the frequency so that changes are replicated often enough to avoid overly large copies.

Incremental Replication

The incremental replication in Hive is achieved using notification events maintained by Hive in Hive Metastore.

Hive logs notification events for all operations (both metadata and data changes) on the managed table but in case of external tables, data writes cannot be tracked by Hive as it is performed by external sources directly without using Hive SQL commands. Therefore, Hive always copies the latest data from external tables to target cluster to avoid any loss of data.