Hive cloud replication

Replication Manager supports replication of the Hive database from a cluster with underlying HDFS to another cluster with cloud storage. It uses push-based replication, with the replication job running on the cluster with HDFS.

Hive stores its metadata in Hive Metastore, but the underlying data is stored in HDFS or cloud storage. In a Hadoop cluster with Hive service, the Hive warehouse directory can be configured with either HDFS or cloud storage.

You can perform the following tasks with Hive replication:
  • Rename the dataset in the policy that is replicated.
  • Create a pull-based policy on the source cluster to move data from the target back into the source cluster Hive database.
Hive replication from an HDFS-based cluster to a cloud storage-based cluster requires the following components:
  • Source cluster - The cluster with a Hive warehouse directory on local HDFS. This can be an on-premises cluster or an IaaS cluster with data on local HDFS. The required services are HDFS, YARN, Hive, Ranger, and Knox.
  • Destination cluster - The cluster with data on cloud storage. The cluster minimally requires Hive Metastore, Ranger, and Knox.

Replication Manager does not manage Ranger policies, and Personally Identifiable Information (PII) or any other secure data that gets replicated from on-premises to Amazon S3. You must manage these items outside of Replication Manager.