Hive cloud replication

Replication Manager supports replication of the Hive database from a cluster with underlying HDFS to another cluster with cloud storage. It uses push-based replication, with the replication job running on the cluster with HDFS.

Hive stores its metadata in Hive Metastore, but the underlying data is stored in HDFS or cloud storage. In a Hadoop cluster with Hive service, the Hive warehouse directory can be configured with either HDFS or cloud storage.
  • You can rename the dataset in the policy that is replicated.
  • You can create a pull-based policy on the source cluster to move data from the target back into the source cluster Hive database.
  • Replication Manager does not manage Ranger policies and any PII/secure data that gets replicated from on-premise to Amazon S3. You must manage these items outside of Replication Manager.
  • Hive replication from an HDFS-based cluster to a cloud storage-based cluster requires the following:
    • Source cluster

      The cluster with a Hive warehouse directory on local HDFS. This can be an On-premise cluster or an IaaS cluster with data on local HDFS. The required services are HDFS, YARN, Hive, Ranger, Knox and DLM Engine.

    • Destination cluster

      The cluster with data on cloud storage. The cluster minimally requires Hive Metastore, Ranger, Knox, and DLM Engine.