Hive external table replication policies

Hive external table replication policies enable you to copy (replicate) your Hive metastore and data from one cluster to another and synchronize the Hive metastore and data set on the 'destination' cluster with the source, based on a specified replication policy. You can also use CDP Private Cloud Base Replication Manager to replicate Hive/Impala data to and from S3 or ADLS, however you cannot replicate data from one S3 or ADLS instance to another using Replication Manager.

The destination cluster must be managed by the Cloudera Manager Server where the replication is being set up, and the source cluster can be managed by that same server or by a peer Cloudera Manager Server.

Limitations and guidelines to consider for Hive external table replication policies

You must consider the following limitations and guidelines before you replicate Hive external tables using Hive external table replication policies.

The following list provides the limitations and guidelines to consider for Hive external table replication policies:

Hive replication policy considerations
Before you create Hive external table replication polices, you must know how to specify the hosts to improve performance; understand how DDL commands affect Hive tables during replication; how to disable parameter replication in Cloudera Manager; and which properties to configure for Hive replication in dynamic environments. For more information, see Hive external table replication policy considerations.
Replicate to and from S3 or ADLS
Replication Manager replicates Hive/Impala data to and from S3 or ADLS, however you cannot replicate data from one S3 or ADLS instance to another using Replication Manager.
To replicate Hive/Impala data to and from S3 or ADLS, you must have the appropriate credentials to access the S3 or ADLS account. Additionally, you must create buckets in S3 or Data Lake store in ADLS. Replication Manager backs up file metadata, including extended attributes and ACLs when you replicate data to cloud storage.
Replication Manager supports the following replication scenarios:
  • Replicate to and from Amazon S3 from CDH 5.14+ and Cloudera Manager version 5.13+.

    Replication Manager does not support S3 as a source or destination when S3 is configured to use SSE-KMS.

  • Replicate to and from Microsoft ADLS Gen1 from CDH 5.13+ and Cloudera Manager 5.15, 5.16, 6.1+.
  • Replicate to Microsoft ADLS Gen2 (ABFS) from CDH 5.13+ and Cloudera Manager 6.1+.
Replicate from CDH clusters
  • Because of the warehouse directory changes between CDH clusters and CDP Private Cloud Base, Hive external table replication policies do not copy the table data from the database and tables specified in the source cluster. But the replication job completes successfully without any disruptions.
  • Hive3 has a different default table type and warehouse directory structure, therefore the following changes apply while replicating Hive data from CDH5 or CDH6 versions to CDP Private Cloud Base:
    • When you replicate from a CDH cluster to a CDP Private Cloud Base cluster, all tables become External tables during Hive external table replication. This is because the default table type is ACID in Hive3, which is the only managed table type. As of this release, Replication Manager does not support Hive2 -> Hive3 replication into ACID tables and all the tables will necessarily be replicated as External tables.
    • Replicated tables are created under the external Hive warehouse directory set by hive.metastore.warehouse.external.dir Hive configuration parameter. You have to make sure that this has a different value than hive.metastore.warehouse.dir Hive configuration parameter, that is the location of Managed tables.
    • If you want to replicate the same database from Hive2 to Hive3 (that has different paths by design), you must use the Force Overwrite option per policy to avoid any mismatch issues.
  • Hive external table replication policies do not support managed to managed table replication. When you replicate from a CDH cluster to a CDP Private Cloud Base cluster, Replication Manager converts managed tables to external tables. Therefore, to replicate managed tables (ACID) and external tables in a database successfully, you must perform the following steps in the order shown below:
    1. Create Hive ACID table replication policy for the database to replicate the managed data.
    2. After the replication completes, create the Hive external table replication policy to replicate the external tables in the database.
Migrate Sentry to Ranger
You require source Cloudera Manager version 6.3.1 and higher and target Cloudera Manager version 7.1.1 and higher.
Replicate Atlas metadata
Atlas metadata for the chosen Hive external tables can be replicated using Hive external table replication policies from CDP Private Cloud Base 7.1.9 SP1 or higher using Cloudera Manager 7.11.3 CHF7 or higher.

During the Hive external table replication policy creation process, when you choose the General > Replicate Atlas Metadata option, Replication Manager:

  1. runs a bootstrap replication for all the chosen Hive external tables and its Atlas metadata during the first replication policy run. Bootstrap replication replicates all the available Hive external tables’ data and its associated Atlas metadata.
  2. runs incremental replication on the Hive external tables’ data and its Atlas metadata during subsequent replication runs. Here, the delta of the data and metadata gets replicated during each run.

Ensure that you have the Atlas user credentials in addition to the Replication Administrator or Full Administrator roles to replicate Atlas metadata. The atlas user must also have relevant read and write permissions to the staging locations.

Metadata-only replication for Ozone storage-backed Hive external tables
Metadata-only replication for Ozone storage-backed Hive external tables is supported from CDP Private Cloud Base 7.1.9 SP1 or higher using Cloudera Manager 7.11.3 CHF7 or higher. You must replicate the data using Ozone replication policies.

Replication Manager replicates Ozone-backed Hive external tables when the Destination staging path parameter contains an ofs:// path. By default (when no ofs:// path is used) Ozone backed external tables are ignored. When the ofs:// staging path is provided, only the Ozone backed tables and databases are replicated. An error appears if non-Ozone backed table or database are found in the replication scope.

The destination staging path can be specified on service, volume, or bucket level, and is used to map the replicated database and table locations according to the following table:

Destination staging path Mapping of source database/table/partition locations.

Note: Bucket relative locations are always kept unchanged.

ofs://[***DST OM SERVICE***] Source volume and bucket names are used on the destination.

ofs://[***DST OM SERVICE***]/[***SRC VOLUME***]/[***SRC BUCKET***]/[***SRC PATH***]

ofs://[***DST OM SERVICE***]/[***DST VOLUME***] Source bucket names are used on the destination, specified destination volume is used, source volume is dropped. Replicated entities (tables, databases) cannot spread across multiple volumes.

ofs://[***DST OM SERVICE***]/[***DST VOLUME***]/[***SRC BUCKET***]/[***SRC PATH***]

ofs://[***DST OM SERVICE***]/[***DST VOLUME***]/[***DST BUCKET***] Source volume and bucket names are dropped, specified bucket is used as the destination. Replicated entities cannot spread across multiple buckets.

ofs://[***DST OM SERVICE***]/[***DST VOLUME***]/[***DST BUCKET***]/[***SRC PATH***]

Configuration-related guidelines
  • If the hadoop.proxyuser.hive.groups configuration has been changed to restrict access to the Hive Metastore Server to certain users or groups, the hdfs group or a group containing the hdfs user must also be included in the list of groups specified for Hive/Impala replication to work. This configuration can be specified either on the Hive service as an override, or in the core-site HDFS configuration. This applies to configuration settings on both the source and destination clusters.
  • If you configured on the target cluster for the directory where HDFS data is copied during Hive/Impala replication, the permissions that were copied during replication, are overwritten by the HDFS ACL synchronization and are not preserved.