Data Replication

For recommendations on using data replication and Sentry authorization, see Configuring Sentry to Enable BDR Replication.

In Cloudera Manager 5, replication is supported between CDH 5 or CDH 4 clusters. In Cloudera Manager 5, support for HDFS and Hive replication is as follows.

Supported Replication Scenarios

  • HDFS and Hive
    • Cloudera Manager 4 with CDH 4 to Cloudera Manager 5 with CDH 4.
    • Cloudera Manager 5 with CDH 4 to Cloudera Manager 4.7.3 or later with CDH 4.
    • Cloudera Manager 5 with CDH 4 to Cloudera Manager 5 with CDH 4.
    • Cloudera Manager 5 with CDH 5 to Cloudera Manager 5 with CDH 5.
    • Cloudera Manager 4 or 5 with CDH 4.4 or later to Cloudera Manager 5 with CDH 5.
    • Cloudera Manager 5 with CDH 5 to Cloudera Manager 5 with CDH 4.4 or later.
    • (HDFS only) Within one Cloudera Manager instance, from one directory to another directory within the same cluster or to a different cluster. Both clusters must be running CDH 4.8 or higher.
  • SSL
    • Between CDH 5.0 with SSL and CDH 5.0 with SSL.
    • Between CDH 5.0 with SSL and CDH 5.0 without SSL.
    • From a CDH 5.1 source cluster with SSL and YARN.

Unsupported Replication Scenarios

  • HDFS and Hive
    • Cloudera Manager 5 with CDH 5 as the source, and Cloudera Manager 4 with CDH 4 as the target.
    • Between Cloudera Enterprise and any Cloudera Manager free edition:Cloudera Express, Cloudera Standard, Cloudera Manager Free Edition.
    • Between CDH 5 and CDH 4 (in either direction) where the replicated data includes a directory that contains a large number of files or subdirectories (several hundreds of thousands of entries), causing out-of-memory errors. This is because of limitations in the WebHDFS API. The workaround is to increase the heap size as follows:
      1. On the target Cloudera Manager instance, go to the HDFS service page.
      2. Click the Configuration tab.
      3. Expand the Service-Wide category.
      4. Click Advanced > HDFS Replication Advanced Configuration Snippet.
      5. Increase the heap size by adding a key-value pair, for instance, HADOOP_CLIENT_OPTS=-Xmx1g. In this example, 1g sets the heap size to 1 GB. This value should be adjusted depending on the number of files and directories being replicated.
    • Replication involving HDFS data from CDH 5 HA to CDH 4 clusters or CDH 4 HA to CDH5 clusters will fail if a NameNode failover happens during replication. This is because of limitations in the CDH WebHDFS API.
  • HDFS
    • Between a source cluster that has encryption over-the-wire enabled and a target cluster running CDH 4.0. This is because the CDH 4 client is used for replication in this case, and it does not support this.
    • From CDH 5 to CDH 4 where there are URL-encoding characters such as % in file and directory names. This is because of a bug in the CDH 4 WebHDFS API.
    • HDFS replication does not work from CDH 5 to CDH 4 with different realms when using older JDK versions. Use JDK 7 or upgrade to JDK6u34 or later on the CDH 4 cluster to avoid this issue.
  • Hive
    • With data replication, between a source cluster that has encryption enabled and a target cluster running CDH 4. This is because the CDH 4 client used for replication does not support encryption.
    • Without data replication, between a source cluster running CDH 4 and a target cluster that has encryption enabled.
    • Between CDH 4.2 or later and CDH 4, if the Hive schema contains views.
    • With the same cluster as both source and destination
    • Replication from CDH 4 to CDH 5 HA can fail if a NameNode failover happens during replication.
    • Hive replication from CDH 5 to CDH 4 with different realms with older JDK versions, if data replication is enabled (since this involves HDFS replication). Use JDK 7 or upgrade to JDK6u34 or later on the CDH 4 cluster to avoid this issue.
    • Hive replication from CDH 4 to CDH 5 with different realms with older JDK versions (even without data replication enabled). Use JDK 7 or upgrade to JDK6u34 or later on the CDH 4 cluster to avoid this issue.
    • Cloudera Manager 5.2 only supports replication of Impala UDFs if running CDH 5.2 or later. In clusters running Cloudera Manager 5.2 and a CDH version earlier than 5.2 that include Impala User-Defined Functions (UDFs), Hive replication will succeed, but replication of the Impala UDFs will be skipped.
  • SSL
    • From a CDH 4.x source cluster with SSL.
    • From CDH 5.0 source cluster with SSL and YARN (because of a YARN bug).
    • Between CDH 5.0 with SSL and CDH 4.x.
  • Kerberos
    • From a source cluster configured to use Kerberos authentication to a target cluster that is not configured to use Kerberos authentication.
    • From a source cluster not configured to use Kerberos authentication to a target cluster that is configured to use Kerberos authentication.