This is the documentation for Cloudera Manager 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Data Replication

Cloudera Manager provides rich functionality for replicating data (stored in HDFS or accessed through Hive) across data centers. When critical data is stored on HDFS, Cloudera Manager provides the necessary capabilities to ensure that the data is available at all times, even in the face of the complete shutdown of a data center.

For recommendations on using data replication and Sentry authorization, see Configuring Sentry to Enable BDR Replication.

In Cloudera Manager 5, replication is supported between CDH 5 or CDH 4 clusters. In Cloudera Manager 5, support for HDFS and Hive replication is as follows.

  Important: To use HDFS replication, both the target and source HDFS services must use Kerberos authentication, or both the target and source HDFS services must not use Kerberos authentication.

Supported Replication Scenarios

  • HDFS and Hive
    • Cloudera Manager 4 with CDH 4 to Cloudera Manager 5 with CDH 4
    • Cloudera Manager 5 with CDH 4 to Cloudera Manager 4.7.3 or later with CDH 4
    • Cloudera Manager 5 with CDH 4 to Cloudera Manager 5 with CDH 4
    • Cloudera Manager 5 with CDH 5 to Cloudera Manager 5 with CDH 5
    • Cloudera Manager 4 or 5 with CDH 4.4 or later to Cloudera Manager 5 with CDH 5
    • Cloudera Manager 5 with CDH 5 to Cloudera Manager 5 with CDH 4.4 or later.

Unsupported Replication Scenarios

  • HDFS and Hive
    • Cloudera Manager 5 with CDH 5 as the source, and Cloudera Manager 4 with CDH 4 as the target.
    • Between Cloudera Enterprise and any Cloudera Manager free edition: Cloudera Express, Cloudera Standard, Cloudera Manager Free Edition.
    • Between CDH 5 and CDH 4 (in either direction) where the replicated data includes a directory that contains a large number of files or subdirectories (several hundreds of thousands of entries), causing out-of-memory errors. This is because of limitations in the WebHDFS API. The workaround is to increase the heap size as follows:
      1. On the target Cloudera Manager instance, go to the HDFS service page.
      2. Select Configuration > View and Edit.
      3. Expand the Service-Wide category.
      4. Click Advanced > HDFS Replication Advanced Configuration Snippet.
      5. Increase the heap size by adding a key-value pair, for instance, HADOOP_CLIENT_OPTS=-Xmx1g. In this example, 1g sets the heap size to 1 GB. This value should be adjusted depending on the number of files and directories being replicated.
    • Replication involving HDFS data from CDH 5 HA to CDH 4 clusters will fail if a NameNode failover happens during replication. This is because of limitations in the CDH 4 WebHDFS API.
  • HDFS
    • Between a source cluster that has encryption enabled and a target cluster running CDH 4.0. This is because the CDH 4 client is used for replication in this case, and it does not support encryption.
    • From CDH 5 to CDH 4 where there are URL-encoding characters such as % in file and directory names. This is because of a bug in the CDH 4 WebHDFS API.
    • HDFS replication does not work from CDH 5 to CDH 4 with different realms when using older JDK versions. This is because of a JDK SPNEGO issue. For more information, see JDK-6670362. Use JDK 7 or upgrade to JDK6u34 or later on the CDH 4 cluster to work around this issue.
    • Replication from CDH 5 HA to CDH 4 where there are separate Kerberos realms and no cross-realm trust.
    • Replication from CDH 4 HA to CDH 5 with Kerberos.
  • Hive
    • With data replication, between a source cluster that has encryption enabled and a target cluster running CDH 4. This is because the CDH 4 client used for replication does not support encryption.
    • Without data replication, between a source cluster running CDH 4 and a target cluster that has encryption enabled.
    • Between CDH 4.2 or later and CDH 4, if the Hive schema contains views.
    • With the same cluster as both source and destination
    • Replication from CDH 4 to CDH 5 HA can fail if a NameNode failover happens during replication. This is because of limitations in the CDH 4 WebHDFS API.
    • Hive replication from CDH 5 to CDH 4 with different realms with older JDK versions, if data replication is enabled (since this involves HDFS replication). This is because of a JDK SPNEGO issue. For more information, see JDK-6670362. Use JDK 7 or upgrade to JDK6u34 or later on the CDH 4 cluster to work around this issue.
    • Hive replication from CDH 4 to CDH 5 with different realms with older JDK versions (even without data replication enabled). This is because of a JDK SPNEGO issue. For more information, see JDK-6670362. Use JDK 7 or upgrade to JDK6u34 or later on the CDH 4 cluster to work around this issue.
    • Replication from CDH 5 HA to CDH 4 with separate Kerberos realms and no cross-realm trust when either data replication is involved or Impala UDF jars need to be replicated.
    • Replication from CDH 4 HA to CDH 5 with Kerberos when either data replication is involved or Impala UDF jars need to be replicated.
  • Kerberos
    • From a source cluster configured to use Kerberos authentication to a target cluster that is not configured to use Kerberos authentication.
    • From a source cluster not configured to use Kerberos authentication to a target cluster that is configured to use Kerberos authentication.
Page generated September 3, 2015.