Data Replication
Cloudera Manager enables you to replicate data across datacenters for disaster recovery scenarios. Replications can include data stored in HDFS, data stored in Hive tables, Hive metastore data, and Impala metadata (catalog server metadata) associated with Impala tables registered in the Hive metastore. When critical data is stored on HDFS, Cloudera Manager helps to ensure that the data is available at all times, even in case of complete shutdown of a datacenter.
You can also use the HBase shell to replicate HBase data. (Cloudera Manager does not manage HBase replications.)
For recommendations on using data replication and Sentry authorization, see Configuring Sentry to Enable BDR Replication.
The following sections describe license requirements and supported and unsupported replication scenarios.
Cloudera License Requirements for Replication
Both the source and destination clusters must have a Cloudera Enterprise license.
Supported Replication Scenarios
In Cloudera Manager 5, replication is supported between CDH 5 or CDH 4 clusters. The following tables describe support for HDFS and Hive replication.
Service | Source | Destination | ||||
---|---|---|---|---|---|---|
Cloudera Manager Version | CDH Version | Comment | Cloudera Manager Version | CDH Version | Comment | |
HDFS, Hive | 4 | 4 | 5 | 4 | ||
HDFS, Hive | 4 | 4.4 or higher | 5 | 5 | ||
HDFS, Hive | 4 or 5 | 5 | SSL enabled on Hadoop services | 4 or 5 | 5 | SSL enabled on Hadoop services |
HDFS, Hive | 4 or 5 | 5 | SSL enabled on Hadoop services | 4 or 5 | 5 | SSL not enabled on Hadoop services |
HDFS, Hive | 4 or 5 | 5.1 | SSL enabled on Hadoop services and YARN | 4 or 5 | 4 or 5 | |
HDFS, Hive | 5 | 4 | 4.7.3 or higher | 4 | ||
HDFS, Hive | 5 | 4 | 5 | 4 | ||
HDFS, Hive | 5 | 5 | 5 | 5 | ||
HDFS, Hive | 5 | 5 | 5 | 4.4 or higher |
Unsupported Replication Scenarios
Service | Source | Destination | ||||
---|---|---|---|---|---|---|
Cloudera Manager Version | CDH Version | Comment | Cloudera Manager Version | CDH Version | Comment | |
Any | 4 or 5 | 4 or 5 | Kerberos enabled. | 4 or 5 | 4 or 5 | Kerberos not enabled |
Any | 4 or 5 | 4 or 5 | Kerberos not enabled. | 4 or 5 | 4 or 5 | Kerberos enabled |
HDFS, Hive | 4 or 5 | 4 | Where the replicated data includes a directory that contains a large number of files or subdirectories (several hundred
thousand entries), causing out-of-memory errors.
To work around this issue, follow this procedure. |
4 or 5 | 5 | |
Hive | 4 or 5 | 4 | Replicate HDFS Files is disabled. | 4 or 5 | 4 or 5 | Over-the-wire encryption is enabled. |
Hive | 4 or 5 | 4 | Replication can fail if the NameNode fails over during replication. | 4 or 5 | 5, with high availability enabled | Replication can fail if the NameNode fails over during replication. |
Hive | 4 or 5 | 4 | The clusters use different Kerberos realms. | 4 or 5 | 5 | An older JDK is deployed. (Upgrade the CDH 4 cluster to use JDK 7 or JDK6u34 to work around this issue.) |
Any | 4 or 5 | 4 | SSL enabled on Hadoop services. | 4 or 5 | 4 or 5 | |
Hive | 4 or 5 | 4.2 or higher | If the Hive schema contain views. | 4 or 5 | 4 | |
HDFS | 4 or 5 | 4, with high availability enabled | Replications fail if NameNode failover occurs during replication. | 4 or 5 | 5, without high availability | Replications fail if NameNode failover occurs during replication. |
HDFS | 4 or 5 | 4 or 5 | Over the wire encryption is enabled. | 4 or 5 | 4 | |
HDFS | 4 or 5 | 5 | Clusters where there are URL-encoding characters such as % in file and directory names. | 4 or 5 | 4 | |
HIve | 4 or 5 | 4 or 5 | Over the wire encryption is enabled and Replicate HDFS Files is enabled. | 4 or 5 | 4 | |
Hive | 4 or 5 | 4 or 5 | From one cluster to the same cluster. | 4 or 5 | 4 or 5 | From one cluster to the same cluster. |
HDFS, Hive | 4 or 5 | 5 | Where the replicated data includes a directory that contains a large number of files or subdirectories (several hundred
thousand entries), causing out-of-memory errors.
To work around this issue, follow this procedure. |
4 or 5 | 4 | |
HDFS | 4 or 5 | 5 | The clusters use different Kerberos realms. | 4 or 5 | 4 | An older JDK is deployed. (Upgrade the CDH 4 cluster to use JDK 7 or JDK6u34 to work around this issue.) |
Hive | 4 or 5 | 5 | Replicate HDFS Files is enabled and the clusters use different Kerberos realms. | 4 or 5 | 4 | An older JDK is deployed. (Upgrade the CDH 4 cluster to use JDK 7 or JDK6u34 to work around this issue.) |
Any | 4 or 5 | 5 | SSL enabled on Hadoop services and YARN. | 4 or 5 | 4 or 5 | |
Any | 4 or 5 | 5 | SSL enabled on Hadoop services. | 4 or 5 | 4 | |
HDFS | 4 or 5 | 5, with high availability enabled | Replications fail if NameNode failover occurs during replication. | 4 or 5 | 4, without high availability | Replications fail if NameNode failover occurs during replication. |
HDFS, Hive | 5 | 5 | 4 | 4 | ||
Hive | 5.2 | 5.2 or lower | Replication of Impala UDFs is skipped. | 4 or 5 | 4 or 5 |
- On the destination Cloudera Manager instance, go to the HDFS service page.
- Click the Configuration tab.
- Select and .
- Locate the HDFS Replication Advanced Configuration Snippet property.
- Increase the heap size by adding a key-value pair, for instance, HADOOP_CLIENT_OPTS=-Xmx1g. In this example, 1g sets the heap size to 1 GB. This value should be adjusted depending on the number of files and directories being replicated.
- Click Save Changes to commit the changes.