Data Replication
Cloudera Manager provides rich functionality for replicating data (stored in HDFS or accessed through Hive) across data centers. When critical data is stored on HDFS, Cloudera Manager provides the necessary capabilities to ensure that the data is available at all times, even in the face of the complete shutdown of a data center.
- Supported Replication Scenarios
- Unsupported Replication Scenarios
- Designating a Replication Source
- HBase Replication
- Common Replication Topologies
- Points to Note about Replication
- Requirements
- Deploying HBase Replication
- Disabling Replication at the Peer Level
- Stopping Replication in an Emergency
- Initiating Replication When Data Already Exists
- Understanding How WAL Rolling Affects Replication
- Configuring Secure HBase Replication
- Restoring Data From A Replica
- Replication Caveats
- HDFS Replication
- Hive Replication
- Impala Metadata Replication
- Using Snapshots with Replication
- Enabling Replication Between Clusters in Different Kerberos Realms
- Replication of Encrypted Data
For recommendations on using data replication and Sentry authorization, see Configuring Sentry to Enable BDR Replication.
In Cloudera Manager 5, replication is supported between CDH 5 or CDH 4 clusters. In Cloudera Manager 5, support for HDFS and Hive replication is as follows.
Supported Replication Scenarios
- HDFS and Hive
- Cloudera Manager 4 with CDH 4 to Cloudera Manager 5 with CDH 4.
- Cloudera Manager 5 with CDH 4 to Cloudera Manager 4.7.3 or later with CDH 4.
- Cloudera Manager 5 with CDH 4 to Cloudera Manager 5 with CDH 4.
- Cloudera Manager 5 with CDH 5 to Cloudera Manager 5 with CDH 5.
- Cloudera Manager 4 or 5 with CDH 4.4 or later to Cloudera Manager 5 with CDH 5.
- Cloudera Manager 5 with CDH 5 to Cloudera Manager 5 with CDH 4.4 or later.
- (HDFS only) Within one Cloudera Manager instance, from one directory to another directory within the same cluster or to a different cluster. Both clusters must be running CDH 4.8 or higher.
- SSL
- Between CDH 5.0 with SSL and CDH 5.0 with SSL.
- Between CDH 5.0 with SSL and CDH 5.0 without SSL.
- From a CDH 5.1 source cluster with SSL and YARN.
Unsupported Replication Scenarios
- HDFS and Hive
- Cloudera Manager 5 with CDH 5 as the source, and Cloudera Manager 4 with CDH 4 as the target.
- Between Cloudera Enterprise and any Cloudera Manager free edition:Cloudera Express, Cloudera Standard, Cloudera Manager Free Edition.
- Between CDH 5 and CDH 4 (in either direction) where the replicated data includes a directory that contains a large number of files or subdirectories (several hundreds of thousands of
entries), causing out-of-memory errors. This is because of limitations in the WebHDFS API. The workaround is to increase the heap size as follows:
- On the target Cloudera Manager instance, go to the HDFS service page.
- Click the Configuration tab.
- Expand the Service-Wide category.
- Click .
- Increase the heap size by adding a key-value pair, for instance, HADOOP_CLIENT_OPTS=-Xmx1g. In this example, 1g sets the heap size to 1 GB. This value should be adjusted depending on the number of files and directories being replicated.
- Replication involving HDFS data from CDH 5 HA to CDH 4 clusters or CDH 4 HA to CDH5 clusters will fail if a NameNode failover happens during replication. This is because of limitations in the CDH WebHDFS API.
- HDFS
- Between a source cluster that has encryption over-the-wire enabled and a target cluster running CDH 4.0. This is because the CDH 4 client is used for replication in this case, and it does not support this.
- From CDH 5 to CDH 4 where there are URL-encoding characters such as % in file and directory names. This is because of a bug in the CDH 4 WebHDFS API.
- HDFS replication does not work from CDH 5 to CDH 4 with different realms when using older JDK versions. Use JDK 7 or upgrade to JDK6u34 or later on the CDH 4 cluster to avoid this issue.
- Hive
- With data replication, between a source cluster that has encryption enabled and a target cluster running CDH 4. This is because the CDH 4 client used for replication does not support encryption.
- Without data replication, between a source cluster running CDH 4 and a target cluster that has encryption enabled.
- Between CDH 4.2 or later and CDH 4, if the Hive schema contains views.
- With the same cluster as both source and destination
- Replication from CDH 4 to CDH 5 HA can fail if a NameNode failover happens during replication.
- Hive replication from CDH 5 to CDH 4 with different realms with older JDK versions, if data replication is enabled (since this involves HDFS replication). Use JDK 7 or upgrade to JDK6u34 or later on the CDH 4 cluster to avoid this issue.
- Hive replication from CDH 4 to CDH 5 with different realms with older JDK versions (even without data replication enabled). Use JDK 7 or upgrade to JDK6u34 or later on the CDH 4 cluster to avoid this issue.
- Cloudera Manager 5.2 only supports replication of Impala UDFs if running CDH 5.2 or later. In clusters running Cloudera Manager 5.2 and a CDH version earlier than 5.2 that include Impala User-Defined Functions (UDFs), Hive replication will succeed, but replication of the Impala UDFs will be skipped.
- SSL
- From a CDH 4.x source cluster with SSL.
- From CDH 5.0 source cluster with SSL and YARN (because of a YARN bug).
- Between CDH 5.0 with SSL and CDH 4.x.
- Kerberos
- From a source cluster configured to use Kerberos authentication to a target cluster that is not configured to use Kerberos authentication.
- From a source cluster not configured to use Kerberos authentication to a target cluster that is configured to use Kerberos authentication.