Data Replication

Cloudera Manager enables you to replicate data across data centers for disaster recovery scenarios. Replications can include data stored in HDFS, data stored in Hive tables, Hive metastore data, and Impala metadata (catalog server metadata) associated with Impala tables registered in the Hive metastore. When critical data is stored on HDFS, Cloudera Manager helps to ensure that the data is available at all times, even in case of complete shutdown of a datacenter.

You can also replicate HDFS data to and from Amazon S3 and you can replicate Hive data and metadata to and from Amazon S3.

For an overview of data replication, view this video about Backing Up Data Using Cloudera Manager.

You can also use the HBase shell to replicate HBase data. (Cloudera Manager does not manage HBase replications.)

Video: Backing up Data Using Cloudera Manager

Cloudera License Requirements for Replication

Both the source and destination clusters must have a Cloudera Enterprise license.

Supported and Unsupported Replication Scenarios

Supported Replication Scenarios

Versions
To replicate data to or from clusters managed by Cloudera Manager 6, the source or destination cluster must be managed by Cloudera Manager 5.14.0 or higher. Note that some functionality may not be available in Cloudera Manager 5.14.0 and higher or 6.0.0 and higher.
Kerberos
BDR supports the following replication scenarios when Kerberos authentication is used on a cluster:
  • Secure source to a secure destination.
  • Insecure source to an insecure destination.
  • Insecure source to a secure destination. Keep the following requirements in mind:
    • In replication scenarios where a destination cluster has multiple source clusters, all the source clusters must either be secure or insecure. BDR does not support replication from a mixture of secure and insecure source clusters.
    • The destination cluster must run Cloudera Manager 6.1.0 or higher.
    • The source cluster must run a compatible Cloudera Manager version.
    • This replication scenario requires additional configuration. For more information, see Replicating from Insecure to Secure Clusters for Hive and Replicating from Insecure to Secure Clusters for HDFS.
Cloud Storage
BDR supports replicating to or from Amazon S3 and Microsoft Azure ADLS Gen1 and Microsoft Azure ADLS Gen2 (ABFS).
TLS

You can use TLS with BDR. Additionally, BDR supports replication scenarios where TLS is enabled for non-Hadoop services (Hive/Impala) and TLS is disabled Hadoop services (such as HDFS,YARN, and MapReduce).

Unsupported Replication Scenarios

Versions
Replicating to or from Cloudera Manager 6 managed clusters with Cloudera Manager versions earlier than 5.14.0 are not supported.
Kerberos
BDR does not support the following replication scenarios when Kerberos authentication is used on a cluster:
  • Secure source to an insecure destination is not supported.

Replicating Directories with Thousands of Files and Subdirectories

To replicate data that includes a directory with several hundred thousand files or subdirectories:
  1. On the destination Cloudera Manager instance, go to the HDFS service page.
  2. Click the Configuration tab.
  3. Select Scope > HDFS service name (Service-Wide) and Category > Advanced.
  4. Locate the HDFS Replication Environment Advanced Configuration Snippet (Safety Valve) for hadoop-env.sh property.
  5. Increase the heap size by adding a key-value pair, for instance, HADOOP_CLIENT_OPTS=-Xmx1g. In this example, 1g sets the heap size to 1 GB. This value should be adjusted depending on the number of files and directories being replicated.
  6. Enter a Reason for change, and then click Save Changes to commit the changes.

HDFS and Hive/Impala Replication To and From Cloud Storage

Minimum Required Role: User Administrator (also provided by Full Administrator)

To configure Amazon S3 as a source or destination for HDFS or Hive/Impala replication, you configure AWS Credentials that specify the type of authentication to use, the Access Key ID, and Secret Key. See How to Configure AWS Credentials.

To configure Microsoft ADLS as a source or destination for HDFS or Hive/Imapla replication, you configure the service principal for ADLS. See Configuring ADLS Gen1 Connectivity or Configuring ADLS Gen2 Connectivity.

After configuring S3 or ADLS, you can click the Replication Schedules link to define a replication schedule. See HDFS Replication or Hive/Impala Replication for details about creating replication schedules. You can also click Close and create the replication schedules later. Select the AWS Credentials account in the Source or Destination drop-down lists when creating the schedules.

Supported Replication Scenarios for Clusters using Isilon Storage

Note the following when scheduling replication jobs for clusters that use Isilon storage:
  • As of CDH 5.8 and higher, Replication is supported for clusters using Kerberos and Isilon storage on the source or destination cluster, or both. See Configuring Replication with Kerberos and Isilon. Replication between clusters using Isilon storage and Kerberos is not supported in CDH 5.7.
  • Make sure that the hdfs user is a superuser in the Isilon system. If you specify alternate users with the Run As option when creating replication schedules, those users must also be superusers.
  • Cloudera recommends that you use the Isilon root user for replication jobs. (Specify root in the Run As field when creating replication schedules.)
  • Select the Skip checksum checks property when creating replication schedules.
  • Clusters that use Isilon storage do not support snapshots. Snapshots are used to ensure data consistency during replications in scenarios where the source files are being modified. Therefore, when replicating from an Isilon cluster, Cloudera recommends that you do not replicate Hive tables or HDFS files that could be modified before the replication completes.

See Using CDH with Isilon Storage.

BDR Log Retention

By default, Cloudera Manager retains BDR logs for 90 days. You can change the number of days Cloudera Manager retains logs for or disable log retention completely. Note that for existing clusters, you need to run the POST/clusters/{clusterName}/commands/expireLogs API command at least once to trigger the BDR log retention.

To configure the number of days for BDR log retention, perform the following steps:
  1. In the Cloudera Manager Admin Console, search for the following property: Backup and Disaster Log Retention.
  2. Enter the number of days you want to retain logs for. To disable log retention, enter -1.