Replication Manager in CDP Private Cloud Base

Replication Manager is a service in Cloudera Manager. You can create replication policies in this service to replicate data across data centers for various use cases which include disaster recovery scenarios, running hybrid workloads, migrating data to/from cloud, or a generic backup/restore scenario. You can also create HDFS or HBase snapshot policies to take snapshots of HDFS directories and HBase tables respectively.

Cloudera Manager provides the following key functionalities in the Cloudera Manager Admin Console that can be leveraged by Replication Manager:

  • Select datasets that are critical for your business operations.
  • Monitor and track progress of your snapshots and replication jobs through a central console and easily identify issues or files that failed to be transferred.
  • Issue Alert when a snapshot or replication job fails or is aborted so that the problem can be diagnosed quickly.

You can also use Cloudera Manager to schedule, save, and restore snapshots of HDFS directories and HBase tables.

Replication Manager provides the following functionalities that you can use to accomplish your data replication goals:

HDFS replication policies

These policies replicate HDFS data and metadata from CDH (version 5.10 and higher) clusters to CDP Private Cloud Base (version 7.0.3 and higher) clusters.

Some use cases where you can use HDFS replication policies include:

  • copying data from legacy on-premises systems to Amazon S3 or Microsoft ADLS Gen2 (ABFS) cloud buckets or from cloud buckets to on-premise systems.
  • replicating required data to another cluster to run load-intensive workflows on it which optimizes the primary cluster performance.

  • deploying a complete backup-restore solution for your enterprise.

Hive external table replication policies

These policies replicate HDFS, Hive external tables (without manual translation of Hive datasets to HDFS datasets, or vice versa), Hive metastore data, Impala metadata (catalog server metadata) associated with Impala tables registered in the Hive metastore, Impala data, and Sentry permissions to Ranger from CDH (version 5.10 and higher) clusters to CDP Private Cloud Base (version 7.0.3 and higher) clusters. In this instance, applications that depend on external table definitions stored in Hive, operate on both replica and source as the table definitions are updated.

Some use cases where you might find these replication policies useful is to:

  • backup legacy data for future use or archive cold data
  • replicate or move data to cloud clusters to run analytics
  • implement a complete backup and disaster recovery solution

HDFS and HBase snapshot policies

These policies take regular point-in-time snapshots of HDFS directories and HBase tables respectively.

Snapshots act as a backup, and you can restore an HDFS directory or a HBase table to a previous version or to another location on the same HDFS or HBase service as necessary. Snapshots are also used by replication policies. The first replication policy run replicates all the data and metadata from the chosen directories. The subsequent replication policy runs leverage HDFS snapshot diffs to replicate the changed data.