Replication Manager in Cloudera Public Cloud

Replication Manager is a service in Cloudera Public Cloud. You can create replication policies in Replication Manager to copy and migrate data from CDH (version 5.13 and higher) clusters (HDFS, Hive, and HBase data) and Cloudera Private Cloud Base (version 7.1.4 and higher) clusters (HDFS, Hive external tables, and HBase data) to Cloudera Public Cloud clusters. You can also replicate HDFS data from cloud storage to classic clusters (CDH or Cloudera Private Cloud Base clusters), and Hive external tables to Data Hubs. The supported Cloudera Public Cloud services include Amazon S3 and Microsoft Azure ADLS Gen2 (ABFS). Replicating Hive managed tables using Replication Manager from HDP clusters to Cloudera Public Cloud is a beta feature and is not available for general use.

Before you create replication policies, you must ensure that the clusters are supported by Replication Manager. For more information, see Support matrix for Replication Manager on Cloudera Public Cloud.

You can access the Replication Manager service on the Cloudera Public Cloud web interface. To replicate data between clusters, add the source on-premises clusters as classic clusters on the Cloudera Management Console > Clusters page, add/create one or more Cloudera Public Cloud SDX Data Lakes and/or Data Hubs, and then create the replication policies in Replication Manager. The Replication Policies page shows the progress and status of replication policy jobs. You can also use CDP CLI to create HDFS and Hive replication policies.

Replication Manager provides the following functionalities that you can use to accomplish your data replication goals:

HDFS replication policies

These policies replicate HDFS data and metadata from on-premises clusters (CDH, Cloudera Private Cloud Base, and HDP) to Public Cloud storage buckets such as S3 and ABFS, and from cloud storage to classic clusters (CDH or Cloudera Private Cloud Base clusters). You can choose the frequency of replicating data.

Some use cases where you can use HDFS replication policies include:
  • Moving legacy data (from CDH clusters) to cloud deployments (AWS or Azure on Cloudera Public Cloud).
  • Archiving cold data.
  • Replicating the required data to another cluster to run analytics on it.

Hive replication policies

These policies support table-level replication and can replicate Hive external tables from on-premises clusters (CDH and Cloudera Private Cloud Base) to cloud storage such as S3 and ABFS and to Data Hubs. They also can:
  • replicate data stored in Hive tables, Hive metadata, data in Hive metastore, and Impala metadata (catalog server metadata) associated with Impala tables registered in the Hive metastore, and
  • migrate Sentry permissions to Ranger.
Some use cases where you can use Hive replication policies include:
  • Backing up data periodically.
  • Performing a recovery operation when necessary.
  • Creating a development and test system for engineers to run quality checks.

HBase replication policies

You can create these policies to replicate HBase data from a source classic cluster (CDH or Cloudera Private Cloud Base cluster), COD, or Data Hub to a target Data Hub or COD cluster. You can also copy or replicate HBase data between different environments within a Virtual Private Cloud (VPC) using these policies. Any future data change in the source cluster is pushed to the target cluster automatically without user intervention.

Some use cases where you can use HBase replication policies include:
  • Performing an active-active disaster recovery with conflict resolution (enabling other disaster recovery use cases which provides an efficient utilization of resources).
  • Copying required data to the cloud clusters for heavy-duty analytics workloads which helps to optimize on-premises cluster performance.
  • Utilizing the continuous data synchronize feature to implement a hybrid cloud that in turn helps you to use it in various other use cases.

CDP CLI for HDFS and Hive replication policies

You can also use CDP CLI commands to create HDFS and Hive replication policies. The CDP CLI commands for Replication Manager are under the replicationmanager CDP CLI option. For more information, see CDP CLI for Replication Manager.

CDP CLI is a unified tool to manage all the Cloudera Public Cloud services. This gives you the flexibility to use it across services from a single pane, view required information in a single scroll (for example, you can view all available clusters’ service status in a single page), and collaborate to troubleshoot issues.