About Replication Manager

Replication Manager is a service in CDP Public Cloud. You can create replication policies in Replication Manager to copy and migrate data from CDH (version 5.13 and higher) clusters (HDFS, Hive, and HBase data) and CDP Private Cloud Base (version 7.1.4 and higher) clusters (HDFS, Hive external tables, and HBase data) to CDP Public Cloud clusters. You can also replicate HDFS data from cloud storage to classic clusters (CDH or CDP Private Cloud Base clusters), and Hive external tables to Data Hubs. The supported Public Cloud services include Amazon S3 and Microsoft Azure ADLS Gen2 (ABFS). Replicating Hive managed tables using Replication Manager from HDP clusters to CDP Public Cloud is a beta feature and is not available for general use.

Before you create replication policies, you must ensure that the source cluster is supported by Replication Manager. For more information, see Support matrix for CDP Public Cloud Replication Manager.

You can access the Replication Manager service on the CDP Public Cloud web interface. To replicate data between clusters, add the source on-premises clusters as classic clusters on the Management Console > Clusters page, add/create one or more CDP Public Cloud SDX Data Lakes and/or Data Hubs, and then create the replication policies in Replication Manager. The Replication Policies page shows the progress and status of replication policy jobs. You can also use CDP CLI to create HDFS and Hive replication policies.

Replication Manager provides the following functionalities that you can use to accomplish your data replication goals:

HDFS replication policies

These policies replicate HDFS data and metadata from on-premises clusters (CDH, CDP Private Cloud Base, and HDP) to Public Cloud storage buckets such as S3 and ABFS, and from cloud storage to classic clusters (CDH or CDP Private Cloud Base clusters). You can choose the frequency of replicating data.

Some use cases where you can use HDFS replication policies include:
  • Moving legacy data (from CDH clusters) to cloud deployments (AWS or Azure on CDP Public Cloud).
  • Archiving cold data.
  • Replicating the required data to another cluster to run analytics on it.

Hive replication policies

These policies support table-level replication and can replicate Hive external tables from on-premises clusters (CDH and CDP Private Cloud Base) to cloud storage such as S3 and ABFS and to Data Hubs. They also can:

  • replicate data stored in Hive tables, Hive metadata, data in Hive metastore, and Impala metadata (catalog server metadata) associated with Impala tables registered in the Hive metastore, and
  • migrate Sentry permissions to Ranger.
Some use cases where you can use Hive replication policies include:
  • Backing up data periodically.
  • Performing a recovery operation when necessary.
  • Creating a development and test system for engineers to run quality checks.

HBase replication policies

You can create these policies to replicate HBase data from a source classic cluster (CDH or CDP Private Cloud Base cluster), COD, or Data Hub to a target Data Hub or COD cluster. You can also copy or replicate HBase data between different environments within a Virtual Private Cloud (VPC) using these policies. Any future data change in the source cluster is pushed to the target cluster automatically without user intervention.

Some use cases where you can use HBase replication policies include:
  • Performing an active-active disaster recovery with conflict resolution (enabling other disaster recovery use cases which provides an efficient utilization of resources).
  • Copying required data to the cloud clusters for heavy-duty analytics workloads which helps to optimize on-premises cluster performance.
  • Utilizing the continuous data synchronize feature to implement a hybrid cloud that in turn helps you to use it in various other use cases.

CDP CLI for HDFS and Hive replication policies

You can also use CDP CLI commands to create HDFS and Hive replication policies. The CDP CLI commands for Replication Manager are under the replicationmanager CDP CLI option. For more information, see CDP CLI for Replication Manager.

CDP CLI is a unified tool to manage all the CDP services. This gives you the flexibility to use it across services from a single pane, view required information in a single scroll (for example, you can view all available clusters’ service status in a single page), and collaborate to troubleshoot issues.