Use cases for Replication Manager in CDP Public Cloud

Replication Manager is a service in CDP Public Cloud. You can create replication policies in Replication Manager to copy and migrate data from CDH (version 5.13 and higher) clusters (HDFS, Hive, and HBase data) and CDP Private Cloud Base (version 7.1.4 and higher) clusters (HDFS, Hive external tables, and HBase data) to CDP Public Cloud clusters. You can also replicate HDFS data from classic clusters (CDH, CDP Private Cloud Base, and HDP clusters) to Public Cloud buckets. The supported Public Cloud services include Amazon S3 and Microsoft Azure ADLS Gen2 (ABFS).

You can use Replication Manager for a variety of use cases. Some major use cases include:
  • Implementing a complete backup and restore solution.

    You might want to implement a backup and restore solution for HDFS data or Hive external tables. You create the replication policy based on the type of data you want to backup and restore. To implement this use case, you back up data in ClusterA to ClusterB. When the need arises, you can create another replication policy to restore the data from ClusterB to ClusterA.

  • Migrating legacy data (for example, from CDH clusters) to CDP Public Cloud clusters on Amazon S3 and Microsoft Azure ADLS Gen2 (ABFS).

    Replication Manager supports replicating HDFS data, Hive external tables, and HBase data from CDH (version 5.13 and higher) clusters (HDFS, Hive, and HBase data) to CDP Public Cloud clusters. Before you create replication policies, you must prepare the clusters and complete the prerequisites.

    Alternatively, you can use CDP CLI to create replication policies to replicate HDFS data and Hive external tables.

  • Replicating the required data to another cluster to run workloads or analytics.

    Sometimes, you might want to move workloads, especially heavy-duty workloads to another cluster to reduce the load and optimize the performance of the primary cluster, or run analytics on the required data on another cluster because you do not want to overload the primary cluster. In such scenarios, you can create replication policies in Replication Manager to replicate the required data at regular intervals. After replication, you can use the required tools to analyze the data.