Replicating multiple large tables using CDP Public Cloud Replication Manager

You can create multiple replication policies to replicate multiple large HDFS directories, Hive external tables, or HBase tables using Replication Manager in CDP Public Cloud.

When you have multiple tables or directories to replicate, you might want to create more than one replication policy to replicate the data. This is because even if one replication policy fails, it would not stop the replication process. You can troubleshoot and restart the failed replication policy or create a new replication policy for the tables or directories in the failed replication policy.

Scenario and solution

You want to replicate multiple large HBase tables (around 6000 tables) using the incremental approach. In this approach, you replicate data in batches. For example, 500 tables at a time. This approach ensures that the source cluster is healthy because you replicate data in small batches.

The following steps explain the incremental approach in detail:

  1. You create an HBase replication policy for the first 500 tables.

    Internally, Replication Manager performs the following steps:

    1. Creates a disabled HBase peer on the source cluster or disables if the HBase peer existed.
    2. Creates a snapshot and copies it to the target cluster.

      HBase replication policies use snapshots to replicate existing HBase data; this step ensures that all the existing data existing is replicated.

    3. Restores the snapshot to appear as the table on the target.

      This step ensures that the existing data is replicated to the target cluster.

    4. Deletes the snapshot.

      The Replication Manager performs this step after the replication is successfully complete.

    5. Enables table’s replication scope for subsequent replication.
    6. Enables the peer.

      This step ensures that the accumulated data is completely replicated.

    After all the accumulated data is migrated, the HBase service continues to replicate new/changed data in this batch of tables automatically.

  2. Create another HBase replication policy to replicate the next batch of 500 tables after all the existing data and accumulated data of the first batch of tables is migrated successfully.
  3. You can continue this process until all the tables are replicated successfully.

In an ideal scenario, the time taken to replicate 500 tables of 6 TB size might take around four to five hours, and the time taken to replicate the accumulated data might be another 30 minutes to one and a half hours, depending on the speed at which the data is being generated on the source cluster. Therefore, this approach uses 12 batches and around four to five days to replicate all the 6000+ tables to Cloudera Operational Database (COD).

For more information about the cluster versions supported by Replication Manager, clusters preparation process to complete before you create the HBase replication policy, HBase replication policy creation, and for an in depth analysis of the use case, see related information.