Creating Iceberg replication policy

You can create Iceberg replication policies in CDP Private Cloud Base 7.1.9 clusters using Cloudera Manager 7.11.3 or higher versions.

  1. Add the source cluster as a peer to the target cluster. An Iceberg replication policy requires a replication peer to locate the source data. You can use an existing peer or add a new peer.
    For information about adding a source cluster as a peer, see Adding cluster as a peer.
  2. Go to the Cloudera Manager > Replication > Replication Policies page in the target cluster where the peer is set up.
  3. Click Create Replication Policy > Iceberg Replication Policy.
    The Create Iceberg Replication Policy wizard appears.
  4. Configure the following options on the General tab:
    Option Description
    Policy Name Enter a unique name for the replication policy.
    Source Choose the source cluster that has the required peer, the required source data to replicate, and the source Iceberg Replication service.
    Destination Choose the target cluster that has the required target Iceberg Replication service.

    The dropdown list shows the clusters that are managed by the current Cloudera Manager.

    Schedule Choose:
    • Immediate to run the replication policy immediately after policy creation is complete.
    • Once to run the schedule one time in the future. Set the date and time.
    • Recurring to run the schedule periodically in the future. Set the date, time, and interval between runs.

    You must consider the following factors before you configure the replication frequency or recurring schedule:

    • The anticipated rate of change and the frequency of the schedule can predict the RTO (Recovery Time Objective) and RPO (Recovery Point Objective) during a disaster recovery process. Therefore, choose a schedule that provides an optimal RTO and RPO.
    • Recurring frequency impacts the compute load on the entire system. That is, frequent replication affects the overall compute capacity of the participating nodes in the replication process which in turn can impact the other workloads running on these nodes.
    Inclusion Table Filters Enter the one or more database and table names to include for replication. The table name can be a Java Regular Expression, or the complete table name that is stored in the catalog.
    Exclusion Table Filters Enter the one or more database and table names to exclude from replication. The table name can be a Java Regular Expression, or the complete table name that is stored in the catalog.
  5. Configure the following options on the Resources tab, Replication Manager uses these options to optimize the DistCp jobs during data replication:
    Option Description
    Custom YARN Queue Optional. Enter the name of the YARN queue for the cluster to which the replication job is submitted if you are using Capacity Scheduler queues to limit resource consumption. The default value for this field is default.
    Maximum Number of Copy Mappers Optional. Enter the number of map slots per mapper, as required. The default value is 20.
    Maximum Bandwidth Per Copy Mapper Optional. Enter the bandwidth per mapper, as required. The default value for the bandwidth is 100 MB per second for each mapper.

    The total bandwidth used by the replication policy is equal to Maximum Bandwidth multiplied by Maximum Map Slots. Therefore, you must ensure that the bandwidth and map slots you choose do not impact other tasks or network resources in the target cluster.

    Adjust this setting so that each map task is throttled to consume only the specified bandwidth.

    Each map task (simultaneous copy) is restricted to consume only the specified bandwidth. This is not always exact. The map throttles back its bandwidth consumption during a copy in such a way that the net bandwidth used tends towards the specified value. You can adjust this setting so that each map task is throttled to consume only the specified bandwidth and the net bandwidth used tends towards the specified value.

  6. Click Create.
The replication policy appears on the Replication Policies page. It can take up to 15 seconds for the task to appear.

If you selected Immediate in the Schedule field, the replication job starts replicating after you click Save Policy.