Creating Iceberg replication policy

Iceberg replication policies replicate Iceberg V1 and V2 tables, created using Spark (read-only with Impala), between Cloudera Private Cloud Base 7.1.9 or higher clusters using Cloudera Manager 7.11.3 or higher versions. Starting from Cloudera Private Cloud Base 7.3.1, Replication Manager can also replicate Iceberg V1 and V2 tables created using Hive.

  1. Add the source cluster as a peer to the target cluster. The replication policy requires a replication peer to locate the source data. You can use an existing peer or add a new peer.
    For information about adding a source cluster as a peer, see Adding cluster as a peer.
  2. Go to the Cloudera Manager > Replication > Replication Policies page in the target cluster where the peer is set up.
  3. Click Create Replication Policy > Iceberg Replication Policy.
    The Create Iceberg Replication Policy wizard appears.
  4. Configure the following options on the General tab:
    Option Description
    Policy Name Enter a unique name for the replication policy.
    Source Choose the source cluster that has the required peer, the required source data to replicate, and the source Iceberg Replication service.
    Destination Choose the target cluster that has the required target Iceberg Replication service.

    The drop-down list shows the clusters that are managed by the current Cloudera Manager.

    Schedule Choose:
    • Immediate to run the replication policy immediately after policy creation is complete.
    • Once to run the schedule one time in the future. Set the date and time.
    • Recurring to run the schedule periodically in the future. Set the date, time, and interval between runs.

    You must consider the following factors before you configure the replication frequency or recurring schedule:

    • The anticipated rate of change and the frequency of the schedule can predict the RTO (Recovery Time Objective) and RPO (Recovery Point Objective) during a disaster recovery process. Therefore, choose a schedule that provides an optimal RTO and RPO.
    • Recurring frequency impacts the compute load on the entire system. That is, frequent replication affects the overall compute capacity of the participating nodes in the replication process which in turn can impact the other workloads running on these nodes.
    Inclusion Table Filters Enter the one or more database and table names to include for replication. The table name can be a Java Regular Expression, or the complete table name that is stored in the catalog. Use “|” to separate the table names.
    Exclusion Table Filters Enter the one or more database and table names to exclude from replication. The table name can be a Java Regular Expression, or the complete table name that is stored in the catalog.
    Run as Username Enter the username to run the MapReduce job. By default, MapReduce jobs run as hdfs.

    To run the MapReduce job as a different user, enter the user name. If you are using Kerberos, you must provide a user name here, and it must have an ID greater than 1000.

    Replicate Table Column Statistics Choose to copy the table column statistics associated with the chosen Iceberg tables.
    Alternate target data root Optional. Specify an alternate root location for all the tables in the replication scope. All the Iceberg table data/metadata are rebased in this location and keeps the source directory structure intact.
    Replicate Atlas Metadata Choose to copy the Atlas metadata and data lineage associated with the chosen Iceberg tables. For more information, see How Atlas metadata replication for Iceberg tables work
  5. Configure the following options on the Resources tab, Replication Manager uses these options to optimize the DistCp jobs during data replication:
    Option Description
    Custom YARN Queue Optional. Enter the name of the YARN queue for the cluster to which the replication job is submitted if you are using Capacity Scheduler queues to limit resource consumption. The default value for this field is default.
    Maximum Number of Copy Mappers Optional. Enter the number of map slots per mapper, as required. The default value is 20.
    Maximum Bandwidth Per Copy Mapper Optional. Enter the bandwidth per mapper, as required. The default value for the bandwidth is 100 MB per second for each mapper.

    The total bandwidth used by the replication policy is equal to Maximum Bandwidth multiplied by Maximum Map Slots. Therefore, you must ensure that the bandwidth and map slots you choose do not impact other tasks or network resources in the target cluster.

    Adjust this setting so that each map task is throttled to consume only the specified bandwidth.

    Each map task (simultaneous copy) is restricted to consume only the specified bandwidth. This is not always exact. The map throttles back its bandwidth consumption during a copy in such a way that the net bandwidth used tends towards the specified value. You can adjust this setting so that each map task is throttled to consume only the specified bandwidth and the net bandwidth used tends towards the specified value.

  6. Configure the following options on the Advanced tab:
    Option Description
    Use Batch Size Choose and enter the maximum number of snapshots to process for an export batch. This limits the amount of work to be processed in a single batch and can improve throughput.

    By default, this option is clear so as to process all the available snapshots in an export batch.

    JVM Options for Export Enter comma-separated JVM options to use for the export process during the Iceberg replication policy run. For example, JAVA_OPTS = -XX:ErrorFile=file.log

    Some JVM options that you can use are -Xms256m to specify minimum heap size; -Xmx512m to specify max heap size; and -DmyProperty=value where myProperty is the property name with the required value.

    JVM Options for XFer Enter comma-separated JVM options to use for the transfer process during the Iceberg replication policy.
    JVM Options for Sync Enter comma-separated JVM options to use for the sync process during the Iceberg replication policy.
    Advanced Configuration Snippet (Safety Valve) for source hdfs-site.xml Add one or more key-value pairs to the hdfs-site.xml file on the source cluster. New key-value pairs are added to the file. Existing key-value pairs are overwritten in the file.
    Advanced Configuration Snippet (Safety Valve) for source core-site.xml Add one or more key-value pairs to the core-site.xml file on the source cluster. New key-value pairs are added to the file. Existing key-value pairs are overwritten in the file.
    Advanced Configuration Snippet (Safety Valve) for destination hdfs-site.xml Add one or more key-value pairs to the hdfs-site.xml file on the target cluster. New key-value pairs are added to the file. Existing key-value pairs are overwritten in the file.
    Advanced Configuration Snippet (Safety Valve) for destination core-site.xml Add one or more key-value pairs to the core-site.xml file on the target cluster. New key-value pairs are added to the file. Existing key-value pairs are overwritten in the file.
  7. Click Create.
The replication policy appears on the Replication Policies page. It can take up to 15 seconds for the task to appear.

If you selected Immediate in the Schedule field, the replication job starts replicating after you click Save Policy.