Creating Iceberg replication policy
You can create Iceberg replication policies in Cloudera Replication Manager.
- Go to the Replication Manager > Replication Policies page.
-
Click Create Policy.
The Create Replication Policy wizard is displayed.
-
On the General page, configure the following
details:
Field Description Type Select Iceberg to create an Iceberg replication policy. Policy Name Enter a unique name for the replication policy. Description Optional. Enter a brief description about the replication policy. - Click Next.
-
On the Select Source page, configure the following
details:
Field Description Source Data Hub Choose the source Iceberg Replication Data Hub on the source Data Lake. If the required Iceberg Replication Data Hub does not display, ensure that it is deployed and the Iceberg Replication feature is enabled in the Data Hub.
Inclusion Filters Enter the database or schema name and the table name. The table name can be a Java regular expression (regex) pattern. Replication Manager includes these tables in the replication policy job runs. Exclusion Filters Optional. Specify the tables to exclude from the replication policy. Replicate Column Statistics Select to replicate the column statistics of the tables. Run as Username on Source Enter the required username to run the replication policy. The username overrides the default hdfs username. - Click Next.
-
On the Select Destination page, configure the following
options:
Field Description Destination Data Hub Select the target Iceberg Replication Data Hub on the target Data Lake. If the required Iceberg Replication Data Hub does not display, ensure that it is deployed and the Iceberg Replication feature is enabled in the Data Hub.
Validate Access on Each Run Select to verify that the source or target cluster has the required access to the source or target bucket during each job run. By default, access is verified only once after the replication policy creation process is complete.
Run DistCp on Source Select to run the transfer step on the source cluster during each job run. By default, the transfer step runs on the destination cluster. The transfer step location determines the IAM bucket access requirements. When the transfer step runs on the target cluster, the target cluster’s IAM requires read access to the source data location. When the transfer step runs on the source cluster, the source cluster’s IAM requires read and write access to the destination data locations.
Alternative Staging Location Enter an alternate staging location if you do not want to use the default location to stage the intermediate work. The default location varies with the environment. For example, the default location might be located in [*** CLOUD DATA ROOT BUCKET FROM ENVIRONMENT ***]/user/replication on the target.
Location Mapping Configure to override the path mapping when copying files. -
Click Validate Access.
Replication Manager performs the following steps:
-
Validates the IAM role bucket permissions as shown in the following
table:
IAM role Required permissions Source IAM role - Read access to source data locations.
- Read and write access to the staging location.
- Write access to the destination warehouse if the DistCp jobs run on the source.
Target IAM role - Write access to the target data locations.
- Read access to the staging location.
- Read access to the source data locations if the DistCp jobs run on the destination.
-
Validates whether the Cloudera Manager peer exists. Otherwise,
Replication Manager verifies whether a peer can be created from the
target cluster to the source cluster.
-
Validates the IAM role bucket permissions as shown in the following
table:
- Click Next after the Validate Access process is successfully completed.
-
On the Schedule page, configure the following
information:
Option Description Run Now Starts to replicate the data after the replication policy creation is complete. Select the frequency to replicate data periodically. Schedule Run Runs the replication policy at a later time. Specify the date and time for the first run, and then set the frequency to replicate data periodically. Frequency Select one of the following options:
- Does Not Repeat
- Custom – In the
Custom Recurrence dialog
box, set the time, date, and the frequency to run
the policy.
Replication Manager ensures that the exact same number of seconds elapses between the runs. For example, if you set the Start Time to January 19, 2022 11.06 AM and the Interval to 1 day, Replication Manager runs the replication policy for the first time at the specified time in the time zone where it was created. Subsequent runs occur exactly 1 day that (24 hours or 86400 seconds) later.
- Click Next.
-
On the Additional Settings page, configure the values as
necessary. These advanced parameters can be configured for specific purposes
depending on your requirements:
Field Description YARN Queue Name Enter the name of the YARN queue for the replication job if you are using Capacity Scheduler queues to limit resource consumption. The default value for this field is default.Maximum Maps Slots Set the maximum number of map tasks (simultaneous copies) per replication job. The default value is 20.Maximum Bandwidth Specify the maximum bandwidth for each copy (map) task. The default value for the bandwidth is 100MB per second for each mapper or copy task. Batch Size Enter the maximum number of snapshots to process per export batch. A high volume of source changes affects the time taken by each replication run. Setting this limit controls the amount of work to be processed in a single batch, which improves throughput and makes the replication run time more predictable. By default, this field is empty, meaning the job processes all available snapshots in an export batch.
Alerts Select when to generate alerts for the replication job: On Failure, On Start, On Success, or On Abort. Advanced Configuration Snippet (Safety Valve) for source hdfs-site.xml Add one or more key-value pairs to the hdfs-site.xml file on the source cluster. New key-value pairs are added to the file. Existing key-value pairs are overwritten in the file. Advanced Configuration Snippet (Safety Valve) for source core-site.xml Add one or more key-value pairs to the core-site.xml file on the source cluster. New key-value pairs are added to the file. Existing key-value pairs are overwritten in the file. Advanced Configuration Snippet (Safety Valve) for destination hdfs-site.xml Add one or more key-value pairs to the hdfs-site.xml file on the target cluster. New key-value pairs are added to the file. Existing key-value pairs are overwritten in the file. Advanced Configuration Snippet (Safety Valve) for destination core-site.xml Add one or more key-value pairs to the core-site.xml file on the target cluster. New key-value pairs are added to the file. Existing key-value pairs are overwritten in the file. -
Click Create.
The replication policy is displayed on the Replication Policies page. If you selected Immediate in the Schedule field, the replication job starts replicating after you click Create.
