Using Iceberg replication policies

The Iceberg replication policies in can replicate Iceberg tables between Data Lakes through Data Hubs in 7.3.2 or higher versions using AWS. The Data Lakes can be located in a single AWS region or across multiple regions.

Before you create an Iceberg replication policy, you must deploy the source Iceberg Replication Data Hub in the source Data Lake and the target Iceberg Replication Data Hub in the target Data Lake. The deployed Data Hubs provide the Hive database the details about the table metadata, source location of the tables, and optionally compute resources for the replication process. The replication occurs between S3-backed Data Lakes using HDFS protocols (or DistCp).

When you create an Iceberg replication policy, you can configure the following settings:
  • Select the source and target Data Hubs.
  • Choose the Iceberg tables to replicate.
  • Configure location mapping, performance, and tuning parameters.
  • Set a schedule frequency to replicate the tables at required intervals.
  • Specify exact tables to replicate using the inclusion and exclusion filters. These filters are evaluated during each replication job run.

    In the inclusion filters, you specify the tables to be included in the replication policy. Each inclusion filter is a database (or namespace) and a table pattern that is defined using Java regular expression (regex). You can also specify a table pattern that can discover newly created tables since the last replication job run.

    In the exclusion filter, you can specify the tables to be excluded by the replication policy.