How Iceberg replication policy works

Replication Manager performs several steps to replicate the Iceberg tables when you create or run an Iceberg replication policy.

The following list shows a few high-level steps that are completed during the replication process:

Determines the tables to replicate depending on the choice you made during the Iceberg replication policy creation process.
Reads the table names to fetch the checkpoint for the tables from the target cluster HMS. A checkpoint is the metadata about the latest Iceberg snapshot for a table on the target cluster and is saved in an HDFS file.
Initiates the exportCLI command in the source cluster to generate a list of files (manifest files, data files, and delete files) to copy from the source cluster to the target cluster.
Copies the files from the source cluster to the target cluster using DistCp jobs which takes advantage of the transfer bandwidth of the target cluster.
The job copies the data files directly to the target data root directory, and it copies the metadata files to a temporary staging location where it is further processed as explained in the next step.
Transforms the copied manifest files to point to the correct manifest files pointers and data files on the target cluster, deletes the pre-transformed manifest files, and updates the target HMS with the latest snapshot.

How Atlas metadata replication for Iceberg tables work🔗

Atlas metadata for the chosen Iceberg tables can be replicated using Iceberg replication policies.

During the Iceberg replication policy creation process, if you:

choose the General > Replicate Atlas Metadata option, Replication Manager:
1. runs a bootstrap replication for all the chosen Iceberg tables and its Atlas metadata during the first replication policy run. Bootstrap replication replicates all the available Iceberg data and its associated Atlas metadata.
2. runs incremental replication on the Iceberg data and its Atlas metadata during subsequent replication runs. Here, the delta data and metadata gets replicated during each run.
choose to replicate an Iceberg table that was created using 'create table as select (CTAS)', Replication Manager sets the Skip lineage option to false and the Fetch type option to CONNECTED during the Iceberg replication policy run.
tip
You can create an Iceberg table based on an existing Hive, Spark, or Impala table using the CTAS query where you can optionally include a partitioning spec for the created table.

Use case🔗

You have an original or base table named T1. You create table T2 using CTAS from T1. Similarly, you create T3 from T2, and T4 from T3. During the Iceberg replication policy creation process, you choose T2 as source table, and then choose Replicate Atlas metadata. In this scenario, Replication Manager performs the following tasks during the replication policy run:

Sets Skip lineage to false, and Fetch type to CONNECTED during Atlas replication step.
Replicates T2 and all the Atlas entities connected to it, which includes the hdfs_path.
Replicates T1 and T3 Iceberg tables.