How Iceberg replication policy works

Replication Manager performs several steps to replicate the Iceberg tables when you create or run an Iceberg replication policy.

The following list shows a few high-level steps that are completed during the replication process:

  1. Determines the tables to replicate depending on the choice you made during the Iceberg replication policy creation process.
  2. Reads the table names to fetch the checkpoint for the tables from the target cluster HMS. A checkpoint is the metadata about the latest Iceberg snapshot for a table on the target cluster and is saved in an HDFS file.
  3. Initiates the exportCLI command in the source cluster to generate a list of files (manifest files, data files, and delete files) to copy from the source cluster to the target cluster.
  4. Copies the files from the source cluster to the target cluster using DistCp jobs which takes advantage of the transfer bandwidth of the target cluster.
    The job copies the data files directly to the target data root directory, and it copies the metadata files to a temporary staging location where it is further processed as explained in the next step.
  5. Transforms the copied manifest files to point to the correct manifest files pointers and data files on the target cluster, deletes the pre-transformed manifest files, and updates the target HMS with the latest snapshot.