Iceberg replication policies

Iceberg replication policies replicate Iceberg V2 tables, created using Spark (read-only with Impala), between CDP Private Cloud Base 7.1.9 or higher clusters using Cloudera Manager 7.11.3 or higher versions.

Starting from CDP Private Cloud Base 7.1.9 SP1 using Cloudera Manager 7.11.3 CHF7, you can enter the maximum number of snapshots to process for an export batch, add one or more key- value pairs to hdfs-site.xml and core-site.xml files on the source and target clusters, replicate Atlas metadata for Iceberg tables, and replicate Iceberg tables residing in custom directories using Iceberg replication policies.

​​Apache Iceberg is a cloud-native, high-performance open table format for organizing petabyte-scale analytic datasets on a file system or object store. Iceberg supports ACID compliant tables which includes row-level deletes and updates and can define large analytic data tables using open format files.

Iceberg replication policies:

  • replicate the metadata and catalog from the source cluster Hive Metastore (HMS) to target cluster HMS.

    The catalog contains the current metadata pointer/file and is stored in the Hive HMS. The metadata file contains the snapshots. The snapshots point to the manifest list that has the manifest files. The manifest files in turn point to the data files.

  • replicate the data files in the HDFS storage system from the source cluster to the target cluster. The Iceberg replication policy can replicate only between HDFS storage systems.
  • replicate all the snapshots from the source cluster. This allows you to run time travel queries on the target cluster.
  • replicate at table-level.

    You must ensure that the tables are in the default warehouse location because Iceberg replication policies do not replicate tables in a custom location.

Some use cases where you can use Iceberg replication policies are to:
  • replicate Iceberg tables between on-premises clusters to archive data or run analytics.
  • implement passive disaster recovery with planned failover and incremental replication at regular intervals between two similar systems. For example, between an HDFS to another HDFS system.