Iceberg replication policies
Iceberg replication policies replicate Iceberg V1 and V2 tables stored on HDFS and Ozone between Cloudera Base on premises clusters.
The following table lists the Iceberg replication features and their minimum
supported versions:
| Feature | Minimum supported versions |
|---|---|
| Replicate Iceberg V1 and V2 tables created using Spark (read-only with Impala) stored on HDFS | Cloudera Base on premises 7.1.9* using Cloudera Manager 7.11.3* |
| Replicate Iceberg V1 and V2 tables created using Hive stored on HDFS | Cloudera Manager 7.3.1* using Cloudera Manager 7.13.1* |
| Replicate Iceberg V1 and V2 tables created using Spark (read-only with Impala) or Hive stored on Ozone buckets | Cloudera Base on premises 7.3.2* using Cloudera Manager 7.13.2* |
| *The Cloudera Base on premises version must be associated with the correct Cloudera Manager version, including the CHF versions. Iceberg replication currently requires the source and target clusters to run the same full version of Cloudera Base on premises and Cloudera Manager. Running different versions might result in failures such as missing intermediate Iceberg metadata files. | |
Apache Iceberg is a cloud-native, high-performance open table format for
organizing petabyte-scale analytic datasets on a file system or object store. Iceberg supports
ACID-compliant tables, including row-level deletes and updates and can define large analytic
data tables using open format files.
Iceberg replication policies provide the following functionalities:
- Replicating metadata and catalog from the source cluster Hive Metastore (HMS) to the target cluster HMS.
- Replicating data files in the HDFS storage system from the source cluster to the target cluster. The Iceberg replication policies can replicate only between HDFS storage systems.
- Replicating data at table level.
- Replicating all the snapshots from the source cluster which allows you to run time travel queries on the target cluster.
Some use cases where you can use Iceberg replication policies are to:
- replicate Iceberg tables between on-premises clusters to archive data or run analytics,
- implement passive disaster recovery with planned failover and perform incremental replication at regular intervals between two similar systems. For example, between an HDFS to another HDFS system.
This video demonstrates the ability of the Iceberg replication policy to replicate multiple Iceberg tables from diverse locations in the source cluster to a target cluster in a single replication job. It also showcases the Hive on Iceberg feature that allows you to replicate Iceberg tables created by Hive and Impala, and also a use case related to the location mapping feature to map the source path and the target path. These features are available in Cloudera Base on premises 7.3.1 using Cloudera Manager 7.13.1 and higher versions.
