Before you create an Iceberg replication policy, you must complete the prerequisites.
Iceberg replication policies can replicate Iceberg V2 tables, created using Spark (read-only
with Impala), between Cloudera Base on premises 7.1.9 or higher
clusters using Cloudera Manager 7.11.3 or higher versions. Starting from
Cloudera Base on premises 7.3.1, Replication Manager can
also replicate V1 and V2 Iceberg tables created using Hive.
You can use one or more Iceberg replication policies to
replicate a database from the source cluster to the target cluster. And, you must ensure
that you replicate the database only from the source to the target to maintain a single
source of truth for the database.
Ensure that the source cluster and target cluster versions are Cloudera Base on premises 7.1.9 or higher using Cloudera Manager version 7.11.3 or higher versions.
Ensure that the source and target clusters have the same Cloudera Manager major
version.
Activate the Iceberg Replication parcel. The parcel might be included in your
Cloudera Runtime distribution or in a separate distribution. For more
information, contact your Cloudera account team.
Add the Iceberg Replication service on both the
clusters.
To add a service, go to the Cloudera Manager > Clusters > [***CLUSTER NAME***] page and click Actions > Add Service. For more information, see Adding a Service.
Ensure that you have the Atlas user credentials in addition to the
Replication Administrator or Full Administrator
roles to replicate Atlas metadata. The atlas user must also
have relevant read and write permissions to the staging locations.
Ensure that Cloudera Lakehouse Optimizer is disabled and is not
available in your target cluster if you have enabled the service in your AWS or
Azure environment. This service is available from Cloudera on cloud 7.3.1.500
onwards.
If the Cloudera Lakehouse Optimizer service is available in the target
cluster and if a compaction maintenance task is scheduled to run on the
replicated tables, the Cloudera Lakehouse Optimizer policy runs a
compaction maintenance task on the replicated Iceberg tables. By default, if the
metadata.json file of the target cluster is absent on
the source cluster, Replication Manager initiates a bootstrap replication in the
subsequent Iceberg replication policy job. During the bootstrap job, Replication
Manager copies the already replicated small files from the source cluster to the
target cluster, and the Cloudera Lakehouse Optimizer policy detects these
new small files and it triggers a compaction maintenance task. This leads to a
repetitive cycle which negates the benefit of the compaction task.
For more
information about Cloudera Lakehouse Optimizer, see Lakehouse
Optimize.