Replicating Iceberg tables stored in OBS and FSO buckets

Learn how to replicate Iceberg tables stored in OBS and FSO buckets created using S3 gateway APIs.

  1. Install the AWS CLI on only one host in the source cluster.
    curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o awscliv2.zip
    unzip awscliv2.zip
    sudo ./aws/install
  2. Install AWS CLI on only one host of the target cluster.
  3. If you are using secure clusters, import the certificate from the Cloudera Manager global truststore (cm-auto-global_truststore.jks) into the default Java truststore (cacerts) on all the hosts of the source cluster.
    1. Search for ssl.client.truststore.location and ssl.client.truststore.password in the /etc/ozone/conf.cloudera.OZONE-1/ozone-site.xml file to get the Cloudera Manager global truststore location and password.
    2. Get the alias name for the cluster certificate using the /usr/java/default/bin/keytool -list -v -keystore [*** SSL CLIENT TRUSTSTORE LOCATION ***] command.
      For example, the alias name cmrootca-0 in the cmrootca-/root/cert_cmrootca-0/[*** REMOTE HOST NAME***] location would be available on all the Auto-TLS clusters for replication.
    3. Import the certificate to the Java default truststore using the /usr/java/default/bin/keytool -importkeystore -destkeystore /usr/java/default/lib/security/cacerts -srckeystore [*** SSL.CLIENT.TRUSTSTORE.LOCATION ***] -srcalias [*** ALIAS FOUND IN PREVIOUS COMMAND ***] command.
  4. If you are using secure clusters, import the certificate from the Cloudera Manager global truststore (cm-auto-global_truststore.jks) into the default Java truststore (cacerts) on all the hosts of the target cluster.
    1. Search for ssl.client.truststore.location and ssl.client.truststore.password in the /etc/ozone/conf.cloudera.OZONE-1/ozone-site.xml file to get the Cloudera Manager global truststore location and password.
    2. Get the alias name for the cluster certificate using the /usr/java/default/bin/keytool -list -v -keystore [*** SSL.CLIENT.TRUSTSTORE.LOCATION ***] command.
    3. Import the certificate to the Java default truststore using the /usr/java/default/bin/keytool -importkeystore -destkeystore /usr/java/default/lib/security/cacerts -srckeystore [*** SSL.CLIENT.TRUSTSTORE.LOCATION ***] -srcalias [*** ALIAS FOUND IN PREVIOUS COMMAND ***] command.
  5. Generate and configure the S3 secrets on any one of the hosts of the source cluster to get the S3 secrets from Ozone Manager. Record the AWS access key and AWS secret key.
    1. On a Kerberos-enabled cluster, run the kinit -kt /cdep/keytabs/om.keytab om command:
    2. Search for the om.service.id property using the cat /etc/ozone/conf.cloudera.OZONE-1/ozone-site.xml command:
    3. Get the secret using the ozone s3 getsecret --om-service-id=[*** OM SERVICE ID***] command.
  6. Generate and configure the S3 secrets on any one of the hosts of the target cluster to get the S3 secrets from Ozone Manager. Record the AWS access key and AWS secret key.
    1. On a Kerberos-enabled cluster, run the kinit -kt /cdep/keytabs/om.keytab om command:
    2. Search for the om.service.id property using the cat /etc/ozone/conf.cloudera.OZONE-1/ozone-site.xml command:
    3. Get the secret using the ozone s3 getsecret --om-service-id=[*** OM SERVICE ID***] command.
  7. Configure the AWS access key and AWS secret key on one of the source cluster hosts, using one of the following set of commands:
    • To set the credentials using environment variables, use the following commands:
      export AWS_ACCESS_KEY_ID=[*** ACCESS KEY ***]
      export AWS_SECRET_ACCESS_KEY=[*** SECRET ACCESS KEY ***]
    • To set the credentials using the AWS CLI, use the following commands:
      aws configure set aws_access_key_id [*** ACCESS KEY ***]
      aws configure set aws_secret_access_key [ *** SECRET ACCESS KEY ***]
  8. Configure the AWS access key and AWS secret key on one of the target cluster hosts, using one of the following set of commands:
    • To set the credentials using environment variables, use the following commands:
      export AWS_ACCESS_KEY_ID=[*** ACCESS KEY ***]
      export AWS_SECRET_ACCESS_KEY=[*** SECRET ACCESS KEY ***]
    • To set the credentials using the AWS CLI, use the following commands:
      aws configure set aws_access_key_id [*** ACCESS KEY ***]
      aws configure set aws_secret_access_key [ *** SECRET ACCESS KEY ***]
  9. Retrieve the S3 gateway endpoint URL from Cloudera Manager.
    1. Go to Cloudera Manager > Clusters > OZONE service > Instances > S3 Gateway > S3 Gateway Web UI.
    2. Record the endpoint URL. On a secure cluster, the endpoint format is https://[*** HOST NAME ***]:9879. On an unsecure cluster, the format is http://[*** HOST NAME ***]:9878.
  10. Create Ozone buckets, and then provide access to Hive, Spark, and Impala to create Iceberg databases and tables.
    For secure clusters, you require the SSL client truststore location available in the /etc/ozone/conf.cloudera.OZONE-1/ozone-site.xml file.
    1. To create the OBS buckets using AWS CLI on non-Kerberos clusters, run the aws s3api --endpoint [*** S3 GATEWAY ENDPOINT ***] create-bucket --bucket [*** BUCKET NAME ***] command.
    2. To create the OBS buckets using AWS CLI on Kerberos-enabled clusters, run the following commands:
      kinit -kt /cdep/keytabs/om.keytab om 
      aws s3api --endpoint [*** S3 GATEWAY ENDPOINT ***] create-bucket --bucket [*** BUCKET NAME ***]
       --ca-bundle=[*** SSL.CLIENT.TRUSTSTORE.LOCATION ***]
      
    3. To create OBS buckets using the Ozone shell, run the ozone sh bucket create s3v/[*** BUCKET NAME ***] --layout OBJECT_STORE command.
    4. To create FSO buckets using the Ozone shell, run the ozone sh bucket create s3v/[*** BUCKET NAME ***] --layout FILE_SYSTEM_OPTIMIZED command.
  11. Disable filesystem path in the Ozone service configuration by adding the following key-value pair to the Cloudera Manager > Clusters > OZONE service > Configuration > Ozone Service Advanced Configuration Snippet (Safety Valve) for ozone-conf/ozone-site.xml property:
    ozone.om.enable.filesystem.paths = false
  12. Configure the S3A client properties to provide Hive, Spark, and Impala access to the bucket.
    1. Go to Cloudera Manager > Clusters > HDFS service > Configuration > Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml property.
    2. Add the following key-value pairs:
      fs.s3a.bucket.[*** BUCKET NAME ***].access.key = [*** AWS ACCESS KEY ***]
      fs.s3a.bucket.[*** BUCKET NAME ***].secret.key = [*** AWS SECRET KEY ***]
      fs.s3a.endpoint = [*** S3 ENDPOINT ***]
      fs.s3a.bucket.probe = 0
      fs.s3a.change.detection.version.required = false
      Fs.s3a.path.style.access = true
      Fs.s3a.change.detection.mode = none
      fs.s3a.impl.disable.cache = true
    3. Save your changes and refresh the stale configurations.
  13. Create the Iceberg table.
    create table tb1(id int, val int) stored by iceberg location 's3a://[*** BUCKET ***]/[*** KEY ***]';
  14. Enable the ‘Iceberg on Ozone replication’ feature flag.
  15. Add the source cluster as peer for replication.
  16. Create the Iceberg replication policy by providing the following mandatory details in the Create Iceberg replication policy wizard:
    1. On the General tab, set the Source Storage Filter to S3.
    2. On the Advanced tab, set the Location Mapping field to s3a://[*** SOURCE BUCKET ***] ---> s3a://[*** TARGET BUCKET ***].