Best practices for Iceberg in CDP

Based on large scale TPC-DS and performance testing and real-world experiences, Cloudera recommends several best practices when using Iceberg.

Follow the best practices listed below when using Iceberg:

  • Use Iceberg as intended for analytics.

    The table format is designed to manage a large, slow-changing collection of files. For more information, see the Iceberg spec.

  • Reduce read amplification

    Monitor the growth of positional delta files, and perform timely compactions.

  • Speed up drop table performance, preventing deletion of data files by using the following table properties:
    Set external.table.purge=false and gc.enabled=false
  • To improve concurrency on writes and reduce commit failures, tune the table properties: commit.retry.num-retries (default is 4), commit.retry.min-wait-ms (default is 100)
  • Tune the following table properties to improve concurrency on writes and reduce commit failures: commit.retry.num-retries (default is 4), commit.retry.min-wait-ms (default is 100)
  • Maintain a relatively small number of data files under the iceberg table/partition directory for efficient reads. To alleviate poor performance caused by too many small files, run the following queries:
    TRUNCATE TABLE target;
    INSERT OVERWRITE TABLE target select * from target FOR SYSTEM_VERSION AS OF <preTruncateSnapshotId>; 
  • To minimize the number of delete files and file handles and improve performance, ensure that the Spark write.distribution.mode table property value is “hash” (the default setting for Spark Iceberg 1.2.0 onwards).