Best practices for Iceberg in CDP

Based on large scale TPC-DS benchmark testing, performance testing and real-world experiences, Cloudera recommends several best practices when using Iceberg.

Follow these key best practices listed below when using Iceberg:

  • Use Iceberg as intended for analytics.

    The table format is designed to manage a large, slow-changing collection of files. For more information, see the Iceberg spec.

  • Reduce read amplification

    Monitor the growth of positional delta files, and perform timely compactions.

  • Speed up drop table performance, preventing deletion of data files by using the following table properties:
    Set external.table.purge=false and gc.enabled=false
  • Tune the following table properties to improve concurrency on writes and reduce commit failures: commit.retry.num-retries (default is 4), commit.retry.min-wait-ms (default is 100)
  • Maintain a relatively small number of data files under the iceberg table/partition directory for efficient reads. To alleviate poor performance caused by too many small files, run the following queries:
    TRUNCATE TABLE target;
    INSERT OVERWRITE TABLE target select * from target FOR SYSTEM_VERSION AS OF <preTruncateSnapshotId>; 
  • To minimize the number of delete files and file handles and improve performance, ensure that the Spark write.distribution.mode table property value is “hash” (the default setting for Spark Iceberg 1.2.0 onwards).

The following topics list additional best practices.