Best practices for Iceberg in Cloudera
Based on large scale TPC-DS benchmark testing, performance testing and real-world experiences, Cloudera recommends several best practices when using Iceberg.
Follow these key best practices listed below when using Iceberg:
- Use Iceberg as intended for analytics.
The table format is designed to manage a large, slow-changing collection of files. For more information, see the Iceberg spec.
 - Reduce read amplification
Monitor the growth of positional delta files, and perform timely compactions.
 - Speed up drop table performance, preventing deletion of data files by using the following
        table
        properties:
Set external.table.purge=false and gc.enabled=false - Tune the following table properties to improve concurrency on writes and reduce commit
                failures: 
commit.retry.num-retries(default is 4),commit.retry.min-wait-ms(default is 100) -  Maintain a relatively small number of data files under the iceberg table/partition
          directory for efficient reads. To alleviate poor performance caused by too many
          small files, run the following queries: 
TRUNCATE TABLE target; INSERT OVERWRITE TABLE target select * from target FOR SYSTEM_VERSION AS OF <preTruncateSnapshotId>; - To minimize the number of delete files and file handles and improve performance, ensure that the Spark write.distribution.mode table property value is “hash” (the default setting for Spark Iceberg 1.2.0 onwards).
 
