Best practices for Iceberg in Cloudera

Based on large scale TPC-DS benchmark testing, performance testing and real-world experiences, Cloudera recommends several best practices when using Iceberg.

Follow the best practices listed below when using Iceberg:

Use Iceberg as intended for analytics.
The table format is designed to manage a large, slow-changing collection of files. For more information, see the Iceberg spec.
Increase parallelism to handle large manifest list files in Spark.
By default, the number of processors determines the preset value of the iceberg.worker.num-threads system property. Try increasing parallelism by setting the iceberg.worker.num-threads system property to a higher value to speed up query compilation.
Reduce read amplification
Monitor the growth of positional delta files, and perform timely compactions.
Speed up drop table performance, preventing deletion of data files by using the following table properties:
```
Set external.table.purge=false and gc.enabled=false
```
Tune the following table properties to improve concurrency on writes and reduce commit failures: commit.retry.num-retries (default is 4),commit.retry.min-wait-ms (default is 100)
Maintain a relatively small number of data files under the iceberg table/partition directory for efficient reads. To alleviate poor performance caused by too many small files, run the following queries:
```
TRUNCATE TABLE target;
              INSERT OVERWRITE TABLE target select * from target FOR SYSTEM_VERSION AS OF <preTruncateSnapshotId>; 
```
To minimize the number of delete files and file handles and improve performance, ensure that the Spark write.distribution.mode table property value is “hash” (the default setting for Spark Iceberg 1.2.0 onwards).