Best practices for Iceberg in CDP
Based on large scale TPC-DS and performance testing and real-world experiences, Cloudera recommends several best practices when using Iceberg.
Follow the best practices listed below when using Iceberg:
- Use Iceberg as intended for analytics.
The table format is designed to manage a large, slow-changing collection of files. For more information, see the Iceberg spec.
- Increase parallelism to handle large manifest list
files in Spark.
By default, the number of processors determines the preset value of the
iceberg.worker.num-threads
system property. Try increasing parallelism by setting theiceberg.worker.num-threads
system property to a higher value to speed up query compilation. - Reduce read amplification
Monitor the growth of positional delta files, and perform timely compactions.
- Speed up drop table performance, preventing deletion of data files by using the following
table
properties:
Set external.table.purge=false and gc.enabled=false
- To improve concurrency on writes and reduce commit failures, tune the table properties:
commit.retry.num-retries
(default is 4),commit.retry.min-wait-ms
(default is 100)