Performance tuning

Impala uses its own C++ implementation to deal with Iceberg tables. This implementation provides significant performance advantages over other engines.

To tune performance, try the following actions:

  • Increase parallelism to handle large manifest list files in Spark.

    By default, the number of processors determines the preset value of the iceberg.worker.num-threads system property. Try increasing parallelism by setting the iceberg.worker.num-threads system property to a higher value to speed up query compilation.

  • Speed up drop table performance, preventing deletion of data files by using the following table properties:
    Set external.table.purge=false and gc.enabled=false
  • Tune the following table properties to improve concurrency on writes and reduce commit failures: commit.retry.num-retries (default is 4), commit.retry.min-wait-ms (default is 100)
  • Read Iceberg V2 tables from Hive using vectorization when heavy table scanning occurs as in SELECT COUNT(*) FROM TBL_ICEBERG_PART.

    • set hive.llap.io.memory.mode=cache;

    • set hive.llap.io.enabled=true;

    • set hive.vectorized.execution.enabled=true;

  • Use Iceberg from Impala for querying Iceberg tables when latency is a concern.

    The massively parallel SQL query engine, backend executors written in C++, and frontend (analyzer, planner) written in Java delivers query results fast.

  • Cache manifest files as described in the next topic.