Fixed Issues in Iceberg

Review the list of Iceberg issues that are resolved in Cloudera Runtime 7.1.9 SP1.

CDPD-45139: Improve Iceberg V2 reads with a custom Iceberg Position Delete operator

This fix helps by improving the Impala query performance when reading Iceberg tables with delete files.

CDPD-47349: Use hive.metastore.table.owner during table creation

Creation of an Iceberg table used to happen in two steps. The first step creates a table, however, the table is created with the wrong owner. The second step runs the ALTER TABLE statement to set the correct table owner.

This fix resolves the issue by creating an Iceberg table with the correct owner in a single step. When creating an Iceberg table, Impala specifies the owner in the hive.metastore.table.owner table property.

CDPD-55029: Include snapshot ID of Iceberg tables in query plan or profile

This fix includes the snapshot ID of Iceberg tables for the Iceberg SCAN operators, which is useful to enable queries to be re-executed. Re-executing queries are useful because they help you to better investigate performance problems and bugs.

CDPD-59657: Iceberg V2 operator provides incorrect results in PARTITIONED mode

When PARTITIONED mode is used, the fix introduced through IMPALA-12327 performs a binary search when the position-based difference between the current row and previous row is not one.

CDPD-60282: Need better cardinality estimation for Iceberg V2 tables with deletes

Currently, the cardinality of the IcebergDeleteNode is the same as the cardinality of the left-hand side (LHS) and does not take into account the cardinality of the right-hand side (RHS). The RHS contains position delete records, therefore, all the records in RHS remove a record from the LHS.

If there are joins on the Iceberg table, they have the same selectivity on the data records and on the delete records.

This fix updates the cardinality of the IcebergDeleteNode to use the following formula:

Cardinality of DELETE operator = Cardinality(LHS) - (Cardinality(RHS) * selectivity of LHS)

CDPD-60946/CDPD-60717: Iceberg tables created through Trino are incompatible with Impala

The Trino SQL engine creates Iceberg tables without setting hive.engine.enabled=true and does not provide users with an option to manually set this property. Therefore, Trino always creates Iceberg tables with non-HiveIceberg storage descriptors.

Impala uses the Input/Output/SerDe properties to determine the table type, however, a table is also considered to be an Iceberg table if the table property, table_type=ICEBERG is set.

The fix introduced through IMPALA-12413 ensures that modifications to the table from Impala goes through its Iceberg library (with engine.hive.enabled=true). This results in setting the HiveIceberg storage descriptors and allows Trino to be compatible with Iceberg tables.

CDPD-66786: Impala returns incorrect results when the optimized Iceberg V2 operator is used

If you are using Impala to read Iceberg V2 tables, then you might have noticed Impala returning incorrect results when the optimized V2 operator is used. This issue has been resolved by resetting the delete state when it detects records from files that do not have delete records.

CDPD-67632: Optimized count(*) for Iceberg table gives wrong results after a Spark rewrite_data_files

Spark's rewrite_data_files action can leave dangling Delete files in the Iceberg table. Delete files are not applicable to any data files. This can cause incorrect results in Impala when it runs simple count(*) queries with the help of table statistics stored in the Iceberg metadata layer.

This fix resolves the issue and Impala now returns the correct results even with the presence of dangling Delete files.

Apache Patch Information

IMPALA-11619
IMPALA-11776
IMPALA-12072
IMPALA-12327
IMPALA-12371
IMPALA-12413
IMPALA-12894

Technical Service Bulletins

TSB 2024-752: Dangling delete issue in Spark rewrite_data_files procedure causes incorrect results for Iceberg V2 tables: For the latest update on this issue see the corresponding Knowledge article: TSB 2024-752: Dangling delete issue in Spark rewrite_data_files procedure causes incorrect results for Iceberg V2 tables.