Iceberg

You can review the list of reported issues and their fixes for Iceberg in 7.3.1.100.

CDPD-75667: Querying an Iceberg table with a TIMESTAMP_LTZ column can result in data loss

When you query an Iceberg table that has a TIMESTAMP_LTZ column, the query could result in data loss.

When Impala changes the TIMESTAMP_LTZ column to TIMESTAMP, it does it by calling alter_table() on Hive Metastore (HMS) directly. It provides a Metastore Table object to HMS as the desired state of the table. HMS then persists this table object.

This issue is fixed by avoiding the alter_table() call to HMS towards the end of loading the Iceberg table. This avoids the necessity of persisiting the schema adjustments that Impala had to do while loading the table.

Apache JIRA: IMPALA-13484

CDPD-78355: Impala should ignore character case of Iceberg schema elements

Impala cannot read Iceberg tables written by Apache Spark that contain schema elements in uppercase or lowercase letters.

Schema is case insensitive in Impala, however, Spark allows creation of schema elements with uppercase or lowercase letters and stores them in the metadata JSON files of Iceberg.

With this fix, Impala invokes Scan.caseSensitive(boolean caseSensitive) on the TableScan object to set case insensivity.

Apache JIRA: IMPALA-13463

CDPD-78362: Schema resolution does not work for migrated partitioned Iceberg tables that have complex types

Schema resolution does not work correctly for migrated partitioned Iceberg tables that have complex data types. This fix addresses the field ID generation by taking the number of partitions into account. If none of the partition columns are included in the data file (common scenario), file-level field IDs are adjusted accordingly. You could also come across a scenario where all the partition columns are included in the data files.

However, if some partition columns are included in the data file while other partition columns are not, an error is generated.

Apache JIRA: IMPALA-13364

CDPD-78540: DELETE statement throws DateTimeParseException when deleting from DAY-partitioned Iceberg tables

Due to an issue in IcebergDeleteSink, Impala cannot successfully run a DELETE operation on Iceberg tables that are partitioned by time-based transforms (YEAR, MONTH, DAY, HOUR).

This fix addresses the error by adding functions that transforms the partition values to their human-readable representations. This is done in the IcebergDeleteSink so that the Catalog-side logic is not affected.

Apache JIRA: IMPALA-12557

CDPD-78562: Iceberg tables have a large memory footprint in catalog cache

This fix clears the GroupContentFiles after they are used.

GroupContentFiles stores the file descriptors in Iceberg's format and is used for creating file descriptors in Impala's format. Once the creation is complete, we do not have to retain the Iceberg ContentFiles. Dropping this can significantly reduce the memory footprint of an Iceberg table.

For example, the memory size of a test Iceberg table containing 110k files was reduced from 140MB to 80MB after cleaning the GroupContentFiles.

Apache JIRA: IMPALA-11265