Fixed Issues in Apache Iceberg

Review the list of Iceberg issues that are resolved in Cloudera Runtime 7.3.1, its service packs and cumulative hotfixes.

Fixed issues in Cloudera Runtime 7.3.1.400 SP2

There are no new fixed issues in this release.

Fixed issues in Cloudera Runtime 7.3.1.300 SP1 CHF1

CDPD-75411: SELECT COUNT query on an Iceberg table in AWS times out
7.3.1.300
In an AWS environment, a SELECT COUNT query that is run on an Iceberg table times out because some 4KB ORC file parts cannot be downloaded. This issue occurs because Iceberg uses the positional delete index only if the count of positional deletes are less than a threshold value which is by default, 100000.

This issue has been resolved, and the positional delete index is now always used regardless of the positional delete count, resulting in improved performance.

CDPD-79741: Balance scheduling for consecutive partitions for Iceberg tables
7.3.1.300
During remote read scheduling Impala does the following:
Non-Iceberg tables:
  • The scheduler processes the scan ranges in partition key order
  • The scheduler selects N executors as candidates
  • The scheduler chooses the executor from the candidates based on minimum number of assigned bytes
  • Therefore, consecutive partitions are more likely to be assigned to different executors
Iceberg tables:
  • The scheduler processes the scan ranges in random order
  • The scheduler selects N executors as candidates
  • The scheduler chooses the executor from the candidates based on minimum number of assigned bytes
  • Therefore, consecutive partitions (by partition key order) are assigned randomly and there is a higher chance of clustering

With this fix, IcebergScanNode orders its file descriptors based on their paths to facilitate a more balanced scheduling for consecutive partitions. This is especially important for queries that prune partitions through runtime filters (due to a JOIN), because it does not matter that we schedule the scan ranges evenly, the scan ranges that survive the runtime filters can still be clustered on certain executors.

Apache JIRA: IMPALA-12765
CDPD-81311: Unable to query Iceberg tables from Impala
7.3.1.300
After upgrading to Cloudera Runtime 7.3.1.200 or lower versions, you may notice issues while querying Iceberg tables from Impala. An error is reported indicating that the migrated file has unexpected schema or partitioning.

In migrated Iceberg tables, there can be data files with missing field IDs. It is assumed that their schema corresponds to the table schema at the point when the table migration happened, which means field IDs can be generated during runtime. The logic becomes complicated when there are complex types in the table and the table is partitioned. In such cases, some adjustments are required during field ID generation and we verify that the file schema corresponds to the table schema (during migration).

This fix ensures that these adjustments are not needed when the table does not have complex types and therefore schema verification is skipped. As a result, Impala can still read the table if there were some trivial schema changes before migration.

Apache JIRA: IMPALA-13853

Fixed issues in Cloudera Runtime 7.3.1.200 SP1

CDPD-71365: Support Iceberg 1.3 on Spark 3.5
7.3.1.200
Cloudera Runtime 7.3.1.200 SP1 introduces support for Apache Spark 3.5.4. The Iceberg support for Spark 3.5 is only available in the upstream Iceberg 1.4, however, Cloudera Runtime 7.3.1.200 SP1 offers Iceberg 1.3.

This was addressed and Cloudera ensures that Iceberg 1.3 is compatible with Spark 3.5.4.

CDPD-81709: Update parquet-avro to 1.15.1 due to CVE-2025-30065
7.3.1.200
Due to CVE-2025-30065, schema parsing in the parquet-avro module of Apache Parquet 1.15.0 and earlier versions allows bad actors to execute arbitrary code.

To avoid this CVE, the parquet-avro module is upgraded to version 1.15.1.

Fixed issues in Cloudera Runtime 7.3.1.100 CHF1

CDPD-75667: Querying an Iceberg table with a TIMESTAMP_LTZ column can result in data loss
7.3.1.100
When you query an Iceberg table that has a TIMESTAMP_LTZ column, the query could result in data loss.

When Impala changes the TIMESTAMP_LTZ column to TIMESTAMP, it does it by calling alter_table() on Hive Metastore (HMS) directly. It provides a Metastore Table object to HMS as the desired state of the table. HMS then persists this table object.

This issue is fixed by avoiding the alter_table() call to HMS towards the end of loading the Iceberg table. This avoids the necessity of persisiting the schema adjustments that Impala had to do while loading the table.

Apache JIRA: IMPALA-13484
CDPD-78355: Impala should ignore character case of Iceberg schema elements
7.3.1.100
Impala cannot read Iceberg tables written by Apache Spark that contain schema elements in uppercase or lowercase letters.

Schema is case insensitive in Impala, however, Spark allows creation of schema elements with uppercase or lowercase letters and stores them in the metadata JSON files of Iceberg.

With this fix, Impala invokes Scan.caseSensitive(boolean caseSensitive) on the TableScan object to set case insensivity.

Apache JIRA: IMPALA-13463
CDPD-78362: Schema resolution does not work for migrated partitioned Iceberg tables that have complex types
7.3.1.100
Schema resolution does not work correctly for migrated partitioned Iceberg tables that have complex data types. This fix addresses the field ID generation by taking the number of partitions into account. If none of the partition columns are included in the data file (common scenario), file-level field IDs are adjusted accordingly. You could also come across a scenario where all the partition columns are included in the data files.

However, if some partition columns are included in the data file while other partition columns are not, an error is generated.

Apache JIRA: IMPALA-13364
CDPD-78540: DELETE statement throws DateTimeParseException when deleting from DAY-partitioned Iceberg tables
7.3.1.100
Due to an issue in IcebergDeleteSink, Impala cannot successfully run a DELETE operation on Iceberg tables that are partitioned by time-based transforms (YEAR, MONTH, DAY, HOUR).

This fix addresses the error by adding functions that transforms the partition values to their human-readable representations. This is done in the IcebergDeleteSink so that the Catalog-side logic is not affected.

Apache JIRA: IMPALA-12557
CDPD-78562: Iceberg tables have a large memory footprint in catalog cache
7.3.1.100
This fix clears the GroupContentFiles after they are used.

GroupContentFiles stores the file descriptors in Iceberg's format and is used for creating file descriptors in Impala's format. Once the creation is complete, we do not have to retain the Iceberg ContentFiles. Dropping this can significantly reduce the memory footprint of an Iceberg table.

For example, the memory size of a test Iceberg table containing 110k files was reduced from 140MB to 80MB after cleaning the GroupContentFiles.

Apache JIRA: IMPALA-11265

Fixed issues in Cloudera Runtime 7.3.1

CDPD-48395: Upgrade the Parquet version to 1.12.3 for Hive
7.3.1
This fix upgrades the Parquet version for Hive to 1.12.3, which is the same Parquet version that is used for Iceberg.