Fixed Issues in Apache Iceberg

Fixed issues in Cloudera Runtime 7.3.1.400 SP2

CDPD-85165: Slow HMS metadata summary collection for non-native Iceberg tables

7.3.1.400

When using the Hive MetaTool to generate a metadata summary (MetaToolTaskMetadataSummary), the process was exceptionally slow for databases containing a large number of non-native tables, particularly Iceberg tables.

This issue has now been addressed by introducing a new configuration property, hive.metatool.summary.nonnative.threads. This property allows you to specify the number of threads dedicated to collecting summaries for non-native tables like Iceberg. The default value is set to 20.

Apache Jira: HIVE-28990

Fixed issues in Cloudera Runtime 7.3.1.300 SP1 CHF1

CDPD-75411: SELECT COUNT query on an Iceberg table in AWS times out

7.3.1.300

In an AWS environment, a SELECT COUNT query that is run on an Iceberg table times out because some 4KB ORC file parts cannot be downloaded. This issue occurs because Iceberg uses the positional delete index only if the count of positional deletes are less than a threshold value which is by default, 100000.

This issue has been resolved, and the positional delete index is now always used regardless of the positional delete count, resulting in improved performance.

CDPD-79741: Balance scheduling for consecutive partitions for Iceberg tables

7.3.1.300

During remote read scheduling Impala does the following:

Non-Iceberg tables:

The scheduler processes the scan ranges in partition key order
The scheduler selects N executors as candidates
The scheduler chooses the executor from the candidates based on minimum number of assigned bytes
Therefore, consecutive partitions are more likely to be assigned to different executors

Iceberg tables:

The scheduler processes the scan ranges in random order
The scheduler selects N executors as candidates
The scheduler chooses the executor from the candidates based on minimum number of assigned bytes
Therefore, consecutive partitions (by partition key order) are assigned randomly and there is a higher chance of clustering

With this fix, IcebergScanNode orders its file descriptors based on their paths to facilitate a more balanced scheduling for consecutive partitions. This is especially important for queries that prune partitions through runtime filters (due to a JOIN), because it does not matter that we schedule the scan ranges evenly, the scan ranges that survive the runtime filters can still be clustered on certain executors.

Apache JIRA: IMPALA-12765

CDPD-81311: Unable to query Iceberg tables from Impala

7.3.1.300

After upgrading to Cloudera Runtime 7.3.1.200 or lower versions, you may notice issues while querying Iceberg tables from Impala. An error is reported indicating that the migrated file has unexpected schema or partitioning.

In migrated Iceberg tables, there can be data files with missing field IDs. It is assumed that their schema corresponds to the table schema at the point when the table migration happened, which means field IDs can be generated during runtime. The logic becomes complicated when there are complex types in the table and the table is partitioned. In such cases, some adjustments are required during field ID generation and we verify that the file schema corresponds to the table schema (during migration).

This fix ensures that these adjustments are not needed when the table does not have complex types and therefore schema verification is skipped. As a result, Impala can still read the table if there were some trivial schema changes before migration.

Apache JIRA: IMPALA-13853

Fixed issues in Cloudera Runtime 7.3.1.200 SP1

CDPD-71365: Support Iceberg 1.3 on Spark 3.5: 7.3.1.200; Cloudera Runtime 7.3.1.200 SP1 introduces support for Apache Spark 3.5.4. The Iceberg support for Spark 3.5 is only available in the upstream Iceberg 1.4, however, Cloudera Runtime 7.3.1.200 SP1 offers Iceberg 1.3.
This was addressed and Cloudera ensures that Iceberg 1.3 is compatible with Spark 3.5.4.
CDPD-81709: Update parquet-avro to 1.15.1 due to CVE-2025-30065: 7.3.1.200; Due to CVE-2025-30065, schema parsing in the parquet-avro module of Apache Parquet 1.15.0 and earlier versions allows bad actors to execute arbitrary code.
To avoid this CVE, the parquet-avro module is upgraded to version 1.15.1.

Fixed issues in Cloudera Runtime 7.3.1.100 CHF1

CDPD-75667: Querying an Iceberg table with a TIMESTAMP_LTZ column can result in data loss

7.3.1.100

When you query an Iceberg table that has a TIMESTAMP_LTZ column, the query could result in data loss.

When Impala changes the TIMESTAMP_LTZ column to TIMESTAMP, it does it by calling alter_table() on Hive Metastore (HMS) directly. It provides a Metastore Table object to HMS as the desired state of the table. HMS then persists this table object.

This issue is fixed by avoiding the alter_table() call to HMS towards the end of loading the Iceberg table. This avoids the necessity of persisiting the schema adjustments that Impala had to do while loading the table.

Apache JIRA: IMPALA-13484

CDPD-78355: Impala should ignore character case of Iceberg schema elements

7.3.1.100

Impala cannot read Iceberg tables written by Apache Spark that contain schema elements in uppercase or lowercase letters.

Schema is case insensitive in Impala, however, Spark allows creation of schema elements with uppercase or lowercase letters and stores them in the metadata JSON files of Iceberg.

With this fix, Impala invokes Scan.caseSensitive(boolean caseSensitive) on the TableScan object to set case insensivity.

Apache JIRA: IMPALA-13463

CDPD-78362: Schema resolution does not work for migrated partitioned Iceberg tables that have complex types

7.3.1.100

Schema resolution does not work correctly for migrated partitioned Iceberg tables that have complex data types. This fix addresses the field ID generation by taking the number of partitions into account. If none of the partition columns are included in the data file (common scenario), file-level field IDs are adjusted accordingly. You could also come across a scenario where all the partition columns are included in the data files.

However, if some partition columns are included in the data file while other partition columns are not, an error is generated.

Apache JIRA: IMPALA-13364

CDPD-78540: DELETE statement throws DateTimeParseException when deleting from DAY-partitioned Iceberg tables

7.3.1.100

Due to an issue in IcebergDeleteSink, Impala cannot successfully run a DELETE operation on Iceberg tables that are partitioned by time-based transforms (YEAR, MONTH, DAY, HOUR).

This fix addresses the error by adding functions that transforms the partition values to their human-readable representations. This is done in the IcebergDeleteSink so that the Catalog-side logic is not affected.

Apache JIRA: IMPALA-12557

CDPD-78562: Iceberg tables have a large memory footprint in catalog cache

7.3.1.100

This fix clears the GroupContentFiles after they are used.

GroupContentFiles stores the file descriptors in Iceberg's format and is used for creating file descriptors in Impala's format. Once the creation is complete, we do not have to retain the Iceberg ContentFiles. Dropping this can significantly reduce the memory footprint of an Iceberg table.

For example, the memory size of a test Iceberg table containing 110k files was reduced from 140MB to 80MB after cleaning the GroupContentFiles.

Apache JIRA: IMPALA-11265

Fixed issues in Cloudera Runtime 7.3.1

CDPD-48395: Upgrade the Parquet version to 1.12.3 for Hive: 7.3.1; This fix upgrades the Parquet version for Hive to 1.12.3, which is the same Parquet version that is used for Iceberg.