Review the list of Iceberg issues that are resolved in Cloudera Runtime
7.3.1, its service packs and cumulative hotfixes.
Fixed issues in Cloudera Runtime 7.3.1.400 SP2
There are no new fixed issues in this release.
Fixed issues in Cloudera Runtime 7.3.1.300 SP1 CHF1
- CDPD-75411:
SELECT COUNT
query on an Iceberg
table in AWS times out
- 7.3.1.300
- In an AWS environment, a
SELECT COUNT
query
that is run on an Iceberg table times out because some 4KB ORC file parts cannot be
downloaded. This issue occurs because Iceberg uses the positional delete index only if
the count of positional deletes are less than a threshold value which is by default,
100000.This issue has been resolved, and the positional delete index is now always
used regardless of the positional delete count, resulting in improved
performance.
- CDPD-79741: Balance scheduling for consecutive partitions for
Iceberg tables
- 7.3.1.300
- During remote read scheduling Impala does the
following:
Non-Iceberg tables:
- The scheduler processes the scan ranges in partition key order
- The scheduler selects N executors as candidates
- The scheduler chooses the executor from the candidates based on minimum number
of assigned bytes
- Therefore, consecutive partitions are more likely to be assigned to different
executors
Iceberg tables:
- The scheduler processes the scan ranges in random order
- The scheduler selects N executors as candidates
- The scheduler chooses the executor from the candidates based on minimum number
of assigned bytes
- Therefore, consecutive partitions (by partition key order) are assigned randomly
and there is a higher chance of clustering
With this fix, IcebergScanNode
orders its file
descriptors based on their paths to facilitate a more balanced scheduling for
consecutive partitions. This is especially important for queries that prune partitions
through runtime filters (due to a JOIN), because it does not matter that we schedule
the scan ranges evenly, the scan ranges that survive the runtime filters can still be
clustered on certain executors.
- Apache JIRA: IMPALA-12765
- CDPD-81311: Unable to query Iceberg tables from Impala
- 7.3.1.300
- After upgrading to Cloudera Runtime 7.3.1.200
or lower versions, you may notice issues while querying Iceberg tables from Impala. An
error is reported indicating that the migrated file has unexpected schema or
partitioning.
In migrated Iceberg tables, there can be data files with missing field
IDs. It is assumed that their schema corresponds to the table schema at the point when
the table migration happened, which means field IDs can be generated during runtime.
The logic becomes complicated when there are complex types in the table and the table
is partitioned. In such cases, some adjustments are required during field ID
generation and we verify that the file schema corresponds to the table schema (during
migration).
This fix ensures that these adjustments are not needed when the
table does not have complex types and therefore schema verification is skipped. As a
result, Impala can still read the table if there were some trivial schema changes
before migration.
- Apache JIRA: IMPALA-13853
Fixed issues in Cloudera Runtime 7.3.1.200 SP1
- CDPD-71365: Support Iceberg 1.3 on Spark 3.5
- 7.3.1.200
- Cloudera Runtime 7.3.1.200 SP1 introduces
support for Apache Spark 3.5.4. The Iceberg support for Spark 3.5 is only available in
the upstream Iceberg 1.4, however, Cloudera Runtime 7.3.1.200 SP1
offers Iceberg 1.3.
This was addressed and Cloudera
ensures that Iceberg 1.3 is compatible with Spark 3.5.4.
- CDPD-81709: Update
parquet-avro
to 1.15.1 due
to CVE-2025-30065
- 7.3.1.200
- Due to CVE-2025-30065, schema parsing in the
parquet-avro
module of Apache Parquet 1.15.0 and earlier versions
allows bad actors to execute arbitrary code.To avoid this CVE, the
parquet-avro
module is upgraded to version 1.15.1.
Fixed issues in Cloudera Runtime 7.3.1.100 CHF1
- CDPD-75667: Querying an Iceberg table with a
TIMESTAMP_LTZ
column can result in data loss
- 7.3.1.100
- When you query an Iceberg table that has a
TIMESTAMP_LTZ
column, the query could result in data loss.When
Impala changes the TIMESTAMP_LTZ
column to
TIMESTAMP
, it does it by calling alter_table()
on
Hive Metastore (HMS) directly. It provides a Metastore Table object to HMS as the
desired state of the table. HMS then persists this table object.
This issue is
fixed by avoiding the alter_table()
call to HMS towards the end of
loading the Iceberg table. This avoids the necessity of persisiting the schema
adjustments that Impala had to do while loading the table.
- Apache JIRA: IMPALA-13484
- CDPD-78355: Impala should ignore character case of Iceberg
schema elements
- 7.3.1.100
- Impala cannot read Iceberg tables written by Apache Spark that
contain schema elements in uppercase or lowercase letters.
Schema is case insensitive
in Impala, however, Spark allows creation of schema elements with uppercase or
lowercase letters and stores them in the metadata JSON files of Iceberg.
With
this fix, Impala invokes Scan.caseSensitive(boolean caseSensitive)
on
the TableScan object to set case insensivity.
- Apache JIRA: IMPALA-13463
- CDPD-78362: Schema resolution does not work for migrated
partitioned Iceberg tables that have complex types
- 7.3.1.100
- Schema resolution does not work correctly for migrated
partitioned Iceberg tables that have complex data types. This fix addresses the field ID
generation by taking the number of partitions into account. If none of the partition
columns are included in the data file (common scenario), file-level field IDs are
adjusted accordingly. You could also come across a scenario where all the partition
columns are included in the data files.
However, if some partition columns are
included in the data file while other partition columns are not, an error is
generated.
- Apache JIRA: IMPALA-13364
- CDPD-78540: DELETE statement throws DateTimeParseException when
deleting from DAY-partitioned Iceberg tables
- 7.3.1.100
- Due to an issue in
IcebergDeleteSink
, Impala
cannot successfully run a DELETE operation on Iceberg tables that are partitioned by
time-based transforms (YEAR, MONTH, DAY, HOUR).This fix addresses the error by adding
functions that transforms the partition values to their human-readable
representations. This is done in the IcebergDeleteSink
so that the
Catalog-side logic is not affected.
- Apache JIRA: IMPALA-12557
- CDPD-78562: Iceberg tables have a large memory footprint in
catalog cache
- 7.3.1.100
- This fix clears the GroupContentFiles after they are used.
GroupContentFiles stores the file descriptors in Iceberg's format and is used for
creating file descriptors in Impala's format. Once the creation is complete, we do not
have to retain the Iceberg ContentFiles. Dropping this can significantly reduce the
memory footprint of an Iceberg table.
For example, the memory size of a test
Iceberg table containing 110k files was reduced from 140MB to 80MB after cleaning the
GroupContentFiles.
- Apache JIRA: IMPALA-11265
Fixed issues in Cloudera Runtime 7.3.1
- CDPD-48395: Upgrade the Parquet version to 1.12.3 for Hive
- 7.3.1
- This fix upgrades the Parquet version for Hive to 1.12.3, which
is the same Parquet version that is used for Iceberg.