Known Issues in Apache Iceberg

Learn about the known issues in Iceberg, the impact or changes to the functionality, and the workaround.

Known issues identified in Cloudera Runtime 7.3.1.400 SP2

CDPD-83022: Incorrect row count displayed in table metadata after compaction

7.3.1.400, 7.3.1.500

After running data compaction operations on large tables, the row count displayed by the DESCRIBE FORMATTED command may be inaccurate. Initially, the count may appear higher than the actual number of rows. Subsequently, after running the ANALYZE command to update table statistics, the count might then appear lower than the actual number of rows.

This issue has been observed in large tables containing a significant number of historical snapshots (exceeding 10000). All these snapshots are primarily generated through UPDATE operations.

It is important to note that this is just a metadata display issue, and there is no loss of data. The underlying table data remains complete and correct.

To obtain an accurate row count, use the

SELECT
              COUNT(*)

query.

Known issues identified in Cloudera Runtime 7.3.1.300 SP1 CHF1

There are no new issues identified in this release.

Known issues identified in Cloudera Runtime 7.3.1.200 SP1

There are no new issues identified in this release.

Known issues identified in Cloudera Runtime 7.3.1.100 CHF1

CDPD-78381: Performance degradation noticed in some Hive Iceberg TPC-DS queries

7.3.1.100, 7.3.1.200, 7.3.1.300, 7.3.1.400, 7.3.1.500

While running Hive TPC-DS (Parquet + Iceberg) performance benchmarking for Cloudera Runtime 7.3.1.100, the overall performance of Iceberg tables resulted in a 15.68% increase as compared to Iceberg tables in Cloudera Runtime 7.3.1.0. However, it was noticed that some of the queries resulted in a decreased performance.

None.

CDPD-78134: CBO fails when a materialized view is dropped but its pre-compiled plan remains in the registry.

7.3.1.100, 7.3.1.200, 7.3.1.300, 7.3.1.400, 7.3.1.500

Consider a cluster having two HiveServer (HS2) instances. Each HS2 instance contains its own Materialized View (MV) registry and the registries contain pre-complied plans of MVs that are enabled for query rewriting. Without the registries, MVs will have to be loaded and compiled during each query compilation, resulting in slow query performance.

When MVs are created or dropped, they are added to or removed from the registry pertaining to the HS2 instance that issues the create or drop statement. The other HS2 instance is not immediately notified of the change. A background process is scheduled to refresh the registry, however, this process does not handle the removal of dropped MVs.

When an MV is dropped by one of the HS2 instances, it remains in the registry of the other HS2 instance. Now, if a query is processed in the second HS2 instance, the rewrite algorithm still attempts to use the dropped MV. If this MV is stored in an Iceberg table, the storage handler tries to refresh the MV metadata from the metastore but throws an exception because the MV no longer exists, resulting in a CBO failure.

Perform one of the following workarounds to address the issue:

Restart all the HS2 instances after dropping the MV.
From Cloudera Manager, go to Clusters > Hive > Configuration and add the hive.server2.materializedviews.registry.impl=DUMMY property in the HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml. The DUMMY value indicates that MVs should not be cached and requests should be forwarded to Hive Metastore.
note
Setting this property to DUMMY was done for testing purposes and can greatly increase the query compilation time.

Apache JIRA: HIVE-28773

Known issues in Cloudera Runtime 7.3.1

CDPD-75667: Querying an Iceberg table with a TIMESTAMP_LTZ column can result in data loss: 7.3.1; 7.3.1.100; When you query an Iceberg table that has a TIMESTAMP_LTZ column, the query could result in data loss.; When creating Iceberg tables from Spark, set the following Spark configuration to avoid creating columns with the TIMESTAMP_LTZ type:
spark.sql.timestampType=TIMESTAMP_NTZ; Apache JIRA: IMPALA-13484
CDPD-75411: SELECT COUNT query on an Iceberg table in AWS times out: 7.3.1, 7.3.1.100, 7.3.1.200; 7.3.1.300; In an AWS environment, a SELECT COUNT query that is run on an Iceberg table times out because some 4KB ORC file parts cannot be downloaded. This issue occurs because Iceberg uses the positional delete index only if the count of positional deletes are less than a threshold value which is by default, 100000.; None.
CDPD-75088, CDPD-75218: Iceberg tables in azure cannot be partitioned by strings ending in '.': 7.3.1, 7.3.1.100, 7.3.1.200, 7.3.1.300, 7.3.1.400, 7.3.1.500; In an Azure environment, you cannot create Iceberg tables from Spark that are partitioned by string columns having a partition value that contains the period (.) character. The query fails with the following error:
24/10/08 18:14:12 WARN scheduler.TaskSetManager: [task-result-getter-2]: Lost task 0.0 in stage 2.0 (TID 2) (spark-sfvq0t-compute0.spark-r9.l2ov-m7vs.int.cldr.work executor 1): java.lang.IllegalArgumentException: ABFS does not allow files or directories to end with a dot.; None.
CDPD-72942: Unable to read Iceberg table from Hive after writing data through Apache Flink: 7.3.1, 7.3.1.100, 7.3.1.200, 7.3.1.300, 7.3.1.400, 7.3.1.500; If you create an Iceberg table with default values using Hive and insert data into the table through Apache Flink, you cannot then read the Iceberg table from Hive using the Beeline client, and the query fails with the following error:
Error while compiling statement: java.io.IOException: java.io.IOException: Cannot create an instance of InputFormat class org.apache.hadoop.mapred.FileInputFormat as specified in mapredWork!
The issue persists even after you use the ALTER TABLE statement to set the engine.hive.enabled table property to "true".; None.; Apache JIRA: HIVE-28525
CDPD-71962: Hive cannot write to a Spark Iceberg table bucketed by date column: 7.3.1, 7.3.1.100, 7.3.1.200, 7.3.1.300, 7.3.1.400, 7.3.1.500; If you have used Spark to create an Iceberg table that is bucketed by the "date" column and then try inserting or updating this Iceberg table using Hive, the query fails with the following error:
Error: Error while compiling statement: FAILED: RuntimeException org.apache.hadoop.hive.ql.exec.UDFArgumentException: ICEBERG_BUCKET() only takes STRING/CHAR/VARCHAR/BINARY/INT/LONG/DECIMAL/FLOAT/DOUBLE types as first argument, got DATE (state=42000,code=40000)
This issue does not occur if the Iceberg table is created through Hive.; None.
CDPD-84220: Cannot query Iceberg tables: 7.3.1, 7.3.1.100, 7.3.1.200, 7.3.1.300, 7.3.1.400, 7.3.1.500; You cannot query existing Iceberg tables after you enable HDFS HA. This is because Iceberg stores the table path in the manifest files differently depending on whether the HDFS HA is enabled or not. After you enable HDFS HA, you might not be able to query the tables created prior to you enabling HDFS HA.; None.