Iceberg

You must be aware of the known issues and limitations, the areas of impact, and workaround for Iceberg in 7.3.1.100.

Known issues identified in Cloudera Runtime 7.3.1.100

CDPD-75411: SELECT COUNT query on an Iceberg table in AWS times out
In an AWS environment, a SELECT COUNT query that is run on an Iceberg table times out because some 4KB ORC file parts cannot be downloaded. This issue occurs because Iceberg uses the positional delete index only if the count of positional deletes are less than a threshold value which is by default, 100000.
None.
CDPD-78134: CBO fails when a materialized view is dropped but its pre-compiled plan remains in the registry.
Consider a cluster having two HiveServer (HS2) instances. Each HS2 instance contains its own Materialized View (MV) registry and the registries contain pre-complied plans of MVs that are enabled for query rewriting. Without the registries, MVs will have to be loaded and compiled during each query compilation, resulting in slow query performance.

When MVs are created or dropped, they are added to or removed from the registry pertaining to the HS2 instance that issues the create or drop statement. The other HS2 instance is not immediately notified of the change. A background process is scheduled to refresh the registry, however, this process does not handle the removal of dropped MVs.

When an MV is dropped by one of the HS2 instances, it remains in the registry of the other HS2 instance. Now, if a query is processed in the second HS2 instance, the rewrite algorithm still attempts to use the dropped MV. If this MV is stored in an Iceberg table, the storage handler tries to refresh the MV metadata from the metastore but throws an exception because the MV no longer exists, resulting in a CBO failure.

Perform one of the following workarounds to address the issue:
  • Restart all the HS2 instances after dropping the MV.
  • From Cloudera Manager, go to Clusters > Hive > Configuration and add the hive.server2.materializedviews.registry.impl=DUMMY property in the HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml. The DUMMY value indicates that MVs should not be cached and requests should be forwarded to Hive Metastore.
Apache JIRA: HIVE-28773
CDPD-78381: Performance degradation noticed in some Hive Iceberg TPC-DS queries
While running Hive TPC-DS (Parquet + Iceberg) performance benchmarking for Cloudera Runtime 7.3.1.100, the overall performance of Iceberg tables resulted in a 15.68% increase as compared to Iceberg tables in Cloudera Runtime 7.3.1.0. However, it was noticed that some of the queries resulted in a decreased performance.
None.

Known issues identified before Cloudera Runtime 7.3.1.100

CDPD-75667: Querying an Iceberg table with a TIMESTAMP_LTZ column can result in data loss
When you query an Iceberg table that has a TIMESTAMP_LTZ column, the query could result in data loss.
When creating Iceberg tables from Spark, set the following Spark configuration to avoid creating columns with the TIMESTAMP_LTZ type:
spark.sql.timestampType=TIMESTAMP_NTZ
Apache JIRA: IMPALA-13484
CDPD-75088: Iceberg tables in azure cannot be partitioned by strings ending in '.'
In an Azure environment, you cannot create Iceberg tables from Spark that are partitioned by string columns having a partition value that contains the period (.) character. The query fails with the following error:
24/10/08 18:14:12 WARN  scheduler.TaskSetManager: [task-result-getter-2]: Lost task 0.0 in stage 2.0 (TID 2) (spark-sfvq0t-compute0.spark-r9.l2ov-m7vs.int.cldr.work executor 1): java.lang.IllegalArgumentException: ABFS does not allow files or directories to end with a dot.
None.
CDPD-72942: Unable to read Iceberg table from Hive after writing data through Apache Flink
If you create an Iceberg table with default values using Hive and insert data into the table through Apache Flink, you cannot then read the Iceberg table from Hive using the Beeline client, and the query fails with the following error:
Error while compiling statement: java.io.IOException: java.io.IOException: Cannot create an instance of InputFormat class org.apache.hadoop.mapred.FileInputFormat as specified in mapredWork!

The issue persists even after you use the ALTER TABLE statement to set the engine.hive.enabled table property to "true".

None.
Apache JIRA: HIVE-28525
CDPD-71962: Hive cannot write to a Spark Iceberg table bucketed by date column
If you have used Spark to create an Iceberg table that is bucketed by the "date" column and then try inserting or updating this Iceberg table using Hive, the query fails with the following error:
Error: Error while compiling statement: FAILED: RuntimeException org.apache.hadoop.hive.ql.exec.UDFArgumentException:  ICEBERG_BUCKET() only takes STRING/CHAR/VARCHAR/BINARY/INT/LONG/DECIMAL/FLOAT/DOUBLE types as first argument, got DATE (state=42000,code=40000)

This issue does not occur if the Iceberg table is created through Hive.

None.
CDPD-66305: Do not turn on the optimized Iceberg V2 operator
The optimized Iceberg V2 operator is disabled by default due to a correctness issue. The correct setting for the property that turns off the operator is DISABLE_OPTIMIZED_ICEBERG_V2_READ=true.
Accept the default setting of the V2 operator. Do not change the setting from true to false.
CDPD-64629: Performance degradation of Iceberg tables compared to Hive tables
Cloudera testing of Iceberg and Hive tables using the Hive TPC-DS 1 Tb dataset (Parquet) revealed a slower performance executing a few of the queries in TPCDS. Overall performance of Iceberg executing queries on Hive external tables of Iceberg is faster than Hive.
CDPD-57551: Performance issue can occur on reads after writes of Iceberg tables
Hive might generate too many small files, which causes performance degradation.
Maintain a relatively small number of data files under the iceberg table/partition directory to have efficient reads. To alleviate poor performance caused by too many small files, run the following queries:
TRUNCATE TABLE target;
INSERT OVERWRITE TABLE target select * from target FOR SYSTEM_VERSION AS OF <preTruncateSnapshotId>;