Iceberg-related known issues in CDW Private Cloud

This topic describes the Iceberg-related known issues in Cloudera Data Warehouse (CDW) Private Cloud.

Known issues identified in 1.5.2🔗

DWX-16591: Concurrent merge and update Iceberg queries are failing: You may see that the concurrent merge and update Iceberg queries fail with the following error in the Hive application logs: "Base metadata location hdfs://<Location-A> is not same as the current table metadata location ‘<Location-B>’ for default.merge_insert_target_iceberg\rorg.apache.iceberg.exceptions.CommitFailedException". This happens because the corresponding Query A and Query B have overlapping updates. For example, if Query A commits the data and deletes files first, then Query B will fail with validation failure due to conflicting writes. In this case, Query B should invalidate the commit files that are already generated and re-execute the full query on the latest snapshot.; None.
CDPD-59413: Unable to view Iceberg table metadata in Atlas: You may see the following exception in the Atlas application logs when you create an Iceberg table from the CDW data service associated with a CDP Private Cloud Base 7.1.8 or 7.1.7 SP2 cluster: Type ENTITY with name iceberg_table does not exist. This happens because the Atlas server on CDP Private Cloud Base 7.1.8 and 7.1.7 SP2 does not contain the necessary, compatible functionality to support Iceberg tables. This neither affects creating, querying, or modifying of Iceberg tables using CDW nor does it affect creating of policies in Ranger.
On CDP Private Cloud Base 7.1.9, Iceberg table entities are not created in Atlas. You can ignore the following error appearing in the Atlas application logs: ERROR - [NotificationHookConsumer thread-1:] ~ graph rollback due to exception (GraphTransactionInterceptor:200) org.apache.atlas.exception.AtlasBaseException: invalid relationshipDef: hive_table_storagedesc: end type 1: hive_storagedesc, end type 2: iceberg_table; If you are on CDP Private Cloud Base 7.1.7 SP2 or 7.1.8, then you can manually upload the Iceberg model file z1130-iceberg_table_model.json in to the /opt/cloudera/parcels/CDH/lib/atlas/models/1000-Hadoop directory as follows:

SSH into the Atlas server host as an Administrator.

Change directory to the following:
cd /opt/cloudera/parcels/CDH/lib/atlas/models/1000-Hadoop

Create a file called 1130-iceberg_table_model.json with the following content:
{ "enumDefs": [], "structDefs": [], "classificationDefs": [], "entityDefs": [ { "name": "iceberg_table", "superTypes": [ "hive_table" ], "serviceType": "hive", "typeVersion": "1.0", "attributeDefs": [ { "name": "partitionSpec", "typeName": "array<string>", "cardinality": "SET", "isIndexable": false, "isOptional": true, "isUnique": false } ] }, { "name": "iceberg_column", "superTypes": [ "hive_column" ], "serviceType": "hive", "typeVersion": "1.0" } ], "relationshipDefs": [ { "name": "iceberg_table_columns", "serviceType": "hive", "typeVersion": "1.0", "relationshipCategory": "COMPOSITION", "relationshipLabel": "__iceberg_table.columns", "endDef1": { "type": "iceberg_table", "name": "columns", "isContainer": true, "cardinality": "SET", "isLegacyAttribute": true }, "endDef2": { "type": "iceberg_column", "name": "table", "isContainer": false, "cardinality": "SINGLE", "isLegacyAttribute": true }, "propagateTags": "NONE" } ] }

Save the file and exit.

Restart the Atlas service using Cloudera Manager.

Technical Service Bulletins🔗

TSB 2024-745: Impala returns incorrect results for Iceberg V2 tables when optimized operator is being used in CDW: Cloudera Data Warehouse (CDW) customers using Apache Impala (Impala) to read Apache Iceberg (Iceberg) V2 tables can encounter an issue of Impala returning incorrect results when the optimized V2 operator is used. The optimized V2 operator is enabled by default in the affected versions below. The issue only affects Iceberg V2 tables that have position delete files.
Knowledge article: For the latest update on this issue see the corresponding Knowledge Article: TSB 2024-745: Impala returns incorrect results for Iceberg V2 tables when optimized operator is being used in CDW.

TSB 2024-746: Concurrent compactions and modify statements can corrupt Iceberg tables: Apache Hive (Hive) and Apache Impala (Impala) modify statements (DELETE/UPDATE/MERGE) on Apache Iceberg (Iceberg) V2 tables can corrupt the tables if there is a concurrent table compaction from Apache Spark. The issue happens when the compaction and modify statement run in parallel, and when the compaction job commits before the modify statement. In this case the position delete files of the modify statement still point to the old files. This means the following in case of

DELETE statements

Deleting records pointing to old files have no effect

UPDATE / MERGE statements

Deleting records pointing to old files have no effect

The table will also have the newly added data records

Rewritten records will still be active

This issue does not affect Apache NiFi (NiFi) and Apache Flink (Flink) as these components write equality delete files.
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2024-746: Concurrent compactions and modify statements can corrupt Iceberg tables

TSB 2024-752: Dangling delete issue in Spark rewrite_data_files procedure causes incorrect results for Iceberg V2 tables

The Spark Iceberg library includes two procedures - rewrite_data_files and rewrite_position_delete_files. The current implementation of rewrite_data_files has a limitation that the position delete files are not deleted and still tracked by the table metadata, even if they no longer refer to an active data file. This is called the dangling delete problem. To solve this, the rewrite_position_delete_files procedure is implemented in the Spark Iceberg library to remove these old “dangling” position delete files.

Due to the dangling delete limitation, when an Iceberg table with dangling deletes is queried in Impala, Impala tries to optimize select count(*) from iceberg_table query to return the results using stats. This optimization returns incorrect results.

The following conditions must be met for this issue to occur:

All delete files in the Iceberg table are “dangling”
- This would occur immediately after running Spark rewrite_data_files AND
  - Before any further delete operations are performed on the table OR
  - Before Spark rewrite_position_delete_files is run on the table
Only stats optimized plain select count(*) from iceberg_table queries are affected. For example, the query should not have:
- Any WHERE clause
- Any GROUP BY clause
- Any HAVING clause

Remove dangling deletes: After rewrite_data_files, position delete records pointing to the rewritten data files are not always marked for removal, and can remain tracked by the live snapshot metadata of the table. This is known as the dangling delete problem.

Knowledge article

For the latest update on this issue see the corresponding Knowledge article: TSB 2024-752: Dangling delete issue in Spark rewrite_data_files procedure causes incorrect results for Iceberg V2 tables.

TSB 2024-758: Truncate command on Iceberg V2 branches cause unintentional data deletion: When working with Apache Hive (Hive) and Apache Iceberg (Iceberg) V2 tables, using the TRUNCATE statement may lead to unintended data deletion. This issue arises when the truncate command is applied to a branch of an Iceberg table. Instead of truncating the branch itself, the command affects the original (main) table, which results in unintended loss of data.
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2024-758: Truncate command on Iceberg V2 branches cause unintentional data deletion.