Iceberg known issues
This topic describes the Iceberg-related known issues in Cloudera Data Warehouse (CDW) Private Cloud.
Technical Service Bulletins
- TSB 2024-746: Concurrent compactions and modify statements can corrupt Iceberg tables
- Apache Hive (Hive) and Apache Impala (Impala) modify statements
(
DELETE
/UPDATE
/MERGE
) on Apache Iceberg (Iceberg) V2 tables can corrupt the tables if there is a concurrent table compaction from Apache Spark. The issue happens when the compaction and modify statement run in parallel, and when the compaction job commits before the modify statement. In this case the position delete files of the modify statement still point to the old files. This means the following in case ofDELETE
statements- Deleting records pointing to old files have no effect
UPDATE
/MERGE
statements- Deleting records pointing to old files have no effect
- The table will also have the newly added data records
- Rewritten records will still be active
This issue does not affect Apache NiFi (NiFi) and Apache Flink (Flink) as these components write equality delete files.
- Knowledge article
-
For the latest update on this issue see the corresponding Knowledge article: TSB 2024-746: Concurrent compactions and modify statements can corrupt Iceberg tables
- TSB 2024-752: Dangling delete issue in Spark rewrite_data_files procedure causes incorrect results for Iceberg V2 tables
- The Spark Iceberg library includes two procedures -
rewrite_data_files and rewrite_position_delete_files. The current implementation of
rewrite_data_files has a limitation that the position delete files are not deleted and
still tracked by the table metadata, even if they no longer refer to an active data
file. This is called the dangling delete problem. To solve this, the
rewrite_position_delete_files procedure is implemented in the Spark Iceberg library to
remove these old “dangling” position delete files.
Due to the dangling delete limitation, when an Iceberg table with dangling deletes is queried in Impala, Impala tries to optimize select count(*) from iceberg_table query to return the results using stats. This optimization returns incorrect results.
The following conditions must be met for this issue to occur:- All delete files in the Iceberg table are “dangling”
- This would occur immediately after running Spark rewrite_data_files AND
- Before any further delete operations are performed on the table OR
- Before Spark rewrite_position_delete_files is run on the table
- This would occur immediately after running Spark rewrite_data_files AND
- Only stats optimized plain select count(*) from iceberg_table queries are
affected. For example, the query should not have:
- Any WHERE clause
- Any GROUP BY clause
- Any HAVING clause
Remove dangling deletes: After rewrite_data_files, position delete records pointing to the rewritten data files are not always marked for removal, and can remain tracked by the live snapshot metadata of the table. This is known as the dangling delete problem.
- All delete files in the Iceberg table are “dangling”
- Knowledge article
- For the latest update on this issue see the corresponding Knowledge article: TSB 2024-752: Dangling delete issue in Spark rewrite_data_files procedure causes incorrect results for Iceberg V2 tables.