Known Issues in Apache Iceberg

Learn about the known issues in Iceberg, the impact or changes to the functionality, and the workaround.

Known Issues identified in Cloudera Runtime 7.3.2

DWX-18843: Unable to read Iceberg table from Hive Virtual Warehouse
7.3.2
If you have used Apache Flink to insert data into an Iceberg table that is created from Hive, you cannot read the Iceberg table from the Hive Virtual Warehouse.
Add the engine.hive.enabled table property through the Hive beeline and set the value to "true". You can add this table property either while creating the Iceberg table or use the ALTER TABLE statement to add the table property.
DWX-18489: Hive compaction of Iceberg tables results in a failure
7.3.2
When Cloudera Data Warehouse and Cloudera Data Hub are deployed in the same environment and use the same Hive Metastore (HMS) instance, the Cloudera Data Hub compaction workers can inadvertently pick up Iceberg compaction tasks. Since Iceberg compaction is not yet supported in the latest Cloudera Data Hub version, the compaction tasks will fail when they are processed by the Cloudera Data Hub compaction workers.

In such a scenario where both Cloudera Data Warehouse and Cloudera Data Hub share the same HMS instance and there is a requirement to run both Hive ACID and Iceberg compaction jobs, it is recommended that you use the Cloudera Data Warehouse environment for these jobs. If you want to run only Hive ACID compaction tasks, you can choose to use either the Cloudera Data Warehouse or Cloudera Data Hub environments.

If you want to run the compaction jobs without changing the environment, it is recommended that you use Cloudera Data Warehouse. To avoid interference from Cloudera Data Hub, change the value of the hive.compactor.worker.threads Hive Server (HS2) property to '0'. This ensures that the compaction jobs are not processed by Cloudera Data Hub.
  1. In Cloudera Manager, click Clusters > Hive > Configuration to navigate to the configuration page for HMS.
  2. Search for hive.compactor.worker.threads and modify the value to '0'.
  3. Save the changes and restart the Hive service.
DWX-17254: Merging Iceberg branches requires a target table alias
7.3.2
Hive supports only one level of qualifier when referencing columns. In other words only one dot is accepted. For example, select table.col from ...; is allowed. select db.table.col is not allowed. Using the merge statement to merge Iceberg branches without a target or source table alias causes an exception:
org.apache.hadoop.hive.ql.parse.SemanticException: ... Invalid table alias or column reference ...
Use an alias, for example t, for the target table.
merge into mydb.target.branch_branch1 t using mydb.source.branch_branch1 s on t.id = s.id when matched then update set value = 'matched';

Apache Jira: HIVE-28055

DWX-17210, DWX-13733: Timeout issue querying Iceberg tables from Hive
7.3.2
When querying Iceberg tables from Hive, the queries can faile due to a timeout issue.
  1. Add the following configurations to hadoop-core-site for the Database Catalog and the Virtual Warehouse.
    • fs.s3.maxConnections=1000
    • fs.s3a.connection.maximum=1000
  2. Restart the Database Catalog and Virtual Warehouse.
DWX-14163: Limitations reading Iceberg tables in Avro file format from Impala
7.3.2
The Avro, Impala, and Iceberg specifications describe some limitations related to Avro, and those limitations exist in Cloudera. In addition to these, the DECIMAL type is not supported in this release.
None.
DEX-7946: Data loss during migration of a Hive table to Iceberg
7.3.2
In this release, by default the table property 'external.table.purge' is set to true, which deletes the table data and metadata if you drop the table during migration from Hive to Iceberg.
Either one of the following workarounds prevents data loss during table migration:
  • Set the table property 'external.table.purge'='FALSE'.
  • Do not drop a table during migration from Hive to Iceberg.
DWX-13062: Converting a Hive table having CHAR or VARCHAR columns to Iceberg causes an exception
7.3.2
CHAR and VARCHAR data can be shorter than the length specified by the data type. Remaining characters are padded with spaces. Data is converted to a string in Iceberg. This process can yield incorrect results when you query the converted Iceberg table.
Change columns from CHAR or VARCHAR to string types before converting the Hive table to Iceberg.

Apache Jira: HIVE-26507

Known issues identified before Cloudera Runtime 7.3.2

Known issues identified before Cloudera Runtime 7.3.2 include only unresolved issues from previous releases that continue to affect the Cloudera Runtime 7.3.2 base release.

CDPD-92182: Inserting into Hive Iceberg tables on S3 fails with RazS3ClientCredentialsException
7.3.2, 7.3.1.706, 7.3.1.600, 7.3.1.500
In RAZ-enabled clusters where HDFS is the default file system, attempts to insert data into Hive Iceberg tables pointing explicitly to S3 locations fail with a RazS3ClientCredentialsException
None.
CDPD-89390/CDPD-83022: Incorrect row count displayed in table metadata after compaction
7.3.2, 7.3.1.600, 7.3.1.500, 7.3.1.400
After running data compaction operations on large tables, the row count displayed by the DESCRIBE FORMATTED command may be inaccurate. Initially, the count may appear higher than the actual number of rows. Subsequently, after running the ANALYZE command to update table statistics, the count might then appear lower than the actual number of rows.

This issue has been observed in large tables containing a significant number of historical snapshots (exceeding 10000). All these snapshots are primarily generated through UPDATE operations.

It is important to note that this is just a metadata display issue, and there is no loss of data. The underlying table data remains complete and correct.

To obtain an accurate row count, use the SELECT COUNT(*) query.