Prerequisites and limitations for using Iceberg
Learn about the supported versions for Cloudera Data Engineering (CDE), Spark, and Data Lake to use with Apache Iceberg CDE.
To use Apache Iceberg in CDE, you'll need the following prerequisites:
- Spark 3.2 or higher
- A compatible version of Data Lake as listed in CDE and Data Lake compatibility linked below
- CDE 1.16 or higher
- AWS or Azure is supported starting in CDE 1.17-h1 (which supports Iceberg 0.14)
Limitations
- The use of Iceberg tables as Structured Streaming sources or sinks is not supported.
- PyIceberg is not supported. Using Spark SQL to query Iceberg tables in PySpark is supported.
- Iceberg supports two timestamp types:
- timestamp (without timezone)
- timestamptz ( with timezone)
In Spark 3.3 and earlier, Spark SQL supports a single TIMESTAMP type, which maps to the Iceberg timestamptz type. However, Impala is unable to write to Iceberg tables with timestamptz columns. To create Iceberg tables from Spark with timestamp rather than timestamptz columns, set the following configurations to true:
- spark.sql.iceberg.handle-timestamp-without-timezone
- spark.sql.iceberg.use-timestamp-without-timezone-in-new-tables
Configure these properties only on Spark 3.3 and earlier.
Spark still handles the timestamp column as a timestamp with local timezone. Inconsistent results occur unless Spark is running in UTC.
Iceberg tables with equality deletes do not support partition evolution or schema evolution on Primary Key columns.
Users should not do partition evolution on tables with Primary Keys or Identifier Fields available, or do Schema Evolution on Primary Key columns, Partition Columns, or Identifier Fields from Spark.
Iceberg table format version 2
Iceberg table format version 2 (v2) is available starting in Iceberg 0.14. Iceberg table format v2 uses row-level UPDATE and DELETE operations that add deleted files to encoded rows that were deleted from existing data files. The DELETE, UPDATE, and MERGE operations function by writing delete files instead of rewriting the affected data files. Additionally, upon reading the data, the encoded deletes are applied to the affected rows that are read. This functionality is called merge-on-read.
To use Iceberg table format v2, you'll need the following prerequisites:
- CDE 1.17-h1 or higher
- Iceberg 0.14
- Spark 3.2 or higher