Apache Iceberg in Cloudera Data Platform
Apache Iceberg is a cloud-native, high-performance open table format for organizing petabyte-scale analytic datasets on a file system or object store. Combined with Cloudera Data Platform (CDP), users can build an open data lakehouse architecture for multi-function analytics and to deploy large scale end-to-end pipelines.
Open data lakehouse on CDP simplifies advanced analytics on all data with a unified platform for structured and unstructured data and integrated data services to enable any analytics use case from ML, BI to stream analytics and real-time analytics. Apache Iceberg is the secret sauce of the open lakehouse.
- Apache Iceberg in Cloudera Data Engineering (CDE) Public Cloud
- Apache Iceberg in Cloudera Data Warehouse (CDW) Public Cloud
- Apache Iceberg in Cloudera Data Engineering (CDE) Private Cloud
- Apache Iceberg in Cloudera Data Warehouse (CDW) Private Cloud
- Apache Iceberg in Cloudera Public Cloud Data Hub 7.2.16 and later
- Cloudera DataFlow (CDF)
- CDF Data Service ingestion of Iceberg tables using the PutIceberg processor
- Cloudera Streams Processing Public Cloud integration with Iceberg using a connector to Flink
- Cloudera Machine Learning (CML) Public Cloud connection to Iceberg
Spark integration with Iceberg v1 is available on a GA (general availability) status in CDP Public Cloud. In Cloudera Data Warehouse (CDW) Public Cloud, Iceberg v2 integration from Hive and Impala is also GA.
- v1
Defines large analytic data tables using open format files.
- v2
Specifies ACID complaint tables including row-level deletes and updates.
- Cloudera Data Engineering Private Cloud
- Cloudera Data Warehouse Private Cloud
- Flink in Cloudera Streams Processing (CSP) Public Cloud
- Cloudera DataFlow Flow Designer Public Cloud
- CDF Data Service
- Atlas integration with Iceberg in Data Hub
- Get high throughput reads of large tables at petabyte scale.
- Run time travel queries.
- Query tables with high concurrency on your object store.
- Query Iceberg tables in ORC or Parquet format from Hive or Impala.
- Query Iceberg tables in Parquet format from Spark.
- Query Iceberg V2 tables including delete, update, merge.
- Evolve partitions and schemas quickly and easily.
- Make schema changes quickly and easily.
- Migrate Hive tables to Iceberg.
Supported ACID transaction properties
Iceberg supports atomic and isolated database transaction properties. Writers work in isolation, not affecting the live table, and perform a metadata swap only when the write is complete, making the changes in one atomic commit.
Iceberg uses snapshots to guarantee isolated reads and writes. You see a consistent version of table data without locking the table. Readers always see a consistent version of the data without the need to lock the table. Writers work in isolation, not affecting the live table, and perform a metadata swap only when the write is complete, making the changes in one atomic commit.
Iceberg partitioning
The Iceberg partitioning technique has performance advantages over conventional partitioning, such as Apache Hive partitioning. Iceberg hidden partitioning is easier to use. Iceberg supports in-place partition evolution; to change a partition, you do not rewrite the entire table to add a new partition column, and queries do not need to be rewritten for the updated table. Iceberg continuously gathers data statistics, which supports additional optimizations, such as partition pruning.
Iceberg uses multiple layers of metadata files to find and prune data. Hive and Impala keep track of data at the folder level and not at the file level, performing file list operations when working with data in a table. Performance problems occur during the execution of multiple list operations. Iceberg keeps track of a complete list of files within a table using a persistent tree structure. Changes to an Iceberg table use an atomic object/file level commit to update the path to a new snapshot file. The snapshot points to the individual data files through manifest files.

The manifest files in the diagram below track several data files across many partitions. These files store partition information and column metrics for each data file. A manifest list is an additional index for pruning entire manifests. File pruning increases efficiency.
Iceberg relieves Hive metastore (HMS) pressure by storing partition information in metadata files on the file system/object store instead of within the HMS. This architecture supports rapid scaling without performance hits.