Apache Iceberg in Cloudera Data Platform

Apache Iceberg is a cloud-native, high-performance open table format for organizing petabyte-scale analytic datasets on a file system or object store. Combined with Cloudera Data Platform (CDP), users can build an open data lakehouse architecture for multi-function analytics and to deploy large scale end-to-end pipelines.

Open data lakehouse on CDP simplifies advanced analytics on all data with a unified platform for structured and unstructured data and integrated data services to enable any analytics use case from ML, BI to stream analytics and real-time analytics. Apache Iceberg is the secret sauce of the open lakehouse.

Spark integration with Iceberg v1 is available on a GA (general availability) status in CDP Public Cloud. In Cloudera Data Warehouse (CDW) Public Cloud, Iceberg v2 integration from Hive and Impala is also GA.

The Apache Iceberg format specification describes the following versions of tables:
  • v1

    Defines large analytic data tables using open format files.

  • v2

    Specifies ACID complaint tables including row-level deletes and updates.

The following Iceberg integrations are available as technical previews:
  • Cloudera Data Engineering Private Cloud
  • Cloudera Data Warehouse Private Cloud
  • Flink in Cloudera Streams Processing (CSP) Public Cloud
  • Cloudera DataFlow Flow Designer Public Cloud
  • CDF Data Service
  • Atlas integration with Iceberg in Data Hub
The CDP Shared Data Experience (SDX) provides compliance and self-service data access for all users with consistent security and governance. Data Visualization provides reporting and visualization of Iceberg data. Accessing Iceberg in CDP, you can perform the following tasks:
  • Get high throughput reads of large tables at petabyte scale.
  • Run time travel queries.
  • Query tables with high concurrency on your object store.
  • Query Iceberg tables in ORC or Parquet format from Hive or Impala.
  • Query Iceberg tables in Parquet format from Spark.
  • Query Iceberg V2 tables including delete, update, merge.
  • Evolve partitions and schemas quickly and easily.
  • Make schema changes quickly and easily.
  • Migrate Hive tables to Iceberg.

Supported ACID transaction properties

Iceberg supports atomic and isolated database transaction properties. Writers work in isolation, not affecting the live table, and perform a metadata swap only when the write is complete, making the changes in one atomic commit.

Iceberg uses snapshots to guarantee isolated reads and writes. You see a consistent version of table data without locking the table. Readers always see a consistent version of the data without the need to lock the table. Writers work in isolation, not affecting the live table, and perform a metadata swap only when the write is complete, making the changes in one atomic commit.

Iceberg partitioning

The Iceberg partitioning technique has performance advantages over conventional partitioning, such as Apache Hive partitioning. Iceberg hidden partitioning is easier to use. Iceberg supports in-place partition evolution; to change a partition, you do not rewrite the entire table to add a new partition column, and queries do not need to be rewritten for the updated table. Iceberg continuously gathers data statistics, which supports additional optimizations, such as partition pruning.

Iceberg uses multiple layers of metadata files to find and prune data. Hive and Impala keep track of data at the folder level and not at the file level, performing file list operations when working with data in a table. Performance problems occur during the execution of multiple list operations. Iceberg keeps track of a complete list of files within a table using a persistent tree structure. Changes to an Iceberg table use an atomic object/file level commit to update the path to a new snapshot file. The snapshot points to the individual data files through manifest files.

The manifest files in the diagram below track several data files across many partitions. These files store partition information and column metrics for each data file. A manifest list is an additional index for pruning entire manifests. File pruning increases efficiency.

Iceberg relieves Hive metastore (HMS) pressure by storing partition information in metadata files on the file system/object store instead of within the HMS. This architecture supports rapid scaling without performance hits.