Apache Iceberg in Cloudera Data Platform

Apache Iceberg is a cloud-native, open table format for organizing petabyte-scale analytic datasets on a file system or object store. Combined with CDP architecture for multi-function analytics users can deploy large scale end-to-end pipelines.

Iceberg supports atomic and isolated database transaction properties. Writers work in isolation, not affecting the live table, and perform a metadata swap only when the write is complete, making the changes in one atomic commit. Iceberg uses snapshots to guarantee isolated reads and writes. You see a consistent version of table data without locking the table. Readers always see a consistent version of the data without the need to lock the table.

The Iceberg partitioning technique has performance advantages over conventional partitioning, such as Apache Hive partitioning. Iceberg hidden partitioning is easier to use. Iceberg supports in-place partition evolution; to change a partition, you do not rewrite the entire table to add a new partition column, and queries do not need to be rewritten for the updated table. Iceberg continuously gathers data statistics, which supports additional optimizations, such as partition pruning.

Cloudera Data Engineering (CDE) and Cloudera Data Warehouse (CDW) support Apache Iceberg as a technical preview:

In this technical preview, Apache Iceberg v1 format and Parquet is supported. Enhanced DDLs, including partition/schema evolution, are included. This technical preview in on par, performance-wise, with Hive external table querying.

Accessing Iceberg from within CDW and CDE, you can perform the following tasks:
  • Get high throughput reads of large tables at petabyte scale.
  • Run time travel queries.
  • Query tables with high concurrency on Amazon S3.
  • Query Iceberg tables in ORC or Parquet format from Hive or Impala.
  • Query Iceberg tables in Parquet format from Spark.
  • Evolve partitions and schemas quickly and easily.
  • Make schema changes quickly and easily.
  • SDX integration (table authorization and policies).
  • Migration of Hive tables to Iceberg.

Limitations

  • Storing Iceberg tables in AVRO is not supported.
  • If partition columns are not present in the data files, tables cannot be read.
  • Spark can read Iceberg tables created by Hive and Impala if you select a Data Lake 7.2.12.1 or later when you register the CDP environment.
  • AWS storage is the only storage supported.