Iceberg introduction

Apache Iceberg is a table format for huge analytics datasets that defines how metadata is stored and data files are organized. Iceberg is also a library that compute engines can use to read/write a table. Iceberg supports concurrent reads and writes on object stores, such as Ozone and on file systems, such as HDFS. HadoopFileIO is supported.

Supported engines and platforms

Querying Apache Iceberg from Apache Hive and Apache Impala is fully supported in Cloudera Data Warehouse (CDW) Private Cloud, deployed on CDP Private Cloud Base version 7.1.9. In this deployment, CDP integrates Iceberg with the following engines:
  • Apache Flink
  • Apache Hive
  • Apache NiFi
  • Apache Impala
  • Apache Spark 3
Querying Iceberg from Hive and Impala is a technical preview in CDW Private Cloud, deployed on CDP Private Cloud Base version 7.1.7 or 7.1.8. These versions of CDP Private Cloud Base and CDW Private Cloud are not fully interoperable with Iceberg. Iceberg when deployed on these versions is integrated with the following engines only:
  • Apache Hive
  • Apache Impala

Cloudera supports Iceberg on Red Hat OpenShift and Embedded Container Service (ECS) platforms in CDW on HDFS and Ozone.

Storage

The Hive metastore stores Iceberg metadata, including the location of the table.

Hive metastore plays a lightweight role in the Catalog operations. Iceberg relieves Hive metastore (HMS) pressure by storing partition information in metadata files on the file system/object store instead of within the HMS. This architecture supports rapid scaling without performance hits.

By default, Hive and Impala use the Iceberg HiveCatalog. Cloudera recommends the default HiveCatalog to create an Iceberg table.

Key features overview

You can use Iceberg when a single table contains tens of petabytes of data, and you can read these tables without compromising performance. From Hive and Impala, you can read and write V1 or V2 Iceberg tables. The following features are included:
  • Iceberg V2 tables support reading row-level deletes or updates*, making Apache Iceberg ACID-compliant with serializable isolation and an optimistic concurrency model
  • Materialized views of Iceberg tables
  • Enhanced maintenance features, such as expiring and removing old snapshots
  • Performance and scalability enhancements
* The support for delete operations shown in this table is limited to position deletes. Equality deletes are not supported in these releases.

For more information about Iceberg features in CDP, see the Iceberg Support Matrix.

Iceberg table security and visualization

Apache Iceberg integrates Apache Ranger for security. You can use Ranger integration with Hive and Impala to apply fine-grained access control to sensitive data in Iceberg tables. Iceberg is also integrated with Data Visualization for creating dashboards and other graphics of your Iceberg data.

Supported ACID transaction properties

Iceberg supports atomic and isolated database transaction properties. Writers work in isolation, not affecting the live table, and perform a metadata swap only when the write is complete, making the changes in one atomic commit.

Iceberg uses snapshots to guarantee isolated reads and writes. You see a consistent version of table data without locking the table. Readers always see a consistent version of the data without the need to lock the table. Writers work in isolation, not affecting the live table, and perform a metadata swap only when the write is complete, making the changes in one atomic commit.

Iceberg partitioning

The Iceberg partitioning technique has performance advantages over conventional partitioning, such as Apache Hive partitioning. Iceberg hidden partitioning is easier to use. Iceberg supports in-place partition evolution; to change a partition, you do not rewrite the entire table to add a new partition column, and queries do not need to be rewritten for the updated table. Iceberg continuously gathers data statistics, which supports additional optimizations, such as partition pruning.

Iceberg uses multiple layers of metadata files to find and prune data. Hive and Impala keep track of data at the folder level and not at the file level, performing file list operations when working with data in a table. Performance problems occur during the execution of multiple list operations. Iceberg keeps track of a complete list of files within a table using a persistent tree structure. Changes to an Iceberg table use an atomic object/file level commit to update the path to a new snapshot file. The snapshot points to the individual data files through manifest files.

The manifest files track data files across partitions, storing partition information and column metrics for each data file. A manifest list is an additional index for pruning entire manifests. File pruning increases efficiency.

Iceberg relieves Hive metastore (HMS) pressure by storing partition information in metadata files on the file system/object store instead of within the HMS. This architecture supports rapid scaling without performance hits.