Iceberg introduction

Apache Iceberg is a table format for huge analytics datasets in the cloud. You can efficiently query large Iceberg tables on object stores. Iceberg supports concurrent reads and writes on all storage media. Iceberg supports HadoopFileIO.

You can use Iceberg when a single table contains tens of petabytes of data, and you can read these tables without compromising performance. Iceberg catalog is set to HiveCatalog for Metastore management of Iceberg tables. The catalog stores Iceberg metadata, including the location of the table. Iceberg catalog use of Hive metastore for catalog operations is lightweight. Iceberg relieves Hive metastore (HMS) pressure by storing partition information in metadata files on the file system/object store instead of within the HMS. This architecture supports rapid scaling without performance hits.

Cloudera recommends the default Iceberg catalog to create an Iceberg table.

Apache Iceberg integrates Apache Ranger for security. You can use Ranger integration with Impala to apply fine-grained access control to sensitive data in Iceberg tables. Iceberg is also integrated with Data Visualization for creating dashboards and other graphics of your Iceberg data.

Supported ACID transaction properties

Iceberg supports atomic and isolated database transaction properties. Writers work in isolation, not affecting the live table, and perform an atomic swap for metadata pointers only when the transactions are committed, making the changes in one atomic commit.

Iceberg uses snapshots to guarantee isolated reads and writes. You see a consistent version of table data without locking the table. Readers always see a consistent version of the data without the need to lock the table. Writers work in isolation, not affecting the live table.

Iceberg partitioning

The Iceberg partitioning technique has performance advantages over conventional partitioning, such as Impala partitioning. Iceberg hidden partitioning is easier to use. Iceberg supports in-place partition evolution; to change a partition, you do not rewrite the entire table to add a new partition column, and queries do not need to be rewritten for the updated table. Iceberg continuously gathers data statistics, which supports additional optimizations, such as partition pruning.

Iceberg uses multiple layers of metadata files to find and prune data. Impala, Spark, and Flink keep track of data at the folder level and not at the file level, performing file list operations when working with data in a table. Performance problems occur during the execution of multiple directory listings. Iceberg keeps track of a complete list of files within a table using a persistent tree structure. Changes to an Iceberg table use an atomic object/file level commit to update the path to a new snapshot file. The snapshot points to the individual data files through manifest files.

The manifest files track data files across partitions, storing partition information and column metrics for each data file. A manifest list is an additional index for pruning entire manifests. File pruning increases efficiency.

Iceberg relieves Hive metastore (HMS) pressure by storing partition information in metadata files on the file system/object store instead of within the HMS. This architecture supports rapid scaling without performance hits.