What's New in Apache Iceberg

Learn about the new features of Iceberg in Cloudera Runtime 7.3.1, its service packs and cumulative hotfixes.

Cloudera Runtime 7.3.1.500 SP3

Cloudera support for Apache Iceberg version 1.5.2
The Apache Iceberg component has been upgraded from 1.3.1 to 1.5.2.
Insert into partition and insert overwrite partition support
From Hive you can insert into, or overwrite data in, Iceberg tables that are statically or dynamically partitioned. For syntax and limitations, see Insert into/overwrite partition support.
Truncate partition support
This release introduces the capability to truncate an Iceberg partition from Hive. Truncation removes all rows from the table and a new snapshot is created.
Enhancement of expiring Iceberg snapshots
The Expire Snapshots feature has been enhanced to offer more flexibility. In addition to expiring snapshots older than a timestamp, you can now expire snapshots based on the following conditions:
  • A snapshot having a given ID
  • Snapshots having IDs matching a given list of IDs
  • Snapshots within the range of two timestamps

You can keep snapshots you are likely to need, for example recent snapshots, and expire old snapshots. For example, you can keep daily snapshots for the last 30 days, then weekly snapshots for the past year, then monthly snapshots for the last 10 years. You can remove specific snapshots to meet the GDPR right to be forgotten requirements.

For more information, see the Expire snapshots feature.

Iceberg branching and tagging
From Hive, you can manage the lifecycle of snapshots using the Iceberg branching and Iceberg tagging features. Branches are references to snapshots that have a lifecycle of their own. Tags identify snapshots you need for auditing and conforming to GDPR.
Drop partition feature
You can easily remove a partition from an Iceberg partition using an alter table statement from Impala. Removing a partition does not affect the table schema. The column is not removed from the schema.
Support for copy-on-write (COW)
Hive supports the copy-on-write (COW) as well as merge-on-read (MOR) for handling Iceberg row-level updates and deletes. You configure COW or MOR based on your use case and rate of data change.
Support for Iceberg data compaction and related enhancements
You can compact Iceberg tables and optimize them for read operations from Hive and Impala. Compaction is an essential table maintenance activity that creates a new snapshot, which contains the table content in a compact form.
The OPTIMIZE TABLE statement also includes the following improvements:
  • Supports partition evolution

    The Hive and Impala OPTIMIZE TABLE supports compaction of Iceberg tables with partition evolution.

  • Supports data compaction based on file size threshold

    The Impala OPTIMIZE TABLE statement includes a FILE_SIZE_THRESHOLD_MB option that enables you to specify the maximum size of files (in MB) that should be considered for compaction.

For more information, see Iceberg data compaction.

SQL support for querying Iceberg metadata tables
Apache Iceberg stores extensive metadata for its tables. From Hive and Impala, you can query the metadata tables as you would query a regular table. For example, you can use projections, joins, filters, and so on. See Query metadata tables feature.
Directed distribution mode
This release implements directed distribution mode. The scheduler collects information about which Iceberg data file is scheduled on which host. Since, the scan node for the data files are on the same host as the Iceberg join node, delete files are sent directly to that specific host. This mode can improve V2 table read performance.
Impala support for reading Iceberg equality deletes for NiFi (Preview)
Cloudera supports row-level deletes, and starting with this release you can read equality deletes from Impala with suport added for Apache NiFi. See the Delete data feature.
Reading Iceberg Puffin statistics
Impala supports reading Puffin statistics from current and older snapshots. When there are Puffin statistics for multiple snapshots, Impala chooses the most recent statistics for each column. This indicates that statistics for different columns may come from different snapshots. If there are Hive Metastore (HMS) and Puffin statistics for a column, the most recent statistics are considered. For HMS statistics, the impala.lastComputeStatsTime property is used and for Puffin statistics, the snapshot timestamp is used to determine which among the two is the most recent. For more information, see Iceberg Puffin statistics.
Impala supports the UPDATE statement for Iceberg tables
This release introduces support for the UPDATE statement in Impala for Iceberg tables. With this enhancement, you can now use Impala to update data in a V2 Iceberg table. The UPDATE operations in Impala are executed as an atomic pair of DELETE and INSERT operations. The Iceberg V2 format supports row-level modifications using delete files, enabling seamless row-level updates. For more information, see the Iceberg Update data feature
Impala supports the MERGE INTO statement for Iceberg tables
You can use Impala to run a MERGE INTO statement on an Iceberg table based on the results of a join between a target and source Iceberg table. For more information, see the Iceberg Merge feature.

Cloudera Runtime 7.3.1.400 SP2

There are no new features in this release.

Cloudera Runtime 7.3.1.300 SP1 CHF1

There are no new features in this release.

Cloudera Runtime 7.3.1.200 SP1

There are no new features in this release.

Cloudera Runtime 7.3.1.100 CHF1

There are no new features in this release.

Cloudera Runtime 7.3.1

Apache Iceberg support for Hive
Cloudera supports a Data Lakehouse architecture by pre-integrating and unifying the capabilities of Data Warehouses and Data Lakes, to support data engineering, business intelligence, and machine learning – all on a single platform.

Starting from this release, Cloudera Base on premises supports queries of Iceberg tables from the Apache Hive compute engine. You can run SQL queries to create and query Iceberg tables. Hive queries are table-format agnostic. You can run nested, correlated, or analytic queries on all supported table types. Hive on Iceberg supports and enables you to use the following Apache Iceberg features:

  • ACID transactions with Iceberg V2 tables
  • Point in time queries using Iceberg Time travel
  • Rollback table
  • Position deletes
  • Schema evolution
  • Flexible partitioning using partition evolution and partition transform
  • Support for materialized views
  • Snapshot expiry
  • Merge table
  • Multi-engine concurrent read and write

For more information about the Apache Iceberg features supported in Cloudera, see Using Apache Iceberg.

If you want to migrate your existing Hive tables to Iceberg tables, you can use the ALTER TABLE statement. For more information, see Migrate Hive table to Iceberg.

Cloudera supports the integration of Iceberg and Atlas that helps you identify the Iceberg tables to scan data and provide lineage support. Learn how Atlas works with Iceberg and what schema evolution, partition specification, partition evolution are with examples.