What's New in Apache Iceberg

Learn about the new features of Iceberg in Cloudera Runtime 7.3.2, its service packs and cumulative hotfixes.

Cloudera Runtime 7.3.2

Cloudera Runtime 7.3.2 introduces new features of Iceberg and includes all service packs and cumulative hotfixes from 7.3.1.100 through 7.3.1.706. For a comprehensive record of all updates in Cloudera Runtime 7.3.1.x, see New Features.

Cloudera Lakehouse Optimizer for Iceberg table optimization

In Cloudera Runtime 7.3.2 and higher versions, you can use Cloudera Lakehouse Optimizer service in Cloudera Manager to automate the Iceberg table maintenance tasks.

Cloudera Lakehouse Optimizer provides automated Iceberg table maintenance, through Spark jobs, for Iceberg tables in Cloudera Open Data Lakehouse. It simplifies table management, improves query performance, and reduces operational costs.

You can add the service to an existing Cloudera Base on premises 7.3.2 or higher versions cluster in Cloudera Manager 7.13.2 or higher versions, or you can create a dedicated cluster and then add the service. You must ensure that the cluster contains all the required services. After you finish configuring the service, you can use the Cloudera Lakehouse Optimizer service REST APIs to define the Cloudera Lakehouse Optimizer policies, perform policy management, and run other Iceberg table optimization operations.

For more information, see Cloudera Lakehouse Optimizer.

Integrate Iceberg scan metrics into Impala query profiles

Iceberg scan metrics are now integrated into the Frontend section of Impala query profiles, providing deeper insight into query planning performance for Iceberg tables.

The query profile now displays scan metrics from Iceberg's planFiles() API, including total planning time, counts of data/delete files and manifests, and the number of skipped files.

Metrics are displayed on a per-table basis. If a query scans multiple Iceberg tables, a separate metrics section will appear in the profile for each one.

Apache Jira: IMPALA-13628

Delete orphan files for Iceberg tables

You can now use the following syntax to remove orphan files for Iceberg tables:

-- Remove orphan files older than '2022-01-04 10:00:00'.
ALTER TABLE ice_tbl EXECUTE remove_orphan_files('2022-01-04 10:00:00');
            
-- Remove orphan files older than 5 days from now.
ALTER TABLE ice_tbl EXECUTE remove_orphan_files(now() - interval 5 days);

This feature removes all files from a table’s data directory that are not linked from metadata files and that are older than the value of older_than parameter. Deleting orphan files from time to time is recommended to keep the size of a table’s data directory under control.

Apache Jira: IMPALA-14492

Allow forced predicate pushdown to Iceberg

Since IMPALA-11591, Impala has optimized query planning by avoiding predicate pushdown to Iceberg unless it is strictly necessary. While this default behavior makes planning faster, it can miss opportunities to prune files early based on Iceberg's file-level statistics.

A new table property, impala.iceberg.push_down_hint is introduced, which allows you to force predicate pushdown for specific columns. The property accepts a comma-separated list of column names, for example, 'col_a, col_b'.

If a query contains a predicate on any column listed in this property, Impala will push that predicate down to Iceberg for evaluation during the planning phase.

Apache Jira: IMPALA-14123

UPDATE operations now skip rows that already have the desired value

The UPDATE statement for Iceberg and Kudu tables is optimized to reduce unnecessary writes.

Previously, an UPDATE operation would modify all rows matching the WHERE clause, even if those rows already contained the new value. For Iceberg tables, this resulted in writing unnecessary new data and delete records.

With this enhancement, Impala automatically adds an extra predicate to the UPDATE statement to exclude rows that already match the target value.

Apache Jira: IMPALA-12588