What's New in Apache Iceberg
Learn about the new features of Iceberg in Cloudera Runtime 7.3.1, its service packs and cumulative hotfixes.
Cloudera Runtime 7.3.1.500 SP3
- Cloudera support for Apache Iceberg version 1.5.2
- The Apache Iceberg component has been upgraded from 1.3.1 to 1.5.2.
- Insert into partition and insert overwrite partition support
- From Hive you can insert into, or overwrite data in, Iceberg tables that are statically or dynamically partitioned. For syntax and limitations, see Insert into/overwrite partition support.
- Truncate partition support
- This release introduces the capability to truncate an Iceberg partition from Hive. Truncation removes all rows from the table and a new snapshot is created.
- Enhancement of expiring Iceberg snapshots
- The Expire Snapshots feature has been enhanced to offer more flexibility. In addition
to expiring snapshots older than a timestamp, you can now expire snapshots based on the
following conditions:
- A snapshot having a given ID
- Snapshots having IDs matching a given list of IDs
- Snapshots within the range of two timestamps
You can keep snapshots you are likely to need, for example recent snapshots, and expire old snapshots. For example, you can keep daily snapshots for the last 30 days, then weekly snapshots for the past year, then monthly snapshots for the last 10 years. You can remove specific snapshots to meet the GDPR right to be forgotten requirements.
For more information, see the Expire snapshots feature.
- Iceberg branching and tagging
- From Hive, you can manage the lifecycle of snapshots using the Iceberg branching and Iceberg tagging features. Branches are references to snapshots that have a lifecycle of their own. Tags identify snapshots you need for auditing and conforming to GDPR.
- Drop partition feature
- You can easily remove a partition from an Iceberg partition using an alter table statement from Impala. Removing a partition does not affect the table schema. The column is not removed from the schema.
- Support for copy-on-write (COW)
- Hive supports the copy-on-write (COW) as well as merge-on-read (MOR) for handling Iceberg row-level updates and deletes. You configure COW or MOR based on your use case and rate of data change.
- Support for Iceberg data compaction and related enhancements
- You can compact Iceberg tables and optimize them for read operations from Hive and
Impala. Compaction is an essential table maintenance activity that creates a new
snapshot, which contains the table content in a compact form.The
OPTIMIZE TABLEstatement also includes the following improvements:- Supports partition evolution
The Hive and Impala
OPTIMIZE TABLEsupports compaction of Iceberg tables with partition evolution. - Supports data compaction based on file size threshold
The Impala
OPTIMIZE TABLEstatement includes aFILE_SIZE_THRESHOLD_MBoption that enables you to specify the maximum size of files (in MB) that should be considered for compaction.
For more information, see Iceberg data compaction.
- Supports partition evolution
- SQL support for querying Iceberg metadata tables
- Apache Iceberg stores extensive metadata for its tables. From Hive and Impala, you can query the metadata tables as you would query a regular table. For example, you can use projections, joins, filters, and so on. See Query metadata tables feature.
- Directed distribution mode
- This release implements directed distribution mode. The scheduler collects information about which Iceberg data file is scheduled on which host. Since, the scan node for the data files are on the same host as the Iceberg join node, delete files are sent directly to that specific host. This mode can improve V2 table read performance.
- Impala support for reading Iceberg equality deletes for NiFi (Preview)
- Cloudera supports row-level deletes, and starting with this release you can read equality deletes from Impala with suport added for Apache NiFi. See the Delete data feature.
- Reading Iceberg Puffin statistics
- Impala supports reading Puffin statistics from current and older snapshots. When there
are Puffin statistics for multiple snapshots, Impala chooses the most recent statistics
for each column. This indicates that statistics for different columns may come from
different snapshots. If there are Hive Metastore (HMS) and Puffin statistics for a
column, the most recent statistics are considered. For HMS statistics, the
impala.lastComputeStatsTimeproperty is used and for Puffin statistics, the snapshot timestamp is used to determine which among the two is the most recent. For more information, see Iceberg Puffin statistics. - Impala supports the
UPDATEstatement for Iceberg tables - This release introduces support for the
UPDATEstatement in Impala for Iceberg tables. With this enhancement, you can now use Impala to update data in a V2 Iceberg table. The UPDATE operations in Impala are executed as an atomic pair of DELETE and INSERT operations. The Iceberg V2 format supports row-level modifications using delete files, enabling seamless row-level updates. For more information, see the Iceberg Update data feature - Impala supports the
MERGE INTOstatement for Iceberg tables - You can use Impala to run a
MERGE INTOstatement on an Iceberg table based on the results of a join between a target and source Iceberg table. For more information, see the Iceberg Merge feature.
Cloudera Runtime 7.3.1.400 SP2
There are no new features in this release.
Cloudera Runtime 7.3.1.300 SP1 CHF1
There are no new features in this release.
Cloudera Runtime 7.3.1.200 SP1
There are no new features in this release.
Cloudera Runtime 7.3.1.100 CHF1
There are no new features in this release.
