Data compaction

From Hive and Impala, you can compact Iceberg tables and optimize them for read operations. Compaction is an essential table maintenance activity that creates a new snapshot, which contains the table content in a compact form.

Frequent updates and row-level modifications on Iceberg tables can result in many small data files and delete files, which have to be merged-on-read. This degrades the query performance over time. You can use the following Hive and Impala SQL statements to compact Iceberg tables and optimize the table for reading.

Impala Syntax:
OPTIMIZE TABLE [db_name.]table_name;
Impala Example:
OPTIMIZE TABLE ice_table;
Hive Syntax:
ALTER TABLE [database_name.]table_name COMPACT 'compaction_type' [AND WAIT];
OPTIMIZE TABLE [database_name.]table_name REWRITE DATA;
Hive Example:
ALTER TABLE ice_table COMPACT 'MAJOR';
OPTIMIZE TABLE ice_table REWRITE DATA;

To perform table optimization, ensure that the following prerequisites are met:

  • The user performing compaction must have the 'ALL' permissions on the table, which can be set through Ranger.
  • Impala can only write Parquet files, therefore the write.format.default table property must be set to parquet. Hive can write both Parquet and ORC file formats.
  • Impala cannot compact tables with complex data types.
  • Impala cannot compact views or empty tables.

The OPTIMIZE TABLE statement rewrites the entire table, performing the following tasks:

  • Compact small files into larger files
  • Merge delta files created due to previously run DELETE and UPDATE operations
  • Rewrite all files, converting them to the latest table schema
  • Rewrite all partitions according to the latest partition specification

When an Iceberg table is optimized, a new snapshot is created where all the old files of the table are replaced with newly written files. The old table state and old files can still be queried using time travel, because the rewritten data and delete files are not removed physically. This can lead to the accumulation of unused files that belong to old snapshots. Use the Expire Snapshots feature to permanently remove the old files from the file system.