Row-level operations

Hive supports the copy-on-write (COW) as well as merge-on-read (MOR) modes for handling Iceberg row-level updates and deletes. Impala supports only the MOR mode and will fail if configured for copy-on-write. Impala does support reading copy-on-write tables.

Depending on the COW or MOR setting, Hive performs updates and deletes to Iceberg tables as follows:
  • COW: Hive creates a new version of files for each update/delete. Use COW when updating/deleting a large number of rows, or when reading data frequently.
  • MOR (default): Updates/deletes are logged to delta files, which tends to be faster than creating new versions of the files. Later a compaction can eliminate the delete files and rewrite the affected data files.

Impala uses only the MOR method. Impala does not support copy-on-write and will fail if configured for copy-on-write. Impala does support reading copy on write tables.

Configuring Hive tables to support COW or MOR

You use the CREATE TABLE or ALTER commands to set either COW or MOR mode in table properties as shown in the following examples:

ALTER TABLE ice01 SET TBLPROPERTIES ('write.delete.mode'='copy-on-write');
ALTER TABLE ice01 SET TBLPROPERTIES ('write.update.mode'='copy-on-write');

When to use COW or MOR

Set either COW or MOR based on your use case and rate of data change. Consider the following advantages and disadvantages of the modes:
  • MOR

    • Writes are efficient.

    • Reads are inefficient due to read amplification, but regularly scheduled compaction can reduce inefficiency.

    • A good choice when streaming.

    • A good choice when frequently writing or updating, such as running hourly batch jobs.

    • A good choice when the percentage of data change is low.

    COW

    • Reads are efficient.

    • A good choice for bulk updates and deletes, such as running a daily batch job.

    • Writes are inefficient due to write amplification, but the need for compaction is reduced.

    • A good choice when the percentage of data change is high.

Position or equality delete files

By default, Hive and Impala delete Iceberg V2 table data using position delete files. Hive and Impala can read equality deletes, which you might encounter in a table created and written to by Flink or another engine that supports writing equality deletes. Hive and Impala cannot write equality deletes.

Choose the MOR or COW mode based on your use case. The MOR mode keeps track of deleted rows, so it's expensive when you're making many deletions.

If you use an Iceberg table in the following ways, consider using the COW mode:
  • As a sink for streaming data from Flink to Cloudera Data Warehouse (CDW) Public Cloud
  • As an Apache NiFi data flow to ingest data in Iceberg table format into CDW

Equality deletes are streamed or ingested, respectively, in these cases. If you cannot read equality deletes, the output of the read is incorrect with regard to any deletions.

To create an Iceberg table for these use cases, see the following documentation:

The position and equality deletes are summarized below:

Position delete: DELETE operations find the data records that need to be deleted, then write out their file path and position into delete files. Postponing the rewrite of the files speeds writes of updates/deletes. Read operations need to merge the data and delete files to retrieve the active rows.

Equality delete: DELETE operations write out column values (for example, primary keys) that identify the records to be deleted. If you already know the column values of the records you want to delete, then there is no need to read the whole table, which means equality deletes can be written faster than position deletes. Read operations need to merge the data and delete files to retrieve the active rows. Processing equality deletes is typically slower than processing position deletes.