Delete data feature

Cloudera supports row-level deletes in Iceberg V2 tables.

From Hive and Impala, you can use the following formatting types defined by the Iceberg Spec:
  • position deletes
  • equality deletes

From Hive and Impala, you can read/write position deletes, but you can only read equality deletes. You cannot write equality deletes.

If you use an Iceberg table in the following ways, equality deletes are streamed or ingested, respectively:
  • As a sink for streaming data from Flink to Cloudera Data Warehouse (CDW) Public Cloud
  • As an Apache NiFi data flow to ingest data in Iceberg table format into CDW

If you cannot read equality deletes, the output of the read is incorrect with regard to any deletions.

For equality deletes, you must be aware of the following considerations:
  • If you are using Apache Flink or Apache NiFi to write equality deletes, then ensure that you provide a PRIMARY KEY for the table. This is required for engines to know which columns to write into the equality delete files.
  • If the table is partitioned then the partition columns have to be part of the PRIMARY KEY
  • For Apache Flink, the table should be in 'upsert-mode' to write equality deletes
  • Partition evolution is not allowed for Iceberg tables that have PRIMARY KEYs
To create an Iceberg table for these use cases, see the following documentation:
Position delete files contain the following information:
  • file_path, which is a full URI
  • pos, the file position of the row
Delete files are sorted by file_path and pos. The following table shows an example of delete files in a partitioned table:
Table 1.
Partition Partition information Delete file
YEAR(ts)=2022

ts_year=2022/data-abcd.parquet

ts_year=2022/data-bcde.parquet

ts_year=2022/delete-wxyz.parquet

YEAR(ts)=2023 ts_year=2023/data-efgh.parquet ts_year=2023/delete-hxkw.parquet
MONTH(ts)=2023-06 ts_month=2023-06/data-ijkl.parquet ts_month=2023-06/delete-uzwd.parquet
MONTH(ts)=2023-07 ts_month=2023-07/data-mnop.parquet ts_month=2023-07/delete-udqx.parquet

Inserting, deleting, or updating table data generates a snapshot.

You use a WHERE clause in your DELETE statement. For example:

delete from tbl_ice where a <= 2,1;

Hive and Impala evaluate rows from one table against a WHERE clause, and delete all the rows that match WHERE conditions. If you want delete all rows, use the Truncate feature. The WHERE expression is similar to the WHERE expression used in SELECT. The conditions in the WHERE clause can refer to any columns.

Concurrent operations that include DELETE do not introduce inconsistent table states. Iceberg runs validation checks to check for concurrent modifications, such as DELETE+INSERT. Only one will succeed. On the other hand, DELETE+DELETE, and INSERT+INSERT can both succeed, but in the case of a concurrent DELETE+UPDATE, UPDATE+UPDATE, DELETE+INSERT, UPDATE+INSERT from Hive, only the first operation will succeed.

From joined tables, you can delete all matching rows from one of the tables. You can join tables of any kind, but the table from which the rows are deleted must be an Iceberg table. The FROM keyword is required in this case, to separate the name of the table whose rows are being deleted from the table names of the join clauses.

Hive or Impala syntax

delete from tablename [where expression]
delete joined_tablename from [joined_tablename, joined_tablename2, ...] [ where expression ] 

Hive or Impala examples

create external table tbl_ice(a int, b string, c int) stored by iceberg tblproperties ('format-version'='2');

insert into tbl_ice values (1, 'one', 50), (2, 'two', 51), (3, 'three', 52), (4, 'four', 53), (5, 'five', 54), (111, 'one', 55), (333, 'two', 56);

delete from tbl_ice where a <= 2,1;

The following example deletes 0, 1, or more rows of the table. If col1 is a primary key, 0 or 1 rows are deleted:

delete from ice_table where col1 = 100;