Delete data feature

Cloudera supports row-level deletes in Iceberg V2 tables.

From Hive and Impala, you can use the following formatting types defined by the Iceberg Spec:

position deletes
equality deletes
note
The equality deletes feature is in technical preview and not recommended for use in production deployments. Cloudera recommends that you try this feature in test and development environments.

From Hive and Impala, you can read/write position deletes, but you can only read equality deletes. You cannot write equality deletes.

If you use an Iceberg table in the following ways, equality deletes are streamed or ingested, respectively:

As a sink for streaming data from Flink to Cloudera Data Warehouse on cloud
As an Apache NiFi data flow to ingest data in Iceberg table format into Cloudera Data Warehouse

If you cannot read equality deletes, the output of the read is incorrect with regard to any deletions.

For equality deletes, you must be aware of the following considerations:

If you are using Apache Flink or Apache NiFi to write equality deletes, then ensure that you provide a PRIMARY KEY for the table. This is required for engines to know which columns to write into the equality delete files.
If the table is partitioned then the partition columns have to be part of the PRIMARY KEY
For Apache Flink, the table should be in 'upsert-mode' to write equality deletes
Partition evolution is not allowed for Iceberg tables that have PRIMARY KEYs

To create an Iceberg table for these use cases, see the following documentation:

Creating Iceberg tables in SQL Stream Builder
Ingesting data into Cloudera Data Warehouse using Iceberg

Position delete files contain the following information:

file_path, which is a full URI
pos, the file position of the row

Delete files are sorted by file_path and pos. The following table shows an example of delete files in a partitioned table:

Table 1.
Partition	Partition information	Delete file
YEAR(ts)=2022	ts_year=2022/data-abcd.parquet ts_year=2022/data-bcde.parquet	ts_year=2022/delete-wxyz.parquet
YEAR(ts)=2023	ts_year=2023/data-efgh.parquet	ts_year=2023/delete-hxkw.parquet
MONTH(ts)=2023-06	ts_month=2023-06/data-ijkl.parquet	ts_month=2023-06/delete-uzwd.parquet
MONTH(ts)=2023-07	ts_month=2023-07/data-mnop.parquet	ts_month=2023-07/delete-udqx.parquet

Inserting, deleting, or updating table data generates a snapshot.

You use a WHERE clause in your DELETE statement. For example:

delete from tbl_ice where a <= 2,1;

Hive and Impala evaluate rows from one table against a WHERE clause, and delete all the rows that match WHERE conditions. If you want delete all rows, use the Truncate feature. The WHERE expression is similar to the WHERE expression used in SELECT. The conditions in the WHERE clause can refer to any columns.

Concurrent operations that include DELETE do not introduce inconsistent table states. Iceberg runs validation checks to check for concurrent modifications, such as DELETE+INSERT. Only one will succeed. On the other hand, DELETE+DELETE, and INSERT+INSERT can both succeed, but in the case of a concurrent DELETE+UPDATE, UPDATE+UPDATE, DELETE+INSERT, UPDATE+INSERT from Hive, only the first operation will succeed.

From joined tables, you can delete all matching rows from one of the tables. You can join tables of any kind, but the table from which the rows are deleted must be an Iceberg table. The FROM keyword is required in this case, to separate the name of the table whose rows are being deleted from the table names of the join clauses.

Hive or Impala syntax🔗

delete from tablename [where expression]

delete joined_tablename from [joined_tablename, joined_tablename2, ...] [ where expression ]

Hive or Impala examples🔗

create external table tbl_ice(a int, b string, c int) stored by iceberg tblproperties ('format-version'='2');

insert into tbl_ice values (1, 'one', 50), (2, 'two', 51), (3, 'three', 52), (4, 'four', 53), (5, 'five', 54), (111, 'one', 55), (333, 'two', 56);

delete from tbl_ice where a <= 2,1;

The following example deletes 0, 1, or more rows of the table. If col1 is a primary key, 0 or 1 rows are deleted:

delete from ice_table where col1 = 100;

Delete data feature

Hive or Impala syntax🔗

Hive or Impala examples🔗

We want your opinion

How can we improve this page?