Data compaction

As administrator, you need to manage compaction of delta files that accumulate during data ingestion. Compaction is a process that performs critical cleanup of files.

Hive creates a set of delta files for each transaction that alters a table or partition. By default, compaction of delta and base files occurs at regular intervals. Compactions occur in the background without affecting concurrent reads and writes.

There are two types of compaction:

Minor
Rewrites a set of delta files to a single delta file for a bucket.
Major
Rewrites one or more delta files and the base file as a new base file for a bucket.
Carefully consider the need for a major compaction as this process can consume significant system resources and take a long time. Base and delta files for a table or partition are compacted.
You can configure automatic compactions or do manual compactions. Start a major compaction during periods of low traffic. You use an ALTER TABLE statement to start compaction manually. A manual compaction either returns the accepted compaction request ID or shows the ID (and current state) of a compaction request for the very same target. The request is stored in the COMPACTION_QUEUE table.

The compactor initiator must run on only one HMS instance at a time.

Data retention and recovery during compaction

The Hive compaction cleaner deletes obsolete directories, such as delta and base directories, without moving them to the Trash. This process explicitly bypasses the Trash to prevent storage accumulation from internal system artifacts and does not honor table-level retention settings. Because these files are internal system artifacts and not user-managed data, you cannot recover them from the Trash after a compaction cycle occurs. Cloudera recommends using storage-layer features, for example, AWS S3 versioning or equivalent backup solutions, if you require backup and recovery capabilities for your storage.