Managing Apache Hive

Data compaction

As administrator, you need to manage compaction of delta files that accumulate during data ingestion. Compaction is a process that performs critical cleanup of files.

Hive creates a set of delta files for each transaction that alters a table or partition. By default, compaction of delta and base files occurs at regular intervals. Compactions occur in the background without affecting concurrent reads and writes.

There are two types of compaction:

Minor
Rewrites a set of delta files to a single delta file for a bucket.
Major
Rewrites one or more delta files and the base file as a new base file for a bucket. A major compaction runs if there are multiple deltas and no base file.
Carefully consider the need for a major compaction as this process can consume significant system resources and take a long time. Base and delta files for a table or partition are compacted.
You can configure automatic compactions or do manual compactions. Start a major compaction during periods of low traffic. You use an ALTER TABLE statement to start compaction manually. A manual compaction either returns the accepted compaction request ID or shows the ID (and current state) of a compaction request for the very same target. The request is stored in the COMPACTION_QUEUE table.

You can configure automatic compactions or do manual compactions.

We want your opinion

How can we improve this page?

What kind of feedback do you have?