Managing Apache Hive
Also available as:
PDF

Data compaction

To prevent NameNode capacity problems, as administrator, you need to manage compaction of delta files that accumulate during data ingestion.

Hive stores data in base files that cannot be updated by HDFS. Instead, Hive creates a set of delta files for each transaction that alters a table or partition and stores them in a separate delta directory. By default, Hive automatically compacts delta and base files at regular intervals. Compaction is a consolidation of files. You can configure automatic compactions, as well as perform manual compactions of base and delta files. Hive performs all compactions in the background without affecting concurrent reads and writes.

The compactor initiator should run on only one HMS instance.

There are two types of compaction:

  • Minor

    Rewrites a set of delta files to a single delta file for a bucket.

  • Major

    Rewrites one or more delta files and the base file as a new base file for a bucket.

Transactional tables you created in an earlier version require a major compaction before upgrading to Hive 3.