You check and change a number of Apache Hive properties to configure the compaction of
delta files that accumulate during data ingestion. You need to know the defaults and valid
values.
Basic compactor properties
- hive.compactor.initiator.on
- Default=false
- Whether to run the initiator and cleaner threads on this metastore instance or not.
Set in only one instance.
- hive.compactor.worker.threads
- Default=0
- Set this to a positive number to enable Hive transactions, which are required to
trigger transactions. Worker threads spawn jobs to perform compactions, but do not
perform the compactions themselves. Increasing the number of worker threads decreases
the time that it takes tables or partitions to be compacted. However, increasing the
number of worker threads also increases the background load on the CDP cluster because
they cause more jobs to run in the background.
- hive.metastore.runworker.in
- Default=HS2
- Specifies where to run the Worker threads that spawn jobs to perform compactions. Valid
values are HiveServer (HS2) or Hive metastore (HMS).
- hive.compactor.abortedtxn.threshold
- Default=1000 aborts
- The number of aborted transactions that triggers compaction on a table/partition.
- hive.compactor.aborted.txn.time.threshold
- Default=12 hours
- The hours of aborted transactions that trigger compaction on a table/partition.
Advanced compactor properties
- hive.compactor.worker.timeout
- Default=86400s
-
Expects a time value using d/day, h/hour, m/min, s/sec, ms/msec, us/usec, or ns/nsec
(default is sec). Time in seconds after which a compaction job will be declared failed
and the compaction re-queued.
- hive.compactor.check.interval
- Default=300s
- A valid value is a time with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec,
ns/nsec), which is sec if not specified.
-
Time in seconds between checks to see if any tables or partitions need to be
compacted. This value should be kept high because each check for compaction requires
many calls against the NameNode. Decreasing this value reduces the time it takes to start compaction
for a table or partition that requires it. However, checking if compaction
is needed requires several calls to the NameNode for each table or partition involved in
a transaction done since the last major compaction. Consequently, decreasing this
value increases the load on the NameNode.
- hive.compactor.delta.num.threshold
- Default=10
-
Number of delta directories in a table or partition that triggers a minor
compaction.
- hive.compactor.delta.pct.threshold
- Default=0.1
-
Percentage (fractional) size of the delta files relative to the base that triggers
a major compaction. (1.0 = 100%, so the default 0.1 = 10%.)
- hive.compactor.max.num.delta
- Default=500
- Maximum number of delta files that the compactor attempts to handle in a single job.
- hive.compactor.wait.timeout
- Default=300000
- The value must be greater than 2000 milliseconds.
- Time out in milliseconds for blocking compaction.
- hive.compactor.initiator.failed.compacts.threshold
- Default=2
- A valid value is between 1 and 20, and
must be less than
hive.compactor.history.retention.failed
.
- The number of consecutive compaction failures (per table/partition) after which
automatic compactions are not scheduled any longer.
- hive.compactor.cleaner.run.interval
- Default=5000ms
- A valid value is a time with unit
(d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is msec if not
specified.
- The time between runs of the cleaner thread.
- hive.compactor.job.queue
- Specifies the Hadoop queue name to which compaction jobs are submitted. If the
value is an empty string, Hadoop chooses the queue.
- hive.compactor.compact.insert.only
- Default=true
- The compactor compacts insert-only tables, or not (false). A safety
switch.
- hive.compactor.crud.query.based
- Default=false
- Performs major compaction on full CRUD tables as a query, and disables minor
compaction.
- hive.split.grouping.mode
- Default=query
- A valid value is either
query
or compactor
.
- This property is set to compactor from within
the query-based compactor. This setting enables the Tez SplitGrouper to group splits based on
their bucket number, so that all rows from different bucket files for the same bucket
number can end up in the same bucket file after the compaction.
- hive.compactor.history.retention.succeeded
- Default=3
- A valid value is between 0 and 100.
- Determines how many successful compaction
records are retained in compaction history for a given table/partition.
- hive.compactor.history.retention.failed
- Default=3
- A valid value is between 0 and 100.
- Determines how many failed compaction
records are retained in compaction history for a given table/partition.
- hive.compactor.history.retention.attempted
- Default=2
- A valid value is between 0 and 100.
- Determines how many attempted compaction
records are retained in compaction history for a given table/partition.
- hive.compactor.history.reaper.interval
- Default=2m
- A valid value is a time with unit (d/day, h/hour, m/min, s/sec, ms/msec,
us/usec, ns/nsec), which is msec if not specified.
- Determines how often the compaction history reaper runs.