Compactor properties

You check and change a number of Apache Hive properties to configure the compaction of delta files that accumulate during data ingestion. You need to know the defaults, valid values, and where to set these properties: Cloudera Manager, TBLPROPERTIES, hive-site.xml, or core-site.xml. When properties do not appear in Cloudera Manager search of configuration properties for a runtime service, you add the property to hive-site or core-site using the Cloudera Manager Safety Valve.

Basic compactor properties

hive.compactor.initiator.on
Default=false
Whether to run the initiator and cleaner threads on this metastore instance or not.
hive.compactor.worker.threads
Default=0
Set this to a positive number to enable Hive transactions, which are required to trigger transactions. Worker threads spawn jobs to perform compactions, but do not perform the compactions themselves. Increasing the number of worker threads decreases the time that it takes tables or partitions to be compacted. However, increasing the number of worker threads also increases the background load on the CDP cluster because they cause more jobs to run in the background.
hive.metastore.runworker.in
Default=HS2
Specifies where to run the Worker threads that spawn jobs to perform compactions. Valid values are HiveServer (HS2) or Hive metastore (HMS).
hive.compactor.abortedtxn.threshold
Default=1000
The number of aborted transactions involving a given table or partition that will trigger a major compaction.

Advanced compactor properties

hive.compactor.worker.timeout
Default=86400s
Expects a time value using d/day, h/hour, m/min, s/sec, ms/msec, us/usec, or ns/nsec (default is sec). Time in seconds after which a compaction job will be declared failed and the compaction re-queued.
hive.compactor.check.interval
Default=300s
A valid value is a time with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is sec if not specified.
Time in seconds between checks to see if any tables or partitions need to be compacted. This value should be kept high because each check for compaction requires many calls against the NameNode. Decreasing this value reduces the time it takes to start compaction for a table or partition that requires it. However, checking if compaction is needed requires several calls to the NameNode for each table or partition involved in a transaction done since the last major compaction. Consequently, decreasing this value increases the load on the NameNode.
hive.compactor.delta.num.threshold
Default=10
Number of delta directories in a table or partition that triggers a minor compaction.
hive.compactor.delta.pct.threshold
Default=0.1
Percentage (fractional) size of the delta files relative to the base that triggers a major compaction. (1.0 = 100%, so the default 0.1 = 10%.)
hive.compactor.max.num.delta
Default=500
Maximum number of delta files that the compactor attempts to handle in a single job.
hive.compactor.wait.timeout
Default=300000
The value must be greater than 2000 milliseconds.
Time out in milliseconds for blocking compaction.
hive.compactor.initiator.failed.compacts.threshold
Default=2
A valid value is between 1 and 20, and must be less than hive.compactor.history.retention.failed.
The number of consecutive compaction failures (per table/partition) after which automatic compactions are not scheduled any longer.
hive.compactor.cleaner.run.interval
Default=5000ms
A valid value is a time with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is msec if not specified.
The time between runs of the cleaner thread.
hive.compactor.job.queue
Specifies the Hadoop queue name to which compaction jobs are submitted. If the value is an empty string, Hadoop chooses the default queue to submit compaction jobs.
Providing an invalid queue name results in compaction job failures.
hive.compactor.compact.insert.only
Default=true
The compactor compacts insert-only tables, or not (false). A safety switch.
hive.compactor.crud.query.based
Default=false
Performs major compaction on full CRUD tables as a query, and disables minor compaction.
hive.split.grouping.mode
Default=query
A valid value is either query or compactor.
This property is set to compactor from within the query-based compactor. This setting enables the Tez SplitGrouper to group splits based on their bucket number, so that all rows from different bucket files for the same bucket number can end up in the same bucket file after the compaction.
hive.compactor.history.retention.succeeded
Default=3
A valid value is between 0 and 100.
Determines how many successful compaction records are retained in compaction history for a given table/partition.
hive.compactor.history.retention.failed
Default=3
A valid value is between 0 and 100.
Determines how many failed compaction records are retained in compaction history for a given table/partition.
hive.compactor.history.retention.attempted
Default=2
A valid value is between 0 and 100.
Determines how many attempted compaction records are retained in compaction history for a given table/partition.
hive.compactor.history.reaper.interval
Default=2m
A valid value is a time with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is msec if not specified.
Determines how often compaction history reaper runs.