Managing Apache Hive

Compactor properties

You check and change a number of Apache Hive properties to configure the compaction of delta files that accumulate during data ingestion. You need to know the defaults, valid values, and where to set these properties: Cloudera Manager, TBLPROPERTIES, hive-site.xml, or core-site.xml. When properties do not appear in Cloudera Manager search of configuration properties for a runtime service, you add the property to hive-site or core-site using the Cloudera Manager Safety Valve.

Basic compactor properties🔗

hive.compactor.initiator.on: Default=false; Whether to run the initiator and cleaner threads on this metastore instance or not.

hive.compactor.worker.threads: Default=0; Set this to a positive number to enable Hive transactions, which are required to trigger transactions. Worker threads spawn jobs to perform compactions, but do not perform the compactions themselves. Increasing the number of worker threads decreases the time that it takes tables or partitions to be compacted. However, increasing the number of worker threads also increases the background load on the CDP cluster because they cause more jobs to run in the background.

hive.metastore.runworker.in: Default=HS2; Specifies where to run the Worker threads that spawn jobs to perform compactions. Valid values are HiveServer (HS2) or Hive metastore (HMS).

hive.compactor.abortedtxn.threshold: Default=1000 aborts; The number of aborted transactions that triggers compaction on a table/partition.

hive.compactor.aborted.txn.time.threshold: Default=12 hours; The hours of aborted transactions that trigger compaction on a table/partition.

Advanced compactor properties🔗

hive.compactor.worker.timeout: Default=86400s; Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is sec if not specified. Time in seconds after which a compaction job will be declared failed and the compaction re-queued.

hive.compactor.check.interval: Default=300s; A valid value is a time with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is sec if not specified.; Time in seconds between checks to see if any tables or partitions need to be compacted. This value should be kept high because each check for compaction requires many calls against the NameNode. Decreasing this value reduces the time it takes to start compaction for a table or partition that requires it. However, checking if compaction is needed requires several calls to the NameNode for each table or partition involved in a transaction done since the last major compaction. Consequently, decreasing this value increases the load on the NameNode.

hive.compactor.delta.num.threshold: Default=10; Number of delta directories in a table or partition that triggers a minor compaction.

hive.compactor.delta.pct.threshold: Default=0.1; Percentage (fractional) size of the delta files relative to the base that triggers a major compaction. (1.0 = 100%, so the default 0.1 = 10%.)

hive.compactor.max.num.delta: Default=500; Maximum number of delta files that the compactor attempts to handle in a single job.

hive.compactor.wait.timeout: Default=300000; The value must be greater than 2000 milliseconds.; Time out in milliseconds for blocking compaction.

hive.compactor.initiator.failed.compacts.threshold: Default=2; A valid value is between 1 and 20, and must be less than hive.compactor.history.retention.failed.; The number of consecutive compaction failures (per table/partition) after which automatic compactions are not scheduled any longer.

hive.compactor.cleaner.run.interval: Default=5000ms; A valid value is a time with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is msec if not specified.; The time between runs of the cleaner thread.

hive.compactor.job.queue: Specifies the Hadoop queue name to which compaction jobs are submitted. If the value is an empty string, Hadoop chooses the queue.

hive.compactor.compact.insert.only: Default=true; The compactor compacts insert-only tables, or not (false). A safety switch.

hive.compactor.crud.query.based: Default=false; Performs major compaction on full CRUD tables as a query, and disables minor compaction.

hive.split.grouping.mode: Default=query; A valid value is either query or compactor.; This property is set to compactor from within the query-based compactor. This setting enables the Tez SplitGrouper to group splits based on their bucket number, so that all rows from different bucket files for the same bucket number can end up in the same bucket file after the compaction.

hive.compactor.history.retention.succeeded: Default=3; A valid value is between 0 and 100.; Determines how many successful compaction records are retained in compaction history for a given table/partition.

hive.compactor.history.retention.failed: Default=3; A valid value is between 0 and 100.; Determines how many failed compaction records are retained in compaction history for a given table/partition.

hive.compactor.history.retention.attempted: Default=2; A valid value is between 0 and 100.; Determines how many attempted compaction records are retained in compaction history for a given table/partition.

hive.compactor.history.reaper.interval: Default=2m; A valid value is a time with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is msec if not specified.; Determines how often compaction history reaper runs.