2. New Feature: Transactions

Support for transactions in Hive 0.13 enables SQL atomicity of operations at the row level rather than at the level of a table or partition. This means that a Hive client may read from the same partition to which another Hive client is adding rows. In addition, transactions provide a mechanism for streaming clients to rapidly update Hive tables and partitions. However, Hive transactions are different than RDBMS transactions: each transaction has an identifier, and multiple transactions are grouped into a single transaction batch. A streaming client requests a set of transaction IDs after connecting to Hive and subsequently uses these transaction IDs one at a time during the initialization of new transaction batches. Clients write one or more records for each transaction and either commit or abort a transaction before moving to the next transaction.

[Note]Note

This feature is a technical preview and considered under development. Do not use this feature in your production systems. If you have questions regarding this feature, contact Support by logging a case on our Hortonworks Support Portal at http://support.hortonworks.com.

[Note]Note

Hive transactions do not work for this release when using Oracle as the Metastore database.

Understanding Compactions

Hive stores data in base files that cannot be updated by HDFS. Instead, Hive creates a set of delta files for each transaction that alters a table or partition and stores them in a separate delta directory. Occasionally, Hive compacts, or merges, the base and delta files. Hive performs all compactions in the background without affecting concurrent reads and writes of Hive clients. There are two types of compactions:

 

Table 3.3. Hive Compaction Types

Compaction TypeDescription
MinorRewrites a set of delta files to a single delta file for a bucket.
MajorRewrites one or more delta files and the base file as a new base file for a bucket.


By default, Hive automatically compacts delta and base files at regular intervals. However, Hadoop administrators can configure automatic compactions, as well as perform manual compactions of base and delta files using the following configuration parameters in hive-site.xml.

 

Table 3.4. Hive Transaction Configuration Parameters

Configuration ParameterDescription
hive.txn.managerSpecifies the class name of the transaction manager used by Hive. Configure this property with org.apache.hadoop.hive.ql.lockmgr.DbTxnManager to enable transactions. The default value is org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager, which disables transactions.
hive.txn.driverSpecifies the class name of the JDBC driver for the Hive metastore database. Include the username and password of a Hive administrator. The default MySQL database uses the following JDBC connector. Change the username and password to match your environment: jdbc:mysql://hdp.example.com/hivedb?user=hivedba&password=hivedbp
hive.txn.timeoutSpecifies the time period, in seconds, after which Hive fails a transaction if a Hive client has not sent a hearbeat. The default value is 300 seconds.
hive.txn.max.open.batchSpecifies the maximum number of transactions that can be retrieved in one call to open_txn(). The default value is 1000 transactions.
hive.compactor.initiator.onSpecifies whether to run the initiator and cleaner threads on this metastore instance. The default value is false. Must be set to true for exactly one instance of the Hive metastore service.
hive.compactor.worker.threadsSpecifies the number of of worker threads to run on this metastore instance. The default value is 0, which must be set to greater than 0 to enable compactions. Worker threads initialize MapReduce jobs to do compactions. Increasing the number of worker threads decreases the time required to compact tables after they cross a threshold that triggers compactions. However, increasing the number of worker threads also increases the background load on a Hadoop cluster.
hive.compactor.worker.timeoutSpecifies the time period, in seconds, after which a compaction job is failed and requeued. The default value is 86400 seconds, or 24 hours.
hive.compactor.check.intervalSpecifies the time period, in seconds, between checks to see if any partitions require compacting. The default value is 300 seconds. Decreasing this value reduces the time required to start a compaction for a table or partition. However, it also increases the background load on the NameNode since each check requires several calls to the NameNode.
hive.compactor.delta.num.thresholdSpecifies the number of delta directories in a partition that triggers an automatic minor compaction. The default value is 10.
hive.compactor.delta.pct.thresholdSpecifies the percentage size of delta files relative to the corresponding base files that triggers an automatic major compaction. The default value is .1, which is 10 percent.
hive.compactor.abortedtxn.thresholdSpecifies the number of aborted transactions on a single partition that trigger an automatic major compaction.


Enabling the Hive Transaction Manager

Configure a the following Hive configuration properties from the table above to enable transactions:

  • hive.txn.manager

  • hive.txn.driver

  • hive.compactor.initiator.on

  • hive.compactor.worker.threads

[Note]Note

To disable automatic compactions for individual tables, set the NO_AUTO_COMPACTION table property for those tables. This overrides the configuration settings in hive-site.xml. However, this property does not prevent manual compactions.

Check the hive log file at /tmp/hive/hive.log for errors if you experience problems enabling hive transactions.

Performing Manual Compactions

Hive administrators use the ALTER TABLE DLL command to queue a request to compact base and delta files for a table or partition:

ALTER TABLE tablename [PARTITION (partition_key='partition_value' [,...])] COMPACT 'compaction_type'

[Note]Note

ALTER TABLE will compact tables even if the NO_AUTO_COMPACTION table property is set.

Use the SHOW COMPACTIONS DDL command to monitor the progress of the compactions:

SHOW COMPACTIONS

This command provides the following output for each compaction:

  • Database name

  • Table name

  • Partition name

  • Major or minor compaction

  • Compaction state:

    • Initiated - waiting in queue

    • Working - currently compacting

    • Ready for cleaning - compaction completed and old files scheduled for removal

  • Thread ID

  • Start time of compaction

Hive administrators can also view a list of currently open and aborted transactions with the the SHOW TRANSACTIONS DDL command:

SHOW TRANSACTIONS

This command provides the following output for each transaction:

  • Transaction ID

  • Transaction state

  • Hive user who initiated the transaction

  • Host machine where transaction was initiated

New Lock Manager

Hive 0.13 utilizes a new lock manager, DbLockManager, to store all transaction and related lock information in the Hive metastore. Heartbeats are sent regularly from lock holders and transaction initiators to the Hive metastore to prevent stale locks and transactions. The lock or transaction is aborted if the Hive metastore does not receive a heartbeat within the amount of time specified by the hive.txn.timeout configuration property. Hive administrators use the SHOW LOCKS DDL command to view information about locks associated with transactions.

This command provides the following output for each lock:

  • Database name

  • Table name

  • Partition, if the table is paritioned

  • Lock state:

    • Acquired - transaction initiator hold the lock

    • Waiting - transaction initiator is waiting for the lock

    • Aborted - the lock has timed out but has not yet been cleaned

  • Lock type:

    • Exclusive - the lock may not be shared

    • Shared_read - the lock may be shared with any number of other shared_read locks

    • Shared_write - the lock may be shared by any number of other shared_read locks but not with other shared_write locks

  • Transaction ID associated with the lock, if one exists

  • Last time lock holder sent a heartbeat

  • Time the lock was acquired, if it has been acquired

  • Hive user who requested teh lock

  • Host machine on which the Hive user is running a Hive client

[Note]Note

The output of the command reverts to behavior prior to Hive 0.13 if administrators use Zookeeper or in-memory lock managers.

Transaction Limitations

HDP 2.1 has the following limitations for transactions with Hive:

  • The user initiating the Hive session must have write permission for the destination partition or table.

  • Zookeeper and in-memory locks are not compatible with transactions.

  • Only ORC files are supported.

  • Destination tables must be bucketed and not sorted.

  • Snapshot-level isolation, similar to READ COMMITTED. A query is provided with a consistent snapshot of the data during execution.


loading table of contents...