Migrating Hive workloads to ACID

An Apache Hive transactional table is also known as a Hive ACID table. The terms transactional and ACID are interchangeable. Data from transactions involving money, especially, but also other transactions, require databases that meet ACID requirements. You learn what ACID means and some features available if you use ACID tables.

This documentation describes Hive 3 under default conditions. All managed tables are ACID tables.

ACID is an acronym that describes database properties as follows:

A = Atomic: Changes to the database occur all at once or not at all.

C = Consistent: Before and after system failures, results of inserts and deletes are the same.

I = Isolated: A read operation is not affected by changes to the database that occur during the operation.

D = Durable: Successful transactions occur regardless of system failure.

Atomic means simultaneous operations do not interfere with each othter. Once done, the result of an atomic operation is available to others. Transactions are consistent between operations. An operation is isolated in its own workspace and durable, meaning after being committed, data remains and survives any failure.

If the data about your transactions does not have to be atomic, consistent, isolated, and durable, you can configure Hive to create managed tables that are not ACID.

You get a number of capabilities described in the following sections when you use ACID tables.

Query results cache and data cache🔗

Results of ACID tables that Hive manages on the file system and in HMS can be cached. To reliably deal with data caching and query results caching, Hive needs to check that nothing is changed. If something changed, Hive invalidates those items to read more data. Hive must have complete control over the tables to perform the checks associated with caching.

System management of ACID tables🔗

ACID tables, which are transactional tables, are managed by Hive. Hive tightly controls data storage and access. By default, direct file system access to the managed Hive warehouse files is not allowed. Hive keeps you out of the warehouse. You do not need to access the Hive files on the file system. Access to the file system subverts security and governance on Hive tables beyond just reading the columns. The data masking in columns and row filtering are compromised if you have access to the files. Exceptions: Ranger RMS policies are projected down to the file system level to handle Spark and Hive Warehouse Connector (HWC) Direct Reader access.

Hive stops external operations from changing managed files to prevent data integrity and corruption problems.

Compaction🔗

Managed tables go through a compaction process. You can make inserts into table repeatedly and the ACID compaction system helps clean up inserts. The system compacts the changes, reducing your footprint of small files (delta files). You cannot, however, batch micro inserts, such as making 1000 inserts at one time. This creates problems with the system. Hive is a batch system, not intended to be handling online transaction processing (OLTP).

Hive statistics🔗

Managed tables generates and manages table statistics for ACID tables. This is not necessarily true for other tables.

Governance of Hive tables🔗

Hive tables are managed, audited, and have protected columns. You can protect the data level using masking and row level filtering.

Advanced features🔗

Advanced features, in addition to ACID, include materialized views, the query cache, and automatic statistics generation. You can update or delete data from ACID tables. Hive 3 addresses small files different from Hive 2.

For more information, see the Hive product documentation on the Cloudera web site:

Materialized views
Query results cache
Automatically collected statistics
Mutability
- Updates allowed
- Deletes allowed
Handling small files