What's New in Apache Hive

Hive ACID compaction observability

Compaction observability is a notification and information system based on metrics about the health of the compaction process. You can use Cloudera Manager to view compaction health checks for the Hive Metastore and Hive on Tez services, view actions and advice related to configurations and thresholds, and use the Compaction tab from the Hive Metastore service to view compaction-related charts based on the collected metrics. For more information, see Compaction Observability in Cloudera Manager.

Using SYS table to monitor compactions, transactions, and locks

You can monitor the progress and filter for specific compaction, transaction, and transaction lock jobs by querying the COMPACTIONS, TRANSACTIONS, and LOCKS view within the SYS database. For details, see Monitoring compactions, Monitoring transactions, and Monitoring transaction locks.

Support table defaults at database-level

You can use the database property, defaultTableType=EXTERNAL or ACID to specify the default table type to be created using the CREATE TABLE statement. You can specify this property when creating the database or at a later point using the ALTER DATABASE statement. For more information, see Understanding CREATE TABLE behavior.

Support external-only tables at database-level

You can choose to configure a database to allow only external tables to be created and prevent the creation of ACID tables. While creating a database, you can set the database property, EXTERNAL_TABLES_ONLY=true to ensure that only external tables are created in the database. For more information, see Understanding CREATE TABLE behavior.

Partition filtering support for the MSCK REPAIR TABLE statement

The MSCK REPAIR TABLE statement is enhanced to support filtering of the partition columns using operators so that a larger subset of partitions can be recovered (added/removed) without triggering a full repair. For more information, see Partition refresh and configuration.

Mapping specific columns in the INSERT clause of the MERGE statement

The MERGE statement is enhanced to support mapping of specific columns in the INSERT clause of the query instead of passing values (including null) for columns in the target table that do not have any data to insert. The unspecified columns in the INSERT clause are either mapped to null or use default constraints, if any. For more information, see Merging data in Hive tables.

Hive Warehouse Connector support for Spark 3

As part of this release, Hive Warehouse Connector (HWC) is certified to work with Spark 3. You can use the CDS 3.3 Powered by Apache Spark add-on service to use Spark 3 for querying Hive managed tables. For more information, see Introduction to HWC.

Simplified Hive Warehouse Connector configuration

Setting up HWC configurations that are required by Spark is now simplified. As a cluster administrator, you need to specify the required configurations in Cloudera Manager as a one-time activity and then enable HWC by setting the spark.cloudera.useHWC property to "true". The spark.cloudera.useHWC property can either be specified in the spark-defaults.conf file or by using the –conf option in spark-shell or spark-submit. For more information, see Setting up HWC configurations.

Enable caching for Hive Warehouse Connector Secure access mode

You can enable caching for the HWC secure access mode to have finer control over read queries and ensure that the content updated outside of a Spark session is considered during reads. For more information, see Enabling caching for secure access mode.

Hive metastore supports Oracle RAC databases for High Availability

Hive metastore supports using Oracle Real Application Clusters (RAC) to achieve high availability and seamlessly handle SELECT, DDL, and DML operations.

Create and Drop UDFs are authorized in Hive metastore (Technical Preview)

The Create and Drop user-defined functions (UDFs) that were earlier authorized only in HiveServer2 (HS2) are enhanced to be authorized in Hive metastore (HMS). For details, see HIVE-25468.

Technical Preview: This is a technical preview feature and considered under development. Do not use this in your production systems. To share your feedback, contact Support by logging a case on our Cloudera Support Portal. Technical preview features are not guaranteed troubleshooting guidance and fixes.

Hive ACID improvements

As part of this release, the following improvements have been implemented to enhance Hive ACID operations:

Allow faster CREATE TABLE (no-rename CTAS) — During table creation, the table location is suffixed with the transaction identifier of the query to prevent a possible race condition between users who are trying to create a table with the same name at the exact same time. See HIVE-24906.
Faster Drop Table/Drop Partition operation — HIVE-25393 introduces read locks in compaction (MR, Minor Query-based) that would prevent concurrency issues with DDL operations (DROP/RENAME table).
Faster sequence numbers — HIVE-23048 implements the use of sequences for TXN_ID generation, which does not restore the number when rolling back the transaction.
Split Update statement and skip sorting of Insert rows — HIVE-21158 rewrites the Update statement as Multi-Insert and performs Update split early. This helps in preventing the sorting of insert delta since these records have new record identifier which are generated in a monotonic increase manner, and can lead to a faster statement execution when a large number of records are affected.
Improved read-locks and prevent single transaction locks — This improvement fixes the issue that occurs when a single long running transaction prevents the Cleaner to clean up any tables. This causes file buildup in tables, which can cause performance penalties when listing the directories.

Hive ACID replication using REPL commands

As part of this release, you can now use REPL commands (REPL DUMP and REPL LOAD) to replicate Hive ACID tables between CDP Private Cloud Base clusters.