Apache Hive 3 architectural overview
Understanding Apache Hive 3 major design features, such as default ACID transaction processing, can help you use Hive to address the growing needs of enterprise data warehouse systems.
Data storage and access control
One of the major architectural changes to support Hive 3 design gives Hive much more control over metadata memory resources and the file system, or object store. The following architectural changes from Hive 2 to Hive 3 provide improved security:
- Tightly controlled file system and computer memory resources, replacing flexible boundaries: Definitive boundaries increase predictability. Greater file system control improves security.
- Optimized workloads in shared files and YARN containers
- Hive uses ACID to determine which files to read rather than relying on the storage system.
- In Hive 3, file movement is reduced from that in Hive 2.
- Hive caches metadata and data agressively to reduce file system operations
The major authorization model for Hive is Ranger. Hive enforces access controls specified in Ranger. This model offers stronger security than other security schemes and more flexibility in managing policies.
This model permits only Hive to access the Hive warehouse.
Transaction processing
You can deploy new Hive application types by taking advantage of the following transaction processing characteristics:
- Mature versions of ACID transaction processing:
ACID tables are the default table type.
ACID enabled by default causes no performance or operational overload.
- Simplified application development, operations with strong transactional guarantees, and
simple semantics for SQL commands
You do not need to bucket ACID tables.
- Materialized view rewrites
- Automatic query cache
- Advanced optimizations
Hive client changes
You can use the thin client Beeline for querying Hive from the command line. You can run Hive
administrative commands from the command line. Beeline uses a JDBC connection to Hive to run
commands. Hive parses, compiles, and runs operations. Beeline supports many of the
command-line options that Hive CLI supported. Beeline does not support hive -e set
key=value
to configure the Hive Metastore.
You enter supported Hive CLI commands by invoking Beeline using the hive
keyword, command option, and command. For example, hive -e set
. Using Beeline
instead of the thick client Hive CLI, which is no longer supported, has several advantages,
including low overhead. Beeline does not use the entire Hive code base. A small number of
daemons required to run queries simplifies monitoring and debugging.
Hive enforces allowlist and denylist settings that you can change using SET commands. Using the denylist, you can restrict memory configuration changes to prevent instability. Different Hive instances with different allowlists and denylists to establish different levels of stability.
Apache Hive Metastore sharing
Hive, Impala, and other components can share a remote Hive metastore.
Spark integration
Spark and Hive tables interoperate using the Hive Warehouse Connector.
You can use the Hive Warehouse Connector (HWC) to access Hive managed tables from Spark. HWC is specifically designed to access managed Hive tables, and supports writing to tables in ORC format only.
You do not need HWC to read from or write to Hive external tables. Spark uses native Spark to read external tables.
Query execution of batch and interactive workloads
You can connect to Hive using a JDBC command-line tool, such as Beeline, or using an JDBC/ODBC driver with a BI tool, such as Tableau. You configure the settings file for each instance to perform either batch or interactive processing.