Apache Hive 3 architectural overview
Understanding Apache Hive 3 major design features, such as default ACID transaction processing, can help you use Hive to address the growing needs of enterprise data warehouse systems.
Apache Tez is the Hive execution engine for the Hive-on-Tez service in Cloudera Manager. MapReduce is not supported. In a Cloudera cluster, if a legacy script or application specifies MapReduce for execution, an exception occurs. Most user-defined functions (UDFs) require no change to execute on Tez instead of MapReduce.
With expressions of directed acyclic graphs (DAGs) and data transfer primitives, execution of Hive queries on Tez instead of MapReduce improves query performance. In Cloudera Data Plane (CDP), Tez is usually used only by Hive, and HiveServer launches and manages Tez AM automatically when HiveServer2 starts. SQL queries you submit to Hive are executed as follows:
- Hive compiles the query.
- Tez executes the query.
- Resources are allocated for applications across the cluster.
- Hive updates the data in the data source and returns query results.
Hive on Tez runs tasks on ephemeral containers and uses the standard YARN shuffle service.
Data storage and access control
One of the major architectural changes to support Hive 3 design gives Hive much more control over metadata memory resources and the vfile system, or object store. The following architectural changes from Hive 2 to Hive 3 provide improved security:
- Tightly controlled file system and computer memory resources, replacing flexible boundaries: Definitive boundaries increase predictability. Greater file system control improves security.
- Optimized workloads in shared files and YARN containers
- Hive uses ACID to determine which files to read rather than relying on the storage system.
- In Hive 3, file movement is reduced from that in Hive 2.
- Hive caches metadata and data agressively to reduce file system operations
The major authorization model for Hive is Ranger. Hive enforces access controls specified in Ranger. This model offers stronger security than other security schemes and more flexibility in managing policies.
This model permits only Hive to access the data warehouse. If you do not enable the Ranger security service, or other security, CDP Data Center by default Hive uses storage-based authorization (SBA) based on user impersonation.
HDFS permission changes
- Increased flexibility when giving multiple groups and users specific permissions
- Convenient application of permissions to a directory tree rather than by individual files
You can deploy new Hive application types by taking advantage of the following transaction processing characteristics:
- Mature versions of ACID transaction processing:
ACID tables are the default table type.
ACID enabled by default causes no performance or operational overload.
- Simplified application development, operations with strong transactional guarantees, and
simple semantics for SQL commands
You do not need to bucket ACID tables.
- Materialized view rewrites
- Automatic query cache
- Advanced optimizations
Hive client changes
CDP Data Center supports the thin client Beeline for working on the command line. You can run
Hive administrative commands from the command line. Beeline uses a JDBC connection to HiveServer
to execute commands. Parsing, compiling, and executing operations occur in HiveServer. Beeline
supports many of the command-line options that Hive CLI supported. Beeline does not support
hive -e set key=value to configure the Hive Metastore.
You enter supported Hive CLI commands by invoking Beeline using the
keyword, command option, and command. For example,
hive -e set. Using Beeline
instead of the thick client Hive CLI, which is no longer supported, has several advantages,
including low overhead. Beeline does not use the entire Hive code base. A small number of
daemons required to execute queries simplifies monitoring and debugging.
HiveServer enforces whitelist and blacklist settings that you can change using SET commands. Using the blacklist, you can restrict memory configuration changes to prevent HiveServer instability. You can configure multiple HiveServer instances with different whitelists and blacklists to establish different levels of stability.
You use the grunt command line to work with Apache Pig.
Apache Hive Metastore sharing
HiveServer, Impala, and other components can share a remote Hive metastore. In CDP Public Cloud, HMS uses a pre-installed MySQL database. You perform little, or no, configuration of HMS in the cloud.
Spark and Hive tables interoperate using the Hive Warehouse Connector in some cases.
You can access ACID and external tables from Spark using the Hive Warehouse Connector. You do not need the Hive Warehouse Connector to read Hive external tables from Spark and write Hive external tables from Spark.
Query execution of batch and interactive workloads
You can connect to Hive using Data Analytics Studio (DAS), a JDBC command-line tool, such as Beeline, or using an JDBC/ODBC driver with a BI tool, such as Tableau. Clients communicate with an instance of the same HiveServer version. You configure the settings file for each instance to perform either batch or interactive processing.