What's new in this release: Apache Hive
HDP 3.x includes many Apache Hive 3 enhancements. HDP 3.x can help you improve query performance and comply with internet regulations. A few interfaces available in prior releases are unsupported in HDP 3.x.
HDP 3.1.5 key features
In HDP 3.1.5 and later, Spark and Hive share a catalog in Hive metastore (HMS) instead of using separate catalogs, which was the case in HDP 3.1.4 and earlier. The shared catalog simplifies use of HWC. To read the Hive external table from Spark, you no longer need to define the table redundantly in the Spark catalog. Also, HDP 3.1.5 introduces HMS table transformations. HMS detects type of client for interacting with HMS, for example Hive or Spark, and compares the capabilities of the client with the table requirement. A resulting action occurs that makes sense given the client capabilities and other factors. See link below.
The default table type created from Spark using HWC has changed to external. The
external.table.purge property is set to true, so external table behavior is like HDP 2.x
managed tables with regard to the drop statement, which now drops the table data, not just the schema.
When you use SparkSQL, standard Spark APIs access tables in the Spark catalog.
HDP 3.1.4 issues resolved and key features
- Automatic partition management synchronizes changes in the metadata and on the file system.
- How long to retain partition data and metadata can be configured.
- Hive Warehouse Connector now validates mapping of columns against those in Hive to alert the user to input errors.
- Writing a DataFrame to Hive supports specifying partitions.
- A new MergeBuilder interface for HiveWarehouseSession API operations supports merging tables.
See related links below.
HDP 3.1 key features
- Kafka-Hive connector reads and writes from Kafka to Hive and vice versa.
- JdbcStorageHandler connects to BI tools.
- Built-in UDF generates surrogate keys for ACID tables.
- Hive Warehouse Connector creates Hive tables based on the existing Spark DataFrames when you save a DataFrame to Hive.
- Ambari 2.7.3 feature configures Hive high availability.
Add multiple HDP 3.1 HiveServer Interactive instances that run in passive (standby) mode, ready to take over in the event of a problem.
See related links below.
HDP 3.0.1 key feature
HDP 3.0.1 includes partitioning of materialized views, which can improve query responsiveness, and maintenance fixes.
HDP 3.0 key features
HDP 3.0 includes many features. The primary features are listed below.
Using workload management, you can configure who uses resources, how much can be used, and how quickly Hive responds to resource requests. Managing resources is critical to Hive LLAP (low-latency analytical processing), especially in a multitenant environment. Using workload management, you can create resource pools and allocate resources to match availability needs and prevent contention for those resources. Workload management improves parallel query execution and cluster sharing for queries running on Hive LLAP, and also improves performance of non-LLAP queries. Workload management reduces resource starvation in large clusters. You implement workload management on the command line using Hive queries.
Transaction processing improvements
Mature versions of ACID (Atomicity, Consistency, Isolation, and Durability) transaction processing and LLAP evolve in Hive and HDP 3.0. ACID tables are enhanced to serve as the default table type in HDP 3.0, without performance or operational overload. LLAP processes queries in subseconds. Using ACID table operations facilitates compliance with the right to be forgotten requirement of the GDPR (General Data Protection Regulation). Application development and operations are simplified with stronger transactional guarantees and simpler semantics for SQL commands. You do not need to bucket ACID tables, so maintenance is easier. You no longer need to perform ACID delete operations in a Hive table.
With improvements in transactional semantics comes advanced optimizations, such as materialized view rewrites and automatic query cache. With these optimizations, you can deploy new Hive application types. Because multiple queries frequently need the same intermediate roll up or joined table, you can avoid costly, repetitious query portion sharing, by precomputing and caching intermediate tables into views. The query optimizer automatically leverages the precomputed cache, improving performance. Materialized views increase the speed of join and aggregation queries in business intelligence (BI) and dashboard applications, for example.
Cost-based optimizer enhancements
Hive can push down the filtering, sorting, and joining of columns in a query. For example, MySQL tables joins can be pushed down to underlying database.
Direct, low latency Hive query of Kafka topics
You can ingest Kafka into ACID tables, or query the data in the Kafka message from Hive. With HDP 3.0, you can create a Druid table within Hive from a Kafka topic in a single command. This feature simplifies queries of Kafka data by eliminating the data processing step between delivery by Kafka and querying in Druid.
HDP 3 introduces a technical preview of Apache Superset, the data exploration and visualization UI platform. Superset is a way to create HDP dashboards. Using Superset, installed by default as a service in Ambari, you can connect to Hive, create visualizations of Hive data, and create custom dashboards on Hive datasets. Superset is an alternative to Hive View, which is not available in HDP 3.0.
Spark integration with Hive
You can use Hive 3 to query data from Apache Spark and Apache Kafka applications, without workarounds. The Hive Warehouse Connector supports reading and writing Hive tables from Spark.
Hive security improvements
Apache Ranger secures Hive data by default. Through the Hive Warehouse Connector (HWC), you can secure data access at the column or row level from Spark. To meet customer demands for concurrency improvements, ACID support for GDPR (General Data Protection Regulation), render security, and other features, Hive now tightly controls the file system and computer memory resources. With the additional control, Hive better optimizes workloads in shared files and YARN containers. The more Hive controls the file system, the better Hive can secure data.
Query result cache
Hive filters and caches similar or identical queries. Hive does not recompute the data that has not changed. Caching repetitive queries can reduce the load substantially when hundreds or thousands of users of BI tools and web services query Hive.
Information schema database
Hive creates two databases from JDBC data sources when you add the Hive service to a cluster: information_schema and sys. All Metastore tables are mapped into your tablespace and available in sys. The information_schema data reveals the state of the system, similar to sys database data. You can query information_schema using SQL standard queries, which are portable from one DBMS to another.
Deprecated, unavailable, and unsupported interfaces
In HDP 3.0 and later, Hive does not support the following features:
- Hcat CLI
- Hive CLI (replaced by Beeline)
- SQL Standard Authorization
- MapReduce execution engine (replaced by Tez)
- Apache Hadoop Distributed Copy (DistCp)