Hive Warehouse Connector read modes

Learn about the supported Hive Warehouse Connector (HWC) read modes and their advantages, requirements, and limitations.

Hive Warehouse Connector read modes supported by Cloudera Data Engineering

HWC supports multiple read modes, each tailored for specific use cases. These modes determine how Spark interacts with Hive tables and views.

HWC provides the following read methods, each suited for different scenarios:

  • DIRECT_READER_V2
  • JDBC_CLUSTER
  • SECURE_ACCESS
DIRECT_READER_V2

HWC needs connection to Hive Metastore (HMS). Spark reads the data directly from the managed table location using a transaction snapshot.

Requirements:
  • You must have read and execute access to the underlying data and folder structure of the table.
  • You need access to HiveServer2 (HS2).

Advantages: The DIRECT_READER_V2 read mode is recommended for achieving better performance through vectorized Optimized Row Columnar (ORC) reads.

Limitations:
  • Cannot write data using HWC DIRECT_READER.
  • No support for FGAC.
  • Supports only single-table transaction consistency. The DIRECT_READER does not guarantee that multiple tables referenced in a query read the same snapshot of data.

Usage: You can use implicit methods, like

sql("select * from default.hwctest").show(1, false)

or

hive.sql("select * from default.hwctest").show(1, false)

Configuration

Example Cloudera Data Engineering job or session configuration to use the DIRECT_READER_V2 read mode of HWC:
spark.datasource.hive.warehouse.metastoreUri=thrift://[***HMS-HOST***]:9083
spark.hadoop.hive.zookeeper.quorum=[***ZK-HOST***]:2181
spark.sql.hive.hiveserver2.jdbc.url=[***HS2-URL***]
spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@[***REALM***]
spark.security.credentials.hiveserver2.enabled=true
spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
spark.datasource.hive.warehouse.read.mode=DIRECT_READER_V2
spark.hadoop.secure.access.cache.disable=true
JDBC_CLUSTER
HWC makes a connection to HiveServer (HS2) to get transaction information. This read mode is secured through Ranger authorization and supports FGAC, such as column masking.
Requirements: You need access to HiveServer2 (HS2).
Limitations:
  • Recommended for workloads with a size of less than 1GB to avoid bottleneck, as data is transferred through a single JDBC connection.
  • Requires HiveServer2 connection.
  • Cannot correctly resolve queries that use the ORDER BY clause when run as hive.sql(" [***QUERY***] ").
  • Cannot correctly query a table that has a complex column type, such as ARRAY, STRUCT and MAP: it incorrectly represents the type as string in the returned data frame.

Usage: you can use explicit methods, like

hive.sql("select * from default.hwctest").show(1, false)

Configuration

Example Cloudera Data Engineering job or session configuration to use the JDBC_CLUSTER read mode of HWC:
spark.datasource.hive.warehouse.metastoreUri=thrift://[***HMS-HOST***]:9083
spark.hadoop.hive.zookeeper.quorum=[***ZK-HOST***]:2181
spark.sql.hive.hiveserver2.jdbc.url=[***HS2-URL***]
spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@[***REALM***]
spark.security.credentials.hiveserver2.enabled=true
spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
spark.datasource.hive.warehouse.read.mode=JDBC_CLUSTER
spark.hadoop.secure.access.cache.disable=true
SECURE_ACCESS

The SECURE_ACCESS read mode allows FGAC column masking and row filtering to secure the data of managed (Atomicity, Consistency, Isolation, and Durability (ACID)) or external Hive tables read from Cloudera Data Engineering Spark sessions and jobs. It is recommended for large workloads and low-latency requirements. HWC creates external tables on a configured staging location. It uses CREATE TABLE AS SELECT (CTAS) to create the tables. Ranger policies are applied and FGAC is enforced during the CTAS operation. Users can read external tables in the secure staging location.

Advantages: Enforces FGAC policies through Ranger.

Usage: The SECURE_ACCESS read mode ensures compliance with FGAC policies.

Configuration

Example Cloudera Data Engineering job or session configuration to use the SECURE_ACCESS read mode of HWC:

spark.hadoop.hive.metastore.uris=thrift://[***HMS-HOST***]:9083
spark.datasource.hive.warehouse.metastoreUri=thrift://[***HMS-HOST***]:9083
spark.hadoop.hive.zookeeper.quorum=[***ZK-HOST***]:2181
spark.sql.hive.hiveserver2.jdbc.url=[***HS2-URL***]
spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@[***REALM***]
spark.security.credentials.hiveserver2.enabled=true
spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
spark.datasource.hive.warehouse.read.mode=SECURE_ACCESS
spark.datasource.hive.warehouse.load.staging.dir=[***STAGING-DIR-PATH***]
spark.hadoop.secure.access.cache.disable=true