Hive Warehouse Connector read modes
Learn about the supported Hive Warehouse Connector (HWC) read modes and their advantages, requirements, and limitations.
Hive Warehouse Connector read modes supported by Cloudera Data Engineering
HWC provides the following read methods, each suited for different scenarios:
- DIRECT_READER_V2
- JDBC_CLUSTER
- SECURE_ACCESS
- DIRECT_READER_V2
-
HWC needs connection to Hive Metastore (HMS). Spark reads the data directly from the managed table location using a transaction snapshot.
Requirements:- You must have read and execute access to the underlying data and folder structure of the table.
- You need access to HiveServer2 (HS2).
Advantages: The DIRECT_READER_V2 read mode is recommended for achieving better performance through vectorized Optimized Row Columnar (ORC) reads.
Limitations:- Cannot write data using HWC DIRECT_READER.
- No support for FGAC.
- Supports only single-table transaction consistency. The DIRECT_READER does not guarantee that multiple tables referenced in a query read the same snapshot of data.
Usage: You can use implicit methods, like
sql("select * from default.hwctest").show(1, false)or
hive.sql("select * from default.hwctest").show(1, false)Configuration
Example Cloudera Data Engineering job or session configuration to use the DIRECT_READER_V2 read mode of HWC:spark.datasource.hive.warehouse.metastoreUri=thrift://[***HMS-HOST***]:9083 spark.hadoop.hive.zookeeper.quorum=[***ZK-HOST***]:2181 spark.sql.hive.hiveserver2.jdbc.url=[***HS2-URL***] spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@[***REALM***] spark.security.credentials.hiveserver2.enabled=true spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator spark.datasource.hive.warehouse.read.mode=DIRECT_READER_V2 spark.hadoop.secure.access.cache.disable=true
- JDBC_CLUSTER
- HWC makes a connection to HiveServer (HS2) to get transaction information. This read mode is secured through Ranger authorization and supports FGAC, such as column masking.Requirements: You need access to HiveServer2 (HS2).Limitations:
- Recommended for workloads with a size of less than 1GB to avoid bottleneck, as data is transferred through a single JDBC connection.
- Requires HiveServer2 connection.
- Cannot correctly resolve queries that use the
ORDER BYclause when run ashive.sql(" [***QUERY***] "). - Cannot correctly query a table that has a complex column type, such as
ARRAY,STRUCTandMAP: it incorrectly represents the type asstringin the returned data frame.
Usage: you can use explicit methods, like
hive.sql("select * from default.hwctest").show(1, false)Configuration
Example Cloudera Data Engineering job or session configuration to use the JDBC_CLUSTER read mode of HWC:spark.datasource.hive.warehouse.metastoreUri=thrift://[***HMS-HOST***]:9083 spark.hadoop.hive.zookeeper.quorum=[***ZK-HOST***]:2181 spark.sql.hive.hiveserver2.jdbc.url=[***HS2-URL***] spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@[***REALM***] spark.security.credentials.hiveserver2.enabled=true spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator spark.datasource.hive.warehouse.read.mode=JDBC_CLUSTER spark.hadoop.secure.access.cache.disable=true
- SECURE_ACCESS
-
The SECURE_ACCESS read mode allows FGAC column masking and row filtering to secure the data of managed (Atomicity, Consistency, Isolation, and Durability (ACID)) or external Hive tables read from Cloudera Data Engineering Spark sessions and jobs. It is recommended for large workloads and low-latency requirements. HWC creates external tables on a configured staging location. It uses CREATE TABLE AS SELECT (CTAS) to create the tables. Ranger policies are applied and FGAC is enforced during the CTAS operation. Users can read external tables in the secure staging location.
Advantages: Enforces FGAC policies through Ranger.
Usage: The SECURE_ACCESS read mode ensures compliance with FGAC policies.
Configuration
Example Cloudera Data Engineering job or session configuration to use the SECURE_ACCESS read mode of HWC:
spark.hadoop.hive.metastore.uris=thrift://[***HMS-HOST***]:9083 spark.datasource.hive.warehouse.metastoreUri=thrift://[***HMS-HOST***]:9083 spark.hadoop.hive.zookeeper.quorum=[***ZK-HOST***]:2181 spark.sql.hive.hiveserver2.jdbc.url=[***HS2-URL***] spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@[***REALM***] spark.security.credentials.hiveserver2.enabled=true spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator spark.datasource.hive.warehouse.read.mode=SECURE_ACCESS spark.datasource.hive.warehouse.load.staging.dir=[***STAGING-DIR-PATH***] spark.hadoop.secure.access.cache.disable=true
