Configuring Spark Direct Reader mode

HWC transparently connects to HMS to get transaction information, and then reads the data directly from the managed table location using the transaction snapshot. You read Hive tables from Spark using Spark SQL, the Hive Warehouse Connector (HWC) API, or the DataFrame API. ORC and Parquet file formats are supported.

The Hive Warehouse Connector Spark Direct Reader for reading Hive ACID, transactional tables from Spark is supported for production use. If your ETL jobs do not require authorization and run as super user, Spark Direct Reader mode is recommended for such use cases.

Requirements

Spark Direct Reader mode requires a connection to Hive Metastore. Neither HiveServer (HS2) nor LLAP connections are needed.

Auto translate should be enabled, and to execute a read in the Spark Direct Reader mode, use hive.execute, which processes queries through HWC in JDBC and Spark Direct Reader modes.

Component Interaction

The following diagram shows component interaction in Spark Direct Reader mode.

Configuration

You configure the Spark Direct Reader using the following properties:

Name: spark.sql.extensions
Value: com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension

Required for using Spark SQL in auto-translate direct reader mode. Set before creating the spark session.

Name: spark.kryo.registrator
Value: com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
Set before the spark session. Required if serialization = kryo.
Name: spark.sql.hive.hwc.execution.mode
Value: spark
Required only if you are using the HWC API for execution. Cannot be changed.
Name: spark.hadoop.hive.metastore.uris
Value: thrift://<host>:<port>
Hive metastore URI.
--jars
Value: HWC jar
Pass the HWC jar to spark-shell or spark-submit using the --jars option while launching the application. For example, launch spark-shell as follows.

Example: Launch a spark-shell

spark-shell --jars \
/opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-<version>.jar \
--conf "spark.sql.extensions=com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension" \
--conf "spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator" \
--conf "spark.hadoop.hive.metastore.uris=<metastore_uri>"

Unsupported functionality

Spark Direct Reader does not support the following functionality:
  • Writes
  • Streaming inserts
  • CTAS statements

Limitations

  • No authorization enforcement; hence, you must configure read access to the HDFS, or other, location for managed tables. You must have Read and Execute permissions on hive warehouse location (hive.metastore.warehouse.dir).
  • Supports only single-table transaction consistency. The direct reader does not guarantee that multiple tables referenced in a query read the same snapshot of data.
  • Does not auto-commit transactions submitted by rdd APIs. Explicitly close transactions to release locks.
  • Requires read and execute access on the hive-managed table locations.
  • Does not support Ranger column masking and fine-grained access control.
  • Blocks compaction on open read transactions.
The way Spark handles null and empty strings can cause a discrepancy when writing the data read by Spark Direct Reader to a CSV file.