Apache Spark-Apache Hive connection configuration

Configuring the HWC mode for reads

The HWC runs in the following modes for reading Hive-managed tables:

LLAP
- true (not supported in CDP Data Center)
- false
JDBC
- cluster
- client

You set the spark.datasource.hive.warehouse.read.via.llap property to true or false to configure LLAP mode on or off. You set spark.datasource.hive.warehouse.read.jdbc.mode to cluster or client to configure JDBC mode.

Configuring the HWC connection in CDP Public Cloud

Configure the following software and property settings for connecting Spark and Hive using the HiveWarehouseConnector library in CDP Public Cloud:

Spark 2.4.x
Low-latency analytical processing (LLAP) is recommended, or alternatively, use the Hive JDBC database driver
Set spark.datasource.hive.warehouse.read.via.llap=true (recommended).
Alternatively, for a non-llap cluster, set the following properties:
- spark.datasource.hive.warehouse.read.via.llap=false
- spark.datasource.hive.warehouse.read.jdbc.mode=cluster.

The Hive Warehouse Connector is required for certain tasks in CDP Public Cloud. Low-latency analytical processing (LLAP) is recommended for reading ACID, or other Hive-managed tables, from Spark. You do not need LLAP to write to ACID, or other managed tables, from Spark. CDP Data Center supports JDBC that you use with HWC in lieu of LLAP. JDBC mode works well for small datasets only.

The HWC library internally uses the Hive Streaming API and LOAD DATA Hive commands to write the data. You do not need LLAP to access external tables from Spark with caveats shown in the following table.

Table 1. Spark Compatibility: CDP Public Cloud
Tasks	HWC Required	Recommended HWC Mode	Other Requirement/Comments
Read Hive managed tables from Spark	Yes	LLAP mode=true	Ranger ACLs enforced.*
Write Hive managed tables from Spark	Yes	N/A	Ranger ACLs enforced.*
Read Hive external tables from Spark	No	N/A unless HWC is used, then LLAP mode=true	Ranger ACLs not enforced.
Write Hive external tables from Spark	No	N/A	Ranger ACLs enforced.

* Ranger column level security or column masking is supported for each access pattern when you use HWC.

Configuring the HWC connection in CDP Data Center

You configure the spark.datasource.hive.warehouse.read.jdbc.mode property as described below. The following software and property settings are required for connecting Spark and Hive using the HiveWarehouseConnector library in CDP Data Center:

Spark 2.4.x
Hive JDBC database driver
Download the Hive JDBC database driver from the Cloudera Downloads page.
Set spark.datasource.hive.warehouse.read.jdbc.mode=cluster (recommended). Alternatively, set this property to client if you expect your resultset to fit in memory.

Table 2. Spark Compatibility: CDP Data Center
Tasks	HWC Required	Recommended HWC Mode	Other Requirement/Comments
Read Hive managed tables from Spark	Yes	JDBC mode=cluster	Ranger ACLs enforced.*
Write Hive managed tables from Spark	Yes	N/A	Ranger ACLs enforced.*
Read Hive external tables from Spark	No	N/A	Ranger ACLs not enforced.
Write Hive external tables from Spark	No	N/A	Ranger ACLs enforced.

* Ranger column level security or column masking is supported for each access pattern when you use HWC.

Spark on a Kerberized YARN cluster in CDP Data Center

In Spark client mode on a kerberized Yarn cluster, set the following property: spark.sql.hive.hiveserver2.jdbc.url.principal. This property must be equal to hive.server2.authentication.kerberos.principal.

In Spark cluster mode on a kerberized YARN cluster, set the following property:

Property: spark.security.credentials.hiveserver2.enabled
Description: Use Spark ServiceCredentialProvider and set equal to a boolean, such as true
Comment: true by default