Apache Spark-Apache Hive connection configuration
In Spark, you can use the Hive Warehouse Connector (HWC) for accessing ACID table data in Hive. You need to understand the workflow and service changes involved in such access.
Configuring the HWC mode for reads
The HWC runs in the following modes for reading Hive-managed tables:
- LLAP
- true (not supported in CDP Data Center)
- false
- JDBC
- cluster
- client
You set the spark.datasource.hive.warehouse.read.via.llap
property to true
or false to configure LLAP mode on or off. You set
spark.datasource.hive.warehouse.read.jdbc.mode
to cluster or client to
configure JDBC mode.
Configuring the HWC connection in CDP Public Cloud
- Spark 2.4.x
- Low-latency analytical processing (LLAP) is recommended, or alternatively, use the Hive JDBC database driver
- Set
spark.datasource.hive.warehouse.read.via.llap
=true
(recommended).Alternatively, for a non-llap cluster, set the following properties:spark.datasource.hive.warehouse.read.via.llap
=false
-
spark.datasource.hive.warehouse.read.jdbc.mode
=cluster
.
The Hive Warehouse Connector is required for certain tasks in CDP Public Cloud. Low-latency analytical processing (LLAP) is recommended for reading ACID, or other Hive-managed tables, from Spark. You do not need LLAP to write to ACID, or other managed tables, from Spark. CDP Data Center supports JDBC that you use with HWC in lieu of LLAP. JDBC mode works well for small datasets only.
The HWC library internally uses the Hive Streaming API and LOAD DATA Hive commands to write the data. You do not need LLAP to access external tables from Spark with caveats shown in the following table.
Tasks | HWC Required | Recommended HWC Mode | Other Requirement/Comments |
---|---|---|---|
Read Hive managed tables from Spark | Yes | LLAP mode=true | Ranger ACLs enforced.* |
Write Hive managed tables from Spark | Yes | N/A | Ranger ACLs enforced.* |
Read Hive external tables from Spark | No | N/A unless HWC is used, then LLAP mode=true | Ranger ACLs not enforced. |
Write Hive external tables from Spark | No | N/A | Ranger ACLs enforced. |
* Ranger column level security or column masking is supported for each access pattern when you use HWC.
Configuring the HWC connection in CDP Data Center
You configure the spark.datasource.hive.warehouse.read.jdbc.mode
property as described below. The following software and property settings are required for
connecting Spark and Hive using the HiveWarehouseConnector library in CDP Data Center:
- Spark 2.4.x
- Hive JDBC database driver
Download the Hive JDBC database driver from the Cloudera Downloads page.
- Set
spark.datasource.hive.warehouse.read.jdbc.mode
=cluster
(recommended). Alternatively, set this property toclient
if you expect your resultset to fit in memory.
Tasks | HWC Required | Recommended HWC Mode | Other Requirement/Comments |
---|---|---|---|
Read Hive managed tables from Spark | Yes | JDBC mode=cluster | Ranger ACLs enforced.* |
Write Hive managed tables from Spark | Yes | N/A | Ranger ACLs enforced.* |
Read Hive external tables from Spark | No | N/A | Ranger ACLs not enforced. |
Write Hive external tables from Spark | No | N/A | Ranger ACLs enforced. |
* Ranger column level security or column masking is supported for each access pattern when you use HWC.
Spark on a Kerberized YARN cluster in CDP Data Center
In Spark client mode on a kerberized Yarn cluster, set the following property:
spark.sql.hive.hiveserver2.jdbc.url.principal
. This property must be
equal to hive.server2.authentication.kerberos.principal
.
- Property:
spark.security.credentials.hiveserver2.enabled
- Description: Use Spark ServiceCredentialProvider and set equal to a boolean, such as
true
- Comment:
true
by default