Apache Spark-Apache Hive connection configuration

In Spark, you can use the Hive Warehouse Connector (HWC) for accessing ACID table data in Hive. You need to understand the workflow and service changes involved in such access.

Configuring the HWC mode for reads

The HWC runs in the following modes for reading Hive-managed tables:

  • LLAP
    • true (not supported in CDP Data Center)
    • false
  • JDBC
    • cluster
    • client

You set the spark.datasource.hive.warehouse.read.via.llap property to true or false to configure LLAP mode on or off. You set spark.datasource.hive.warehouse.read.jdbc.mode to cluster or client to configure JDBC mode.

Configuring the HWC connection in CDP Public Cloud

Configure the following software and property settings for connecting Spark and Hive using the HiveWarehouseConnector library in CDP Public Cloud:
  • Spark 2.4.x
  • Low-latency analytical processing (LLAP) is recommended, or alternatively, use the Hive JDBC database driver
  • Set spark.datasource.hive.warehouse.read.via.llap=true (recommended).
    Alternatively, for a non-llap cluster, set the following properties:
    • spark.datasource.hive.warehouse.read.via.llap=false
    • spark.datasource.hive.warehouse.read.jdbc.mode=cluster.

The Hive Warehouse Connector is required for certain tasks in CDP Public Cloud. Low-latency analytical processing (LLAP) is recommended for reading ACID, or other Hive-managed tables, from Spark. You do not need LLAP to write to ACID, or other managed tables, from Spark. CDP Data Center supports JDBC that you use with HWC in lieu of LLAP. JDBC mode works well for small datasets only.

The HWC library internally uses the Hive Streaming API and LOAD DATA Hive commands to write the data. You do not need LLAP to access external tables from Spark with caveats shown in the following table.

Table 1. Spark Compatibility: CDP Public Cloud
Tasks HWC Required Recommended HWC Mode Other Requirement/Comments
Read Hive managed tables from Spark Yes LLAP mode=true Ranger ACLs enforced.*
Write Hive managed tables from Spark Yes N/A Ranger ACLs enforced.*
Read Hive external tables from Spark No N/A unless HWC is used, then LLAP mode=true Ranger ACLs not enforced.
Write Hive external tables from Spark No N/A Ranger ACLs enforced.

* Ranger column level security or column masking is supported for each access pattern when you use HWC.

Configuring the HWC connection in CDP Data Center

You configure the spark.datasource.hive.warehouse.read.jdbc.mode property as described below. The following software and property settings are required for connecting Spark and Hive using the HiveWarehouseConnector library in CDP Data Center:

  • Spark 2.4.x
  • Hive JDBC database driver

    Download the Hive JDBC database driver from the Cloudera Downloads page.

  • Set spark.datasource.hive.warehouse.read.jdbc.mode=cluster (recommended). Alternatively, set this property to client if you expect your resultset to fit in memory.
Table 2. Spark Compatibility: CDP Data Center
Tasks HWC Required Recommended HWC Mode Other Requirement/Comments
Read Hive managed tables from Spark Yes JDBC mode=cluster Ranger ACLs enforced.*
Write Hive managed tables from Spark Yes N/A Ranger ACLs enforced.*
Read Hive external tables from Spark No N/A Ranger ACLs not enforced.
Write Hive external tables from Spark No N/A Ranger ACLs enforced.

* Ranger column level security or column masking is supported for each access pattern when you use HWC.

Spark on a Kerberized YARN cluster in CDP Data Center

In Spark client mode on a kerberized Yarn cluster, set the following property: spark.sql.hive.hiveserver2.jdbc.url.principal. This property must be equal to hive.server2.authentication.kerberos.principal.

In Spark cluster mode on a kerberized YARN cluster, set the following property:
  • Property: spark.security.credentials.hiveserver2.enabled
  • Description: Use Spark ServiceCredentialProvider and set equal to a boolean, such as true
  • Comment: true by default