Configuring HWC in CDP Data Center

In CDP Data Center, HWC processes data through the Hive JDBC driver. You must configure processing based on your use case.

The HWC runs in the following JDBC modes:

cluster
client

You can choose cluster processing (recommended) or in-memory client processing, depending on the size of your resultset and latency requirements. Cluster processing, described later, is recommended for ingesting large data sets.

Configure the JDBC mode for in-memory processing if the resultset will fit in memory. HWC stores the entire resultset in an in-memory cache for fast processing. The capacity of the in-memory cache is limited based on the capacity of the Spark driver/client application.

Configuring HWC mode reads🔗

The following software and property settings are required for connecting Spark and Hive using the HiveWarehouseConnector library:

Spark 2.4.x
Hive JDBC database driver
Download the Hive JDBC database driver from the Cloudera Downloads page.

You need to configure the following properties in configuration/spark-defaults.conf. Alternatively, you can set the properties using the spark-submit/spark-shell --conf option.

spark.datasource.hive.warehouse.read.via.llap.
Because LLAP is not supported in this release, you need to run HWC in JDBC mode. To run HWC in JDBC mode, set this property to false.
spark.datasource.hive.warehouse.read.jdbc.mode
Configures the JDBC mode. Values: cluster (recommended) or client (if the resultset will fit in memory)
spark.sql.hive.hiveserver2.jdbc.url
The Hive JDBC url in /etc/hive/conf/beeline-site.xml.
spark.datasource.hive.warehouse.metastoreUri
URI of Hive metastore. In Cloudera Manager, click Clusters > Hive-1 > Configuration, search for hive.metastore.uris, and use that value.
spark.datasource.hive.warehouse.load.staging.dir
Temporary staging location required by HWC.
Set the value to a file system location where the HWC user has write permission.

Table 1. Spark Compatibility
Tasks	HWC Required	Recommended HWC Mode
Read Hive managed tables from Spark	Yes	JDBC mode=cluster
Write Hive managed tables from Spark	Yes	N/A
Read Hive external tables from Spark	Ok, but unnecessary	N/A
Write Hive external tables from Spark	Ok, but unnecessary	N/A

Some configuration is required for enforcing Ranger ACLs. For more information, see Accessing Hive tables in HMS from Spark.

Authorization of read/writes of external tables from Spark🔗

If you use HWC, HiveServer authorizes external table drops during query compilation. If you do not use the HWC, the Hive metastore (HMS) API, integrated with Ranger, authorizes external table access. HMS API-Ranger integration enforces the Ranger Hive ACL in this case.

For information about the authorization of external tables, see the section HMS Security (link below).

Spark on a Kerberized YARN cluster🔗

For Spark applications on a kerberized Yarn cluster, set the following property: spark.sql.hive.hiveserver2.jdbc.url.principal. This property must be equal to hive.server2.authentication.kerberos.principal.

In Spark cluster mode on a kerberized YARN cluster, set the following property:

Property: spark.security.credentials.hiveserver2.enabled
Description: Use Spark ServiceCredentialProvider and set equal to a boolean, such as true
Comment: true by default

Configuring HWC in CDP Data Center

Configuring HWC mode reads🔗

Authorization of read/writes of external tables from Spark🔗

Spark on a Kerberized YARN cluster🔗

We want your opinion

How can we improve this page?