Configuring HWC in CDP Data Center

In CDP Data Center, HWC processes data through the Hive JDBC driver. You must configure processing based on your use case.

The HWC runs in the following JDBC modes:
  • cluster
  • client
You can choose cluster processing (recommended) or in-memory client processing, depending on the size of your resultset and latency requirements. Cluster processing, described later, is recommended for ingesting large data sets.
Configure the JDBC mode for in-memory processing if the resultset will fit in memory. HWC stores the entire resultset in an in-memory cache for fast processing. The capacity of the in-memory cache is limited based on the capacity of the Spark driver/client application.

Configuring HWC mode reads

The following software and property settings are required for connecting Spark and Hive using the HiveWarehouseConnector library:

You need to configure the following properties in configuration/spark-defaults.conf. Alternatively, you can set the properties using the spark-submit/spark-shell --conf option.
  • spark.datasource.hive.warehouse.read.via.llap.

    Because LLAP is not supported in this release, you need to run HWC in JDBC mode. To run HWC in JDBC mode, set this property to false.

  • spark.datasource.hive.warehouse.read.jdbc.mode

    Configures the JDBC mode. Values: cluster (recommended) or client (if the resultset will fit in memory)

  • spark.sql.hive.hiveserver2.jdbc.url

    The Hive JDBC url in /etc/hive/conf/beeline-site.xml.

  • spark.datasource.hive.warehouse.metastoreUri

    URI of Hive metastore. In Cloudera Manager, click Clusters > Hive-1 > Configuration, search for hive.metastore.uris, and use that value.

  • spark.datasource.hive.warehouse.load.staging.dir

    Temporary staging location required by HWC.

    Set the value to a file system location where the HWC user has write permission.
Table 1. Spark Compatibility
Tasks HWC Required Recommended HWC Mode
Read Hive managed tables from Spark Yes JDBC mode=cluster
Write Hive managed tables from Spark Yes N/A
Read Hive external tables from Spark Ok, but unnecessary N/A
Write Hive external tables from Spark Ok, but unnecessary N/A

Some configuration is required for enforcing Ranger ACLs. For more information, see Accessing Hive tables in HMS from Spark.

Authorization of read/writes of external tables from Spark

If you use HWC, HiveServer authorizes external table drops during query compilation. If you do not use the HWC, the Hive metastore (HMS) API, integrated with Ranger, authorizes external table access. HMS API-Ranger integration enforces the Ranger Hive ACL in this case.

For information about the authorization of external tables, see the section HMS Security (link below).

Spark on a Kerberized YARN cluster

For Spark applications on a kerberized Yarn cluster, set the following property: spark.sql.hive.hiveserver2.jdbc.url.principal. This property must be equal to hive.server2.authentication.kerberos.principal.

In Spark cluster mode on a kerberized YARN cluster, set the following property:
  • Property: spark.security.credentials.hiveserver2.enabled
  • Description: Use Spark ServiceCredentialProvider and set equal to a boolean, such as true
  • Comment: true by default