Apache Spark-Apache Hive connection configuration

You need to understand the workflow and service changes involved in accessing ACID table data from Spark. You can configure Spark properties in Ambari for using the Hive Warehouse Connector.

Prerequisites

The Hive connection string must include a user name and password; otherwise, Spark and Hive cannot connect. For example:

jdbc:hive2://<host>:2181;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive;user=<user name>;password=<password>

You need to use the following software to connect Spark and Hive using the HiveWarehouseConnector library.

HDP 3.0 or later
Spark2

Hive with HiveServer Interactive (HSI)

The Hive Warehouse Connector (HWC) and low-latency analytical processing (LLAP) are required for certain tasks, as shown in the following table:

Table 1. Spark Compatibility
Tasks	HWC Required	LLAP Required	Other Requirement/Comments
Read Hive managed tables from Spark	Yes	Yes	Ranger ACLs enforced.
Write Hive managed tables from Spark	Yes	No	Ranger ACLs enforced. Supports ORC only.
Read Hive external tables from Spark	No	Only if HWC is used	Table must be defined in Spark catalog. Ranger ACLs not enforced.
Write Hive external tables from Spark	No	No	Ranger ACLs enforced.

You need low-latency analytical processing (LLAP) in HSI to read ACID, or other Hive-managed tables, from Spark. You do not need LLAP to write to ACID, or other managed tables, from Spark. The HWC library internally uses the Hive Streaming API and LOAD DATA Hive commands to write the data. You do not need LLAP to access external tables from Spark with caveats shown in the table above.

Required properties

You must add several Spark properties through spark-2-defaults in Ambari to use the Hive Warehouse Connector for accessing data in Hive. Alternatively, configuration can be provided for each job using hive --conf.


Property	Description	Comments
spark.sql.hive.hiveserver2.jdbc.url	URL for HiveServer2 Interactive	In Ambari, copy the value from Services > Hive > Summary > HIVESERVER2 INTERACTIVE JDBC URL.
spark.datasource.hive.warehouse.metastoreUri	URI for metastore	Copy the value from hive.metastore.uris. For example, `thrift://mycluster-1.com:9083`.
spark.datasource.hive.warehouse.load.staging.dir	HDFS temp directory for batch writes to Hive	For example, `/tmp.`
spark.hadoop.hive.llap.daemon.service.hosts	Application name for LLAP service	Copy value from Advanced hive-interactive-site > `hive.llap.daemon.service.hosts`.
spark.hadoop.hive.zookeeper.quorum	Zookeeper hosts used by LLAP	Copy value from Advanced hive-site`hive.zookeeper.quorum`.

Spark on a Kerberized YARN cluster

In Spark client mode on a kerberized Yarn cluster, set the following property: spark.sql.hive.hiveserver2.jdbc.url.principal. This property must be equal to hive.server2.authentication.kerberos.principal. In Ambari, copy the value for this property from Services > Hive > Configs > Advanced > Advanced hive-site hive.server2.authentication.kerberos.principal.