Apache Spark-Apache Hive connection configuration
You need to understand the workflow and service changes involved in accessing ACID table data from Spark. You can configure Spark properties in Ambari for using the Hive Warehouse Connector.
Prerequisites
The Hive connection string must include a user name and password; otherwise, Spark and Hive cannot connect. For example:
jdbc:hive2://<host>:2181;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive;user=<user name>;password=<password>
You need to use the following software to connect Spark and Hive using the HiveWarehouseConnector library.
- HDP 3.15
- Spark2
- Hive with HiveServer Interactive (HSI)
The Hive Warehouse Connector (HWC) and low-latency analytical processing (LLAP) are required for certain tasks, as shown in the following table:
Table 1. Spark Compatibility Tasks HWC Required LLAP Required Other Requirement/Comments Read Hive managed tables from Spark Yes Yes Ranger ACLs enforced.* Write Hive managed tables from Spark Yes No Ranger ACLs enforced.* Supports ORC only. Read Hive external tables from Spark No Only if HWC is used Ranger ACLs not enforced. Write Hive external tables from Spark No No Ranger ACLs enforced. * Ranger column level security or column masking is supported for each access pattern when you use HWC.
You need low-latency analytical processing (LLAP) in HSI to read ACID, or other Hive-managed tables, from Spark. You do not need LLAP to write to ACID, or other managed tables, from Spark. The HWC library internally uses the Hive Streaming API and LOAD DATA Hive commands to write the data. You do not need LLAP to access external tables from Spark with caveats shown in the table above.
Required properties
spark-2-defaults
in Ambari
to use the Hive Warehouse Connector for accessing data in Hive. Alternatively, configuration
can be provided for each job using --conf
.spark.sql.hive.hiveserver2.jdbc.url
The URL for HiveServer2 Interactive
spark.datasource.hive.warehouse.metastoreUri
The URI for the metastore
spark.datasource.hive.warehouse.load.staging.dir
The HDFS temp directory for batch writes to Hive, /tmp for example
spark.hadoop.hive.llap.daemon.service.hosts
The application name for LLAP service
spark.hadoop.hive.zookeeper.quorum
The ZooKeeper hosts used by LLAP
spark.sql.hive.hiveserver2.jdbc.url
In Ambari, copy the value from Services > Hive > Summary > HIVESERVER2 INTERACTIVE JDBC URL.
spark.datasource.hive.warehouse.metastoreUri
Copy the value from hive.metastore.uris. In Hive, at the hive> prompt, enter
set hive.metastore.uris
and copy the output. For example,thrift://mycluster-1.com:9083
.spark.hadoop.hive.llap.daemon.service.hosts
Copy value from Advanced hive-interactive-site >
hive.llap.daemon.service.hosts
.spark.hadoop.hive.zookeeper.quorum
Copy the value from Advanced hive-site >
hive.zookeeper.quorum
.
Optional HWC properties
spark.datasource.hive.warehouse.write.path.strictColumnNamesMapping
Validates the mapping of columns against those in Hive to alert the user to input errors. Default = true.spark.sql.hive.conf.list
Propagates one or more configuration properties from the HWC to Hive. Set properties on the command line using the --conf option. For example:--conf spark.sql.hive.conf.list="hive.vectorized.execution.filesink.arrow.native.enabled=true;hive.vectorized.execution.enabled=true"
spark.sql.hive.conf.list
programmatically.Spark on a Kerberized YARN cluster
In Spark client mode on a kerberized Yarn cluster, set the following property:
spark.sql.hive.hiveserver2.jdbc.url.principal
This property must be equal to
hive.server2.authentication.kerberos.principal
. In Ambari, copy the value
for this property from hive.server2.authentication.kerberos.principal
in .