JDBC execution mode
You need to understand how JDBC mode interacts with Apache Hive components to read Hive tables from Spark through HWC. Where your queries are executed affects configuration. Understanding execution locations and recommendations help you configure JDBC execution mode for your use case.
Component Interaction
JDBC mode creates only one JDBC connection to HiveServer (HS2) or HiveServer Interactive (HSI), a potential bottleneck in data transfer to Spark. The following diagram shows interaction in JDBC mode with Hive metastore (HMS), TEZ, and HDFS.
HWC does not use JDBC mode during a write. HWC writes to an intermediate location from Spark, and then executes a LOAD DATA query to write the data. Using HWC to write data is recommended for production.
Configuration
In JDBC mode, execution takes place in these locations:
- Driver: Using the Hive JDBC url, connects to Hive and executes the query on the driver side.
- Cluster: From Spark executors, connects to Hive through JDBC and executes the query.
Authorization occurs on the server.
- Client (Driver)
In client mode, any failures to connect to HiveServer (HS2) will not be retried.
-
Cluster (Executor)--recommended
In cluster mode any failures to connect to HS2 will be retried automatically.
JDBC mode is not recommended for production reads due to slow performance when reading huge data sets. Where your queries are executed affects the Kerberos configurations for HWC.
In configuration/spark-defaults.conf, or using the
--conf
option in spark-submit/spark-shell set the
following properties:
- Name: spark.datasource.hive.warehouse.read.jdbc.mode
- Value:
client
orcluster
- Name: spark.sql.hive.hiveserver2.jdbc.url
- Value:
- Name: spark.datasource.hive.warehouse.load.staging.dir
- Value: Temporary staging location required by HWC. Set the value to a file system location where the HWC user has write permission.
- Name: spark.hadoop.hive.zookeeper.quorum
JDBC Mode Limitations
- If you configured Auto Translate, run JDBC in cluster mode.
- JDBC mode, which is used for reads only, is not recommended for production because
bottlenecks develop in data transfer to Spark.
Writes through HWC do not use JDBC mode and are recommended for production.