Integrating Apache Hive with Apache Spark and BIPDF version

JDBC execution mode

You need to understand how JDBC mode interacts with Apache Hive components to read Hive tables from Spark through HWC. Where your queries are executed affects configuration. Understanding execution locations and recommendations help you configure JDBC execution mode for your use case.

JDBC mode creates only one JDBC connection to HiveServer (HS2) or HiveServer Interactive (HSI), a potential bottleneck in data transfer to Spark. The following diagram shows interaction in JDBC mode with Hive metastore (HMS), TEZ, and HDFS.

HWC does not use JDBC mode during a write. HWC writes to an intermediate location from Spark, and then executes a LOAD DATA query to write the data. Using HWC to write data is recommended for production.

In JDBC mode, execution takes place in these locations:

  • Driver: Using the Hive JDBC url, connects to Hive and executes the query on the driver side.
  • Cluster: From Spark executors, connects to Hive through JDBC and executes the query.

Authorization occurs on the server.

JDBC mode runs in the client or cluster:
  • Client (Driver)

    In client mode, any failures to connect to HiveServer (HS2) will not be retried.

  • Cluster (Executor)--recommended

    In cluster mode any failures to connect to HS2 will be retried automatically.

JDBC mode is recommended for production reads of workloads having a data size of 1 GB or less. Using larger workloads is not recommended due to slow performance when reading huge data sets. Where your queries are executed affects the Kerberos configurations for HWC.

In configuration/spark-defaults.conf, or using the --conf option in spark-submit/spark-shell set the following properties:

Name: spark.datasource.hive.warehouse.read.jdbc.mode
Value: client or cluster
Configures the driver location.
Name: spark.sql.hive.hiveserver2.jdbc.url
Value:
The JDBC endpoint for HiveServer. For more information, see the Apache Hive Wiki (link below). For Knox, provide the HiveServer, not Knox, endpoint.
Name: spark.datasource.hive.warehouse.load.staging.dir
Value: Temporary staging location required by HWC. Set the value to a file system location where the HWC user has write permission.
Name: spark.hadoop.hive.zookeeper.quorum
  • If you configured Auto Translate, run JDBC in cluster mode.
  • JDBC mode, which is used for reads only, recommended for production workloads having a data size of 1 GB or less. In larger workloads, bottlenecks develop in data transfer to Spark.

    Writes through HWC of any size are recommended for production. Writes do not use JDBC mode