Reading data through HWC

You can configure one of the several HWC modes to read Apache Hive managed tables from Apache Spark. You need to know about the modes you can configure for querying Hive from Spark. Examples of how to configure the modes are presented.

In this release, HWC configuration has been simplified.

You set the following configurations when starting the spark shell:

  • spark.sql.extensions="com.hortonworks.spark.sql.rule.Extensions"
  • spark.datasource.hive.warehouse.read.mode=<mode>

    where <mode> is one of the following:

  • DIRECT_READER_V1 or DIRECT_READER_V2
  • JDBC_CLUSTER
You can transparently read with HWC in different modes using just spark.sql("<query>"). You can specify the mode in the spark-shell when you run Spark SQL commands to query Apache Hive tables from Apache Spark. You can also specify the mode in configuration/spark-defaults.conf, or using the --conf option in spark-submit.
For backward compatibility, configuring spark.datasource.hive.warehouse.read.mode is the same as the following configurations.
  • --conf spark.datasource.hive.warehouse.read.jdbc.mode //deprecated
  • --conf spark.sql.hive.hwc.execution.mode //deprecated
  • --conf spark.datasource.hive.warehouse.read.via.llap //deprecated

The old configurations are still supported for backward compatibility, but in a later release, support will end for these configurations and spark.datasource.hive.warehouse.read.mode will replace these configurations. HWC gives precedence to new configurations when old and new ones are encountered.

Example of configuring and reading a Hive managed table

Set Kerberos for HWC.
  1. Choose a read mode.
  2. Start the Spark session using the following configurations.
    For example, start the Spark session using Direct Reader and configure for kyro serialization:
    spark-shell --jars ./hive-warehouse-connector-assembly-<version>.jar \
    --master yarn \
    --conf spark.sql.extensions="com.hortonworks.spark.sql.rule.Extensions" \
    --conf spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator \
    --conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://hwc-2.hwc.root.hwx.site:2181/default;retries=5;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" \
    --conf spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@ROOT.HWX.SITE \
    --conf spark.datasource.hive.warehouse.read.mode=DIRECT_READER_V2

    For example, start the Spark session using the JDBC_CLUSTER option:

    spark-shell --jars ./hive-warehouse-connector-assembly-<version>.jar 
    --master yarn
    --conf spark.sql.extensions="com.hortonworks.spark.sql.rule.Extensions" 
    --conf spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
    --conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://hwc-2.hwc.root.hwx.site:2181/default;retries=5;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2"
    --conf spark.sql.hive.hiveserver2.jdbc.url.prinicpal=hive/_HOST@ROOT.HWX.SITE
    --conf spark.datasource.hive.warehouse.read.mode=JDBC_CLUSTER
    You must start the Spark session after setting the Direct Read option, so include the configurations in the launch string.
  3. Read Apache Hive managed tables.
    For example:
    scala> val hive = com.hortonworks.hwc.HiveWarehouseSession.session(spark).build()
    
    scala> hive.sql("select * from managedTable").show