A step-by-step procedure walks you through connecting to HiveServer (HS2) to perform
batch writes from Spark, which is recommended for production. You configure HWC for the
managed table write, launch the Spark session, and write ACID, managed tables to Apache
Hive.
-
From Data Hub, open a terminal window, start the Apache Spark session, and
include the URL for HiveServer.
spark-shell --jars /opt/cloudera/parcels/CDH/jars/hive-warehouse-connector-assembly-<version>.jar \
-- conf spark.sql.hive.hiveserver2.jdbc.url=<JDBC endpoint for HiveServer>
...
-
Include in the launch string a configuration of the intermediate location to use as a staging directory.
Example
syntax:
...
--conf spark.datasource.hive.warehouse.load.staging.dir=<path to directory>
-
Write a Hive managed table.
For example, in Java/Scala:
import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()
hive.setDatabase("tpcds_bin_partitioned_orc_1000")
WAREHOUSE_CONNECTOR).option("table", <tableName>).save()
hive.setDatabase("testDatabase")
hive.createTable("newTable")
.ifNotExists()
.column("ws_sold_time_sk", "bigint")
.column("ws_ship_date_sk", "bigint")
.create()
sql("SELECT ws_sold_time_sk, ws_ship_date_sk FROM web_sales WHERE ws_sold_time_sk > 80000)
.write.format(HIVE_WAREHOUSE_CONNECTOR)
.mode("append")
.option("table", "newTable")
.save()