Writing data through HWC

A step-by-step procedure walks you through connecting to HiveServer (HS2) to perform batch writes from Spark, which is recommended for production. You configure HWC for the managed table write, launch the Spark session, and write ACID, managed tables to Apache Hive.

  • Accept the default spark.datasource.hive.warehouse.load.staging.dir for the temporary staging location required by HWC.
  • Check that spark.hadoop.hive.zookeeper.quorum is configured.
  • Set Kerberos configurations for HWC, or for an unsecured cluster, set spark.security.credentials.hiveserver2.enabled=false.
Limitation: Only the ORC format is supported for writes.

HWC does not use JDBC mode to write. HWC writes to an intermediate location from Spark, and then executes a LOAD DATA query to write the data. Using HWC to write data is recommended for production in CDP.

  1. From Data Hub, open a terminal window, start the Apache Spark session, and include the URL for HiveServer.
    spark-shell --jars /opt/cloudera/parcels/CDH/jars/hive-warehouse-connector-assembly-<version>.jar \
    -- conf spark.sql.hive.hiveserver2.jdbc.url=<JDBC endpoint for HiveServer>
  2. Include in the launch string a configuration of the intermediate location to use as a staging directory.
    Example syntax:
    --conf spark.datasource.hive.warehouse.load.staging.dir=<path to directory>
  3. Write a Hive managed table.
    For example, in Java/Scala:
    import com.hortonworks.hwc.HiveWarehouseSession
    import com.hortonworks.hwc.HiveWarehouseSession._
    val hive = HiveWarehouseSession.session(spark).build()
    val df = hive.executeQuery("select * from web_sales")
    .column("ws_sold_time_sk", "bigint")
    .column("ws_ship_date_sk", "bigint")
    sql("SELECT ws_sold_time_sk, ws_ship_date_sk FROM web_sales WHERE ws_sold_time_sk > 80000)
    .option("table", "newTable")