Managing streaming with Hive Warehouse Connector

Understand how Hive Warehouse Connector uses HMS for transaction management and directly writes ORC files to Hive table locations without relying on HiveServer2.

Hive Warehouse Connector does not rely on HiveServer2 for streaming. Instead, it interacts with HMS for transaction management and writes ORC bucket files directly to the table's location.

An example of using the DATAFRAME_TO_STREAM method for non-streaming writes:
myDF.write.format(DATAFRAME_TO_STREAM)
  .option("metastoreUri", "thrift://jkovacs-1.jkovacs.root.hwx.site:9083")
  .option("metastoreKrbPrincipal", "hive/_HOST@AD.HALXG.CLOUDERA.COM")
  .option("database", "default")
  .option("table", "hwctest")
  .save()

Important Notes:

  • Always pre-create the Hive table before writing to it.
  • Ensure that the Spark session user has appropriate permissions for the table's file system location.
  • Verify that the Hive Metastore URI is correctly configured in the options.
By following these steps, you can leverage Hive Warehouse Connector to efficiently stream data into Hive tables using Spark.