Hive Warehouse Connector write modes

Learn about the supported Hive Warehouse Connector (HWC) write modes and their advantages, requirements, and limitations.

Hive Warehouse Connector write modes supported by Cloudera Data Engineering

HWC provides the following write methods, each suited for different scenarios:

  • HIVE_WAREHOUSE_CONNECTOR
  • DATAFRAME_TO_STREAM
  • STREAM_TO_STREAM
HIVE_WAREHOUSE_CONNECTOR

HWC requires the connection to HiveServer (HS2) to perform batch writes from Cloudera Data Engineering Spark sessions and jobs to Hive. HWC writes to an intermediate location as defined by the value of the spark.datasource.hive.warehouse.load.staging.dir configuration. This configuration is included in the Cloudera Data Engineering job or session configuration.

  • Requirements:
    • Access to the file-system and a temporary location of the table.
    • JDBC connection to HiveServer2.
  • Features:
    • Automatically creates tables if they do not exist.
    • Supports append and overwrite modes.
  • Configuration

    Example Cloudera Data Engineering job or session configuration to use the HIVE_WAREHOUSE_CONNECTOR write mode and the SECURE_ACCESS read mode of HWC:

    spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
    spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
    spark.datasource.hive.warehouse.load.staging.dir=[***STAGING-DIR-PATH***]
    spark.datasource.hive.warehouse.read.mode=secure_access
    spark.security.credentials.hiveserver2.enabled=true
    spark.sql.hive.hiveserver2.jdbc.url=[***HS2-URL***]
    spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@[***REALM***]
    spark.hadoop.secure.access.cache.disable=true
  • Example:
    [***SPARK-DATAFRAME***].write.format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).mode("append").option("database",db).option("table",tname).option("fileformat","orc").save()
DATAFRAME_TO_STREAM
For streaming, HWC does not rely on HiveServer2, but instead, it interacts with Hive Metastore (HMS) for transaction management and writes ORC bucket files directly to the location of the table.
  • Requirements:
    • Connection to HMS.
    • Write access to the file-system location of the table.
    • Requires pre-created tables.
  • Features:
    • Directly writes Optimized Row Columnar (ORC) bucket files to the table location.
  • Configuration

    Example Cloudera Data Engineering job or session configuration to use the DATAFRAME_TO_STREAM write mode of HWC:

    spark.datasource.hive.warehouse.metastoreUri=thrift://[***HMS-HOST***]:9083
    spark.hadoop.hive.zookeeper.quorum=[***ZK-HOST***]:2181
    spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
    spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
    spark.sql.warehouse.dir=[***HIVE-WAREHOUSE-DIR-S3-PATH***]
  • Example:
    [***SPARK-DATAFRAME***].write.format(HiveWarehouseSession().DATAFRAME_TO_STREAM).mode("append").option("metastoreUri", "[***MetaStoreUri***]").option("metastoreKrbPrincipal", "[***Principal***]").option("database","[***DB***]").option("table","[***TABLE***]").option("fileformat","orc").save()
STREAM_TO_STREAM
HWC supports streaming data from Spark into Hive tables, enabling real-time data ingestion.
  • Requirements:
    • Connection to Hive Metastore (HMS).
    • Write access to the file-system location of the table.
    • Requires pre-created tables.
  • Features:

    Similar to DATAFRAME_TO_STREAM, but optimized for streaming data.

  • Configuration

    Example Cloudera Data Engineering job or session configuration to use the STREAM_TO_STREAM write mode of HWC:

    spark.datasource.hive.warehouse.metastoreUri=thrift://[***HMS-HOST***]:9083
    spark.hadoop.hive.zookeeper.quorum=[***ZK-HOST***]:2181
    spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
    spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
    spark.sql.warehouse.dir=[***HIVE-WAREHOUSE-DIR-PATH***]
    spark.sql.streaming.checkpointLocation=[***STREAM-CHECKPOINT-DIR-PATH***]
  • Example:
    query = [***SPARK-STREAM-DATAFRAME***].writeStream.format(HiveWarehouseSession().STREAM_TO_STREAM).outputMode("append").option("metastoreUri", "[***MetaStoreUri***]").option("database", "[***DB***]").option("table", "[***PRE-CREATED-TABLE***]").trigger(processingTime='1 seconds').start()
    query.awaitTermination()