Hive Warehouse Connector write modes

Learn about the supported Hive Warehouse Connector (HWC) write modes and their advantages, requirements, and limitations.

Hive Warehouse Connector write modes supported by Cloudera Data Engineering

HWC provides the following write methods, each suited for different scenarios:

HIVE_WAREHOUSE_CONNECTOR
DATAFRAME_TO_STREAM
STREAM_TO_STREAM

HIVE_WAREHOUSE_CONNECTOR

HWC requires the connection to HiveServer (HS2) to perform batch writes from Cloudera Data Engineering Spark sessions and jobs to Hive. HWC writes to an intermediate location as defined by the value of the spark.datasource.hive.warehouse.load.staging.dir configuration. This configuration is included in the Cloudera Data Engineering job or session configuration.

Requirements:
- Access to the file-system and a temporary location of the table.
- JDBC connection to HiveServer2.
Features:
- Automatically creates tables if they do not exist.
- Supports append and overwrite modes.

Configuration

Example Cloudera Data Engineering job or session configuration to use the HIVE_WAREHOUSE_CONNECTOR write mode and the SECURE_ACCESS read mode of HWC:

spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
spark.datasource.hive.warehouse.load.staging.dir=[***STAGING-DIR-PATH***]
spark.datasource.hive.warehouse.read.mode=secure_access
spark.security.credentials.hiveserver2.enabled=true
spark.sql.hive.hiveserver2.jdbc.url=[***HS2-URL***]
spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@[***REALM***]
spark.hadoop.secure.access.cache.disable=true

Example:

[***SPARK-DATAFRAME***].write.format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).mode("append").option("database",db).option("table",tname).option("fileformat","orc").save()

DATAFRAME_TO_STREAM

For streaming, HWC does not rely on HiveServer2, but instead, it interacts with Hive Metastore (HMS) for transaction management and writes ORC bucket files directly to the location of the table.

Requirements:
- Connection to HMS.
- Write access to the file-system location of the table.
- Requires pre-created tables.
Features:
- Directly writes Optimized Row Columnar (ORC) bucket files to the table location.

Configuration

Example Cloudera Data Engineering job or session configuration to use the DATAFRAME_TO_STREAM write mode of HWC:

spark.datasource.hive.warehouse.metastoreUri=thrift://[***HMS-HOST***]:9083
spark.hadoop.hive.zookeeper.quorum=[***ZK-HOST***]:2181
spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
spark.sql.warehouse.dir=[***HIVE-WAREHOUSE-DIR-S3-PATH***]

Example:

[***SPARK-DATAFRAME***].write.format(HiveWarehouseSession().DATAFRAME_TO_STREAM).mode("append").option("metastoreUri", "[***MetaStoreUri***]").option("metastoreKrbPrincipal", "[***Principal***]").option("database","[***DB***]").option("table","[***TABLE***]").option("fileformat","orc").save()

STREAM_TO_STREAM

HWC supports streaming data from Spark into Hive tables, enabling real-time data ingestion.

Requirements:
- Connection to Hive Metastore (HMS).
- Write access to the file-system location of the table.
- Requires pre-created tables.
Features:
Similar to DATAFRAME_TO_STREAM, but optimized for streaming data.

Configuration

Example Cloudera Data Engineering job or session configuration to use the STREAM_TO_STREAM write mode of HWC:

spark.datasource.hive.warehouse.metastoreUri=thrift://[***HMS-HOST***]:9083
spark.hadoop.hive.zookeeper.quorum=[***ZK-HOST***]:2181
spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
spark.sql.warehouse.dir=[***HIVE-WAREHOUSE-DIR-PATH***]
spark.sql.streaming.checkpointLocation=[***STREAM-CHECKPOINT-DIR-PATH***]

Example:

query = [***SPARK-STREAM-DATAFRAME***].writeStream.format(HiveWarehouseSession().STREAM_TO_STREAM).outputMode("append").option("metastoreUri", "[***MetaStoreUri***]").option("database", "[***DB***]").option("table", "[***PRE-CREATED-TABLE***]").trigger(processingTime='1 seconds').start()
query.awaitTermination()