Hive Warehouse Connector write modes
Learn about the supported Hive Warehouse Connector (HWC) write modes and their advantages, requirements, and limitations.
Hive Warehouse Connector write modes supported by Cloudera Data Engineering
HWC provides the following write methods, each suited for different scenarios:
- HIVE_WAREHOUSE_CONNECTOR
- DATAFRAME_TO_STREAM
- STREAM_TO_STREAM
- HIVE_WAREHOUSE_CONNECTOR
-
HWC requires the connection to HiveServer (HS2) to perform batch writes from Cloudera Data Engineering Spark sessions and jobs to Hive. HWC writes to an intermediate location as defined by the value of the
spark.datasource.hive.warehouse.load.staging.dirconfiguration. This configuration is included in the Cloudera Data Engineering job or session configuration.- Requirements:
- Access to the file-system and a temporary location of the table.
- JDBC connection to HiveServer2.
- Features:
- Automatically creates tables if they do not exist.
- Supports
appendandoverwritemodes.
- Configuration
Example Cloudera Data Engineering job or session configuration to use the HIVE_WAREHOUSE_CONNECTOR write mode and the SECURE_ACCESS read mode of HWC:
spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator spark.datasource.hive.warehouse.load.staging.dir=[***STAGING-DIR-PATH***] spark.datasource.hive.warehouse.read.mode=secure_access spark.security.credentials.hiveserver2.enabled=true spark.sql.hive.hiveserver2.jdbc.url=[***HS2-URL***] spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@[***REALM***] spark.hadoop.secure.access.cache.disable=true - Example:
[***SPARK-DATAFRAME***].write.format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).mode("append").option("database",db).option("table",tname).option("fileformat","orc").save()
- Requirements:
- DATAFRAME_TO_STREAM
-
For streaming, HWC does not rely on HiveServer2, but instead, it interacts with Hive Metastore (HMS) for transaction management and writes ORC bucket files directly to the location of the table.
- Requirements:
- Connection to HMS.
- Write access to the file-system location of the table.
- Requires pre-created tables.
- Features:
- Directly writes Optimized Row Columnar (ORC) bucket files to the table location.
- Configuration
Example Cloudera Data Engineering job or session configuration to use the DATAFRAME_TO_STREAM write mode of HWC:
spark.datasource.hive.warehouse.metastoreUri=thrift://[***HMS-HOST***]:9083 spark.hadoop.hive.zookeeper.quorum=[***ZK-HOST***]:2181 spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator spark.sql.warehouse.dir=[***HIVE-WAREHOUSE-DIR-S3-PATH***] - Example:
[***SPARK-DATAFRAME***].write.format(HiveWarehouseSession().DATAFRAME_TO_STREAM).mode("append").option("metastoreUri", "[***MetaStoreUri***]").option("metastoreKrbPrincipal", "[***Principal***]").option("database","[***DB***]").option("table","[***TABLE***]").option("fileformat","orc").save()
- Requirements:
- STREAM_TO_STREAM
-
HWC supports streaming data from Spark into Hive tables, enabling real-time data ingestion.
- Requirements:
- Connection to Hive Metastore (HMS).
- Write access to the file-system location of the table.
- Requires pre-created tables.
- Features:
Similar to DATAFRAME_TO_STREAM, but optimized for streaming data.
- Configuration
Example Cloudera Data Engineering job or session configuration to use the STREAM_TO_STREAM write mode of HWC:
spark.datasource.hive.warehouse.metastoreUri=thrift://[***HMS-HOST***]:9083 spark.hadoop.hive.zookeeper.quorum=[***ZK-HOST***]:2181 spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator spark.sql.warehouse.dir=[***HIVE-WAREHOUSE-DIR-PATH***] spark.sql.streaming.checkpointLocation=[***STREAM-CHECKPOINT-DIR-PATH***] - Example:
query = [***SPARK-STREAM-DATAFRAME***].writeStream.format(HiveWarehouseSession().STREAM_TO_STREAM).outputMode("append").option("metastoreUri", "[***MetaStoreUri***]").option("database", "[***DB***]").option("table", "[***PRE-CREATED-TABLE***]").trigger(processingTime='1 seconds').start() query.awaitTermination()
- Requirements:
