Hive Warehouse Connector streaming for transactional tables

Learn how to use Hive Warehouse Connector to stream data from Apache Spark into transactional Hive tables.

Running HWC Streaming

Hive Warehouse Connector supports streaming data from Apache Spark into Hive tables, enabling real-time data ingestion. However, the target Hive table must be transactional to use this feature. Use the steps below to set up and run streaming with Hive Warehouse Connector.

Create the Hive Table
Before writing streaming data, pre-create the target Hive table. This table must be transactional and accessible to the Spark session. Example:
```
CREATE TABLE spark_rate_source(`timestamp` STRING, value BIGINT);
```

Configure and Write Streaming Data from Spark

The following example demonstrates streaming data to the Hive table using Spark’s inbuilt rate source, ideal for testing.

Spark streaming requires a checkpoint location to manage state.

spark.conf.set("spark.sql.streaming.checkpointLocation", "/tmp/spark_checkpoint");

Use Spark’s rate source to generate rows at a configurable rate.
```
val rateDF = spark.readStream.format("rate").option("rowsPerSecond", 1).load
```

Use Hive Warehouse Connector’s streaming mode (STREAM_TO_STREAM) to write data directly to the Hive table.

import org.apache.spark.sql.streaming.Trigger
import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val query = rateDF.writeStream .format(STREAM_TO_STREAM).outputMode("append").option("metastoreUri", "thrift://<metastore-host>:9083").option("database", "default").option("table", "spark_rate_source").trigger(Trigger.ProcessingTime("1 seconds")).start

Key Requirements for Streaming

Transactional Tables: The Hive table must be transactional and pre-created.
HMS Connection: Hive Warehouse Connector connects to the Hive Metastore Service to initiate transactions, obtain write IDs, and fetch file locations.

File System Access: The Spark session must have write access to the table's file system location.