Hive Warehouse Connector streaming for transactional tables
Learn how to use Hive Warehouse Connector to stream data from Apache Spark into transactional Hive tables.
Running HWC Streaming
Hive Warehouse Connector supports streaming data from Apache Spark into Hive tables, enabling real-time data ingestion. However, the target Hive table must be transactional to use this feature. Use the steps below to set up and run streaming with Hive Warehouse Connector.
- Create the Hive Table
Before writing streaming data, pre-create the target Hive table. This table must be transactional and accessible to the Spark session. Example:
CREATE TABLE spark_rate_source(`timestamp` STRING, value BIGINT);
- Configure and Write Streaming Data from SparkThe following example demonstrates streaming data to the Hive table using Spark’s inbuilt rate source, ideal for testing.
- Spark streaming requires a checkpoint location to manage
state.
spark.conf.set("spark.sql.streaming.checkpointLocation", "/tmp/spark_checkpoint");
- Use Spark’s rate source to generate rows at a configurable
rate.
val rateDF = spark.readStream.format("rate").option("rowsPerSecond", 1).load
- Use Hive Warehouse Connector’s streaming mode (
STREAM_TO_STREAM
) to write data directly to the Hive table.import org.apache.spark.sql.streaming.Trigger import com.hortonworks.hwc.HiveWarehouseSession import com.hortonworks.hwc.HiveWarehouseSession._ val query = rateDF.writeStream .format(STREAM_TO_STREAM).outputMode("append").option("metastoreUri", "thrift://<metastore-host>:9083").option("database", "default").option("table", "spark_rate_source").trigger(Trigger.ProcessingTime("1 seconds")).start
- Spark streaming requires a checkpoint location to manage
state.
Key Requirements for Streaming
- Transactional Tables: The Hive table must be transactional and pre-created.
- HMS Connection: Hive Warehouse Connector connects to the Hive Metastore Service to initiate transactions, obtain write IDs, and fetch file locations.