Integrating Apache Hive with Apache Spark and BIPDF version

Writing data to Kafka

You can extract, transform, and load a Hive table to a Kafka topic for real-time streaming of a large volume of Hive data. You need some understanding of write semantics and the metadata columns required for writing data to Kafka.

The Hive-Kafka connector supports the following write semantics:

  • At least once (default)
  • Exactly once
At least once (default)
The default semantic. At least once is the most common write semantic used by streaming engines. The internal Kafka producer retries on errors. If a message is not delivered, the exception is raised to the task level, which causes a restart, and more retries. The At least once semantic leads to one of the following conclusions:
  • If the job succeeds, each record is guaranteed to be delivered at least once.
  • If the job fails, some of the records might be lost and some might not be sent.

    In this case, you can retry the query, which eventually leads to the delivery of each record at least once.

Exactly once
Following the exactly once semantic, the Hive job ensures that either every record is delivered exactly once, or nothing is delivered. You can use only Kafka brokers supporting the Transaction API (0.11.0.x or later). To use this semantic, you must set the table property "kafka.write.semantic"="EXACTLY_ONCE".

In addition to the user row payload, the insert statement must include values for the following extra columns:

__key
Although you can set the value of this metadata column to null, using a meaningful key value to avoid unbalanced partitions is recommended. Any binary value is valid.
__partition
Use null unless you want to route the record to a particular partition. Using a nonexistent partition value results in an error.
__offset
You cannot set this value, which is fixed at -1.
__timestamp
You can set this value to a meaningful timestamp, represented as the number of milliseconds since epoch. Optionally, you can set this value to null or -1, which means that the Kafka broker strategy sets the timestamp column.