Writing data to Kafka
You can extract, transform, and load a Hive table to a Kafka topic for real-time streaming of a large volume of Hive data. You need some understanding of write semantics and the metadata columns required for writing data to Kafka.
Write semantics
The Hive-Kafka connector supports the following write semantics:
At least once
(default)Exactly once
- At least once (default)
- The default semantic.
At least once
is the most common write semantic used by streaming engines. The internal Kafka producer retries on errors. If a message is not delivered, the exception is raised to the task level, which causes a restart, and more retries. TheAt least once
semantic leads to one of the following conclusions:- If the job succeeds, each record is guaranteed to be delivered at least once.
- If the job fails, some of the records might be lost and some might not be sent.
In this case, you can retry the query, which eventually leads to the delivery of each record at least once.
- Exactly once
- Following the
exactly once
semantic, the Hive job ensures that either every record is delivered exactly once, or nothing is delivered. You can use only Kafka brokers supporting the Transaction API (0.11.0.x or later). To use this semantic, you must set the table property"kafka.write.semantic"="EXACTLY_ONCE"
.
Metadata columns
In addition to the user row payload, the insert statement must include values for the following extra columns:
- __key
- Although you can set the value of this metadata column to null, using a meaningful key value to avoid unbalanced partitions is recommended. Any binary value is valid.
- __partition
- Use null unless you want to route the record to a particular partition. Using a nonexistent partition value results in an error.
- __offset
- You cannot set this value, which is fixed at
-1
.
- __timestamp
- You can set this value to a meaningful timestamp, represented as the number of milliseconds
since epoch. Optionally, you can set this value to
null
or-1
, which means that the Kafka broker strategy sets the timestamp column.