Integrating Apache Hive with Kafka, Spark, and BI
Also available as:
PDF

Querying Kafka data

You can get useful information, including Kafka record metadata from a table of Kafka data by using typical Hive queries.

Each Kafka record consists of a user payload key (byte []) and value (byte[]), plus the following metadata fields:
  • Partition int32
  • Offset int64
  • Timestamp int64
The Hive row represents the dual composition of Kafka data:
  • The user payload serialized in the value byte array
  • The metadata: key byte array, partition, offset, and timestamp fields

In the Hive representation of the Kafka record, the key byte array is called __key and is of type binary. You can cast __key at query time. Hive appends __key to the last column derived from value byte array, and appends the partition, offset, and timestamp to __key columns that are named accordingly.