Integrating Apache Hive with Kafka, Spark, and BI
Also available as:
PDF

Apache Hive-Kafka integration

As an Apache Hive user, you can connect to, analyze, and transform data in Apache Kafka from Hive. You can offload data from Kafka to the Hive warehouse. Using Hive-Kafka integration, you can perform actions on real-time data and incorporate streamed data into your application.

You connect to Kafka data from Hive by creating an external table that maps to a Kafka topic. The table definition includes a reference to a Kafka storage handler that connects to Kafka. On the external table, Hive-Kafka integration supports ad hoc queries, such as questions about data changes in the stream over a period of time. You can transform Kafka data in the following ways:
  • Perform data masking
  • Join dimension tables or any stream
  • Aggregate data
  • Change the SerDe encoding of the original stream
  • Create a persistent stream in a Kafka topic
You can achieve data offloading by controlling its position in the stream. The Hive-Kafka connector supports the following serialization and deserialization formats:
  • JsonSerDe (default)
  • OpenCSVSerde
  • AvroSerDe