Apache Flume Guide

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized datastore.

Flume is event-driven, and typically handles unstructured or semi-structured data that arrives continuously. It transfers data into CDH components such as HDFS, Apache Spark, Apache HBase, and Cloudera Search.

Flume is similar in some ways to Apache Kafka, although architectural differences often make one more suitable than the other for particular use cases. For example, Flume uses a push model where the source of the data is tightly coupled to its destination (sink), making it a good choice for multi-stage pipelines. Flume can also accept data from Kafka through the KafkaSource class and push data to Kafka using the KafkaSink class or the Kafka channel. Flume is closely integrated with Hadoop, while Kafka is a more general-purpose publish-subscribe mechanism that can run on Hadoop or other kinds of systems.

Flume is often contrasted with Apache Sqoop because their use cases are distinct. Flume is used for unstructured or semi-structured data that arrives continuously in small batches, for example log records or event messages, while Sqoop is used for occasionally transferring large batches of structured data from sources such as databases.

The Apache Flume Guide contains information about configuring and managing Apache Flume.