1. Understanding Falcon

Falcon manages dynamic data processing through the concept of pipelines. A pipeline combines data and processes across your cluster.

Each pipeline consists of XML pipeline specifications, called entities. These entities act together to provide a dynamic flow of information to load, clean, and process data.

There are three types of Falcon entities:

  • Cluster: Defines where data and processes are stored.

  • Feed: Defines the datasets to be cleaned and processed.

  • Process: Consumes feeds, invokes processing logic, and produces further feeds. A process defines the configuration of the Oozie workflow and defines when and how often the workflow should run. Also allows for late data handling.

Each entity is defined separately and then linked together to form a data pipeline. Falcon provides predefined policies for data replication, retention, late data handling, and replication. These sample policies are easily customized to suit your needs in these areas.

These entities are defined and can be reused many times to define data management policies for Oozie jobs, Pig scripts, and Hive queries. For example, Falcon data management policies become Oozie coordinator jobs:

 

Figure 1.1. falc2flow.png