Getting Started with CDE Airflow

Apache Airflow is a platform to author, schedule and execute Data Engineering pipelines. It is widely used to create dynamic and robust workflows for batch Data Engineering use cases because of its flexibility and ease of use.

Airflow is widely used because of its flexibility and ease of use. CDE embeds Apache Airflow at the CDE Virtual Cluster level. It is automatically deployed for the CDE user during CDE Virtual Cluster creation and requires no maintenance on the part of the CDE Admin.

Learning about Airflow CDE jobs

CDE jobs of type Airflow correspond to Airflow DAGs. Just like Spark CDE jobs there are three ways to build an Airflow CDE job:

  • Using the CDE web interface. For more information, see Running Jobs in Cloudera Data Engineering.
  • Using the CDE CLI tool. For more information, see Using the Cloudera Data Engineering command line interface.
  • Using CDE Rest API endpoints. For more information, see CDE API Jobs.

In addition, you can automate migrations from Oozie on CDP Public Cloud Data Hub, CDP Private Cloud Base, CDH and HDP to Spark and Airflow CDE Jobs with the oozie2cde API. For information about using the API, see Migrating Oozie to CDE with the oozie2cde API.

Airflow Concepts

Airflow DAG

In Airflow, a DAG (Directed Acyclic Graph) is defined in a Python script that represents the DAGs structure (tasks and their dependencies) as code.

For example, for a simple DAG consisting of three tasks: A, B, and C. The DAG can specify that A has to run successfully before B can run, but C can run anytime. Also that task A times out after 5 minutes, and B can be restarted up to 5 times in case it fails. The DAG might also specify that the workflow runs every night at 10pm, but should not start until a certain date.

For more information about Airflow DAGs, see Apache Airflow DAGdocumentation. For an example DAG in CDE, see CDE Airflow DAG documentation.

Airflow UI

The Airflow UI makes it easy to monitor and troubleshoot your data pipelines. For a complete overview of the Airflow UI, see Apache Airflow UI documentation.