Apache Airflow in Cloudera Data Engineering

Learn about how Apache Airflow is integrated with Cloudera Data Engineering and how to automate a workflow or data pipeline using Apache Airflow Python DAG files in Cloudera Data Engineering.

Cloudera Data Engineering (CDE) enables you to automate a workflow or data pipeline using Apache Airflow Python DAG files. Each Cloudera Data Engineering Virtual Cluster includes an embedded instance of Apache Airflow. You can also use Cloudera Data Engineering with your own Airflow deployment. For more information about using your own Cloudera Data Engineering Airflow deployment, see Using Cloudera Data Engineering with an external Apache Airflow deployment.

Cloudera Data Engineering currently supports multiple Airflow operators. For example, one for running Cloudera Data Engineering jobs, or one for accessing and executing SQL commands on Cloudera Data Warehouse. For more information about the complete list of installed and supported operators, see Supported Airflow operators and hooks.

You can create and manage Apache Airflow jobs by writing or creating Python DAG files and uploading them using the UI. For more information about Cloudera Data Engineering Airflow job management, see Creating and managing Cloudera Data Engineering Airflow Jobs using the Cloudera Data Engineering UI.

Cloudera Data Engineering provides various ways to connect to Cloudera Data Warehouses (CDW) or to other Cloudera Data Engineering Virtual Clusters. You must create an Airflow connection containing the target system access details and then refer it in the associated operators to execute the required workloads.

You can also install and use custom operators and libraries (Python packages) for Airflow with Cloudera Data Engineering. Cloudera provides a way to extend the installed default packages with the third party or custom Python packages using the Custom Operators and Libraries feature using the Cloudera Data Engineering user interface (UI).