Automating data pipelines using Apache Airflow in Cloudera Data Engineering
Cloudera Data Engineering (CDE) enables you to automate a workflow or data pipeline using Apache Airflow Python DAG files. Each CDE virtual cluster includes an embedded instance of Apache Airflow. You can also use CDE with your own Airflow deployment. CDE currently supports two Airflow operators; one to run a CDE job and one to access Cloudera Data Warehouse (CDW).
To determine the CDW hostname to use for the connection:
- Navigate to the Cloudera Data Warehouse Overview page by clicking the Data Warehouse tile in the Cloudera Data Platform (CDP) management console.
- In the Virtual Warehouses column, find the warehouse you want to connect to.
- Click the three-dot menu for the selected warehouse, and then click Copy JDBC URL.
- Paste the URL into a text editor, and make note of the hostname. For
example, the hostname portion of the following JDBC URL is emphasized
in
italics:
jdbc:hive2://hs2-aws-2-hive.env-k5ip0r.dw.ylcu-atmi.cloudera.site/default;transportMode=http;httpPath=cliservice;ssl=true;retries=3;
The following instructions are for using the Airflow service provided with each CDE virtual cluster. For instructions on using your own Airflow deployment, see Using the Cloudera provider for Apache Airflow.