Orchestrating workflows and pipelinesPDF version

Using CDE with an external Apache Airflow deployment

The Cloudera provider for Apache Airflow, available at the Cloudera GitHub repository, provides an Airflow operator for running Cloudera Data Engineering (CDE) jobs. You can install the provider on your existing Apache Airflow deployment to integrate.

  • The Cloudera provider for Apache Airflow is for use with existing Airflow deployments. If you want to use the embedded Airflow service provided by CDE, see Automating data pipelines with CDE using Apache Airflow.
  • The provider requires Python 3.6 or higher.
  • The provider requires the Python cryptography package version 3.3.2 or higher to address CVE-2020-36242. If an older version is installed, the plugin automatically updates the cryptography library.

This component provides an Airflow Operator CDEJobRunOperator to be integrated in your DAGs. The CDEJobRunOperator is for running Cloudera Data Engineering jobs.

Install Cloudera Airflow provider on your Airflow servers
  1. Run the following pip command on each Airflow server: pip install cloudera-airflow-provider
Create a connection using the Airflow UI

Before you can run a CDE job from your Airflow deployment, you must configure a connection using the Airflow UI.

  1. In the Cloudera Data Platform (CDP) console, click the Data Engineering tile. The CDE Home page displays.
  2. Click Administration in the left navigation menu. The Administration page displays.
  3. In the Virtual Clusters column, click the Cluster Details icon.
  4. Click JOBS API URL to copy the URL.
  5. Go to your Airflow web console (where you installed the Cloudera provider).
  6. Go to Admin > Connection.
  7. Click + Add a new record.
  8. Fill in connection details:
    Conn Id
    Create a unique connection identifier.
    Conn Type
    The type of the connection. From the drop-down, select
    • HTTP (if you are using Apache Airflow version 1)
    • HTTP or Cloudera Data engineering (if you are using Apache Airflow version 2)
    Host/Virtual API Endpoint
    URL of the host where you want the job to run. Paste here the JOBS API URL you copied in a previous step.
    Login/CDP Access Key

    Provide the CDP access key of the account for running jobs on the CDE VC.

    Password/CDP Private Key

    Provide the CDP private key of the account for running jobs on the CDE VC.

  9. Click Save.
  10. In the CDE Home page, click Jobs in the left navigation menu, and then click Create a Job.
  11. Fill in the Job Details:
    Job Type
    Select the option matching your use case.
    Name
    Specify a name for the job.
    DAG File
    Provide a DAG file.
    Use the CDEJobRunOperator to specify a CDE job to run. The job definition in the DAG file must contain:
    connection_id
    The Conn Id you specified on the Airflow UI when creating the connection.
    task_id
    The ID that identifies the job within the DAG.
    dag
    The variable containing the dag object
    job_name
    The name of the CDE job to run. This job must exist in the CDE virtual cluster you are connecting to.
    For example:
    from cloudera.cdp.airflow.operators.cde_operator import CDEJobRunOperator
    ...
    t1 = CDEJobRunOperator(
        connection_id='cde-vc01-dev',
        task_id='ingest',
        dag=example_dag,
        job_name='etl-ingest-job'
    )
    
  12. Click Create and Run to create the job and run it immediately, or click the dropdown button and select Create to create the job.