Orchestrating workflows and pipelinesPDF version

Using Cloudera Data Engineering with an external Apache Airflow deployment

The Cloudera provider for Apache Airflow, available at the Cloudera GitHub repository, provides an Airflow operator for running Cloudera Data Engineering jobs. You can install the provider on your existing Apache Airflow deployment to integrate.

  • The Cloudera provider for Apache Airflow is for use with existing Airflow deployments. If you want to use the embedded Airflow service provided by Cloudera Data Engineering, see Automating data pipelines with Cloudera Data Engineering using Apache Airflow.
  • The provider requires Python 3.6 or higher.
  • The provider requires the Python cryptography package version 3.3.2 or higher to address CVE-2020-36242. If an older version is installed, the plugin automatically updates the cryptography library.

This component provides an Airflow Operator CDEJobRunOperator to be integrated in your DAGs. The CDEJobRunOperator is for running Cloudera Data Engineering jobs.

Install Cloudera Airflow provider on your Airflow servers
  1. Run the following pip command on each Airflow server: pip install cloudera-airflow-provider
Create a connection using the Airflow UI

Before you can run a Cloudera Data Engineering job from your Airflow deployment, you must configure a connection using the Airflow UI.

  1. In the Cloudera console, click the Data Engineering tile. The Cloudera Data Engineering Home page displays.
  2. Click Administration in the left navigation menu. The Administration page displays.
  3. In the Virtual Clusters column, click the Cluster Details icon.
  4. Click JOBS API URL to copy the URL.
  5. Go to your Airflow web console (where you installed the Cloudera provider).
  6. Go to Admin > Connection.
  7. Click + Add a new record.
  8. Fill in connection details:
    Conn Id
    Create a unique connection identifier.
    Conn Type
    The type of the connection. From the drop-down, select
    • HTTP (if you are using Apache Airflow version 1)
    • HTTP or Cloudera Data engineering (if you are using Apache Airflow version 2)
    Host/Virtual API Endpoint
    URL of the host where you want the job to run. Paste here the JOBS API URL you copied in a previous step.
    Login/Cloudera Access Key

    Provide the Cloudera access key of the account for running jobs on the Cloudera Data Engineering VC.

    Password/Cloudera Private Key

    Provide the Cloudera private key of the account for running jobs on the Cloudera Data Engineering VC.

  9. Click Save.
  10. In the Cloudera Data Engineering Home page, click Jobs in the left navigation menu, and then click Create Job.
  11. Fill in the Job Details:
    Job Type
    Select the option matching your use case.
    Name
    Specify a name for the job.
    DAG File
    Provide a DAG file.
    Use the CDEJobRunOperator to specify a Cloudera Data Engineering job to run. The job definition in the DAG file must contain:
    connection_id
    The Conn Id you specified on the Airflow UI when creating the connection.
    task_id
    The ID that identifies the job within the DAG.
    dag
    The variable containing the dag object
    job_name
    The name of the Cloudera Data Engineering job to run. This job must exist in the Cloudera Data Engineering virtual cluster you are connecting to.
    For example:
    from cloudera.cdp.airflow.operators.cde_operator import CDEJobRunOperator
    ...
    t1 = CDEJobRunOperator(
        connection_id='cde-vc01-dev',
        task_id='ingest',
        dag=example_dag,
        job_name='etl-ingest-job'
    )
    
  12. Click Create and Run to create the job and run it immediately, or click the dropdown button and select Create to create the job.

We want your opinion

How can we improve this page?

What kind of feedback do you have?