Running Airflow jobs on other Cloudera Data Engineering Virtual Clusters

You can learn about how to run Airflow jobs on other Cloudera Data Engineering Virtual Clusters.

The following steps are for using the Apache Airflow service provided with each Cloudera Data Engineering virtual cluster. For more information on using your own Airflow deployment, see Using Cloudera Data Engineering with an external Apache Airflow deployment.

To run this job, an existing Airflow connection is required. For more information about how to create an Airflow connection, see Creating a connection to run jobs on other Cloudera Data Engineering Virtual Clusters.

Create an Airflow DAG file in Python. Import the required operators and define the tasks and dependencies.

The following example DAG starts a job called “example-scala-pi” on the Cloudera Data Engineering Virtual Cluster defined by the connection named “cde_runtime_api”. The connection “cde_runtime_api” is also the default connection to the same Virtual Cluster the job is launched on. For more information about the operator and its capabilities, see the GitHub page.

from datetime import timedelta
from airflow import DAG
from cloudera.airflow.providers.operators.cde import CdeRunJobOperator
import pendulum

default_args = {
    'retry_delay': timedelta(seconds=5),
    'depends_on_past': False,
    'start_date': pendulum.datetime(2021, 1, 1, tz="UTC")
}

example_dag = DAG(
    'example-cdeoperator',
    default_args=default_args,
    schedule_interval='@once',
    catchup=False,
    is_paused_upon_creation=False
)

ingest_step1 = CdeRunJobOperator(
    connection_id='cde_runtime_api',
    task_id='ingest',
    retries=3,
    dag=example_dag,
    job_name='example-scala-pi'
)

prep_step2 = CdeRunJobOperator(
    task_id='data_prep',
    dag=example_dag,
    job_name='example-scala-pi'
)

ingest_step1 >> prep_step2