Running Airflow jobs on other CDE Virtual Clusters
You can learn about how to run Airflow jobs on other Cloudera Data Engineering (CDE) Virtual Clusters.
The following steps are for using the Apache Airflow service provided with each CDE virtual cluster. For more information on using your own Airflow deployment, see Using Cloudera Data Engineering with an external Apache Airflow deployment.
To run this job, an existing Airflow connection is required. For more information about how to create an Airflow connection, see Creating a connection to run jobs on other CDE Virtual Clusters.
Create an Airflow DAG file in Python. Import the required operators and define
the tasks and dependencies.
The following example DAG starts a job called
“example-scala-pi” on the CDE Virtual Cluster defined
by the connection named “cde_runtime_api”. The connection
“cde_runtime_api” is also the default connection to the
same Virtual Cluster the job is launched on. For more information about the
operator and its capabilities, see the GitHub
page.
from datetime import timedelta
from airflow import DAG
from cloudera.airflow.providers.operators.cde import CdeRunJobOperator
import pendulum
default_args = {
'retry_delay': timedelta(seconds=5),
'depends_on_past': False,
'start_date': pendulum.datetime(2021, 1, 1, tz="UTC")
}
example_dag = DAG(
'example-cdeoperator',
default_args=default_args,
schedule_interval='@once',
catchup=False,
is_paused_upon_creation=False
)
ingest_step1 = CdeRunJobOperator(
connection_id='cde_runtime_api',
task_id='ingest',
retries=3,
dag=example_dag,
job_name='example-scala-pi'
)
prep_step2 = CdeRunJobOperator(
task_id='data_prep',
dag=example_dag,
job_name='example-scala-pi'
)
ingest_step1 >> prep_step2