Learn how you can create an Airflow connection to Cloudera Data Warehouse
Virtual Warheouse, such as Hive or Impala, using SQL operators.
The following steps are for using the Airflow service provided with each Cloudera Data Engineering Virtual Cluster. For information about using
your own Airflow deployment, see Using Cloudera Data Engineering with an
external Apache Airflow deployment.
To determine the Cloudera Data Warehouse Virtual Warehouse JDBC URL,
perform the following steps:
- In the Cloudera console, click the
Data Warehouse tile. The Cloudera Data Warehouse Home page displays.
- Click Virtual Warehouses in the left navigation menu and
select the relevant Virtual Warehouse for which you want to create an Airflow
connection.
- Click
next
to the selected Warehouse, and then click Copy JDBC
URL.
- Paste the URL into a text editor, and take note of the following:
ssl
transportMode
KrbServiceName
KrbHostFQDN
Example
jdbc:impala://coordinator-sql-operator-impala-reduced.apps.shared-rke-dev-01.kcloud.cloudera.com:443/default;AuthMech=1;transportMode=http;httpPath=cliservice;ssl=1;KrbHostFQDN=dwx-env-ga5gne-env.cdp.local;KrbServiceName=hive
In
this example, take note of the following values:
ssl = 1
transportMode = http
KrbServiceName = hive
krbHostFQDN = dwx-env-ga5gne-env.cdp.local
To create a connection to an existing Cloudera Data Warehouse Virtual
Warehouse using the embedded Airflow UI, perform the following steps:
-
In a text editor, define the following Airflow connection additional arguments
using the JDBC Kerberos URL.
{
"auth_mechanism": "GSSAPI",
"use_ssl": "<***TRUE_OR_FALSE***>",
"use_http_transport": "<***TRUE_OR_FALSE***>",
"http_path": "cliservice",
"kerberos_service_name": "<***KERBEROS-SERVICE-NAME***>",
"krb_host": "<***KERBEROS-HOST-NAME***>"
}
- The
auth_mechanism value must be
GSSAPI for Kerberos authentication.
- If
ssl = 1 in JDBC URL, then set the
use_ssl to True. Otherwise,
set it to False.
- If
transportMode = http in JDBC URL, then set
use_http_transport to True .
Otherwise, set it to False.
- The
http_path value must always be
cliservice.
- The
kerberos_service_name value is the
KrbServiceName in the JDBC URL.
- The
krb_host value is the
KrbHostFQDN in the JDBC URL.
Example
{
"auth_mechanism": "GSSAPI",
"use_ssl": "True",
"use_http_transport": "True",
"http_path": "cliservice",
"kerberos_service_name": "hive",
"krb_host": "dwx-env-ga5gne-env.cdp.local"
}
-
In the Cloudera console, click the
Data Engineering tile. The Cloudera Data Engineering Home page displays.
-
Click Administration in the left navigation menu and
select the service containing the Virtual Cluster that you are using.
-
In the Virtual Clusters column, click Cluster
Details for the Virtual Cluster.
-
Click AIRFLOW UI.
-
In the Airflow UI, click the Connection link from the
Admin menu.
-
Click the plus sign to add a new record and fill the following fields:
- Connection Id: Create a unique connection
identifier. For example, cdw-hive-sql.
- Connection Type: Select
Impala, regardless of the Data Warehouse
type.
- Description: Enter description about the
connection.
- Host: Enter the hostname copied from the JDBC
connection URL. Do not enter the full JDBC URL.
- Schema: Enter the schema to be used. The default
value is
default.
- Login: Leave it blank.
- Password: Leave it blank.
- Port: Set to 443.
- Extra: Copy the extra arguements created in step
1 and paste in the text box.
-
Click Save.
-
Create your DAG as a Cloudera Data Engineering resource and use the
previously created Airflow connection in the DAG.
Example
from __future__ import annotations
import datetime
from airflow import DAG
from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
DAG_ID = "pnc_test_kerberos_impala_cdw"
with DAG(
dag_id=DAG_ID,
start_date=datetime.datetime(2025, 5, 6),
default_args={"conn_id": "pnc_test_kerberos_impala_cdw"},
schedule="@once",
catchup=False,
) as dag:
create_table_impala_task = SQLExecuteQueryOperator(
task_id="create_table_impala",
sql="""
CREATE TABLE IF NOT EXISTS pnc_test_kerberos_impala_cdw (
a STRING,
b INT
)
PARTITIONED BY (c INT)
""",
)
drop_table_impala_task = SQLExecuteQueryOperator(
task_id="drop_table_impala",
sql="DROP TABLE pnc_test_kerberos_impala_cdw",
)
(
create_table_impala_task
>> drop_table_impala_task
)
-
Create an Airflow Job using the created DAG file. For more information about
creating an Airflow job, see Creating an Airflow DAG using the Pipeline
UI.
-
Once the job is created, run the Airflow Job.