Creating a connection to Cloudera Data Warehouse Virtual Warehouse

Learn how you can create an Airflow connection to Cloudera Data Warehouse Virtual Warheouse, such as Hive or Impala, using SQL operators.

The following steps are for using the Airflow service provided with each Cloudera Data Engineering Virtual Cluster. For information about using your own Airflow deployment, see Using Cloudera Data Engineering with an external Apache Airflow deployment.

To determine the Cloudera Data Warehouse Virtual Warehouse JDBC URL, perform the following steps:

  1. In the Cloudera console, click the Data Warehouse tile. The Cloudera Data Warehouse Home page displays.
  2. Click Virtual Warehouses in the left navigation menu and select the relevant Virtual Warehouse for which you want to create an Airflow connection.
  3. Click next to the selected Warehouse, and then click Copy JDBC URL.
  4. Paste the URL into a text editor, and take note of the following:
    • ssl
    • transportMode
    • KrbServiceName
    • KrbHostFQDN
    Example
    jdbc:impala://coordinator-sql-operator-impala-reduced.apps.shared-rke-dev-01.kcloud.cloudera.com:443/default;AuthMech=1;transportMode=http;httpPath=cliservice;ssl=1;KrbHostFQDN=dwx-env-ga5gne-env.cdp.local;KrbServiceName=hive
    In this example, take note of the following values:
    • ssl = 1
    • transportMode = http
    • KrbServiceName = hive
    • krbHostFQDN = dwx-env-ga5gne-env.cdp.local

To create a connection to an existing Cloudera Data Warehouse Virtual Warehouse using the embedded Airflow UI, perform the following steps:

  1. In a text editor, define the following Airflow connection additional arguments using the JDBC Kerberos URL.
    {
      "auth_mechanism": "GSSAPI",
      "use_ssl": "<***TRUE_OR_FALSE***>",
      "use_http_transport": "<***TRUE_OR_FALSE***>",
      "http_path": "cliservice",
      "kerberos_service_name": "<***KERBEROS-SERVICE-NAME***>",
      "krb_host": "<***KERBEROS-HOST-NAME***>" 
    }
    
    • The auth_mechanism value must be GSSAPI for Kerberos authentication.
    • If ssl = 1 in JDBC URL, then set the use_ssl to True. Otherwise, set it to False.
    • If transportMode = http in JDBC URL, then set use_http_transport to True . Otherwise, set it to False.
    • The http_path value must always be cliservice.
    • The kerberos_service_name value is the KrbServiceName in the JDBC URL.
    • The krb_host value is the KrbHostFQDN in the JDBC URL.
    Example
    {
      "auth_mechanism": "GSSAPI", 
      "use_ssl": "True",
      "use_http_transport": "True",
      "http_path": "cliservice",
      "kerberos_service_name": "hive",
      "krb_host": "dwx-env-ga5gne-env.cdp.local"
    }
    
  2. In the Cloudera console, click the Data Engineering tile. The Cloudera Data Engineering Home page displays.
  3. Click Administration in the left navigation menu and select the service containing the Virtual Cluster that you are using.
  4. In the Virtual Clusters column, click Cluster Details for the Virtual Cluster.
  5. Click AIRFLOW UI.
  6. In the Airflow UI, click the Connection link from the Admin menu.
  7. Click the plus sign to add a new record and fill the following fields:
    • Connection Id: Create a unique connection identifier. For example, cdw-hive-sql.
    • Connection Type: Select Impala, regardless of the Data Warehouse type.
    • Description: Enter description about the connection.
    • Host: Enter the hostname copied from the JDBC connection URL. Do not enter the full JDBC URL.
    • Schema: Enter the schema to be used. The default value is default.
    • Login: Leave it blank.
    • Password: Leave it blank.
    • Port: Set to 443.
    • Extra: Copy the extra arguements created in step 1 and paste in the text box.
  8. Click Save.
  9. Create your DAG as a Cloudera Data Engineering resource and use the previously created Airflow connection in the DAG.
    Example
    from __future__ import annotations
    
    
    import datetime
    
    
    from airflow import DAG
    from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
    
    
    DAG_ID = "pnc_test_kerberos_impala_cdw"
    
    
    with DAG(
       dag_id=DAG_ID,
       start_date=datetime.datetime(2025, 5, 6),
       default_args={"conn_id": "pnc_test_kerberos_impala_cdw"},
       schedule="@once",
       catchup=False,
    ) as dag:
       create_table_impala_task = SQLExecuteQueryOperator(
           task_id="create_table_impala",
           sql="""
                   CREATE TABLE IF NOT EXISTS pnc_test_kerberos_impala_cdw (
                       a STRING,
                       b INT
                   )
                   PARTITIONED BY (c INT)
               """,
       )
    
    
       drop_table_impala_task = SQLExecuteQueryOperator(
           task_id="drop_table_impala",
           sql="DROP TABLE pnc_test_kerberos_impala_cdw",
       )
    
    
       (
               create_table_impala_task
               >> drop_table_impala_task
       )
    
  10. Create an Airflow Job using the created DAG file. For more information about creating an Airflow job, see Creating an Airflow DAG using the Pipeline UI.
  11. Once the job is created, run the Airflow Job.