Creating a connection to Cloudera Base on premises

Learn how you can create an Airflow connection to Cloudera Base on premises Hive or Impala using SQL operators.

The following steps are for using the Airflow service provided with each Cloudera Data Engineering Virtual Cluster. For information about using your own Airflow deployment, see Using Cloudera Data Engineering with an external Apache Airflow deployment.

To determine the Impala configurations required for creating an Airflow connection for Cloudera Base on premises Impala, perform the following steps:

  1. Go to Cloudera Manager > IMPALA > Instances.
  2. On the Instances page, identify the hostname that has Impala Daemon assigned as the Role Type and copy it to a text editor. If multiple hostnames exist with Impala Daemon as Role Type, then copy any one of them and paste it to the text editor.
  3. Go to the Configurations tab.
  4. Copy the following configuration values and paste them to the text editor along with the hostname copied earlier:
    • Port that is the port number mentioned in the Impala Daemon HiveServer2 HTTP Port field
    • use_ssl that is True if the client_services_ssl_enabled field is enabled. Otherwise, it is False.
    • kerberos_service_name that is the value mentioned in the kerberos_princ_name field

To determine the Hive configurations required for creating an Airflow connection for Cloudera Base on premises Hive, perform the following steps:

  1. Go to Cloudera Manager > HIVE_ON_TEZ > Instances.
  2. On the Instances page, identify the hostname that has HiveServer2 assigned as the Role Type and copy it to a text editor. If multiple hostnames exist with HiveServer2 as Role Type, then copy any one of them and paste it to the text editor.
  3. Go to Configurations tab.
  4. Copy the following configuration values and paste them to the text editor along with the hostname copied earlier:
    • Port that is the port number mentioned in the HiveServer2 Port field
    • use_ssl that is True if the hive.server2.use.SSL field is enabled. Otherwise, it is False.
    • kerberos_service_name that is the value mentioned in the kerberos_princ_name field

To create a connection to an existing Cloudera Base on premises Hive or Impala using the embedded Airflow UI, perform the following steps:

  1. In a text editor, define the following Airflow connection additional arguments using the configurations copied earlier
    {
      "auth_mechanism": "GSSAPI",
      "use_ssl": "<***TRUE_OR_FALSE***>",
      "use_http_transport": "True",
      "http_path": "cliservice",
      "kerberos_service_name": "<***KERBEROS-SERVICE-NAME***>"
    }
    
    • The auth_mechanism value must be GSSAPI for Kerberos authentication.
    • The use_ssl argument must be true if the client_services_ssl_enabled field is enabled for Impala or the hive.server2.use.SSL field is enabled for Hive in the Configurations tab. Otherwise, it must be false.
    • The use_http_transport value must always be true.
    • The http_path value must be cliservice.
    • The kerberos_service_name value is the kerberos_princ_name value in the Configurations tab.
    Example for Impala
    {
      "auth_mechanism": "GSSAPI", 
      "use_ssl": "True",
      "use_http_transport": "True",
      "http_path": "cliservice",
      "kerberos_service_name": "impala"
    }
    
  2. In the Cloudera console, click the Data Engineering tile. The Cloudera Data Engineering Home page displays.
  3. Click Administration in the left navigation menu and select the service containing the Virtual Cluster that you are using.
  4. In the Virtual Clusters column, click Cluster Details for the Virtual Cluster.
  5. Click AIRFLOW UI.
  6. In the Airflow UI, click the Connection link from the Admin menu.
  7. Click the plus sign to add a new record and fill the following fields:
    • Connection Id: Create a unique connection identifier. For example, cdw-hive-sql.
    • Connection Type: Select Impala, regardless of the Data Warehouse type.
    • Description: Enter description about the connection.
    • Host: Enter the hostname copied from the Impala Instances tab or Hive Instances tab.
    • Schema: The default value is default. Keep it blank or default.
    • Login: Leave it empty.
    • Password: Leave it empty.
    • Port: Enter the port number copied from the Impala Configurations tab or Hive Confiurations tab.
    • Extra: Copy the extra arguements created in step 1 and paste in the text box.
  8. Click Save.
  9. Create your DAG as a Cloudera Data Engineering resource and use the previously created Airflow connection in the DAG.
    Example
    from __future__ import annotations
    
    
    import datetime
    
    
    from airflow import DAG
    from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
    
    
    DAG_ID = "pnc_test_kerberos_impala_base"
    
    
    with DAG(
       dag_id=DAG_ID,
       start_date=datetime.datetime(2025, 5, 6),
       default_args={"conn_id": "pnc_test_kerberos_impala_base"},
       schedule="@once",
       catchup=False,
    ) as dag:
       create_table_impala_task = SQLExecuteQueryOperator(
           task_id="create_table_impala",
           sql="""
                   CREATE TABLE IF NOT EXISTS pnc_test_kerberos_impala_base (
                       a STRING,
                       b INT
                   )
                   PARTITIONED BY (c INT)
               """,
       )
    
    
       drop_table_impala_task = SQLExecuteQueryOperator(
           task_id="drop_table_impala",
           sql="DROP TABLE pnc_test_kerberos_impala_base",
       )
    
    
       (
               create_table_impala_task
               >> drop_table_impala_task
       )
    
  10. Create an Airflow Job using the created DAG file. For more information about creating an Airflow job, see Creating an Airflow DAG using the Pipeline UI.
  11. Once the job is created, run the Airflow Job.