Creating a connection for Sqoop Operator

Learn how you can create an Airflow connection for Sqoop operators.

The following steps are for using the Airflow service provided with each Cloudera Data Engineering Virtual Cluster. For information about using your own Airflow deployment, see Using Cloudera Data Engineering with an external Apache Airflow deployment.

The Java version of Cloudera Manager must be 11 or higher.

To create an Airflow connection for Sqoop operator using the embedded Airflow UI, perform the following steps:

  1. Download a Postgres or MySQL JDBC connector based on the database used in Cloudera Base on premises.
  2. Open the terminal in your machine.
  3. Run the following command to SSH into one of the base cluster nodes and enter the password when prompted:
    ssh <***USERNAME***>@<***CLOUDERA_MANAGER-HOST_NAME***>
    For example, if the hostname is ccycloud-1.sqoop-test.root.comops.site, use the ssh cm@ccycloud-1.sqoop-test.root.comops.site command.
  4. Run the following command to get the database properties for the Sqoop connection and copy them to a text editor:
    cat /etc/cloudera-scm-server/db.properties
    
    Sample output
    com.cloudera.cmf.db.type=mysql
    com.cloudera.cmf.db.host=localhost:3306
    com.cloudera.cmf.db.name=cm
    com.cloudera.cmf.db.user=cm
    com.cloudera.cmf.db.setupType=EXTERNAL
    com.cloudera.cmf.db.password=cmverystr0ngP4ss
    
  5. In the Cloudera console, click the Data Engineering tile. The Cloudera Data Engineering Home page displays.
  6. Click Administration in the left navigation menu and select the service containing the Virtual Cluster that you are using.
  7. In the Virtual Clusters column, click Cluster Details for the Virtual Cluster.
  8. Click AIRFLOW UI.
  9. In the Airflow UI, click the Connection link from the Admin menu.
  10. Click the plus sign to add a new record and fill the following fields:
    • Connection Id: Create a unique connection identifier. For example, test_kerberos_sqoop_mysql.
    • Connection Type: Select Sqoop.
    • Description: Enter description about the connection.
    • Host: Use jdbc:mysql://<***HOST_NAME***> for mysql and jdbc:postgres://<***HOST_NAME***> for postgres. in which <***HOST_NAME***> is the com.cloudera.cmf.db.host value in the database properties copied in step 4.
    • Schema: Enter the database name identified in the database properties copied in step 4.
    • Login: Use the com.cloudera.cmf.db.user value in the database properties copied in step 4.
    • Password: com.cloudera.cmf.db.password value in the database properties copied in step 4.
    • Port: Use 3306 for mysql and 5432 for postgres.
    • Extra: Leave it empty.
  11. Click Save.
  12. Create your DAG as a Cloudera Data Engineering resource and use the previously created Airflow connection in the DAG.
    Example
    from dateutil import parser
    from datetime import timedelta
    from dateutil import parser
    from airflow.utils import timezone
    from airflow import DAG
    from airflow.providers.apache.sqoop.operators.sqoop import SqoopOperator
    
    default_args = {
        'retry_delay': timedelta(seconds=5),
        'depends_on_past': False,
    }
    
    dag = DAG(
        'pnc_test_kerberos_sqoop_mysql',
        default_args=default_args,
        start_date=parser.isoparse("2023-10-16T20:20:04.268Z").replace(tzinfo=timezone.utc),
        schedule_interval=None,
        catchup=False,
        is_paused_upon_creation=False
    )
    
    sqoop_mysql_import = SqoopOperator(conn_id="pnc_test_kerberos_sqoop_mysql",
                                      table="HOSTS",
                                      cmd_type="import",
                                      target_dir="hdfs:///tmp/sqoopMySQL",
                                      driver="com.mysql.cj.jdbc.Driver",
                                      num_mappers=1,
                                      task_id="sqoop_import",
                                      dag=dag)
    
    sqoop_mysql_import
    
  13. Create an Airflow Job using the created DAG file but also creating add the JDBC storage connector jar into a Cloudera Data Engineering resource and then upload it to the job. For more information about creating an Airflow job, see Creating an Airflow DAG using the Pipeline UI.
  14. Once the job is created, run the Airflow Job.