Creating a connection for Sqoop Operator
Learn how you can create an Airflow connection for Sqoop operators.
The following steps are for using the Airflow service provided with each Cloudera Data Engineering Virtual Cluster. For information about using your own Airflow deployment, see Using Cloudera Data Engineering with an external Apache Airflow deployment.
The Java version of Cloudera Manager must be 11 or higher.
To create an Airflow connection for Sqoop operator using the embedded Airflow UI, perform the following steps:
- Download a Postgres or MySQL JDBC connector based on the database used in Cloudera Base on premises.
- Open the terminal in your machine.
-
Run the following command to SSH into one of the base cluster nodes and enter
the password when prompted:
For example, if the hostname isssh <***USERNAME***>@<***CLOUDERA_MANAGER-HOST_NAME***>ccycloud-1.sqoop-test.root.comops.site, use the ssh cm@ccycloud-1.sqoop-test.root.comops.site command. -
Run the following command to get the database properties for the Sqoop
connection and copy them to a text editor:
Sample outputcat /etc/cloudera-scm-server/db.propertiescom.cloudera.cmf.db.type=mysql com.cloudera.cmf.db.host=localhost:3306 com.cloudera.cmf.db.name=cm com.cloudera.cmf.db.user=cm com.cloudera.cmf.db.setupType=EXTERNAL com.cloudera.cmf.db.password=cmverystr0ngP4ss - In the Cloudera console, click the Data Engineering tile. The Cloudera Data Engineering Home page displays.
- Click Administration in the left navigation menu and select the service containing the Virtual Cluster that you are using.
- In the Virtual Clusters column, click Cluster Details for the Virtual Cluster.
- Click AIRFLOW UI.
-
In the Airflow UI, click the Connection link from the
Admin menu.
-
Click the plus sign to add a new record and fill the following fields:
- Connection Id: Create a unique connection identifier. For example, test_kerberos_sqoop_mysql.
- Connection Type: Select Sqoop.
- Description: Enter description about the connection.
- Host: Use
jdbc:mysql://<***HOST_NAME***>for mysql andjdbc:postgres://<***HOST_NAME***>for postgres. in which <***HOST_NAME***> is thecom.cloudera.cmf.db.hostvalue in the database properties copied in step 4. - Schema: Enter the database name identified in the database properties copied in step 4.
- Login: Use the
com.cloudera.cmf.db.uservalue in the database properties copied in step 4. - Password:
com.cloudera.cmf.db.passwordvalue in the database properties copied in step 4. - Port: Use 3306 for mysql and 5432 for postgres.
- Extra: Leave it empty.
- Click Save.
-
Create your DAG as a Cloudera Data Engineering resource and use the
previously created Airflow connection in the DAG.
Example
from dateutil import parser from datetime import timedelta from dateutil import parser from airflow.utils import timezone from airflow import DAG from airflow.providers.apache.sqoop.operators.sqoop import SqoopOperator default_args = { 'retry_delay': timedelta(seconds=5), 'depends_on_past': False, } dag = DAG( 'pnc_test_kerberos_sqoop_mysql', default_args=default_args, start_date=parser.isoparse("2023-10-16T20:20:04.268Z").replace(tzinfo=timezone.utc), schedule_interval=None, catchup=False, is_paused_upon_creation=False ) sqoop_mysql_import = SqoopOperator(conn_id="pnc_test_kerberos_sqoop_mysql", table="HOSTS", cmd_type="import", target_dir="hdfs:///tmp/sqoopMySQL", driver="com.mysql.cj.jdbc.Driver", num_mappers=1, task_id="sqoop_import", dag=dag) sqoop_mysql_import -
Create an Airflow Job using the created DAG file but also creating add the JDBC
storage connector jar into a Cloudera Data Engineering resource and
then upload it to the job. For more information about creating an Airflow job,
see Creating an Airflow DAG using the Pipeline
UI.
-
Once the job is created, run the Airflow Job.
