You can add custom python packages for Airflow with Cloudera Data Engineering (CDE).
Cloudera provides access to the open source packages that you can use for your Airflow jobs
using the UI.
While you can install the operator, if additional runtime
dependencies are required such as additional setup with binaries on the path, and environment
configuration like Kerberos and Cloud credentials, and so on, then the operator will not work.
To use an Airflow third-party operator for your custom library and operator package, you must
configure the Airflow connection in the Airflow UI.
In the Cloudera Data Platform (CDP) console, click the
Data Engineering tile. The CDE Home page
displays.
Click Administration in the left navigation menu. The
Administration page displays.
Locate the Virtual Cluster that you want to edit, and click Cluster
Details.
Go to the
Airflow
tab. The Libraries and
Operators
page displays.
Under the Configure Repositories section, enter the following
fields to configure the Python Package Index (PyPi) repositories used to source your
custom libraries and operators:
PyPI Repository URL - Enter the Python Package Index (PyPi)
URL.
Optional: SSL Certificate - Enter the PEM-encoded CA
certificate.
Optional: Enter Authorization Credentials if you are configuring a Private or Protected PyPi
Repository that requires authorization for access.
Username
Password
Click Validate Configurations.
Under the Build section, upload a
requirements.txt file that contains a list of all library and
operator packages that you want to enable. Once uploaded, the system will automatically
build and install your packages.
You can specify any Python package which is compatible with the Airflow python
constraints, which you can find here:
Click Activate. The activation restarts the Airflow server. This
may take a few minutes. Once activation is complete, you will see the Installed Packages
listed.
You can now create and run an Airflow job using the custom library and operators that
you have acitvated.