Install custom Python libraries in flow deployments
If your data flow requires custom Python packages you can modify your Python script to install these dependencies through the use of NiFi processors.
-
Create a Python script, to install the package you want to add:
Replace [***PACKAGE NAME***] with the name of the package you want to import and [***IMPORT AS***] with a meaningful name you want the package to be called in your data flow.#!/usr/bin/python3 try: import [***PACKAGE NAME***] as [***IMPORT AS***] except ImportError: from pip._internal import main as pip pip(['install', '--user', '[***PACKAGE NAME***]]) import [***PACKAGE NAME***] as [***IMPORT AS***] import sys file = [***IMPORT AS***].read_csv(sys.stdin)
#!/usr/bin/python3 try: import pandas as pd except ImportError: from pip._internal import main as pip pip(['install', '--user', 'pandas']) import pandas as pd import sys file = pd.read_csv(sys.stdin)
- Open the flow definition which requires custom packages in NiFi.
-
Add and configure an
ExecuteStreamCommand
processor to run your script.Make the following property settings:- Command Arguments
- provide #{Script}
- Command Path
- provide python
-
Add and configure a processor that allows uploading file-based resources as
part of its deployment in CDF-PC.
This procedure uses
FetchHDFS
as an example. For this processor make the following configuration:- Hadoop Configuration Resources
- provide #{Script}
Leave all other parameters with their default values.
- Download the data flow as a flow definition from NiFi and import it to Cloudera DataFlow.
- Initiate a flow deployment from the Catalog. In the Parameters step of the Deployment Wizard, upload your Python script to the Script parameter. Complete the Wizard and submit your deployment request.