Install custom Python libraries in flow deployments

If your data flow requires custom Python packages you can modify your Python script to install these dependencies through the use of NiFi processors.

  1. Create a Python script, to install the package you want to add:
    #!/usr/bin/python3
    try: import [***PACKAGE NAME***] as [***IMPORT AS***]
    except ImportError:
        from pip._internal import main as pip
        pip(['install', '--user', '[***PACKAGE NAME***]])
        import [***PACKAGE NAME***] as [***IMPORT AS***]
    import sys
    file = [***IMPORT AS***].read_csv(sys.stdin)
    Replace [***PACKAGE NAME***] with the name of the package you want to import and [***IMPORT AS***] with a meaningful name you want the package to be called in your data flow.
    #!/usr/bin/python3
    try: import pandas as pd
    except ImportError:
        from pip._internal import main as pip
        pip(['install', '--user', 'pandas'])
        import pandas as pd
    import sys
    file = pd.read_csv(sys.stdin)
  2. Open the flow definition which requires custom packages in NiFi.
  3. Add and configure an ExecuteStreamCommand processor to run your script.
    Make the following property settings:
    Command Arguments
    provide #{Script}
    Command Path
    provide python
    Leave all other properties with their default values.
  4. If you have edited your data flow in NiFi, download it as a flow definition and import it to Cloudera DataFlow. If you have edited your data flow in the Flow Designer, publish the flow to the Catalog.
  5. Initiate a flow deployment from the Catalog. In the Parameters step of the Deployment Wizard, upload your Python script to the Script parameter. Upload additional files to the AdditionalResources parameter if applicable. Complete the Wizard and submit your deployment request.
Your Python script is uploaded to the flow deployment and the required custom libraries are installed when the script is executed as part of the data flow.