Supported NiFi Python components [Technical Preview]

This release is based on Apache NiFi 1.25 and 2.0.0. Apache NiFi 2.0 introduces a set of NiFi components written in Python, most of which are supported by Cloudera.

You should be familiar with the available supported Python components and avoid using any unsupported Python component in production environments.

While additional Python components are developed and tested by the Cloudera community, they are not officially supported by Cloudera. Python components may be excluded due to various reasons, such as insufficient reliability, incomplete test case coverage, community declaration of non-production readiness, or feature deviation from Cloudera best practices. Do not use these unsupported Python components in your production environments.

Supported Python components:


Divides a large text document into smaller chunks. Input is expected in the form of a FlowFile containing a JSON Lines document, where each line includes a 'text' and a 'metadata' element.


Parses incoming unstructured text documents and performs optical character recognition (OCR) to extract text from PDF and image files. The output is formatted as 'json-lines' with two keys: 'text' and 'metadata'. The use of this processor may require significant storage space and RAM utilization due to third-party dependencies necessary for processing PDF and image files. Additionally, it is important to install Tesseract and Poppler on your system to enable the processing of PDFs or images.


Submits a prompt to ChatGPT, writing the results either to a FlowFile attribute or to the contents of the FlowFile.


Publishes JSON data to a Chroma VectorDB. The incoming data must be in single JSON per Line format, containing two keys: 'text' and 'metadata'. The text must be a string, while metadata must be a map with string values. Any additional fields are ignored. If the collection name specified does not exist, the processor automatically creates the collection.


Queries a Chroma Vector Database to gather a specified number of documents that are most closely related to the given query.


Creates vectors/embeddings that represent text content and sends the vectors to Pinecone. This use case assumes that the data has already been formatted in JSONL format with the text to be stored in Pinecone provided in the 'text' field.


Queries Pinecone to gather a specified number of documents that are most closely related to the given query.