Supported NiFi Python components [Technical Preview]

Apache NiFi 2.0 introduces a set of NiFi components written in Python. Most of these Python components are supported by Cloudera.

Supported Python components:

Bedrock: Invokes different type of models with the given prompt via Bedrock.
ChunkData: Processes the output of Partition* processors, and creates chunks from the input document in a standardized format, based on the user defined settings. The processor processes only text content, any other data - like images - are silently ignored. The output is a JSON document.
ChunkDocument: Divides a large text document into smaller chunks. Input is expected in the form of a FlowFile containing a JSON Lines document, where each line includes a 'text' and a 'metadata' element.
EmbedData: Embeds incoming data using a locally present model. The processor either embeds the whole incoming data, or specific values of an incoming JSON input. Models can be downloaded for example from huggingface.co by cloning the model's repository.
InsertToMilvus: Inserts or updates a vector in a Milvus collection. The input data is expected to be a float vector in JSON format (the dimension of the input must match dimension of the collection). Usually used together with EmbedData processor providing the float vector as the input for InsertToMilvus.
LexicalQueryMilvus: Performs a lexical search on a Milvus collection. The processor can query Milvus either by a list of IDs or by a filter. The IDs can be specified in a comma separated list in a specified attribute or in the content of the FlowFile. If the IDs are extracted from the content, the FlowFile should be in JSON format having an array of Milvus element objects. The JSON format is either the format of the output of the VectorQueryMilvus processor (list of lists) or a simple JSON list of Milvus objects. Each Milvus object is expected to have at least the primary key field specified.
ParseDocument: Parses incoming unstructured text documents and performs optical character recognition (OCR) to extract text from PDF and image files. The output is formatted as 'json-lines' with two keys: 'text' and 'metadata'. The use of this processor may require significant storage space and RAM utilization due to third-party dependencies necessary for processing PDF and image files. Additionally, it is important to install Tesseract and Poppler on your system to enable the processing of PDFs or images.
PartitionCsv: Partitions a CSV file using the partition_csv function of unstructured.io. Properties are forwarded to partition_csv as parameters. The output is a JSON document in the format output by partition_csv.
PartitionDocx: Partitions DOCX data using the partition_docx function of unstructured.io. Properties are forwarded to partition_docx as parameters. The output is a JSON document in the format output by partition_docx.
PartitionHtml: Partitions HTML data using the partition_html function of unstructured.io. Properties are forwarded to partition_html as parameters. The output is a JSON document in the format output by partition_html.
PartitionPdf: Partitions a PDF file using the partition_pdf function of unstructured.io. Properties are forwarded to partition_pdf as parameters. The output is a JSON document in the format output by partition_pdf.
PartitionText: Partitions a text file using the partition_text function of unstructured.io. Properties are forwarded to partition_text as parameters. The output is a JSON document in the format output by partition_text.
PromptChatGPT: Submits a prompt to ChatGPT, writing the results either to a FlowFile attribute or to the contents of the FlowFile.
PutChroma: Publishes JSON data to a Chroma VectorDB. The incoming data must be in single JSON per Line format, containing two keys: 'text' and 'metadata'. The text must be a string, while metadata must be a map with string values. Any additional fields are ignored. If the collection name specified does not exist, the processor automatically creates the collection.
PutOpenSearchVector: Publishes JSON data to OpenSearch. The Incoming data must be in single JSON per Line format, each with two keys: 'text' and 'metadata'. The text must be a string, while metadata must be a map with strings for values. Any additional fields will be ignored.
PutPinecone: Creates vectors/embeddings that represent text content and sends the vectors to Pinecone. This use case assumes that the data has already been formatted in JSONL format with the text to be stored in Pinecone provided in the 'text' field.
PutQdrant: Publishes JSON data to Qdrant. The Incoming data must be in single JSON per Line format, each with two keys: 'text' and 'metadata'. The text must be a string, while metadata must be a map with strings for values. Any additional fields will be ignored.
QueryChroma: Queries a Chroma Vector Database to gather a specified number of documents that are most closely related to the given query.
QueryOpenSearchVector: Queries OpenSearch in order to gather a specified number of documents that are most closely related to the given query.
QueryPinecone: Queries Pinecone to gather a specified number of documents that are most closely related to the given query.
QueryQdrant: Queries Qdrant in order to gather a specified number of documents that are most closely related to the given query.
VectorQueryMilvus: Performs a vector search in a Milvus collection. The input data is expected to be a float vector in JSON format. (the dimension of the input must match dimension of the collection). Usually used together with EmbedData processor providing the float vector as the input for VectorQueryMilvus.

While additional Python components are developed and tested by the community, they are not officially supported by Cloudera. Python components may be excluded due to various reasons, such as insufficient reliability, incomplete test coverage, community declaration of non-production readiness, or deviations from Cloudera best practices. Do not use these unsupported Python components in your production environments.