Supported NiFi Python components [Technical Preview]

Apache NiFi 2.0 introduces a set of NiFi components written in Python. Most of these Python components are supported by Cloudera.

Supported Python components:

Bedrock

Invokes different type of models with the given prompt via Bedrock.

ChunkData

Processes the output of Partition* processors, and creates chunks from the input document in a standardized format, based on the user defined settings. The processor processes only text content, any other data - like images - are silently ignored. The output is a JSON document.

ChunkDocument

Divides a large text document into smaller chunks. Input is expected in the form of a FlowFile containing a JSON Lines document, where each line includes a 'text' and a 'metadata' element.

EmbedData

Embeds incoming data using a locally present model. The processor either embeds the whole incoming data, or specific values of an incoming JSON input. Models can be downloaded for example from huggingface.co by cloning the model's repository.

InsertToMilvus

Inserts or updates a vector in a Milvus collection. The input data is expected to be a float vector in JSON format (the dimension of the input must match dimension of the collection). Usually used together with EmbedData processor providing the float vector as the input for InsertToMilvus.

LexicalQueryMilvus

Performs a lexical search on a Milvus collection. The processor can query Milvus either by a list of IDs or by a filter. The IDs can be specified in a comma separated list in a specified attribute or in the content of the FlowFile. If the IDs are extracted from the content, the FlowFile should be in JSON format having an array of Milvus element objects. The JSON format is either the format of the output of the VectorQueryMilvus processor (list of lists) or a simple JSON list of Milvus objects. Each Milvus object is expected to have at least the primary key field specified.

ParseDocument

Parses incoming unstructured text documents and performs optical character recognition (OCR) to extract text from PDF and image files. The output is formatted as 'json-lines' with two keys: 'text' and 'metadata'. The use of this processor may require significant storage space and RAM utilization due to third-party dependencies necessary for processing PDF and image files. Additionally, it is important to install Tesseract and Poppler on your system to enable the processing of PDFs or images.

PartitionCsv

Partitions a CSV file using the partition_csv function of unstructured.io. Properties are forwarded to partition_csv as parameters. The output is a JSON document in the format output by partition_csv.

PartitionDocx

Partitions DOCX data using the partition_docx function of unstructured.io. Properties are forwarded to partition_docx as parameters. The output is a JSON document in the format output by partition_docx.

PartitionHtml

Partitions HTML data using the partition_html function of unstructured.io. Properties are forwarded to partition_html as parameters. The output is a JSON document in the format output by partition_html.

PartitionPdf

Partitions a PDF file using the partition_pdf function of unstructured.io. Properties are forwarded to partition_pdf as parameters. The output is a JSON document in the format output by partition_pdf.

PartitionText

Partitions a text file using the partition_text function of unstructured.io. Properties are forwarded to partition_text as parameters. The output is a JSON document in the format output by partition_text.

PromptChatGPT

Submits a prompt to ChatGPT, writing the results either to a FlowFile attribute or to the contents of the FlowFile.

PutChroma

Publishes JSON data to a Chroma VectorDB. The incoming data must be in single JSON per Line format, containing two keys: 'text' and 'metadata'. The text must be a string, while metadata must be a map with string values. Any additional fields are ignored. If the collection name specified does not exist, the processor automatically creates the collection.

PutOpenSearchVector

Publishes JSON data to OpenSearch. The Incoming data must be in single JSON per Line format, each with two keys: 'text' and 'metadata'. The text must be a string, while metadata must be a map with strings for values. Any additional fields will be ignored.

PutPinecone

Creates vectors/embeddings that represent text content and sends the vectors to Pinecone. This use case assumes that the data has already been formatted in JSONL format with the text to be stored in Pinecone provided in the 'text' field.

PutQdrant

Publishes JSON data to Qdrant. The Incoming data must be in single JSON per Line format, each with two keys: 'text' and 'metadata'. The text must be a string, while metadata must be a map with strings for values. Any additional fields will be ignored.

QueryChroma

Queries a Chroma Vector Database to gather a specified number of documents that are most closely related to the given query.

QueryOpenSearchVector

Queries OpenSearch in order to gather a specified number of documents that are most closely related to the given query.

QueryPinecone

Queries Pinecone to gather a specified number of documents that are most closely related to the given query.

QueryQdrant

Queries Qdrant in order to gather a specified number of documents that are most closely related to the given query.

VectorQueryMilvus

Performs a vector search in a Milvus collection. The input data is expected to be a float vector in JSON format. (the dimension of the input must match dimension of the collection). Usually used together with EmbedData processor providing the float vector as the input for VectorQueryMilvus.

While additional Python components are developed and tested by the community, they are not officially supported by Cloudera. Python components may be excluded due to various reasons, such as insufficient reliability, incomplete test coverage, community declaration of non-production readiness, or deviations from Cloudera best practices. Do not use these unsupported Python components in your production environments.