EmbedData

Description:

Embeds incoming data using a locally present model. The processor either embeds the whole incoming data, or specific values of an incoming JSON input. Models can be downloaded for example from huggingface.co by cloning the model's repository.

Tags:

vector, vectordb, vectorstore, embeddings, ai, artificial intelligence, ml, machine learning, text, LLM

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueDescription
Embed ScopeEmbed ScopeWhole InputScope of the input to embed. If it is set to 'Whole Input', the whole input content will be embedded. If it is set to 'JSON Value', the value of the JSON fields defined in the JSON Fields to Embed property will be embedded in each JSON object of the input. In the latter case the input is expected to be a JSON array of objects or separate JSON objects line by line. In case a 'metadata' field is present in the JSON object, its content will be moved to the root of the object.
JSON Fields to EmbedJSON Fields to EmbedThe names of the JSON fields to embed, provided as a comma-separated list. This is only used if the Embed Scope is set to 'JSON Value'. The embedded value will be stored in the same JSON object under the key '<field_key>_embedding'.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Embedding ModelEmbedding ModelHuggingFace model name to download or a local directory path of the model to use for embedding the data. In case a local directory path is specified, the directory should contain the model's config.json file.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Keep Input LimitKeep Input Limit3500This property limits the size of the input content that is kept in the output FlowFile. If the input content is larger than this limit, it will be replaced with an empty string. In case the input to embed is the whole input content, the input content will be kept in the 'embeddata.text' attribute of the output FlowFile if its size is below this limit. Otherwise, the 'embeddata.text' attribute will be empty. In case the input to embed is a JSON object or an array of JSON objects, the value of the JSON fields to embed will be kept in the output JSON file if their size is below this limit. Otherwise, the value of the JSON fields to embed will be replaced with an empty string. If the limit is set to below 0, the input will always be kept.