ChunkData

Description:

Processes the output of Partition* processors, and creates chunks from the input document in a standardized format, based on the user defined settings. The processor processes only text content, any other data - like images - are silently ignored. The output is a JSON document.

Tags:

ai, artificial intelligence, ml, machine learning, text, LLM, chunk, chunk_data

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueDescription
Chunk ByChunk ByPartitionDefines which strategy to use when chunking partitioned inputs. You can choose from 4 different strategies: 'Chunk By Partition' puts each partition to their own chunk, with 1-1 mapping. 'Chunk By Document' collects all text partitions from the incoming FlowFile to a single chunk. Texts are joined together by a single whitespace. There is no semantic differentiation, so titles, footnotes, and paragraphs are all collected. 'Chunk By Title' creates chunks from all title partitions, and any other partitions between the titles are merged into the respective (title) chunk. 'Chunk By Character Count' works similarly to 'Chunk By Document', but might create multiple chunks, based on the defined character limit. For example, if your document has 2000 characters, and you set the character limit to 1500, then you get two chunks. One containing the first 1500 characters of the document, and an another containing the remaining 500.
Max Characters By ChunkMax Characters By Chunk10000Split/Chunk the input document by this number of characters.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)