Chunks incoming documents that are formatted as JSON Lines into chunks that are appropriately sized for creating Text Embeddings. The input is expected to be in "json-lines" format, with each line having a 'text' and a 'metadata' element. Each line will then be split into one or more lines in the output.
text, split, chunk, langchain, embeddings, vector, machine learning, ML, artificial intelligence, ai, document
In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.
Display Name | API Name | Default Value | Description |
---|---|---|---|
Chunking Strategy | Chunking Strategy | Recursively Split by Character | Specifies which splitter should be used to split the text |
Separator | Separator | \n\n,\n, , | Specifies the character sequence to use for splitting apart the text. If using a Chunking Strategy of Recursively Split by Character,
it is a comma-separated list of character sequences. Meta-characters \n, \r and \t are automatically un-escaped. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) |
Separator Format | Separator Format | Plain Text | Specifies how to interpret the value of the <Separator> property |
Chunk Size | Chunk Size | 4000 | The maximum size of a chunk that should be returned |
Chunk Overlap | Chunk Overlap | 200 | The number of characters that should be overlapped between each chunk of text |
Keep Separator | Keep Separator | false | Whether or not to keep the text separator in each chunk of data |
Strip Whitespace | Strip Whitespace | true | Whether or not to strip the whitespace at the beginning and end of each chunk |
Language | Language | python | The language to use for the Code's syntax |
Create chunks of text from a single larger chunk.
The input for this use case is expected to be a FlowFile whose content is a JSON Lines document, with each line having a 'text' and a 'metadata' element.
Set "Input Format" to "Plain Text"
Set "Element Strategy" to "Single Document"
Chunk Plaintext data in order to prepare it for storage in a vector store. The output is in "json-lines" format, containing the chunked data as text, as well as metadata pertaining to the chunk.
The input for this use case is expected to be a FlowFile whose content is a plaintext document.
Set "Input Format" to "Plain Text"
Set "Element Strategy" to "Single Document"
Connect the 'success' Relationship to ChunkDocument.
Set the following properties:
"Chunking Strategy" = "Recursively Split by Character"
"Separator" = "\n\n,\n, ,"
"Separator Format" = "Plain Text"
"Chunk Size" = "4000"
"Chunk Overlap" = "200"
"Keep Separator" = "false"
Connect the 'success' Relationship to the appropriate destination to store data in the desired vector store.
Parse and chunk the textual contents of a PDF document in order to prepare it for storage in a vector store. The output is in "json-lines" format, containing the chunked data as text, as well as metadata pertaining to the chunk.
The input for this use case is expected to be a FlowFile whose content is a PDF document.
Set "Input Format" to "PDF"
Set "Element Strategy" to "Single Document"
Set "Include Extracted Metadata" to "false"
Connect the 'success' Relationship to ChunkDocument.
Set the following properties:
"Chunking Strategy" = "Recursively Split by Character"
"Separator" = "\n\n,\n, ,"
"Separator Format" = "Plain Text"
"Chunk Size" = "4000"
"Chunk Overlap" = "200"
"Keep Separator" = "false"
Connect the 'success' Relationship to the appropriate destination to store data in the desired vector store.