ChunkDocument

Description:

Chunks incoming documents that are formatted as JSON Lines into chunks that are appropriately sized for creating Text Embeddings. The input is expected to be in "json-lines" format, with each line having a 'text' and a 'metadata' element. Each line will then be split into one or more lines in the output.

Tags:

text, split, chunk, langchain, embeddings, vector, machine learning, ML, artificial intelligence, ai, document

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueDescription
Chunking StrategyChunking StrategyRecursively Split by CharacterSpecifies which splitter should be used to split the text
SeparatorSeparator\n\n,\n, ,Specifies the character sequence to use for splitting apart the text. If using a Chunking Strategy of Recursively Split by Character, it is a comma-separated list of character sequences. Meta-characters \n, \r and \t are automatically un-escaped.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Separator FormatSeparator FormatPlain TextSpecifies how to interpret the value of the <Separator> property
Chunk SizeChunk Size4000The maximum size of a chunk that should be returned
Chunk OverlapChunk Overlap200The number of characters that should be overlapped between each chunk of text
Keep SeparatorKeep SeparatorfalseWhether or not to keep the text separator in each chunk of data
Strip WhitespaceStrip WhitespacetrueWhether or not to strip the whitespace at the beginning and end of each chunk
LanguageLanguagepythonThe language to use for the Code's syntax

Example Use Cases:

Use Case:

Create chunks of text from a single larger chunk.

Notes:

The input for this use case is expected to be a FlowFile whose content is a JSON Lines document, with each line having a 'text' and a 'metadata' element.

Keywords:

embedding, vector, text, rag, retrieval augmented generation

Configuration:

Set "Input Format" to "Plain Text"

Set "Element Strategy" to "Single Document"



Example Use Cases Involving Other Components:

Use Case:

Chunk Plaintext data in order to prepare it for storage in a vector store. The output is in "json-lines" format, containing the chunked data as text, as well as metadata pertaining to the chunk.

Notes:

The input for this use case is expected to be a FlowFile whose content is a plaintext document.

Keywords:

embedding, vector, text, rag, retrieval augmented generation

Components involved:

Component Type: ParseDocument

Configuration:

Set "Input Format" to "Plain Text"

Set "Element Strategy" to "Single Document"

Connect the 'success' Relationship to ChunkDocument.



Component Type: ChunkDocument

Configuration:

Set the following properties:

"Chunking Strategy" = "Recursively Split by Character"

"Separator" = "\n\n,\n, ,"

"Separator Format" = "Plain Text"

"Chunk Size" = "4000"

"Chunk Overlap" = "200"

"Keep Separator" = "false"

Connect the 'success' Relationship to the appropriate destination to store data in the desired vector store.





Use Case:

Parse and chunk the textual contents of a PDF document in order to prepare it for storage in a vector store. The output is in "json-lines" format, containing the chunked data as text, as well as metadata pertaining to the chunk.

Notes:

The input for this use case is expected to be a FlowFile whose content is a PDF document.

Keywords:

pdf, embedding, vector, text, rag, retrieval augmented generation

Components involved:

Component Type: ParseDocument

Configuration:

Set "Input Format" to "PDF"

Set "Element Strategy" to "Single Document"

Set "Include Extracted Metadata" to "false"

Connect the 'success' Relationship to ChunkDocument.



Component Type: ChunkDocument

Configuration:

Set the following properties:

"Chunking Strategy" = "Recursively Split by Character"

"Separator" = "\n\n,\n, ,"

"Separator Format" = "Plain Text"

"Chunk Size" = "4000"

"Chunk Overlap" = "200"

"Keep Separator" = "false"

Connect the 'success' Relationship to the appropriate destination to store data in the desired vector store.