ChunkDocument

Description:

Chunks incoming documents that are formatted as JSON Lines into chunks that are appropriately sized for creating Text Embeddings. The input is expected to be in "json-lines" format, with each line having a 'text' and a 'metadata' element. Each line will then be split into one or more lines in the output.

Tags:

text, split, chunk, langchain, embeddings, vector, machine learning, ML, artificial intelligence, ai, document

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display Name	API Name	Default Value	Description
Chunking Strategy	Chunking Strategy	Recursively Split by Character	Specifies which splitter should be used to split the text
Separator	Separator	\n\n,\n, ,	Specifies the character sequence to use for splitting apart the text. If using a Chunking Strategy of Recursively Split by Character, it is a comma-separated list of character sequences. Meta-characters \n, \r and \t are automatically un-escaped. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Separator Format	Separator Format	Plain Text	Specifies how to interpret the value of the <Separator> property
Chunk Size	Chunk Size	4000	The maximum size of a chunk that should be returned
Chunk Overlap	Chunk Overlap	200	The number of characters that should be overlapped between each chunk of text
Keep Separator	Keep Separator	false	Whether or not to keep the text separator in each chunk of data
Strip Whitespace	Strip Whitespace	true	Whether or not to strip the whitespace at the beginning and end of each chunk
Language	Language	python	The language to use for the Code's syntax

Example Use Cases:

Use Case:

Create chunks of text from a single larger chunk.

Notes:

The input for this use case is expected to be a FlowFile whose content is a JSON Lines document, with each line having a 'text' and a 'metadata' element.

Keywords:

embedding, vector, text, rag, retrieval augmented generation

Configuration:

Set "Input Format" to "Plain Text"

Set "Element Strategy" to "Single Document"

Example Use Cases Involving Other Components:

Use Case:

Chunk Plaintext data in order to prepare it for storage in a vector store. The output is in "json-lines" format, containing the chunked data as text, as well as metadata pertaining to the chunk.

Notes:

The input for this use case is expected to be a FlowFile whose content is a plaintext document.

Keywords:

embedding, vector, text, rag, retrieval augmented generation

Components involved:

Component Type: ParseDocument

Configuration:

Set "Input Format" to "Plain Text"

Set "Element Strategy" to "Single Document"

Connect the 'success' Relationship to ChunkDocument.

Component Type: ChunkDocument

Configuration:

Set the following properties:

"Chunking Strategy" = "Recursively Split by Character"

"Separator" = "\n\n,\n, ,"

"Separator Format" = "Plain Text"

"Chunk Size" = "4000"

"Chunk Overlap" = "200"

"Keep Separator" = "false"

Connect the 'success' Relationship to the appropriate destination to store data in the desired vector store.

Use Case:

Parse and chunk the textual contents of a PDF document in order to prepare it for storage in a vector store. The output is in "json-lines" format, containing the chunked data as text, as well as metadata pertaining to the chunk.

Notes:

The input for this use case is expected to be a FlowFile whose content is a PDF document.

Keywords:

pdf, embedding, vector, text, rag, retrieval augmented generation

Components involved:

Component Type: ParseDocument

Configuration:

Set "Input Format" to "PDF"

Set "Element Strategy" to "Single Document"

Set "Include Extracted Metadata" to "false"

Connect the 'success' Relationship to ChunkDocument.

Component Type: ChunkDocument

Configuration:

Set the following properties:

"Chunking Strategy" = "Recursively Split by Character"

"Separator" = "\n\n,\n, ,"

"Separator Format" = "Plain Text"

"Chunk Size" = "4000"

"Chunk Overlap" = "200"

"Keep Separator" = "false"

Connect the 'success' Relationship to the appropriate destination to store data in the desired vector store.