PartitionDocx

Description:

Partitions DOCX data using the partition_docx function of unstructured.io. Properties are forwarded to partition_docx as parameters. The output is a JSON document in the format output by partition_docx.

Tags:

ai, artificial intelligence, ml, machine learning, text, LLM, partition, docx, partition_docx

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueDescription
LanguagesLanguagesComma-separated list of 3-letter language codes to be used as metadata.languages. If unset, the language is detected via langdetect.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Detect Language Per ElementDetect Language Per ElementfalseDetect language per element instead of at the document level.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Metadata Last ModifiedMetadata Last ModifiedDate-time to include in the metadata as last_modified.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Metadata FilenameMetadata FilenameIf present, will be included in the metadata as filename.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
StrategyStrategyfastHow to process image-like input. NOTE: strategies other than 'fast' can be very slow, use them at your own risk.
Infer Table StructureInfer Table StructuretrueIf true, add text_as_html field to metadata on extracted tables.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Starting Page NumberStarting Page Number1Assign this number to the first page of the document and increment the page number from there.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Include Page BreaksInclude Page BreakstrueWhen true, add a PageBreak element to the output where a page break is detected.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)