Partitions DOCX data using the partition_docx function of unstructured.io. Properties are forwarded to partition_docx as parameters. The output is a JSON document in the format output by partition_docx.
ai, artificial intelligence, ml, machine learning, text, LLM, partition, docx, partition_docx
In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.
Display Name | API Name | Default Value | Description |
---|---|---|---|
Languages | Languages | Comma-separated list of 3-letter language codes to be used as metadata.languages. If unset, the language is detected via langdetect. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) | |
Detect Language Per Element | Detect Language Per Element | false | Detect language per element instead of at the document level. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) |
Metadata Last Modified | Metadata Last Modified | Date-time to include in the metadata as last_modified. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) | |
Metadata Filename | Metadata Filename | If present, will be included in the metadata as filename. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) | |
Strategy | Strategy | fast | How to process image-like input. NOTE: strategies other than 'fast' can be very slow, use them at your own risk. |
Infer Table Structure | Infer Table Structure | true | If true, add text_as_html field to metadata on extracted tables. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) |
Starting Page Number | Starting Page Number | 1 | Assign this number to the first page of the document and increment the page number from there. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) |
Include Page Breaks | Include Page Breaks | true | When true, add a PageBreak element to the output where a page break is detected. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) |