Partitions HTML data using the partition_html function of unstructured.io. Properties are forwarded to partition_html as parameters. The output is a JSON document in the format output by partition_html.
ai, artificial intelligence, ml, machine learning, text, LLM, partition, html, partition_html
In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.
Display Name | API Name | Default Value | Description |
---|---|---|---|
Encoding | Encoding | UTF-8 | The character encoding used to decode the text input. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) |
Include Metadata | Include Metadata | true | Whether to include metadata in the output. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) |
Languages | Languages | Comma-separated list of 3-letter language codes to be used as metadata.languages. If unset, the language is detected via langdetect. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) | |
Detect Language Per Element | Detect Language Per Element | false | Detect language per element instead of at the document level. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) |
Metadata Last Modified | Metadata Last Modified | Date-time to include in the metadata as last_modified. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) | |
Skip Headers and Footers | Skip Headers and Footers | false | Ignore content within <header> and <footer> tags. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) |