PartitionHtml

Description:

Partitions HTML data using the partition_html function of unstructured.io. Properties are forwarded to partition_html as parameters. The output is a JSON document in the format output by partition_html.

Tags:

ai, artificial intelligence, ml, machine learning, text, LLM, partition, html, partition_html

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueDescription
EncodingEncodingUTF-8The character encoding used to decode the text input.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Include MetadataInclude MetadatatrueWhether to include metadata in the output.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
LanguagesLanguagesComma-separated list of 3-letter language codes to be used as metadata.languages. If unset, the language is detected via langdetect.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Detect Language Per ElementDetect Language Per ElementfalseDetect language per element instead of at the document level.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Metadata Last ModifiedMetadata Last ModifiedDate-time to include in the metadata as last_modified.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Skip Headers and FootersSkip Headers and FootersfalseIgnore content within <header> and <footer> tags.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)