PartitionPdf

Description:

Partitions a PDF file using the partition_pdf function of unstructured.io. Properties are forwarded to partition_pdf as parameters. The output is a JSON document in the format output by partition_pdf.

Tags:

ai, artificial intelligence, ml, machine learning, text, LLM, partition, pdf, partition_pdf

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueDescription
Include MetadataInclude MetadatatrueWhether to include metadata in the output.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
LanguagesLanguagesComma-separated list of 3-letter language codes to be used as metadata.languages. If unset, the language is detected via langdetect.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Max PartitionMax PartitionThe maximum number of characters in each partition.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Min PartitionMin PartitionThe minimum number of characters in each partition.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Metadata Last ModifiedMetadata Last ModifiedDate-time to include in the metadata as last_modified.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
StrategyStrategyfastHow to process image-like input. NOTE: strategies other than 'fast' can be very slow, use them at your own risk.
Extract Images in PDFExtract Images in PDFWhether to extract images from the PDF file. Deprecated in favor of Extract Image Block Types.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Extract Image Block TypesExtract Image Block TypesComma-separated list of block types to extract. Allowed values are Image and Table.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Extract Image Block to PayloadExtract Image Block to PayloadWhether to include base64-encoded extracted images in the output
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Extract Image Block Output DirectoryExtract Image Block Output DirectoryDirectory to save extracted images to. Only works if Extract Image Block to Payload is false.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)