Partitions a PDF file using the partition_pdf function of unstructured.io. Properties are forwarded to partition_pdf as parameters. The output is a JSON document in the format output by partition_pdf.
ai, artificial intelligence, ml, machine learning, text, LLM, partition, pdf, partition_pdf
In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.
Display Name | API Name | Default Value | Description |
---|---|---|---|
Include Metadata | Include Metadata | true | Whether to include metadata in the output. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) |
Languages | Languages | Comma-separated list of 3-letter language codes to be used as metadata.languages. If unset, the language is detected via langdetect. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) | |
Max Partition | Max Partition | The maximum number of characters in each partition. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) | |
Min Partition | Min Partition | The minimum number of characters in each partition. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) | |
Metadata Last Modified | Metadata Last Modified | Date-time to include in the metadata as last_modified. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) | |
Strategy | Strategy | fast | How to process image-like input. NOTE: strategies other than 'fast' can be very slow, use them at your own risk. |
Extract Images in PDF | Extract Images in PDF | Whether to extract images from the PDF file. Deprecated in favor of Extract Image Block Types. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) | |
Extract Image Block Types | Extract Image Block Types | Comma-separated list of block types to extract. Allowed values are Image and Table. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) | |
Extract Image Block to Payload | Extract Image Block to Payload | Whether to include base64-encoded extracted images in the output Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) | |
Extract Image Block Output Directory | Extract Image Block Output Directory | Directory to save extracted images to. Only works if Extract Image Block to Payload is false. Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables) |