ParseDocument

Description:

Parses incoming unstructured text documents and performs optical character recognition (OCR) in order to extract text from PDF and image files. The output is formatted as "json-lines" with two keys: 'text' and 'metadata'. Note that use of this Processor may require significant storage space and RAM utilization due to third-party dependencies necessary for processing PDF and image files. Also note that in order to process PDF or Images, Tesseract and Poppler must be installed on the system.

Tags:

text, embeddings, vector, machine learning, ML, artificial intelligence, ai, document, langchain, pdf, html, markdown, word, excel, powerpoint

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values.

Display NameAPI NameDefault ValueDescription
Input FormatInput FormatPlain TextThe format of the input FlowFile. This dictates which TextLoader will be used to parse the input. Note that in order to process images or extract tables from PDF files,you must have both 'poppler' and 'tesseract' installed on your system.
Parsing StrategyPDF Parsing StrategyAutomaticSpecifies the strategy to use when parsing a PDF
PDF Parsing ModelPDF Parsing ModelyoloxThe model to use for parsing. Different models will have their own strengths and weaknesses.
Element StrategyElement StrategyDocument Per ElementSpecifies whether the input should be loaded as a single Document, or if each element in the input should be separated out into its own Document
Include Page BreaksInclude Page BreaksfalseSpecifies whether or not page breaks should be considered when creating Documents from the input
Infer Table StructureInfer Table StructurefalseIf true, any table that is identified in the PDF will be parsed and translated into an HTML structure. The HTML of that table will then be added to the Document's metadata in a key named 'text_as_html'. Regardless of the value of this property, the textual contents of the table will be written to the contents without the structure.
LanguagesLanguagesEngA comma-separated list of language codes that should be used when using OCR to determine the text.
Metadata FieldsMetadata Fieldsfilename, uuidA comma-separated list of FlowFile attributes that will be added to the Documents' Metadata
Include Extracted MetadataInclude Extracted MetadatatrueWhether or not to include the metadata that is extracted from the input in each of the Documents