PDF Parser Processor
The PDF Parser Processor can convert binary PDF data into structured text. It enables seamless sectionization of PDF content, preparing it for integration with AI-powered chunking or storage in Vector Databases.
This processor simplifies the extraction of valuable insights and data from PDF documents, facilitating enhanced data processing and analysis of your PDF documents.
Input Configuration
This section allows you to configure how the input PDF data is processed. Configure the input configuration parameters as explained below.
Input Column
Select the column containing the PDF data to be parsed.
Base64 Encoded
Check if the input data is encoded in Base64 format.
Password-Protected
Check if the input PDF file is password-protected.
Password
Enter the password to decrypt the password-protected PDF file(s).
Extract Images
Check to parse any images in the PDF.
INPUT CONTENT SPLIT OPTIONS
In this section, you can define how the content of the PDF will be split into manageable chunks.
Options include splitting by page, section, words, lines, or characters, allowing you to tailor the parsing process to your specific needs.
Configure the input content split configuration parameters as explained below.
Split Text By
Choose how to split the PDF document: pages, sections, lines, words, or characters. Default value is selected as Do Not Split.
Section Delimiter
Specify the delimiter used to identify sections when splitting based on sections.
Example: “\n” (newline character), indicating that each new line marks the beginning of a new section.
Lines Per Section
Specify the maximum number of lines per section when splitting based on lines.
Words Per Section
Specify the maximum number of words per section when splitting based on words.
Characters Per Section
Specify the maximum number of characters per section when splitting based on characters.
Overlap Count
Specify the number of overlapping characters between adjacent split sections.
Ensures that each split section in the output will share some common content based on the value provided, enabling content continuity.
To determine the appropriate value for this setting, consider the selected Split Text By option.
Example: If splitting text by lines, a smaller value might be suitable, while for larger units like sections or pages, a larger value could be more appropriate.
Adjust this value based on the granularity of the text splitting to achieve the desired level of continuity between sections.
Text Split Identifier
Define a format for indicating the location of split text chunks in the PDF file. Provide a Page Prefix, and as per the chosen Split Text By option, a separator string along with a Section Number Prefix.
As per the values defined, the example below this field will update to show how the text split identifier will appear in the output.
Image Identifier
Appears if Extract Images is set to true in the INPUT CONFIGURATION.
Define a format for identifying image location in the PDF file. Provide a Page Prefix, a separator string, and an Image Number Prefix.
As per the values defined, the example below this field will update to show how the image split identifier will appear in the output.
OUTPUT CONTENT
This section deals with configuring the output of the parsing process. You can specify the column where the content type information, the parsed content, and the content identifiers will be placed. Configure the output content configuration parameters as explained below.
Content Type Column
Specify or select the column where the content type information (e.g., text, image) will be stored.
Content Column
Specify or select the column where the split text chunks and images will be stored.
Content Identifier Column
Specify or select the column where identifiers for the parsed images and text chunks will be stored.
Drop Input Column from Output
Check to drop the Input Column from the output after parsing.
OTHERS
This section provides additional configurations related to the parsing process. Configure the additional configuration parameter as explained below.
Assign Failed Content
Specify or select the column where the content of the input column will be captured in case of parsing failure.
If you have any feedback on Gathr documentation, please email us!