PDF Parser Processor

The PDF Parser Processor can convert binary PDF data into structured text. It enables seamless sectionization of PDF content, preparing it for integration with AI-powered chunking or storage in Vector Databases.

This processor simplifies the extraction of valuable insights and data from PDF documents, facilitating enhanced data processing and analysis of your PDF documents.

Input Configuration

This section allows you to configure how the input PDF data is processed. Configure the input configuration parameters as explained below.

Input Column

Select the column containing the PDF data to be parsed.

Base64 Encoded

Check if the input data is encoded in Base64 format.

Password-Protected

Check if the input PDF file is password-protected.

Password

Enter the password to decrypt the password-protected PDF file(s).

Extract Images

Check to parse any images in the PDF.

INPUT CONTENT SPLIT OPTIONS

In this section, you can define how the content of the PDF will be split into manageable chunks.

Options include splitting by page, section, words, lines, or characters, allowing you to tailor the parsing process to your specific needs.

Configure the input content split configuration parameters as explained below.

Split Text By

Choose how to split the PDF document: pages, sections, lines, words, or characters. Default value is selected as Do Not Split.

Section Delimiter

Specify the delimiter used to identify sections when splitting based on sections.

Example: “\n” (newline character), indicating that each new line marks the beginning of a new section.

Lines Per Section

Specify the maximum number of lines per section when splitting based on lines.

Words Per Section

Specify the maximum number of words per section when splitting based on words.

Characters Per Section

Specify the maximum number of characters per section when splitting based on characters.

Overlap Count

Specify the number of overlapping characters between adjacent split sections.

Ensures that each split section in the output will share some common content based on the value provided, enabling content continuity.

To determine the appropriate value for this setting, consider the selected Split Text By option.

Example: If splitting text by lines, a smaller value might be suitable, while for larger units like sections or pages, a larger value could be more appropriate.

Adjust this value based on the granularity of the text splitting to achieve the desired level of continuity between sections.

Text Split Identifier

Define a format for indicating the location of split text chunks in the PDF file. Provide a Page Prefix, and as per the chosen Split Text By option, a separator string along with a Section Number Prefix.

As per the values defined, the example below this field will update to show how the text split identifier will appear in the output.

Image Identifier

Appears if Extract Images is set to true in the INPUT CONFIGURATION.

Define a format for identifying image location in the PDF file. Provide a Page Prefix, a separator string, and an Image Number Prefix.

As per the values defined, the example below this field will update to show how the image split identifier will appear in the output.

OUTPUT CONTENT

This section deals with configuring the output of the parsing process. You can specify the column where the content type information, the parsed content, and the content identifiers will be placed. Configure the output content configuration parameters as explained below.

Content Type Column

Specify or select the column where the content type information (e.g., text, image) will be stored.

Content Column

Specify or select the column where the split text chunks and images will be stored.

Content Identifier Column

Specify or select the column where identifiers for the parsed images and text chunks will be stored.

Drop Input Column from Output

Check to drop the Input Column from the output after parsing.

OTHERS

This section provides additional configurations related to the parsing process. Configure the additional configuration parameter as explained below.

Assign Failed Content

Specify or select the column where the content of the input column will be captured in case of parsing failure.

If you have any feedback on Gathr documentation, please email us!

PDF Parser Processor

Input Configuration #

Input Column #

Base64 Encoded #

Password-Protected #

Password #

Extract Images #

INPUT CONTENT SPLIT OPTIONS #

Split Text By #

Section Delimiter #

Lines Per Section #

Words Per Section #

Characters Per Section #

Overlap Count #

Text Split Identifier #

Image Identifier #

OUTPUT CONTENT #

Content Type Column #

Content Column #

Content Identifier Column #

Drop Input Column from Output #

OTHERS #

Assign Failed Content #