Webscraper ETL Source

The WebScraper ETL connector facilitates the extraction of data from web sources through various scraping methods such as direct URLs, recursive searches, and sitemaps, enabling you to work with it effectively in your data workflows.

Schema Type

To infer schema during the application design phase, you can specify the maximum records to be fetched and sampling method.

After providing schema type details, the next step is to configure the data source.

Data Source Configuration

Configure the data source parameters as explained below.

Scrape Method

Specify how the Web Scraper component should gather data. Supported options are:

URL(s): Scrapes data from specified URL(s).
Recursive: Scrapes data from URLs to a specified depth.
Sitemap(s): Scrapes data from sitemap.

URL Fetch Method

Choose how to provide URLs for fetching in case of Scrape Type - URL and Recursive. Supported options are:

Enter URL(s): (For Methods - URLs & Recursive Search) Manually enter one or more URLs.
Upload File: (For Method - URLs) Upload a file containing URLs. Supported formats: .txt, .rtf, and .xml.
Fetch from S3: (For Method - URLs) Retrieve URLs from an Amazon S3 bucket. Provide a valid S3 connection, a Bucket name, and path to file or directory which has URL(s).
A connection name can be selected from the list if you have created and saved connection details for S3 earlier. Or create one as explained in the topic - Amazon S3 Connection.

These fields appear for Recursive Scraping:

Depth

The maximum level of recursion allowed when fetching URLs from within the same domain.

If you set a depth of 2, the scraper will not only fetch data from the initial URLs provided but will also follow links found on those pages up to two levels deep.

URL Within Same Domain

Enabling the “URL Within Same Domain” option restricts the recursive URL searching to only those URLs within the same domain as the initial URLs provided. This ensures that external website references found outside of the specified domain are not scraped.

Maximum Number Of URLs

The maximum number of URLs that the scraper will queue up for recursive searching. Once this limit is reached, the scraper will stop adding new URLs to the queue, even if it hasn’t reached the maximum depth.

Source

Specify how to fetch URL(s) from sitemap. Supported options are:

File Upload: upload sitemap.xml file from your local system. Supported formats: .txt, .rtf, and .xml.
S3: Retrieve sitemap from an Amazon S3 bucket. Provide a valid S3 connection, a Bucket name, and path to file or directory which has sitemaps.
A connection name can be selected from the list if you have created and saved connection details for S3 earlier. Or create one as explained in the topic - Amazon S3 Connection.

Include HTML Tags

HTML tags to include in the scraped data.Type the element name and press Enter to include.

For example: h1,h2,p

Exclude HTML Tags

HTML tag(s) to exclude from the scraped data. Type the element name and press Enter to exclude.

For example: h1,h2,p

Include Images

Enable this option to include images in response.

Image Formats

Select image formats to be included in response. Supported formats are:

JPEG
PNG
JPG

Max Image size (Design Time)

Specify the maximum cumulative size (in KB) for all images to be fetched during design time.

The minimum allowed size is 1 KB, and the maximum allowed size is 4096 KB. If the total size of the images exceeds this limit, additional images will not be fetched.

Note: All images will be scraped at runtime.

Image Count (Design Time)

Specify the maximum number of images to be fetched during design time.

The minimum number of images allowed is 1, and the maximum allowed is 4. If the total size of the images exceeds this limit, additional images will not be fetched.

If the total number of images exceeds this limit, additional images will not be fetched.

Note: All images will be scraped at runtime.

Output

Please select how you want to format output. Supported output formats are:

Content in tags as JSON
Content as text
Content as markdown

Add Configuration: Additional properties can be added using this option as key-value pairs.

More Configurations

This section contains additional configuration parameters.

Timeout (sec)

Maximum time (in seconds) allowed for each URL fetch attempt before timing out. Default value: 10 seconds.

Retries

Number of retry attempts for failed URL fetch attempts. Default value: 3 retries.

Retry Delay (ms)

Time (in milliseconds) to wait between retry attempts. Default value: 1000 milliseconds (1 second).

Backoff Factor

Factor by which the retry delay increases after each attempt (exponential backoff). Default value: 0.3 (30%).

Use Default Truststore

Check this box to use TLSv1.2 for secure communication.

Pre Action

To understand how to provide SQL queries or Stored Procedures that will be executed during pipeline run, see Pre-Actions →.

Notes

Optionally, enter notes in the Notes → tab and save the configuration.

If you have any feedback on Gathr documentation, please email us!

Webscraper ETL Source

Schema Type #

Data Source Configuration #

Scrape Method #

URL Fetch Method #

Depth #

URL Within Same Domain #

Maximum Number Of URLs #

Source #

Include HTML Tags #

Exclude HTML Tags #

Include Images #

Image Formats #

Max Image size (Design Time) #

Image Count (Design Time) #

Output #

More Configurations #

Timeout (sec) #

Retries #

Retry Delay (ms) #

Backoff Factor #

Use Default Truststore #

Pre Action #

Notes #