Webscraper ETL Source
The WebScraper ETL connector facilitates the extraction of data from web sources through various scraping methods such as direct URLs, recursive searches, and sitemaps, enabling you to work with it effectively in your data workflows.
Schema Type
To infer schema during the application design phase, you can specify the maximum records to be fetched and sampling method.
After providing schema type details, the next step is to configure the data source.
Data Source Configuration
Configure the data source parameters as explained below.
Scrape Method
Specify how the Web Scraper component should gather data. Supported options are:
URL(s): Scrapes data from specified URL(s).
Recursive: Scrapes data from URLs to a specified depth.
Sitemap(s): Scrapes data from sitemap.
URL Fetch Method
Choose how to provide URLs for fetching in case of Scrape Type - URL and Recursive. Supported options are:
Enter URL(s): (For Methods - URLs & Recursive Search) Manually enter one or more URLs.
Upload File: (For Method - URLs) Upload a file containing URLs. Supported formats: .txt, .rtf, and .xml.
Fetch from S3: (For Method - URLs) Retrieve URLs from an Amazon S3 bucket. Provide a valid S3 connection, a Bucket name, and path to file or directory which has URL(s).
A connection name can be selected from the list if you have created and saved connection details for S3 earlier. Or create one as explained in the topic - Amazon S3 Connection.
These fields appear for Recursive Scraping:
Depth
The maximum level of recursion allowed when fetching URLs from within the same domain.
If you set a depth of 2, the scraper will not only fetch data from the initial URLs provided but will also follow links found on those pages up to two levels deep.
URL Within Same Domain
Enabling the “URL Within Same Domain” option restricts the recursive URL searching to only those URLs within the same domain as the initial URLs provided. This ensures that external website references found outside of the specified domain are not scraped.
Maximum Number Of URLs
The maximum number of URLs that the scraper will queue up for recursive searching. Once this limit is reached, the scraper will stop adding new URLs to the queue, even if it hasn’t reached the maximum depth.
Source
Specify how to fetch URL(s) from sitemap. Supported options are:
File Upload: upload sitemap.xml file from your local system. Supported formats: .txt, .rtf, and .xml.
S3: Retrieve sitemap from an Amazon S3 bucket. Provide a valid S3 connection, a Bucket name, and path to file or directory which has sitemaps.
A connection name can be selected from the list if you have created and saved connection details for S3 earlier. Or create one as explained in the topic - Amazon S3 Connection.
Include HTML Tags
HTML tags to include in the scraped data.Type the element name and press Enter to include.
For example: h1,h2,p
Exclude HTML Tags
HTML tag(s) to exclude from the scraped data. Type the element name and press Enter to exclude.
For example: h1,h2,p
Include Images
Enable this option to include images in response.
Image Formats
Select image formats to be included in response. Supported formats are:
JPEG
PNG
JPG
Max Image size (Design Time)
Specify the maximum cumulative size (in KB) for all images to be fetched during design time.
The minimum allowed size is 1 KB, and the maximum allowed size is 4096 KB. If the total size of the images exceeds this limit, additional images will not be fetched.
Note: All images will be scraped at runtime.
Image Count (Design Time)
Specify the maximum number of images to be fetched during design time.
The minimum number of images allowed is 1, and the maximum allowed is 4. If the total size of the images exceeds this limit, additional images will not be fetched.
If the total number of images exceeds this limit, additional images will not be fetched.
Note: All images will be scraped at runtime.
Output
Please select how you want to format output. Supported output formats are:
Content in tags as JSON
Content as text
Content as markdown
Add Configuration: Additional properties can be added using this option as key-value pairs.
More Configurations
This section contains additional configuration parameters.
Timeout (sec)
Maximum time (in seconds) allowed for each URL fetch attempt before timing out. Default value: 10 seconds.
Retries
Number of retry attempts for failed URL fetch attempts. Default value: 3 retries.
Retry Delay (ms)
Time (in milliseconds) to wait between retry attempts. Default value: 1000 milliseconds (1 second).
Backoff Factor
Factor by which the retry delay increases after each attempt (exponential backoff). Default value: 0.3 (30%).
Use Default Truststore
Check this box to use TLSv1.2 for secure communication.
Pre Action
To understand how to provide SQL queries or Stored Procedures that will be executed during pipeline run, see Pre-Actions →.
Notes
Optionally, enter notes in the Notes → tab and save the configuration.
If you have any feedback on Gathr documentation, please email us!