SFTP Ingestion Source

SFTP data source SFTP allows you to read data from network file systems.

Data Source Configuration

Configure the data source parameters as explained below.

Fetch From Source/Upload Data File

For designing the application, you can either fetch the sample data from the SFTP source by providing the data source connection details or upload a sample data file in one of the supported formats to see the schema details during the application design.


If Upload Data File is selected to fetch sample data, provide the below details.

File Format

Select the sample file format (file type) depending on the data type.

Gathr-supported file formats for SFTP data source are CSV, JSON, TEXT, Parquet, ORC and MS Excel.

For CSV file format, select its corresponding delimiter.

Header Included

Enable this option to read the first row as a header if your SFTP data is in CSV format.

Specify Sheets

(If file type is selected as MS Excel) The option to specify how the data should be read from the workbook. Either choose the option Specific Sheets and enter the names of all the required sheets that needs to be read. Or, select the option All to read the entire excel file data.

Schema Sheet

(If All the sheets in the excel file are to be read)
Preferred sheet name should be entered from which the schema will get inferred.

Sheet Name

(If Specific Sheets in the excel file are to be read)
Preferred sheet(s) name should be entered from which the first sheet will be considered for detecting schema and all given sheets will be read during runtime.

Upload

Please upload the sample file as per the file format selected above.


If Fetch From Source is selected, ensure that at least one file in the source directory is less than 500 MB. During design time SFTP data source can fetch sample data only from files less than 500 MB.


Continue configuring the data source.


Connection Name

Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details for SFTP earlier. Or create one as explained in the topic - SFTP Connection →

Use the Test Connection option to ensure that the connection with the SFTP channel is established successfully.

A success message states that the connection is available. In case of any error in test connection, edit the connection to resolve the issue before proceeding further.


File Path

File path of the SFTP file system is to be given.

The wildcards, asterisk (*) and question mark (?) are also supported.

Wildcards can be provided in either folder or file for pattern matching.

Use question mark (?) as a wildcard to search for a single character and an asterisk (*) as a wildcard for any number of characters.

Example: The query /folder/?ink will fetch files from the folders named pink, sink, wink, etc.

Whereas, the query /folder/bird* will fetch files from the folders named bird, birding, birds, and other folders that start with bird.


  • Only .xlsx files are supported.

  • If a given folder path (Example: /home/ec2-user) has excel files and sub-folders too have excel files, then the data will be read from all the excel files including the ones available in the sub-folders.

  • In order to read the data only from the specified folder, provide the path with wildcard as given in this example: /home/ec2-user/*.xlsx


Add Configuration

Additional properties can be added using this option as key-value pairs.

Some useful configurations:

maxSampleSizeInMB - This configuration can be used to limit the size of the file(s) to be read from the SFTP source data during application design. The default is 50 MB, and the max size supported to read a file to get sample data can be set to 500 MB.

locale - This configuration can be provided in BCP47 format while reading MS Excel data. It specifies the document’s locale for evaluating fields (e.g., numeric or date fields). By default, the system locale is used.

emulateCSV - This configuration can be used to simulate reading the content of MS Excel files as if they have been saved as CSV. This can have values as True | False


More Configurations

This section contains additional configuration parameters.

Incremental Read

Check mark to read latest file in case of folder.

Parallelism

Option to provide the required number of multiple SFTP threads to be launched in parallel for greater download speed. Default value is given as 4.

Is Compressed

Check mark if the source files are compressed. (For example, in *.zip, *.tar or *.tar.gz formats)


Fetch File Name

Choose this option to automatically fetch the file name from the SFTP source and create a new column named ‘sourceFileName’ for subsequent processing.


Schema

Check the populated schema details. For more details, see Schema Preview →

Top