SFTP ETL Source

SFTP data source allows you to read data from network file systems.

Schema Type

See the topic Provide Schema for ETL Source → to know how schema details can be provided for data sources.

Header Included

Option to enable or disable the scanning of first row as a header for excel files. This option is disabled by default.

When enabled, an additional option Include Header will get displayed.

Include Header

If All is selected, then the first row will be considered as header of all the sheets that are selected in the source excel file.

If First Sheet is selected, then the first row of only the first sheet will be considered as header out of all the sheets that are selected in the source excel file.


Contains Formulas

Check this option if your sheets contain formulas.

The application can read formulas from Excel only if all sheets are selected.

While using multiple MS Excel sources, choose all sheets in each MS Excel source.

The below condition is applicable if the Contains Formula field is set to True.

To parse date time columns in a SFTP data source with Type of Data selected as MS Excel, use an Expression Evaluator processor and provide an expression in the following format.

CASE WHEN CAST(date_column AS DOUBLE) IS NOT NULL
	THEN FROM_UNIXTIME(UNIX_TIMESTAMP(‘1900-01-01 00:00:000’) + (date_column- 2) * 86400)
	ELSE date_column
END

Specify Sheets

(If file type is selected as MS Excel) The option to specify how the data should be read from the workbook. Either choose the option Specific Sheets and enter the names of all the required sheets that needs to be read. Or, select the option All to read the entire excel file data.

Schema Sheet

(If All the sheets in the excel file are to be read) Preferred sheet name should be entered from which the schema will get inferred. If left blank, the schema will be inferred from the first sheet of the Excel workbook.

Sheet Name

(If Specific Sheets in the excel file are to be read) Preferred sheet(s) name should be entered from which the first sheet will be considered for detecting schema and all given sheets will be read during runtime.

If Fetch From Source is selected, ensure that at least one file in the source directory is less than 500 MB. During design time, SFTP data source can fetch sample data only from files less than 500 MB.

After providing schema type details, the next step is to configure the data source.


Data Source Configuration

Configure the data source parameters as explained below.

Connection Name

Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details for SFTP earlier. Or create one as explained in the topic - SFTP Connection →


File Path

File path of the SFTP file system is to be given.

The wildcards, asterisk (*) and question mark (?) are also supported.

Wildcards can be provided in either folder or file for pattern matching.

Use question mark (?) as a wildcard to search for a single character and an asterisk (*) as a wildcard for any number of characters.

Example: The query /folder/?ink will fetch files from the folders named pink, sink, wink, etc.

Whereas, the query /folder/bird* will fetch files from the folders named bird, birding, birds, and other folders that start with bird.


  • Only .xlsx files are supported.

  • If a given folder path (Example: /home/ec2-user) has excel files and sub-folders too have excel files, then the data will be read from all the excel files including the ones available in the sub-folders.

  • In order to read the data only from the specified folder, provide the path with wildcard as given in this example: /home/ec2-user/*.xlsx

Incremental Read: Check mark to read latest file in case of folder.


Parallelism

Option to provide the required number of multiple SFTP threads to be launched in parallel for greater download speed. Default value is given as 4.


Is Compressed

Check mark if the source files are compressed. (For example, in *.zip, *.tar or *.tar.gz formats)


Fetch File Name

Choose this option to automatically fetch the file name from the SFTP source and create a new column named ‘sourceFileName’ for subsequent processing.


Add Configuration

Additional properties can be added using this option as key-value pairs.

Some useful configurations:

maxSampleSizeInMB - This configuration can be used to limit the size of the file(s) to be read from the SFTP source data during application design. The default is 50 MB, and the max size supported to read a file to get sample data can be set to 500 MB.

locale - This configuration can be provided in BCP47 format while reading MS Excel data. It specifies the document’s locale for evaluating fields (e.g., numeric or date fields). By default, the system locale is used.

emulateCSV - This configuration can be used to simulate reading the content of MS Excel files as if they have been saved as CSV. This can have values as True | False


Detect Schema

Check the populated schema details. For more details, see Schema Preview →

Pre Action

To understand how to provide SQL queries or Stored Procedures that will be executed during pipeline run, see Pre-Actions →.

Notes

Optionally, enter notes in the Notes → tab and save the configuration.

Top