ADLS Data Source - Batch and Streaming
Add an ADLS batch or streaming data source to create a pipeline. Click the component to configure it.
Under the Schema Type tab, select Fetch From Source, Upload Data File or Use Existing Dataset option. Edit the schema if required and click next to configure.
On the ADLS channel, you will be able to read data with formats including JSON, CSV, TEXT, XML, Fixed Length, Binary, AVRO, Parquet, ORC.
- Infer from Data
- Inline Avro Schema
- Upload Avro Schema
The options to Provide Schema field are also available of Upload Data File option under the Schema Type tab.
Field | Description |
---|---|
Connection Name | Connections are the Service identifiers. Select the connection name from the available list of connections, from where you would like to read the data. |
Override Credentials | Select the override credentials option check-box for overriding the credentials. |
Authentication Type | Azure ADLS autentication type. |
Account Name | Provide a valid Azure ADLS account name. |
Account Key | Provide a valid account key. You can also test connection by clicking at the TEST CONNECTION button. |
Container | Provide connection name in Azure Blob storage. |
ADLS Directory Path | Provide directory path for ADLS file system. User has an option to configure ADLS source with supported compressed data files. |
File Filter | Provide a file pattern. File filter is used to only include files with file names matching the pattern. For e.g *.pdf or *emp *.csv |
Recursive File Lookup | Check the option to retrieve the files from current/sub-folder(s). |
ADD CONFIGURATIONS | User can add further configurations (Optional). |
Environment Params | User can add further environment parameters. (Optional) |
Provide the below fields to configure ADLS data source:
Click Next for Incremental Read option.
Field | Description |
---|---|
Enable Incremental Read | Unchecked by default, check mark this option to enable incremental read support. |
Read By | Option to read data incrementally either by choosing the File Modification Time option or Column Partition option. |
Upon selecting the File Modification Time option, provide the below detail:
Offset | Specifies the last modified time of the file. The offset time must be lesser than the latest file modification time. Records with timestamp value greater than the specified datetime (in UTC) will be fetched. After each pipeline run the datetime configuration will set to the most recent timestamp value from the last fetched records. The given value should be in UTC with ISO Date format as yyyy-MM-dd’T’HH:mm:ss.SSSZZZ. Example: 2021-12-24T13:20:54.825+0000. |
Upon selecting the Column Partition option, provide the below details:
Read Control Type | Options to control data fetch: All records in reference column with values greater than the start value will be read. Limit by Value: All records in reference column with values greater than the start value but less than/equal to the max value that you set will be read. Limit by Incremental Size: Â All records in reference column with values greater than the start value with specified incremental size that you set will be selected. Upon selecting the Inclusive Start Offset checkbox, the Start value will be included with selected size. |
Inclusive Start Offset | Check the checkbox for enabling the Inclusive Start Offset option to include the start value for incrementally reading the schema. Supports the integer, date and timestamp data types. In case if the Limit by Value option is selected as Read Control Type, then provide the Max value along with Start value of the selected column ID. Upon selecting the Inclusive Start Offset option the schema from Start value to the Max value will be incrementally read. |
If you have any feedback on Gathr documentation, please email us!