S3 Batch Data Source

S3 Batch Channel can incrementally read data from S3 Buckets.

On an S3 Batch channel, you will be able to read data from the specified S3 Bucket with formats like JSON, CSV, Text, Parquet, and ORC.

For an S3 Data Source, if data is fetched from the source, and the type of data is CSV, the schema has an added tab, Is Header Included in the source configuration.

This is to signify if the data that is fetched from source has a header or not.

If Upload Data File is chosen, then there is an added tab, which is Is Header included in Source. This signifies if the data is uploaded is included in source or not.

Configuring S3 Batch Data Source

To add an S3 Batch Data Source into your pipeline, drag the Data Source to the canvas and right-click on it to configure.

Under the Schema Type tab, select Fetch From Source or Upload Data File.

FieldDescription
Connection Name

Connections are the Service identifiers.

Select the connection name from the available list of connections, from where you would like to read the data.

Override Credentials

Unchecked by default, check mark the checkbox to override credentials for user specific actions.

Provide AWS KeyID and Secret Access Key.

S3 Protocol

Available protocols of S3 are S3, S3n, S3a.

S3a protocol is supported by Databricks.

S3 and S3n are supported by EMR.

S3a is supported by Hadoop Version 3.x.

S3 and S3n protocol is supported by Hadoop Version below 3.x.

End PointS3 endpoint details should be provided if the source is Dell EMC S3.
Bucket NameBuckets are storage units used to store objects, which consists of data and meta-data that describes the data.
PathFile or directory path from where data is to be read.
Enable Incremental ReadNote: Incremental Read works during Pipeline run.
OffsetSpecifies last modified time of file - all the files whose last modified time is greater that this value will be read.
Add Configuration

To add additional custom S3 properties in a key-value pair.

User can add further configurations by the following ways:

Use key: avroSchema in case if you want to provide avro schema file and paste content (as value) in JSON format to map the schema.

- Use key: avroSchemaFilePath and provide S3 absolutepath of AVSC schema file as value.

To load the schema file from s3, IAM Role attached to Instance Profile will be used.

Click Next for the Incremental Read option.

Incremental Read Configuration

FieldDescription
Enable Incremental ReadUnchecked by default, check mark this option to enable incremental read support.
Read ByOption to read data incrementally either by choosing the File Modification Time option or Column Partition option.

Upon selecting the File Modification Time option provide the below detail:

OffsetRecords with timestamp value greater than the specified datetime (in UTC) will be fetched. After each pipeline run the datetime configuration will set to the most recent timestamp value from the last fetched records. The given value should be in UTC with ISO Date format as yyyy-MM-dd’T’HH:mm:ss.SSSZZZ. Ex: 2021-12-24T13:20:54.825+0000.

Upon selecting the Column Partition option provide the below details:

Column

Select the column for incremental read. The listed columns can be integer, long, date, timestamp, decimal, etc.

Note: The selected column should have sequential, sorted (in increasing order) and unique values.

Start ValueMention the value of reference column. Only the records whose value of the reference column is greater than this value will be read.
Read Control Type

Provides three options to control data to be fetched -None, Limit By Count, and Limit by Value.

None: All the records with value of reference column greater than offset will be read.

Limit By Count: Mentioned no. of records will be read with the value of reference column greater than offset will be read.

Limit by Value: All the records with value of reference column greater than offset and less than Column Value field will be read.

For None and Limit by count it is recommended that table should have data in sequential and sorted (increasing) order.

Click on the Add Notes tab. Enter the notes in the space provided.

Click Done to save the configuration.

Configure Pre-Action in Source →

Top