S3 Streaming Data Source

S3 data source reads objects from Amazon S3 bucket. Amazon S3 stores data as objects within resources called Buckets.

For an S3 data source, if data is fetched from the source, and the type of data is CSV, the schema has an added tab, Header Included in source.

This is to signify if the data that is fetched from source has a header or not.

If Upload Data File is chosen, then there is an added tab, which is Is Header Included in the source. This signifies if the data uploaded is included in the source or not.

Configuring S3 Data Source

To add the S3 data source into your pipeline, drag the source to the canvas and click on it to configure.

Under the Schema Type tab, select Fetch From Sourceor Upload Data File.

FieldDescription
Connection Name

Connections are the Service identifiers.

Select the connection name from the available list of connections, from where you would like to read the data.

S3 ProtocolProtocols available are S3, S3n, S3a
End PointS3 endpoint details should be provided if the source is Dell EMC S3.
Bucket NameBuckets are storage units used to store objects, which consists of data and meta-data that describes the data.
Override Credentials

Unchecked by default, check mark the checkbox to override credentials for user specific actions.

Provide AWS KeyID and Secret Access Key.

PathFile or directory path from where data is to be read. The path to be added must be a directory and not absolute.
Add Configuration

To add additional custom S3 properties in a key-value pair.

User can add further configurations by the following ways:

Use key: avroSchema in case if you want to provide avro schema file and paste content (as value) in JSON format to map the schema.

- Use key: avroSchemaFilePath and provide S3 absolutepath of AVSC schema file as value.

To load the schema file from s3, IAM Role attached to Instance Profile will be used.

Click on the Add Notes tab. Enter the notes in the space provided.

Click Done to save the configuration.

Top