S3 Emitter

Amazon S3 stores data as objects within resources called Buckets. S3 emitter stores objects on Amazon S3 bucket.

S3 Emitter Configuration

To add an S3 emitter to your pipeline, drag the emitter onto the canvas and connect it to a Data Source or processor. Right-click on the emitter to configure it as explained below:

FieldDescription
Connection NameAll S3 connections will be listed here. Select a connection for connecting to S3.
S3 protocolS3 protocol to be used while writing on S3.
End PointS3 endpoint details should be provided if the source is Dell EMC S3.
Bucket NameBuckets are storage units used to store objects, which consists of data and meta-data that describes the data.
Override CredentialsUnchecked by default, check mark the checkbox to override credentials for user specific actions.
AWS Key IdProvide the S3 account access key.
Secret Access Key

Provide the S3 account secret key.

PathFile or directory path from where data is to be stored.
Output TypeOutput format in which result will be processed.
DelimiterMessage Field seperator.
Output FieldsFields of the output message.
Partitioning RequiredWhether to partition data on s3 or not.
Save Mode

Save Mode is used to specify the expected behavior of saving data to a data sink.

ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown.

Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data.

Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data.

Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data.

This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

Output Mode

Output mode to be used while writing the data to Streaming emitter. Select the output mode from the given three options:

Append:

Output Mode in which only the new rows in the streaming data will be written to the sink

Complete Mode:

Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates

Update Mode:

Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates.

Checkpoint Storage Location

Select the checkpointing storage location. Available options are HDFS, S3, and EFS.

Note: It is recommended that you use s3a protocol along with the path.

In case of AWS Databricks cluster, while creating a new cluster (within Cluster List View), under IAM role, s3 Role must be selected.

Checkpoint ConnectionsSelect the connection. Connections are listed corresponding to the selected storage location.
Override CredentialCheck the option for overriding the user specific actions. Provide the username.
Checkpoint Directory

It is the path where Spark Application stores the checkpointing data.

For HDFS and EFS, enter the relative path like /user/hadoop/, checkpointingDir system will add suitable prefix by itself.

For S3, enter an absolute path like: S3://BucketName/checkpointingDir

Time-Based Check PointSelect checkbox to enable timebased checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis.
Enable TriggerTrigger defines how frequently a streaming query should be executed.
Trigger Type

Select one of the options available from the drop-down:

- One-Time

Micro-Batch

- Fixed Interval Micro-Batches.

PriorityPriority defines the execution order of emitters.
ADD CONFIGURATION

Enables to configure additional custom properties.

Note: Add various Spark configurations as per requirement.

Example: Perform imputation by clicking the ADD CONFIGURATION button.

Note: For imputation replace nullValue/emptyValue with the entered value across the data. (Optional)

Example: nullValue =123, the output will replace all null values with 123

Top