S3 Emitter
In this article
Amazon S3 stores data as objects within resources called Buckets. S3 emitter stores objects on Amazon S3 bucket.
S3 Emitter Configuration
To add an S3 emitter to your pipeline, drag the emitter onto the canvas and connect it to a Data Source or processor. Right-click on the emitter to configure it as explained below:
Field | Description |
---|---|
Connection Name | All S3 connections will be listed here. Select a connection for connecting to S3. |
S3 protocol | S3 protocol to be used while writing on S3. |
End Point | S3 endpoint details should be provided if the source is Dell EMC S3. |
Bucket Name | Buckets are storage units used to store objects, which consists of data and meta-data that describes the data. |
Override Credentials | Unchecked by default, check mark the checkbox to override credentials for user specific actions. |
AWS Key Id | Provide the S3 account access key. |
Secret Access Key | Provide the S3 account secret key. Once the AWS Key Id and Secret Access Key is provided, user has an option to test the connection. |
Path | File or directory path from where data is to be stored. |
Output Type | Output format in which result will be processed. |
Delimiter | Message Field seperator. |
Output Fields | Fields of the output message. |
Partitioning Required | Whether to partition data on s3 or not. |
Save Mode | Save Mode is used to specify the expected behavior of saving data to a data sink. ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown. Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data. Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data. Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL. |
Output Mode | Output mode to be used while writing the data to Streaming emitter. Select the output mode from the given three options: Append: Output Mode in which only the new rows in the streaming data will be written to the sink Complete Mode: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates Update Mode: Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates. |
Checkpoint Storage Location | Select the checkpointing storage location. Available options are HDFS, S3, and EFS. Note: It is recommended that you use s3a protocol along with the path. In case of AWS Databricks cluster, while creating a new cluster (within Cluster List View), under IAM role, s3 Role must be selected. |
Checkpoint Connections | Select the connection. Connections are listed corresponding to the selected storage location. |
Override Credential | Check the option for overriding the user specific actions. Provide the username. |
Checkpoint Directory | It is the path where Spark Application stores the checkpointing data. For HDFS and EFS, enter the relative path like /user/hadoop/, checkpointingDir system will add suitable prefix by itself. For S3, enter an absolute path like: S3://BucketName/checkpointingDir |
Time-Based Check Point | Select checkbox to enable timebased checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis. |
Enable Trigger | Trigger defines how frequently a streaming query should be executed. |
Trigger Type | Select one of the options available from the drop-down: - One-Time Micro-Batch - Fixed Interval Micro-Batches. |
Priority | Priority defines the execution order of emitters. |
ADD CONFIGURATION | Enables to configure additional custom properties. Note: Add various Spark configurations as per requirement. Example: Perform imputation by clicking the ADD CONFIGURATION button. Note: For imputation replace nullValue/emptyValue with the entered value across the data. (Optional) Example: nullValue =123, the output will replace all null values with 123 |
If you have any feedback on Gathr documentation, please email us!