Azure Blob ETL Target

On a Blob Emitter you should be able to write data to different formats (json, csv, orc, parquet, and more) of data to blob containers by specifying directory path.

Azure Blob Emitter Configuration

To add an Azure Blob emitter into your pipeline, drag the emitter to the canvas and connect it to a Data Source or processor.

👉

If the data source in pipeline has a streaming component, then the emitter will show four additional properties: Checkpoint Storage Location, Checkpoint Connections, Checkpoint Directory, and Time-Based checkpoint.

The configuration settings are as follows:

Connection Name All connections will be listed here. Select a connection for connecting to Azure Blob.

Container Azure Blob Container Name.

Path Sub-directories of the container mentioned above to which data is to be written.

Output Type Output format in which result will be processed.

Delimiter Message Field separator.

Output Fields Select the fields that needs to be included in the output data.

Partioning Required If checked, data will be partitioned.

Save Mode Save mode specifies how to handle the existing data.

Output Mode Output mode to be used while writing the data to Streaming emitter. Select the output mode from the given three options: Append: Output Mode in which only the new rows in the streaming data will be written to the sink. Complete Mode: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates. Update Mode: Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates.

Checkpoint Storage Location Select the checkpointing storage location. Available options are HDFS, S3, and EFS.

Checkpoint Connections Select the connection. Connections are listed corresponding to the selected storage location.

Checkpoint Directory It is the path where Spark Application stores the checkpointing data.

For HDFS and EFS, enter the relative path like /user/hadoop/, checkpointingDir system will add suitable prefix by itself.

For S3, enter an absolute path like: S3://BucketName/checkpointingDir

Time-Based Check Point Select checkbox to enable timebased checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis.

Enable Trigger Trigger defines how frequently a streaming query should be executed.

Processing Time It will appear only when Enable Trigger checkbox is selected. Processing Time is the trigger time interval in minutes or seconds.

Add Configuration Enables to configure additional properties.

👉

Add various Spark configurations as per requirement.

Example: Perform imputation by clicking the ADD CONFIGURATION button.

👉

For imputation replace nullValue/emptyValue with the entered value across the data. (Optional)

Example: nullValue =123, the output will replace all null values with 123

Click on the Next button. Enter the notes in the space provided.

Click on the DONE button for saving the configuration.

If you have any feedback on Gathr documentation, please email us!

Azure Blob ETL Target

Azure Blob Emitter Configuration #

Azure Blob Emitter Configuration