Advanced HDFS Emitter

Advanced HDFS emitter allows you to add rotation policy to the emitter.

Advanced HDFS Emitter Configuration

To add an Advanced HDFS emitter to your pipeline, drag the emitter onto the canvas and connect it to a Data Source or processor. The configuration settings are as follows:

FieldDescription
Connection NameAll advanced HDFS connections will be listed here. Select a connection for connecting to HDFS.
Override CredentialsCheck the checkbox for user specific actions.
Username

The name of user through which the Hadoop service is running.

Output Mode

Output mode to be used while writing the data to Streaming emitter. Select the output mode from the given three options:

Append: Output Mode in which only the new rows in the streaming data will be written to the sink

Complete Mode: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates.

Update Mode: Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates.

Enable TriggerTrigger defines how frequently a streaming query should be executed.
Trigger Type

Available options:

- One-Time Micro Batch

- Fixed-Interval Micro Batches (Provide the Processing Time)

ADD CONFIGURATION

User can add further configurations (Optional).

Example: Perform imputation by clicking the ADD CONFIGURATION button.

Example: nullValue =123, the output will replace all null values with 123

ENVIRONMENT PARAMSClick the + ADD PARAM button to add further parameters as key-value pair.
HDFS PATH
Path Type

Select path type from the drop-down. Available options are:

- Static

- Dynamic

HDFS PathDirectory path from HDFS from where the data is to be read.
Output FieldsMessage field that will be persisted.
Partitioning RequiredCheck the check-box to partition the table.
Partition ColumnOption to select fields on which the table will be partitioned.
Include Column Name in PartitionOption to include the column name in partition path.
Output Type

Output format in which result will be processed.

Delimited: Delimited formats In a comma-separated values (CSV) file the data items are separated using commas as a delimiter.

JSON: An open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value).

ORC: It stands for Optimized Row Columnar that means it can store data in an optimized way than the other file formats.

AVRO: Avro stores the data definition in JSON format making it easy to read and interpret.

Parquet: Parquet stores nested data structures in a flat columnar format.

Rotation Policy

Select a rotation policy from - None, Size based, Time based, Size and time based both.

None: There will be no rotation policy applied.

Size based: The data will be written until the mentioned size in bytes has been reached.

Time based: the data will be written in same file for the mentioned amount of time in seconds.

Time and Size based: Data will be written in same file until one criteria from mentioned size and time is achieved.

Record Based: The data will be written in same file until the mentioned number of records has been written.

DelimiterSelect the message field separator.
Check Point DirectoryIt is the HDFS Path where the Spark application stores the checkpoint data.
Time-Based Check PointSelect checkbox to enable time-based checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis.
Block SizeSize of each block (in bytes) allocated in HDFS
ReplicationReplication factor used to make additional copies of the data.
Compression Type
Save Mode

Save Mode is used to specify the expected behavior of saving data to a data sink.

ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown.

Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data.

Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the data.

Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the data and to not change the existing data.

This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

Enable TriggersTrigger defines how frequently a streaming query should be executed.
Path Type

Path Type could be either Static or Dynamic.

Static: When you select Static, “HDFS Path” field is pertained as a text box. Provide a valid HDFS PATH to emit data.

Dynamic: With Dynamic, HDFS Path will be populated with a drop down of the field names of incoming dataset.

The value of the field as selected should be a valid HDFS path and that is considered as a value of the field. This value/path is the location where the records will be emitted.

HDFS PathEnter the Directory path/Field name.
Output FieldsFields in the message that needs to be a part of the output message.
Partitioning RequiredCheck this option if table is to be partitioned. You will view Partition List input box.
Partition ListThis option is to select fields on which table will be partitioned.
Include Column name in Partition

Includes the column name in partition path.

Example- {Column Name}={Column Value}

Output TypeOutput format in which the results will be processed.
Rotation Policy

Select a rotation policy from - None, Size based, Time based, Size and time based both.

None: There will be no rotation policy applied.

Size based: The data will be written until the mentioned size in bytes has been reached.

Time based: The data will be written in same file for the mentioned amount of time in seconds.

Time and Size based: data will be written in same file until one criteria from mentioned size and time is achieved.

Record Based: The data will be written in same file until the mentioned number of records has been written.

Raw Data SizeIf rotation policy is Size Based or Time and Size based - Enter the raw data size in bytes after which the file will be rotated.
File Rotation TimeIf rotation policy is Time Based or Time and Size based - Enter the time in milliseconds after which the file will be rotated.
Record countRecords count after which the data will be written in the new file. This field is generated when you select your Rotation Policy as Record Based.
DelimiterMessage field separator.
Checkpoint Directory

It is the path where Spark Application stores the checkpointing data.

For HDFS and EFS, enter the relative path like /user/hadoop/, checkpointingDir system will add suitable prefix by itself.

For S3, enter an absolute path like: S3://BucketName/checkpointingDir

Time-based Check PointSelect checkbox to enable time-based checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis.
Block SizeSize of each block (in bytes) allocated in HDFS.
ReplicationReplication factor used to make additional copies of the data.
Compression Type

Algorithm used to compress data. Types of Compression algorithms that you can apply are:

- NONE

- DEFLATE

- GZIP

- BZIP2

- SNAPPY

Top