Advanced HDFS Emitter

Advanced HDFS emitter allows you to add rotation policy to the emitter.

Advanced HDFS Emitter Configuration

To add an Advanced HDFS emitter to your pipeline, drag the emitter onto the canvas and connect it to a Data Source or processor. The configuration settings are as follows:

Field	Description
Connection Name	All advanced HDFS connections will be listed here. Select a connection for connecting to HDFS.
Override Credentials	Check the checkbox for user specific actions.
Username	The name of user through which the Hadoop service is running. 👉 Click TEST CONNECTION BUTTON to test the connection.
Output Mode	Output mode to be used while writing the data to Streaming emitter. Select the output mode from the given three options: Append: Output Mode in which only the new rows in the streaming data will be written to the sink Complete Mode: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates. Update Mode: Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates.
Enable Trigger	Trigger defines how frequently a streaming query should be executed.
Trigger Type	Available options: - One-Time Micro Batch - Fixed-Interval Micro Batches (Provide the Processing Time)
ADD CONFIGURATION	User can add further configurations (Optional). 👉 Add various Spark configurations as per requirement. Example: Perform imputation by clicking the ADD CONFIGURATION button. 👉 For imputation replace nullValue/emptyValue with the entered value across the data. (Optional) Example: nullValue =123, the output will replace all null values with 123
ENVIRONMENT PARAMS	Click the + ADD PARAM button to add further parameters as key-value pair.
HDFS PATH
Path Type	Select path type from the drop-down. Available options are: - Static - Dynamic
HDFS Path	Directory path from HDFS from where the data is to be read.
Output Fields	Message field that will be persisted.
Partitioning Required	Check the check-box to partition the table.
Partition Column	Option to select fields on which the table will be partitioned.
Include Column Name in Partition	Option to include the column name in partition path.
Output Type	Output format in which result will be processed. Delimited: Delimited formats In a comma-separated values (CSV) file the data items are separated using commas as a delimiter. JSON: An open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value). ORC: It stands for Optimized Row Columnar that means it can store data in an optimized way than the other file formats. AVRO: Avro stores the data definition in JSON format making it easy to read and interpret. Parquet: Parquet stores nested data structures in a flat columnar format.
Rotation Policy	Select a rotation policy from - None, Size based, Time based, Size and time based both. None: There will be no rotation policy applied. Size based: The data will be written until the mentioned size in bytes has been reached. Time based: the data will be written in same file for the mentioned amount of time in seconds. Time and Size based: Data will be written in same file until one criteria from mentioned size and time is achieved. Record Based: The data will be written in same file until the mentioned number of records has been written.
Delimiter	Select the message field separator.
Check Point Directory	It is the HDFS Path where the Spark application stores the checkpoint data.
Time-Based Check Point	Select checkbox to enable time-based checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis.
Block Size	Size of each block (in bytes) allocated in HDFS
Replication	Replication factor used to make additional copies of the data.
Compression Type
Save Mode	Save Mode is used to specify the expected behavior of saving data to a data sink. ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown. Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data. Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the data. Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the data and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL.
Enable Triggers	Trigger defines how frequently a streaming query should be executed.
Path Type	Path Type could be either Static or Dynamic. Static: When you select Static, “HDFS Path” field is pertained as a text box. Provide a valid HDFS PATH to emit data. Dynamic: With Dynamic, HDFS Path will be populated with a drop down of the field names of incoming dataset. The value of the field as selected should be a valid HDFS path and that is considered as a value of the field. This value/path is the location where the records will be emitted.
HDFS Path	Enter the Directory path/Field name.
Output Fields	Fields in the message that needs to be a part of the output message.
Partitioning Required	Check this option if table is to be partitioned. You will view Partition List input box.
Partition List	This option is to select fields on which table will be partitioned.
Include Column name in Partition	Includes the column name in partition path. Example- {Column Name}={Column Value}
Output Type	Output format in which the results will be processed.
Rotation Policy	Select a rotation policy from - None, Size based, Time based, Size and time based both. None: There will be no rotation policy applied. Size based: The data will be written until the mentioned size in bytes has been reached. Time based: The data will be written in same file for the mentioned amount of time in seconds. Time and Size based: data will be written in same file until one criteria from mentioned size and time is achieved. Record Based: The data will be written in same file until the mentioned number of records has been written.
Raw Data Size	If rotation policy is Size Based or Time and Size based - Enter the raw data size in bytes after which the file will be rotated.
File Rotation Time	If rotation policy is Time Based or Time and Size based - Enter the time in milliseconds after which the file will be rotated.
Record count	Records count after which the data will be written in the new file. This field is generated when you select your Rotation Policy as Record Based.
Delimiter	Message field separator.
Checkpoint Directory	It is the path where Spark Application stores the checkpointing data. For HDFS and EFS, enter the relative path like `/user/hadoop/`, checkpointingDir system will add suitable prefix by itself. For S3, enter an absolute path like: `S3://BucketName/checkpointingDir`
Time-based Check Point	Select checkbox to enable time-based checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis.
Block Size	Size of each block (in bytes) allocated in HDFS.
Replication	Replication factor used to make additional copies of the data.
Compression Type	Algorithm used to compress data. Types of Compression algorithms that you can apply are: - NONE - DEFLATE - GZIP - BZIP2 - SNAPPY

If you have any feedback on Gathr documentation, please email us!

Advanced HDFS Emitter

Advanced HDFS Emitter Configuration #

Advanced HDFS Emitter Configuration