Advanced HDFS Emitter
In this article
Advanced HDFS emitter allows you to add rotation policy to the emitter.
Advanced HDFS Emitter Configuration
To add an Advanced HDFS emitter to your pipeline, drag the emitter onto the canvas and connect it to a Data Source or processor. The configuration settings are as follows:
Field | Description |
---|---|
Connection Name | All advanced HDFS connections will be listed here. Select a connection for connecting to HDFS. |
Override Credentials | Check the checkbox for user specific actions. |
Username | The name of user through which the Hadoop service is running. Click TEST CONNECTION BUTTON to test the connection. |
Output Mode | Output mode to be used while writing the data to Streaming emitter. Select the output mode from the given three options: Append: Output Mode in which only the new rows in the streaming data will be written to the sink Complete Mode: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates. Update Mode: Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates. |
Enable Trigger | Trigger defines how frequently a streaming query should be executed. |
Trigger Type | Available options: - One-Time Micro Batch - Fixed-Interval Micro Batches (Provide the Processing Time) |
ADD CONFIGURATION | User can add further configurations (Optional). Add various Spark configurations as per requirement. Example: Perform imputation by clicking the ADD CONFIGURATION button. For imputation replace nullValue/emptyValue with the entered value across the data. (Optional) Example: nullValue =123, the output will replace all null values with 123 |
ENVIRONMENT PARAMS | Click the + ADD PARAM button to add further parameters as key-value pair. |
HDFS PATH | |
Path Type | Select path type from the drop-down. Available options are: - Static - Dynamic |
HDFS Path | Directory path from HDFS from where the data is to be read. |
Output Fields | Message field that will be persisted. |
Partitioning Required | Check the check-box to partition the table. |
Partition Column | Option to select fields on which the table will be partitioned. |
Include Column Name in Partition | Option to include the column name in partition path. |
Output Type | Output format in which result will be processed. Delimited: Delimited formats In a comma-separated values (CSV) file the data items are separated using commas as a delimiter. JSON: An open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value). ORC: It stands for Optimized Row Columnar that means it can store data in an optimized way than the other file formats. AVRO: Avro stores the data definition in JSON format making it easy to read and interpret. Parquet: Parquet stores nested data structures in a flat columnar format. |
Rotation Policy | Select a rotation policy from - None, Size based, Time based, Size and time based both. None: There will be no rotation policy applied. Size based: The data will be written until the mentioned size in bytes has been reached. Time based: the data will be written in same file for the mentioned amount of time in seconds. Time and Size based: Data will be written in same file until one criteria from mentioned size and time is achieved. Record Based: The data will be written in same file until the mentioned number of records has been written. |
Delimiter | Select the message field separator. |
Check Point Directory | It is the HDFS Path where the Spark application stores the checkpoint data. |
Time-Based Check Point | Select checkbox to enable time-based checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis. |
Block Size | Size of each block (in bytes) allocated in HDFS |
Replication | Replication factor used to make additional copies of the data. |
Compression Type | |
Save Mode | Save Mode is used to specify the expected behavior of saving data to a data sink. ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown. Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data. Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the data. Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the data and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL. |
Enable Triggers | Trigger defines how frequently a streaming query should be executed. |
Path Type | Path Type could be either Static or Dynamic. Static: When you select Static, “HDFS Path” field is pertained as a text box. Provide a valid HDFS PATH to emit data. Dynamic: With Dynamic, HDFS Path will be populated with a drop down of the field names of incoming dataset. The value of the field as selected should be a valid HDFS path and that is considered as a value of the field. This value/path is the location where the records will be emitted. |
HDFS Path | Enter the Directory path/Field name. |
Output Fields | Fields in the message that needs to be a part of the output message. |
Partitioning Required | Check this option if table is to be partitioned. You will view Partition List input box. |
Partition List | This option is to select fields on which table will be partitioned. |
Include Column name in Partition | Includes the column name in partition path. Example- {Column Name}={Column Value} |
Output Type | Output format in which the results will be processed. |
Rotation Policy | Select a rotation policy from - None, Size based, Time based, Size and time based both. None: There will be no rotation policy applied. Size based: The data will be written until the mentioned size in bytes has been reached. Time based: The data will be written in same file for the mentioned amount of time in seconds. Time and Size based: data will be written in same file until one criteria from mentioned size and time is achieved. Record Based: The data will be written in same file until the mentioned number of records has been written. |
Raw Data Size | If rotation policy is Size Based or Time and Size based - Enter the raw data size in bytes after which the file will be rotated. |
File Rotation Time | If rotation policy is Time Based or Time and Size based - Enter the time in milliseconds after which the file will be rotated. |
Record count | Records count after which the data will be written in the new file. This field is generated when you select your Rotation Policy as Record Based. |
Delimiter | Message field separator. |
Checkpoint Directory | It is the path where Spark Application stores the checkpointing data. For HDFS and EFS, enter the relative path like For S3, enter an absolute path like: |
Time-based Check Point | Select checkbox to enable time-based checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis. |
Block Size | Size of each block (in bytes) allocated in HDFS. |
Replication | Replication factor used to make additional copies of the data. |
Compression Type | Algorithm used to compress data. Types of Compression algorithms that you can apply are: - NONE - DEFLATE - GZIP - BZIP2 - SNAPPY |
If you have any feedback on Gathr documentation, please email us!