Advanced HDFS Emitter
In this article
Advanced HDFS emitter allows you to add rotation policy to the emitter.
An ETL application with Advanced HDFS emitter is supported to run on registered clusters and not on Gathr clusters.
To know about how to register a cluster with Gathr by establishing PrivateLink, see Compute Setup →
Target Configuration
Connection Name: Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details for HDFS earlier. Or create one as explained in the topic - HDFS Connection →
Save Mode: Save Mode is used to specify the expected behavior of saving data to a data sink.
ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown.
Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data.
Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the data.
Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the data and to not change the existing data.
This is similar to a CREATE TABLE IF NOT EXISTS in SQL.
Add Configuration: Additional properties can be added.
HDFS Path
Path Type: Path type whether static or dynamic.
HDFS Path: Enter the Directory path/Field name.
Output Fields: Fields in the message that needs to be a part of the output message.
Partitioning Required: Check this option if table is to be partitioned. You will view Partition List input box.
Partition Columns: This option is to select fields on which table will be partitioned.
Output Type: Output format in which the results will be processed.
Rotation Policy: Select a rotation policy from - None, Size based, Time based, Size and time based both.
None - there will be no rotation policy applied.
Size based - the data will be written until the mentioned size in bytes has been reached.
Time based - the data will be written in same file for the mentioned amount of time in seconds.
Time or Size based - data will be written in same file until one criteria from mentioned size and time is achieved.
Record Based: The data will be written in same file until the mentioned number of records has been written.
Raw Data Size: If rotation policy is Size Based or Time and Size based - Enter the raw data size in bytes after which the file will be rotated.
File Rotation Time: If rotation policy is Time Based or Time and Size based - Enter the time in milliseconds after which the file will be rotated.
Record count: Records count after which the data will be written in the new file. This field is generated when you select your Rotation Policy as Record Based.
Delimiter: Message field separator.
Block Size: Size of each block (in bytes) allocated in HDFS.
Replication: Replication factor used to make additional copies of the data.
Compression Type: Algorithm used to compress data. Types of Compression algorithms that you can apply are:
- NONE
- DEFLATE
- GZIP
- BZIP2
- LZO
- SNAPPY
Post Action
To understand how to provide SQL queries or Stored Procedures that will be executed during pipeline run, see Post-Actions →
Notes
Optionally, enter notes in the Notes → tab and save the configuration.
If you have any feedback on Gathr documentation, please email us!