HDFS Ingestion Target
In this article
HDFS target stores data in Hadoop Distributed File System.
An Ingestion application with HDFS target is supported to run on registered clusters and not on Gathr clusters.
To know about how to register a cluster with Gathr by establishing PrivateLink, see Compute Setup →
To configure a HDFS target, provide the HDFS directory path along with the list of fields of schema to be written. These field values get stored in HDFS file(s), in a specified format, inside the provided HDFS directory.
Target Configuration
Connection Name: Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details for HDFS earlier. Or create one as explained in the topic- HDFS Connection →
Use the Test Connection option to ensure that the connection with the HDFS channel is established successfully.
A success message states that the connection is available. In case of any error in test connection, edit the connection to resolve the issue before proceeding further.
Save Mode: Save Mode is used to specify the expected behavior of saving data to a data sink.
ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown.
Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data.
Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data.
Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data.
This is similar to a CREATE TABLE IF NOT EXISTS in SQL.
Add configuration: Additional properties can be added using Add Configuration link.
HDFS Path
Path Type: Path type whether static or dynamic.
HDFS Path: Directory path on HDFS where data has to be written.
Output Fields: Fields which will be part of output data.
Partitioning Required: If checked, table will be partitioned.
Partition Columns: Select the fields on which table is to be partitioned.
Output Type: Output format in which result will be processed.
Delimited: Delimited formats In a comma-separated values (CSV) file the data items are separated using commas as a delimiter.
JSON: An open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value).
Parquet: Parquet stores nested data structures in a flat columnar format.
AVRO: Avro stores the data definition in JSON format making it easy to read and interpret.
ORC: ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats.
Select a rotation policy from - Size Based, Time Based, Size or Time Based, Records Based or None.
Size Based: The data will be written until the size mentioned in MB has been reached.
Time Based: The data will be written in the same file for the mentioned time in seconds.
Size or Time Based: Data will be written in the same file until one criterion from the mentioned size or time is achieved.
Records Based: The data will be written until the mentioned number of records has been reached.
None: There will be no rotation policy applied.
Delimiter: Message Field separator.
Block Size: Size of each block (in Bytes) allocated in HDFS.
Replication: Enables to make additional copies of data.
Compression Type: Algorithm used to compress the data.
If you have any feedback on Gathr documentation, please email us!