HDFS Ingestion Source
In this article
HDFS data source allows you to read data from HDFS storage.
An Ingestion application with HDFS data source is supported to run on registered clusters and not on Gathr clusters.
To know about how to register a cluster with Gathr by establishing PrivateLink, see Compute Setup →
Data Source Configuration
Fetch From Source/Upload Data File
For designing the application, you can either fetch the sample data from the HDFS source by providing the data source connection details or upload a sample data file in one of the supported formats to see the schema details during the application design.
If Upload Data File is selected to fetch sample data, provide the below details.
File Format: Select the sample file format (file type) depending on the data type.
Gathr-supported file formats for HDFS data source are CSV, JSON, TEXT, Parquet and ORC.
For CSV file format, select its corresponding delimiter.
Header Included: Enable this option to read the first row as a header if your HDFS data is in CSV format.
Upload: Please upload the sample file as per the file format selected above.
If Fetch From Source is selected, continue configuring the data source.
Connection Name: Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details for HDFS earlier. Or create one as explained in the topic - HDFS Connection →
Use the Test Connection option to ensure that the connection with the HDFS channel is established successfully.
A success message states that the connection is available. In case of any error in test connection, edit the connection to resolve the issue before proceeding further.
HDFS File Path: File path of the HDFS file system should be provided.
File Format: Type of data to be fetched should be selected out of CSV, JSON, TEXT, Parquet or ORC formats.
If the type of data is selected as CSV, then the delimiter character should also be selected that is used to separate values in the source file out of below options:
Header Included: Option to enable or disable the scanning of first row as a header for CSV files. This option is disabled by default.
Add Configuration: Additional properties can be added using this option as key-value pairs.
Schema
Check the populated schema details. For more details, see Schema Preview →
Advanced Configuration
Optionally, you can enable incremental read. For more details, see HDFS Incremental Configuration →
If you have any feedback on Gathr documentation, please email us!