HDFS Ingestion Source

HDFS data source allows you to read data from HDFS storage.

An Ingestion application with HDFS data source is supported to run on registered clusters and not on Gathr clusters.

To know about how to register a cluster with Gathr by establishing PrivateLink, see Compute Setup →

Data Source Configuration

Fetch From Source/Upload Data File

For designing the application, you can either fetch the sample data from the HDFS source by providing the data source connection details or upload a sample data file in one of the supported formats to see the schema details during the application design.

If Upload Data File is selected to fetch sample data, provide the below details.

File Format: Select the sample file format (file type) depending on the data type.

Gathr-supported file formats for HDFS data source are CSV, JSON, TEXT, Parquet and ORC.

For CSV file format, select its corresponding delimiter.

Header Included: Enable this option to read the first row as a header if your HDFS data is in CSV format.

Upload: Please upload the sample file as per the file format selected above.

👉

Make sure that the file size does not exceed 10 MB.

If Fetch From Source is selected, continue configuring the data source.

Connection Name: Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details for HDFS earlier. Or create one as explained in the topic - HDFS Connection →

Use the Test Connection option to ensure that the connection with the HDFS channel is established successfully.

A success message states that the connection is available. In case of any error in test connection, edit the connection to resolve the issue before proceeding further.

HDFS File Path: File path of the HDFS file system should be provided.

File Format: Type of data to be fetched should be selected out of CSV, JSON, TEXT, Parquet or ORC formats.

If the type of data is selected as CSV, then the delimiter character should also be selected that is used to separate values in the source file out of below options:

👉

If the type of data is selected as CSV, the source configuration has an added field, Header Included.

Header Included: Option to enable or disable the scanning of first row as a header for CSV files. This option is disabled by default.

Add Configuration: Additional properties can be added using this option as key-value pairs.

Schema

Check the populated schema details. For more details, see Schema Preview →

Advanced Configuration

Optionally, you can enable incremental read. For more details, see HDFS Incremental Configuration →

If you have any feedback on Gathr documentation, please email us!

HDFS Ingestion Source

Data Source Configuration #

Schema #

Advanced Configuration #

Data Source Configuration

Schema

Advanced Configuration