HDFS Data Source

This is a Batch component.

To add an HDFS Data Source to your pipeline, drag the Data Source to the canvas and right-click on it to configure.

The Schema Type tab allows you to create a schema and the fields. On the Detect Schema tab, select a Data Source or Upload Data.

On the HDFS channel, you will be able to read data with formats including JSON, CSV, TEXT, XML, Fixed Length, Parquet, Binary, and AVRO.

If data is fetched from the source, and the type of data is CSV, the schema has an added tab, Is Header Included in the data source configuration.

Field	Description
Connection Name	Connections are the Service identifiers. Select the connection name from the available list of connections, from where you would like to read the data.
Override Credentials	Check the override credentials for user specific actions.
Username	The name of user through which the hadoop services is running. This option is available once you check the Override Credentials option.
KeyTab Select Option	Select one of the options for keytab Upload as mentioned below: KeyTab File or Specify KeyTab File Path.
HDFS file path	File Path for HDFS file system. The file(s) of the given path must contains headers row.
File Filter	Provide a file pattern. File filter is used to only include files with file names matching the pattern. For e.g .pdf or emp *.csv
Recursive File Lookup	Check the option to retrieve the files from current/sub-folder(s).

👉

File Filter and recusursive file lookup will be available when the binary format is selected.

Click on the Add Notes tab. Enter the notes in the space provided.

Click Done to save the configuration.

Configure Pre-Action in Source →

If you have any feedback on Gathr documentation, please email us!