Native HDFS ETL Target

HDFS emitter stores data in Hadoop Distributed File System.

To configure a Native HDFS emitter, provide the HDFS directory path along with the list of fields of schema to be written. These field values get stored in HDFS file(s), in a specified format, inside the provided HDFS directory.

Target Configuration

Connection Name: Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details for HDFS earlier. Or create one as explained in the topic - HDFS Connection →

Save Mode: Save Mode is used to specify the expected behavior of saving data to a data sink.

ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown.

Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data.

Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data.

Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data.

This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

Add Configuration: Additional properties can be added.

HDFS Path: Directory path on HDFS where data has to be written.

Partitioning Required: If checked, table will be partitioned.

Partition List: Select the fields on which table is to be partitioned.

Output Fields: Fields which will be part of output data.

Output Type: Output format in which result will be processed.

Delimited: Delimited formats In a comma-separated values (CSV) file the data items are separated using commas as a delimiter.

JSON: An open-standard file format that uses human-readable text to transmit data objects consisting of attribute-value pairs and array data types (or any other serializable value).

ORC: ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats.

AVRO: Avro stores the data definition in JSON format making it easy to read and interpret.

Parquet: Parquet stores nested data structures in a flat columnar format.

XML: Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.

In case of Multi Level JSON, multi array testing is done.

Output

<root>
<row>
<arrayArrayLevel1>[1,2]</arrayArrayLevel1>
<arrayArrayLevel1>[3,4]</arrayArrayLevel1>
<arrayArrayLevel1>[6,7]</arrayArrayLevel1>
<arrayJsonLevel1>
<array1StringLevel2>jj</array1StringLevel2>
</arrayJsonLevel1>
<arrayJsonLevel1>
<array2StringLevel2>jj22</array2StringLevel2>
</arrayJsonLevel1>
<arrayStringLevel1>a</arrayStringLevel1>
<arrayStringLevel1>b</arrayStringLevel1>
<arrayStringLevel1>c</arrayStringLevel1>
<doubleLevel1>10.0</doubleLevel1>
<intLevel1>1</intLevel1>
<jsonLevel1>
<jsonLevel2>
<stringLevel3>bye</stringLevel3>
</jsonLevel2>
<stringLevel2>hello</stringLevel2>
</jsonLevel1>
<stringLevel1>hi1</stringLevel1>
</row>
</root> 

Sample nested JSON

{"stringLevel1":"hi1","intLevel1":1,"doubleLevel1":10.0,"jsonLevel1":{"stringLevel2":"hello","jsonLevel2":{"stringLevel3":"bye"}},"arrayStringLevel1":["a","b","c"],"arrayJsonLevel1":[{"array1StringLevel2":"jj"},{"array2StringLevel2":"jj22"}],"arrayArrayLevel1":[[1,2],[3,4],[6,7]]}

Delimiter: Message Field separator.

Row Tag: Row Tag for Output XML.

Root Tag: Root Tag for Output XML.

Block Size: Size of each block (in Bytes) allocated in HDFS.

Replication: Enables to make additional copies of data.

Compression Type: Algorithm used to compress the data.

Limitation

This emitter works only for batch pipeline. Partitioning does not work when output format is selected as XML.

Post Action

To understand how to provide SQL queries or Stored Procedures that will be executed during pipeline run, see Post-Actions →

Notes

Optionally, enter notes in the Notes → tab and save the configuration.

Top