NativeHDFS Emitter

HDFS emitter stores data in Hadoop Distributed File System.

To configure a Native HDFS emitter, provide the HDFS directory path along with the list of fields of schema to be written. These field values get stored in HDFS file(s), in a specified format, inside the provided HDFS directory.

Native HDFS Configuration

To add Native HDFS emitter to your pipeline, drag the emitter onto the canvas, connect it to a Data Source or processor, and right-click on it to configure it. You can also save it as a Dataset.

FieldDescription
Save As DatasetWhen you select this checkbox, you will be able to save the data of this emitter a a dataset. After selecting this, provide a name to the Dataset.
Connection NameAll Native HDFS connections will be listed here.Select a connection for connecting to HDFS.
Save Mode

Save Mode is used to specify the expected behavior of saving data to a data sink.

ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown.

Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data.

Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data.

Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data.

This is similar to a CREATE TABLE IF NOT EXISTS in SQL. .

HDFS PathDirectory path on HDFS where data has to be written.
Output FieldsFields which will be part of output data.
Partitioning RequiredIf checked, table will be partitioned.
Partition ListSelect the fields on which table is to be partitioned.
Output Type

Output format in which result will be processed.

Delimited: Delimited formats In a comma-separated values (CSV) file the data items are separated using commas as a delimiter.

JSON is an open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value).

ORC: ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats.

AVRO: Avro stores the data definition in JSON format making it easy to read and interpret.

Parquet: Parquet stores nested data structures in a flat columnar format.

XML: Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.

In case of Multi Level JSON, multi array testing is done.

Output

<root>
<row>
<arrayArrayLevel1>[1,2]</arrayArrayLevel1>
<arrayArrayLevel1>[3,4]</arrayArrayLevel1>
<arrayArrayLevel1>[6,7]</arrayArrayLevel1>
<arrayJsonLevel1>
<array1StringLevel2>jj</array1StringLevel2>
</arrayJsonLevel1>
<arrayJsonLevel1>
<array2StringLevel2>jj22</array2StringLevel2>
</arrayJsonLevel1>
<arrayStringLevel1>a</arrayStringLevel1>
<arrayStringLevel1>b</arrayStringLevel1>
<arrayStringLevel1>c</arrayStringLevel1>
<doubleLevel1>10.0</doubleLevel1>
<intLevel1>1</intLevel1>
<jsonLevel1>
<jsonLevel2>
<stringLevel3>bye</stringLevel3>
</jsonLevel2>
<stringLevel2>hello</stringLevel2>
</jsonLevel1>
<stringLevel1>hi1</stringLevel1>
</row>
</root>

Sample nested JSON

{"stringLevel1":"hi1","intLevel1":1,"doubleLevel1":10.0,"jsonLevel1":{"stringLevel2":"hello","jsonLevel2":{"stringLevel3":"bye"}},"arrayStringLevel1":["a","b","c"],"arrayJsonLevel1":[{"array1StringLevel2":"jj"},{"array2StringLevel2":"jj22"}],"arrayArrayLevel1":[[1,2],[3,4],[6,7]]}
Output TypeOutput Format in which results will be processed.
DelimiterMessage Field separator.
Checkpoint Directory

It is the path where Spark Application stores the checkpointing data.

For HDFS and EFS, enter the relative path like /user/hadoop , checkpointingDir system will add suitable prefix by itself.

For S3, enter an absolute path like: S3://BucketName/checkpointingDir

Time-based Check PointSelect checkbox to enable time-based checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis.
Row TagRow Tag for Output XML.
Root TagRoot Tag for Output XML.
Block SizeSize of each block (in Bytes) allocated in HDFS.
ReplicationEnables to make additional copies of data.
Compression TypeAlgorithm used to compress the data.
ADD CONFIGURATIONEnables to configure additional Elasticsearch properties.

Every option selected will produce a field as per the output selected, for example in case of Delimited, a Delimiter field is populated, select the delimiter accordingly.

Limitation: This emitter works only for batch pipeline. Partitioning does not work when output format is selected as XML.

Click on the NEXT button. Enter the notes in the space provided.

Click SAVE for saving the configuration details.

Top