Kudu Emitter

Apache Kudu is a column-oriented data store of the Apache Hadoop ecosystem. It enable fast analytics on fast (rapidly changing) data. The emitter is engineered to take advantage of hardware and in-memory processing. It lowers query latency significantly from similar type of tools.

KUDU Emitter Configuration

To add a KUDU emitter to your pipeline, drag the emitter onto the canvas and connect it to a Data Source or processor.

The configuration settings are as follows:

FieldDescription
Connection NameConnection URL for creating Kudu connection.
Table AdministrationIf checked, the table will be created.
Primary KeysThis option is to select fields which will be primary keys of the table.
Partition ListThis option is to select fields on which table will be partitioned.
BucketsBuckets used for partitioning.
ReplicationReplication factor used to make additional copies of data. The value should be either 1, 3, 5 or 7, only.
Checkpoint Storage LocationSelect the checkpointing storage location. Available options are HDFS, S3, and EFS.
Checkpoint ConnectionsSelect the connection. Connections are listed corresponding to the selected storage location.
Checkpoint Directory

It is the path where Spark Application stores the checkpointing data.

For HDFS and EFS, enter the relative path like /user/hadoop/, checkpointingDir system will add suitable prefix by itself.

For S3, enter an absolute path like: S3://BucketName/checkpointingDir

Time-Based Check PointSelect checkbox to enable timebased checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis.
Output FieldsThis option is to select fields whose value you want to persist in Table.
Output Mode

Output mode to be used while writing the data to Streaming sink.

Append: Output Mode in which only the new rows in the streaming data will be written to the sink.

Complete Mode: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates.

Update Mode: Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates.

Save ModeSave mode specifies how to handle the existing data.
Enable TriggerTrigger defines how frequently a streaming query should be executed.
Processing Time

It will appear only when Enable Trigger checkbox is selected.

Processing Time is the trigger time interval in minutes or seconds.

ADD CONFIGURATIONEnables additional configuration properties of Elasticsearch.

Click on the Next button. Enter the notes in the space provided.

Click on DONE for saving the configuration.

Top