Kudu Emitter

Apache Kudu is a column-oriented data store of the Apache Hadoop ecosystem. It enable fast analytics on fast (rapidly changing) data. The emitter is engineered to take advantage of hardware and in-memory processing. It lowers query latency significantly from similar type of tools.

KUDU Emitter Configuration

To add a KUDU emitter to your pipeline, drag the emitter onto the canvas and connect it to a Data Source or processor.

👉

If the data source in pipeline has a streaming component, then the emitter will show four additional properties: Checkpoint Storage Location, Checkpoint Connections, Checkpoint Directory, and Time-Based checkpoint.

The configuration settings are as follows:

Field	Description
Connection Name	Connection URL for creating Kudu connection.
Table Administration	If checked, the table will be created.
Primary Keys	This option is to select fields which will be primary keys of the table.
Partition List	This option is to select fields on which table will be partitioned.
Buckets	Buckets used for partitioning.
Replication	Replication factor used to make additional copies of data. The value should be either 1, 3, 5 or 7, only.
Checkpoint Storage Location	Select the checkpointing storage location. Available options are HDFS, S3, and EFS.
Checkpoint Connections	Select the connection. Connections are listed corresponding to the selected storage location.
Checkpoint Directory	It is the path where Spark Application stores the checkpointing data. For HDFS and EFS, enter the relative path like /user/hadoop/, checkpointingDir system will add suitable prefix by itself. For S3, enter an absolute path like: S3://BucketName/checkpointingDir
Time-Based Check Point	Select checkbox to enable timebased checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis.
Output Fields	This option is to select fields whose value you want to persist in Table.
Output Mode	Output mode to be used while writing the data to Streaming sink. Append: Output Mode in which only the new rows in the streaming data will be written to the sink. Complete Mode: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates. Update Mode: Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates.
Save Mode	Save mode specifies how to handle the existing data.
Enable Trigger	Trigger defines how frequently a streaming query should be executed.
Processing Time	It will appear only when Enable Trigger checkbox is selected. Processing Time is the trigger time interval in minutes or seconds.
ADD CONFIGURATION	Enables additional configuration properties of Elasticsearch.

Click on the Next button. Enter the notes in the space provided.

Click on DONE for saving the configuration.

If you have any feedback on Gathr documentation, please email us!

Kudu Emitter

KUDU Emitter Configuration #

KUDU Emitter Configuration