Kudu Emitter
In this article
Apache Kudu is a column-oriented data store of the Apache Hadoop ecosystem. It enable fast analytics on fast (rapidly changing) data. The emitter is engineered to take advantage of hardware and in-memory processing. It lowers query latency significantly from similar type of tools.
KUDU Emitter Configuration
To add a KUDU emitter to your pipeline, drag the emitter onto the canvas and connect it to a Data Source or processor.
The configuration settings are as follows:
Field | Description |
---|---|
Connection Name | Connection URL for creating Kudu connection. |
Table Administration | If checked, the table will be created. |
Primary Keys | This option is to select fields which will be primary keys of the table. |
Partition List | This option is to select fields on which table will be partitioned. |
Buckets | Buckets used for partitioning. |
Replication | Replication factor used to make additional copies of data. The value should be either 1, 3, 5 or 7, only. |
Checkpoint Storage Location | Select the checkpointing storage location. Available options are HDFS, S3, and EFS. |
Checkpoint Connections | Select the connection. Connections are listed corresponding to the selected storage location. |
Checkpoint Directory | It is the path where Spark Application stores the checkpointing data. For HDFS and EFS, enter the relative path like /user/hadoop/, checkpointingDir system will add suitable prefix by itself. For S3, enter an absolute path like: S3://BucketName/checkpointingDir |
Time-Based Check Point | Select checkbox to enable timebased checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis. |
Output Fields | This option is to select fields whose value you want to persist in Table. |
Output Mode | Output mode to be used while writing the data to Streaming sink. Append: Output Mode in which only the new rows in the streaming data will be written to the sink. Complete Mode: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates. Update Mode: Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates. |
Save Mode | Save mode specifies how to handle the existing data. |
Enable Trigger | Trigger defines how frequently a streaming query should be executed. |
Processing Time | It will appear only when Enable Trigger checkbox is selected. Processing Time is the trigger time interval in minutes or seconds. |
ADD CONFIGURATION | Enables additional configuration properties of Elasticsearch. |
Click on the Next button. Enter the notes in the space provided.
Click on DONE for saving the configuration.
If you have any feedback on Gathr documentation, please email us!