Kafka ETL Source
In this article
Schema Type
See the topic Provide Schema for ETL Source → to know how schema details can be provided for data sources.
After providing schema type details, the next step is to configure the data source.
Data Source Configuration
Each configuration property available in the Kafka data source is explained below.
Connection Name
Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details for Kafka earlier. Or create one as explained in the topic - Kafka Connection →
Batch
Option to enable batch processing.
Topic Type
Select one of the below option to fetch the records from Kafka topic(s).
Topic name: The topic name is used to subscribe a single topic.
Topic list: The topic list is used to subscribe a comma-separated list of topics.
Pattern: The pattern is used to subscribe to topic values as Java regex:
With Partitions: The topic with partitions is used for specific topic(s) partitions to consume. i.e. json string {"topicA":[0,1],"topicB":[2,4]}
Additional configuration fields that appear for the Topic Name are described below:
Topic Name
Topic in Kafka from where messages will be read.
Partitions
Number of partitions. Each partition is an ordered unchangeable sequence of message that is repeatedly added to a commit log.
Replication Factor
Number of replications. Replication provides stronger durability and higher availability. For example, a topic with replication factor N can tolerate up to N-1 server failures without losing any messages committed to the log.
Record Has Header
Option to read record headers along with data from the Kafka topics.
Replace Nulls with Blanks
Check this option to replace all the null values in incoming data with no value/blank.
Preserve Quotes
Check this option to preserve quotes in the delimited dataset. For example, ‘a,b’,c will be emitted into two field values ‘a,b’ and c. If unchecked, the same will be emitted as three field values a,b and c.
Specify Consumer Group
Specify consumer ID type. Default type is auto, which means it will be auto-generated by Kafka.
Other options are: Group Id and Group Id Prefix.
Define Offset
Following configurations are used for Kafka offset.
Earliest: The starting point of the query is from the starting /first offset.
Latest: The starting point of the query is just from the latest offset.
Connection Retries
The number of retries for component connection. Possible values are -1, 0 or any positive number. If the value is -1 then there would be infinite retries for infinite connection.
Max Offset Per Trigger
Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topic Partitions of different volume.
Fail on Data Loss
Provides option of query failure in case of data loss. (For example, topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn’t work as you expected. Batch queries will always fail, if it fails to read any data from the provided offsets due to data loss.
Delay Between Connection Retries
Retry delay interval for component connection (in milliseconds).
Log Parsing Errors
Check, to log parsing errors in pipeline logs.
ADD CONFIGURATION: To add additional custom Kafka properties in key-value pairs.
Detect Schema
Check the populated schema details. For more details, see Schema Preview →
Notes
Optionally, enter notes in the Notes → tab and save the configuration.
If you have any feedback on Gathr documentation, please email us!