Dedup Processor

In applications, you are often encountered with large datasets where duplicate records are available. To make the data consistent and accurate, you need to get rid of duplicate records keeping only one of them in the records.

Dedup processor returns a new dataset after removing all duplicate records.

Configuring Dedup processor for ETL pipelines

Processor Configuration

De-Dup Columns: Columns used for determining the duplicate values.

Watermarking

Yes: When selected Yes, watermarking will be applied. No: When selected No, watermarking will not be applied.

Watermark Duration: Specify the watermark duration (in seconds).

Event Column: Message field of type timestamp.

ADD CONFIGURATION: Additional properties can be added using ADD CONFIGURATION link.

Top