Dedup Processor

In applications, you are often encountered with large datasets where duplicate records are available. To make the data consistent and accurate, you need to get rid of duplicate records keeping only one of them in the records.

Dedup processor returns a new dataset after removing all duplicate records.

Dedup Processor Configuration

To add a Dedup processor into your pipeline, drag the processor to the canvas and right-click on it to configure.

FieldDescription
DeDup ColumnsColumns used for determining the duplicate values.
Watermarking

Yes: When selected Yes, watermarking will be applied.

No: When selected No, watermarking will not be applied.

Watermark DurationSpecify the watermark duration.
eventColumnMessage field of type timestamp.
ADD CONFIGURATIONAdditional properties can be added using ADD CONFIGURATION link.

Click on the NEXT button. Enter the notes in the space provided.

Click Save for saving the configuration details.

Example to demonstrate how Dedup works: 

You have a dataset with the following rows:

[Row(name='Alice', age=5, height=80), 
Row(name='Alice', age=5, height=80), 
Row(name='Alice', age=10, height=80)]

Now if Dedup columns are [age, height], then Dedup processor would return below dataset:

[Row(name='Alice', age=5, height=80), 
Row(name='Alice', age=10, height=80)]

And, if Dedup columns are [name, height], then Dedup processor would return below dataset:

[Row(name='Alice', age=5, height=80)]
Top