Dedup Processor

In applications, you are often encountered with large datasets where duplicate records are available. To make the data consistent and accurate, you need to get rid of duplicate records keeping only one of them in the records.

Dedup processor returns a new dataset after removing all duplicate records.

Dedup Processor Configuration

To add a Dedup processor into your pipeline, drag the processor to the canvas and right-click on it to configure.

Field	Description
DeDup Columns	Columns used for determining the duplicate values.
Watermarking	Yes: When selected Yes, watermarking will be applied. No: When selected No, watermarking will not be applied.
Watermark Duration	Specify the watermark duration.
eventColumn	Message field of type timestamp.
ADD CONFIGURATION	Additional properties can be added using ADD CONFIGURATION link.

Click on the NEXT button. Enter the notes in the space provided.

Click Save for saving the configuration details.

Example to demonstrate how Dedup works:

You have a dataset with the following rows:

[Row(name='Alice', age=5, height=80), 
Row(name='Alice', age=5, height=80), 
Row(name='Alice', age=10, height=80)]

Now if Dedup columns are [age, height], then Dedup processor would return below dataset:

[Row(name='Alice', age=5, height=80), 
Row(name='Alice', age=10, height=80)]

And, if Dedup columns are [name, height], then Dedup processor would return below dataset:

[Row(name='Alice', age=5, height=80)]

If you have any feedback on Gathr documentation, please email us!

Dedup Processor

Dedup Processor Configuration #

Dedup Processor Configuration