Dedup Processor
In this article
In applications, you are often encountered with large datasets where duplicate records are available. To make the data consistent and accurate, you need to get rid of duplicate records keeping only one of them in the records.
Dedup processor returns a new dataset after removing all duplicate records.
Dedup Processor Configuration
To add a Dedup processor into your pipeline, drag the processor to the canvas and right-click on it to configure.
Field | Description |
---|---|
DeDup Columns | Columns used for determining the duplicate values. |
Watermarking | Yes: When selected Yes, watermarking will be applied. No: When selected No, watermarking will not be applied. |
Watermark Duration | Specify the watermark duration. |
eventColumn | Message field of type timestamp. |
ADD CONFIGURATION | Additional properties can be added using ADD CONFIGURATION link. |
Click on the NEXT button. Enter the notes in the space provided.
Click Save for saving the configuration details.
Example to demonstrate how Dedup works:
You have a dataset with the following rows:
[Row(name='Alice', age=5, height=80),
Row(name='Alice', age=5, height=80),
Row(name='Alice', age=10, height=80)]
Now if Dedup columns are [age, height]
, then Dedup processor would return below dataset:
[Row(name='Alice', age=5, height=80),
Row(name='Alice', age=10, height=80)]
And, if Dedup columns are [name, height]
, then Dedup processor would return below dataset:
[Row(name='Alice', age=5, height=80)]
If you have any feedback on Gathr documentation, please email us!