Distinct Processor

Distinct is a core operation of Apache Spark over streaming data. The Distinct processor is used for eliminating duplicate records of any dataset.

Distinct Processor Configuration

To add a Distinct processor to your pipeline, drag the processor onto the canvas and right-click on it to configure.

Enter the fields on which distinct operation is to be performed.

Click on the NEXT button. Enter the notes in the space provided.

Click SAVE for saving the configuration details.

👉

Distinct can't be used right after Aggregation and Pivot processor.

Example to demonstrate how distinct works.

If you apply Distinct on any two fields: Name and Age, then the output for the given fields will be as shown below:

Input Set	Output Set
{Name:Mike,Age:7} {Name:Rosy,Age:9} {Name:Jack,Age:5} {Name:Mike,Age:6} {Name:Rosy,Age:9} {Name:Jack,Age:5}	{Name:Mike,Age:7} {Name:Mike,Age:6} {Name:Rosy,Age:9} {Name:Jack,Age:5}

If you have any feedback on Gathr documentation, please email us!