Data Cleansing Processor

The Data Cleansing Processor is used to cleanse the dataset using the metadata. To add a Data Cleansing Processor into your pipeline, drag the processor to the canvas and right-click on it to configure:

Field	Description
Columns included while Extract Schema	Column names mentioned here will be used in the data cleansing process.
Connection Type	Select connection type from where the user wants to read the metadata files. The available connection types are RDS and S3.
Connection Name	Select the connection name to fetch the metadata file.
S3 Protocol	Select the S3 protocol from the drop down list. Below protocols are supported for various versions when user selects S3 connection type: - For HDP versions, S3a protocol is supported. - For CDH versions, S3a protocol is supported. - For Apache versions, S3n protocol is supported. - For GCP, S3n and S3a protocol is supported. - For EMR S3, S3n, and S3a protocol is supported. - For AWS Databricks, s3a protocol is supported.
Bucket Name	Provide the bucket name if the user selects S3 connection.
Path	Provide the path or sub-directories of the bucket name mentioned above to which the data is to be written in case the user has opted for S3 connection.
Schema Name	Select the schema name from the drop-down list in case the RDS connection is selected.
Table Name	Select the table name from the drop-down list in case the RDS connection is selected. Note: Meta data should be in tabular form.
Feed ID	Provide the name of feed ID to be filtered out from metadata.
Remove Duplicate	User has an option to check-mark the checkbox to remove duplicate records.
Include Extra Input Columns	User has an option to check-mark the checkbox to include extra input columns. 👉 User can add further configurations by clicking the ADD CONFIGURATION button.

If you have any feedback on Gathr documentation, please email us!