Data Cleansing Processor
The Data Cleansing Processor is used to cleanse the dataset using the metadata. To add a Data Cleansing Processor into your pipeline, drag the processor to the canvas and right-click on it to configure:
Field | Description |
---|---|
Columns included while Extract Schema | Column names mentioned here will be used in the data cleansing process. |
Connection Type | Select connection type from where the user wants to read the metadata files. The available connection types are RDS and S3. |
Connection Name | Select the connection name to fetch the metadata file. |
S3 Protocol | Select the S3 protocol from the drop down list. Below protocols are supported for various versions when user selects S3 connection type: - For HDP versions, S3a protocol is supported. - For CDH versions, S3a protocol is supported. - For Apache versions, S3n protocol is supported. - For GCP, S3n and S3a protocol is supported. - For EMR S3, S3n, and S3a protocol is supported. - For AWS Databricks, s3a protocol is supported. |
Bucket Name | Provide the bucket name if the user selects S3 connection. |
Path | Provide the path or sub-directories of the bucket name mentioned above to which the data is to be written in case the user has opted for S3 connection. |
Schema Name | Select the schema name from the drop-down list in case the RDS connection is selected. |
Table Name | Select the table name from the drop-down list in case the RDS connection is selected. Note: Meta data should be in tabular form. |
Feed ID | Provide the name of feed ID to be filtered out from metadata. |
Remove Duplicate | User has an option to check-mark the checkbox to remove duplicate records. |
Include Extra Input Columns | User has an option to check-mark the checkbox to include extra input columns. User can add further configurations by clicking the ADD CONFIGURATION button. |
If you have any feedback on Gathr documentation, please email us!