Data Cleansing Processor

The Data Cleansing Processor is used to cleanse the dataset using the metadata. To add a Data Cleansing Processor into your pipeline, drag the processor to the canvas and right-click on it to configure:

FieldDescription
Columns included while Extract SchemaColumn names mentioned here will be used in the data cleansing process.
Connection TypeSelect connection type from where the user wants to read the metadata files. The available connection types are RDS and S3.
Connection NameSelect the connection name to fetch the metadata file.
S3 Protocol

Select the S3 protocol from the drop down list. Below protocols are supported for various versions when user selects S3 connection type:

- For HDP versions, S3a protocol is supported.

- For CDH versions, S3a protocol is supported.

- For Apache versions, S3n protocol is supported.

- For GCP, S3n and S3a protocol is supported.

- For EMR S3, S3n, and S3a protocol is supported.

- For AWS Databricks, s3a protocol is supported.

Bucket NameProvide the bucket name if the user selects S3 connection.
PathProvide the path or sub-directories of the bucket name mentioned above to which the data is to be written in case the user has opted for S3 connection.
Schema NameSelect the schema name from the drop-down list in case the RDS connection is selected.
Table Name

Select the table name from the drop-down list in case the RDS connection is selected.

Note: Meta data should be in tabular form.

Feed IDProvide the name of feed ID to be filtered out from metadata.
Remove DuplicateUser has an option to check-mark the checkbox to remove duplicate records.
Include Extra Input Columns

User has an option to check-mark the checkbox to include extra input columns.

Top