Model Pre-Processing

In Pre-Processing, the data is transformed or consolidated so that the resulting mining process is more efficient, and the patterns found are easier to understand.

Once features are selected on Features selection tab, you can apply various transformations using Pre-Processing tab.

All ML models require feature column to be Vector Data type. For transforming raw input fields into type Vector, use Pre-Processing transformations.

Following are the descriptions of all the transformations/algorithms supported by Gathr over various analytics processor.

Binarizer

Binarizer thresholds numerical features to binary (0/1) features. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0.

Enter values in the following parameters:

Field NameDescription
Input ColumnsInput column name over which Binarizer transformation is to be applied.

Note: If you wish to apply algorithm on multiple columns, apply Vector Assembler transformation before the algorithm.
Output ColumnName of the output column which will contain the transformed values after Binarizer transformation is applied.
ThresholdThreshold value to be used for binarization. Features greater than the threshold, will be binarized to 1.0. The features equal to or less than the threshold, will be binarized to 0.0. Default: 0.0
Output Size HintMention the size of the output Vector which will be generated after transformation is applied
Output Size Handle InvalidParameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

Bucketizer

Bucketizer transforms a column of continuous features to a column of feature buckets.

For configuration of Bucketizer Transformation, select algorithm Bucketizer.

Enter values in the following parameters:

Field NameDescription
Input ColumnsInput column name over which Bucketizer transformation is to be applied.

Note: If you wish to apply algorithm on multiple columns, apply Vector Assembler transformation before the algorithm.
Output ColumnName of the output column which will contain the transformed values after Binarizer transformation is applied.
SplitsSplits are used to define buckets. With n+1 splits, there are n buckets. Splits should be strictly increasing. Use –inf for negative infinity and +inf for positive infinity.
Handle InvalidWith this parameter, one can decide what to do with the invalid records. The three options available are Keep/Skip/Error. Keep will keep the invalid value and put it in any of the buckets, Skip will skip that particular record and Error will raise an exception if invalid record is inputted for transformation.

CountVectorizer

CountVectorizer aim to help convert a collection of text documents to vectors of token counts.

Field NameDescription
Input ColumnsName of the input column over which Bucketizer transformation is to be applied
Output ColumnName of the output columns. Each output column will contain CountVector for the respective input column.
Vocabulary SizeMax size of the vocabulary. CountVectorizer will build a vocabulary that only considers the top vocabSize terms ordered by term frequency across the corpus.
Minimum Document FrequencySpecifies the minimum number of different documents a term must appear to be included in the vocabulary. If this is an integer >= 1, this specifies the number of documents the term must appear in and if this is a double in [0,1), then this specifies the fraction of documents
Minimum Term FrequencyFilter to ignore rare words in a document. For each document, terms with frequency/count less than the given threshold are ignored. If this is an integer >=1, then this specifies a count (of times the term must appear in the document) and if this is a double in [0,1), then this specifies a fraction (out of the document’s token count).
Output Size HintMention the size of the output Vector which will be generated after transformation is applied
Output Size Handle InvalidParameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

Feature Hasher

Feature Hasher projects a set of categorical or numerical features into a feature vector of specified dimension. This is done using a hashing trick to map features to indices in the feature vector. Null (missing) values are ignored (implicitly zero in the resulting feature vector).

For configuration of Feature Hasher, one has to select the algorithm on the transformations tab. It asks for various configuration fields. The description is given below:

Field NameDescription
Input ColumnsName of the input column over which Feature Hasher transformation is to be applied
Output ColumnName of the output column which will contain the transformed feature vector.
Num FeaturesNumber of features. Should be greater than 0. (default = 2^18^)
Categorical ColumnsNumeric columns to treat as categorical features. By default only string and boolean columns are treated as categorical, so this param can be used to explicitly specify the numerical columns to treat as categorical. Note, the relevant columns must also be set in inputCols.
Output Size HintMention the size of the output Vector which will be generated after transformation is applied
Output Size Handle InvalidParameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

HashingTF

HashingTF is a transformer which takes sets of terms and converts those sets into fixed-length feature vectors.

Field NameDescription
Input ColumnsName of the input column over which HashingTF transformation is to be applied
Output ColumnName of the output columns. Each output column will contain CountVector for the respective input column.
Number of FeaturesShould be > 0. (default = 2^18)
Output Size HintMention the size of the output Vector which will be generated after transformation is applied
Output Size Handle InvalidParameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

IDF

The IDFModel takes feature vectors (generally created from HashingTF or CountVectorizer) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus.

Field NameDescription
Input ColumnsName of the input column over which idf transformation is to be applied.
Output ColumnName of the output columns. Each output column will contain CountVector for the respective input column.
Output Size HintMention the size of the output Vector which will be generated after transformation is applied
Output Size Handle InvalidParameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

Imputer

The Imputer transformer completes missing values in a dataset, either using the mean or the median of the columns in which the missing values are located. Imputer doesn’t support categorical features. By default, all null values in the input columns are treated as missing, and so are also imputed. The input columns should be of decimal type.

Field NameDescription
Input ColumnsInput column name over which Imputer transformation is to be applied.

You can select multiple Input Columns.
Output ColumnName of the output columns. In the output column missing values will be replaced by the surrogate value for the relevant column.

Note: You can select multiple columns in Output Column too.

However the Input Column and Output Column will be mapping the first input column with first output column and so on and so forth.
StrategyThe imputation strategy. Available options are “mean” and “median”. If “mean” is selected, then all the occurrences of missing values will be replaced with using the mean value of the column. If “median” is selected, then all the missing values will be replaced using the approximate median value of the column. Default is mean
Missing ValueThe placeholder for the missing values. All occurrences of missingValue will be imputed. Note that null values are always treated as missing.

MaxAbsScaler

MaxAbsScaler transforms a dataset of Vector rows, rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.

MaxAbsScaler computes summary statistics on a data set and produces a MaxAbsScalerModel. The model can then transform each feature individually to range [-1, 1].

Field NameDescription
Input ColumnsName of the input column over which IndexToString transformation is to be applied
Output ColumnName of the output columns. Each output column will contain CountVector for the respective input column.
Output Size HintMention the size of the output Vector which will be generated after transformation is applied
Output Size Handle InvalidParameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

MinMaxScaler

MinMaxScaler transforms a dataset of Vector rows, rescaling each feature to a specific range (specified by parameter-min/max).

Field NameDescription
Input ColumnsName of the input column over which MinMaxScaler transformation is to be applied
Output ColumnName of the output columns. Each output column will contain CountVector for the respective input column.
Min ValueLower bound after transformation, shared by all features.
Max ValueUpper bound after transformation, shared by all features.
Output Size HintMention the size of the output Vector which will be generated after transformation is applied
Output Size Handle InvalidParameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

NGram

NGram takes as input a sequence of strings (e.g. the output of a Tokenizer). The parameter n is used to determine the number of terms in each n-gram. The output will consist of a sequence of n-grams where each n-gram is represented by a space-delimited string of n consecutive words. If the input sequence contains fewer than n strings, no output is produced.

Field NameDescription
Input ColumnsName of the input column over which NGram transformation is to be applied.
Output ColumnName of the output columns. Each output column will contain CountVector for the respective input column.
N-Gram ParamMinimum n-gram length, >= 1. Default value is 2.

Normalizer

Normalizer is a transformer that transforms a dataset of Vector rows, normalizing each Vector to have unit norm. It takes parameter p, which specifies the p-norm used for normalization. (p=2 by default.) This normalization can help standardize your input data and improve the behavior of learning algorithms.

Field NameDescription
Input ColumnsName of the input column over which Normalizer transformation is to be applied
Output ColumnName of the output columns. Each output column will contain CountVector for the respective input column.
NormNormalize a vector to have unit norm using the given p-norm. P-norm value is given by this norm Param
Output Size HintMention the size of the output Vector which will be generated after transformation is applied
Output Size Handle InvalidParameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

OneHotEncoder

One-hot encoding maps a column of label indices to a column of binary vectors, with at most one single value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.

Field NameDescription
Input ColumnsInput column name over which One-Hot Encoder transformation is to be applied.

You can add multiple input columns.
Output ColumnName of the output columns. Each output column will contain one-hot-encoded vector for the respective input column.

Note: You can select multiple columns in Output Column too.

However the Input Column and Output Column will be mapping the first input column with first output column and so on and so forth.
Drop LastWhether to drop the last category in the encoded vector. Default value is true
Handle InvalidParameter for handling invalid values encountered during the transformation. Available options are “keep” (invalid data presented as an extra categorical feature) or “error” (throw an error). Default is “error”.
Output Size HintMention the size of the output Vector which will be generated after transformation is applied
Output Size Handle InvalidParameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

PCA

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A PCA class trains a model to project vectors to a low-dimensional space using PCA.

Field NameDescription
Input ColumnsName of the input column over which PCA transformation is to be applied.
Output ColumnName of the output columns. Each output column will contain CountVector for the respective input column.
Number of Principal ComponentsNumber of principal components.
Output Size HintMention the size of the output Vector which will be generated after transformation is applied.
Output Size Handle InvalidParameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”.

StandardScaler

StandardScaler transforms a dataset of Vector rows, normalizing each feature to have unit standard deviation and or zero mean.

Field NameDescription
Input ColumnsName of the input column over which StandardScaler transformation is to be applied.
Output ColumnName of the output columns. Each output column will contain CountVector for the respective input column.
With Std DevWhether to scale the data to unit standard deviation or not.
With MeanWhether to center the data with mean before scaling or not.
Output Size HintMention the size of the output Vector which will be generated after transformation is applied
Handle InvalidParameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

StopWordsRemover

Stop words are words which should be excluded from the input, characteristically because the words appear frequently and do not carry much meaning.

StopWordsRemover takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of StopWords is specified by the StopWords parameter.

Field NameDescription
Input ColumnsName of the input column over which StandardScaler transformation is to be applied.
Output ColumnName of the output columns. Each output column will contain CountVector for the respective input column.
Load Default Stop WordsWhen you check this checkbox it asks for input language of which you want to pick default stop words to be removed by StopWordsTransformer. Some of the options include English, French, Spanish etc.

If you do not enable this option, there will be another field available to provide the stop words.
LanguageThe language of the stop words should be selected.
Case SensitiveWhether stop words are case sensitive or not

StringIndexer

StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0. The unseen labels will be put at index numLabels if user chooses to keep them. If the input column is numeric, we cast it to string and index the string values.

For configuration of StringIndexer Transformation, one has to select as algorithm StringIndexer on the transformations tab. It asks for various configuration fields. Their description is as below:

Field NameDescription
Input ColumnsName of the input column over which StringIndexer transformation is to be applied
Output ColumnName of the output columns. Each output column will contain CountVector for the respective input column.
Handle InvalidWith this parameter, one can decide what to do with the invalid records. The three options available are Skip/Error. Skip will skip that particular record and Error will raise an exception if invalid record is inputted for transformation.

Tokenizer

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality.

Field NameDescription
Input ColumnsName of the input column over which Tokenizer transformation is to be applied.
Output ColumnName of the output columns. Each output column will contain CountVector for the respective input column.
PatternRegex pattern used to match delimiters if [[gaps]] is true or tokens if [[gaps]] is false.
GapsIndicates whether regex splits on gaps (true) or matches tokens (false).

VectorAssembler

VectorAssembler is a transformer that combines a given list of columns into a single vector column.

Field NameDescription
Input ColumnsInput column name over which VectorAssembler transformation is to be applied
Output ColumnName of the output columns. Each output column will contain CountVector for the respective input column.

VectorIndexer

VectorIndexer helps index categorical features in datasets of Vectors. It can both automatically decide which features are categorical and convert original values to category indices.

For Configuration of VectorIndexer Transformation one has to select as algorithm VectorIndexer on the transformations tab. It asks for various configuration fields. Their description is as below:

Field NameDescription
Input ColumnsName of the input column over which Word2Vec transformation is to be applied.
Output ColumnName of the output columns. Each output column will contain CountVector for the respective input column.
Max CategoriesThreshold for the number of values a categorical feature can take. If a feature is found to have > maxCategories value, it is declared continuous. Must be greater than or equal to 2.

Default: 20
Output Size HintMention the size of the output Vector which will be generated after transformation is applied.
Handle InvalidParameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”.

Word2Vec

Word2Vec maps each word to a unique fixed-size vector.

For configuration of Word2Vec Transformation, one has to select as algorithm Word2Vec on the transformations tab. It asks for various configuration fields. Their description is as below:

Field NameDescription
Input ColumnsName of the input column over which Word2Vec transformation is to be applied.
Output ColumnName of the output columns. Each output column will contain CountVector for the respective input column.
Vector SizeThe dimension of the code that you want to transform from words. Default value is 100.
Window SizeThe window size (context words from [-window, window]) default 5
Step SizeThe step size.
Min CountThe minimum number of times a token must appear to be included in the word2vec model’s vocabulary. Default value is 5.
Max IterationThe max iteration.
Max Sentence LengthSet max sentence lengt.h
Output Size HintMention the size of the output Vector which will be generated after transformation is applied.
Handle InvalidParameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”.
Top