Model Pre-Processing

In Pre-Processing, the data is transformed or consolidated so that the resulting mining process is more efficient, and the patterns found are easier to understand.

Once features are selected on Features selection tab, you can apply various transformations using Pre-Processing tab.

All ML models require feature column to be Vector Data type. For transforming raw input fields into type Vector, use Pre-Processing transformations.

Following are the descriptions of all the transformations/algorithms supported by Gathr over various analytics processor.

Binarizer

Binarizer thresholds numerical features to binary (0/1) features. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0.

Enter values in the following parameters:

Field Name	Description
Input Columns	Input column name over which Binarizer transformation is to be applied. Note: If you wish to apply algorithm on multiple columns, apply Vector Assembler transformation before the algorithm.
Output Column	Name of the output column which will contain the transformed values after Binarizer transformation is applied.
Threshold	Threshold value to be used for binarization. Features greater than the threshold, will be binarized to 1.0. The features equal to or less than the threshold, will be binarized to 0.0. Default: 0.0
Output Size Hint	Mention the size of the output Vector which will be generated after transformation is applied
Output Size Handle Invalid	Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

Bucketizer

Bucketizer transforms a column of continuous features to a column of feature buckets.

For configuration of Bucketizer Transformation, select algorithm Bucketizer.

Enter values in the following parameters:

Field Name	Description
Input Columns	Input column name over which Bucketizer transformation is to be applied. Note: If you wish to apply algorithm on multiple columns, apply Vector Assembler transformation before the algorithm.
Output Column	Name of the output column which will contain the transformed values after Binarizer transformation is applied.
Splits	Splits are used to define buckets. With n+1 splits, there are n buckets. Splits should be strictly increasing. Use –inf for negative infinity and +inf for positive infinity.
Handle Invalid	With this parameter, one can decide what to do with the invalid records. The three options available are Keep/Skip/Error. Keep will keep the invalid value and put it in any of the buckets, Skip will skip that particular record and Error will raise an exception if invalid record is inputted for transformation.

CountVectorizer

CountVectorizer aim to help convert a collection of text documents to vectors of token counts.

Field Name	Description
Input Columns	Name of the input column over which Bucketizer transformation is to be applied
Output Column	Name of the output columns. Each output column will contain CountVector for the respective input column.
Vocabulary Size	Max size of the vocabulary. CountVectorizer will build a vocabulary that only considers the top vocabSize terms ordered by term frequency across the corpus.
Minimum Document Frequency	Specifies the minimum number of different documents a term must appear to be included in the vocabulary. If this is an integer >= 1, this specifies the number of documents the term must appear in and if this is a double in [0,1), then this specifies the fraction of documents
Minimum Term Frequency	Filter to ignore rare words in a document. For each document, terms with frequency/count less than the given threshold are ignored. If this is an integer >=1, then this specifies a count (of times the term must appear in the document) and if this is a double in [0,1), then this specifies a fraction (out of the document’s token count).
Output Size Hint	Mention the size of the output Vector which will be generated after transformation is applied
Output Size Handle Invalid	Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

Feature Hasher

Feature Hasher projects a set of categorical or numerical features into a feature vector of specified dimension. This is done using a hashing trick to map features to indices in the feature vector. Null (missing) values are ignored (implicitly zero in the resulting feature vector).

For configuration of Feature Hasher, one has to select the algorithm on the transformations tab. It asks for various configuration fields. The description is given below:

Field Name	Description
Input Columns	Name of the input column over which Feature Hasher transformation is to be applied
Output Column	Name of the output column which will contain the transformed feature vector.
Num Features	Number of features. Should be greater than 0. (default = 2^18^)
Categorical Columns	Numeric columns to treat as categorical features. By default only string and boolean columns are treated as categorical, so this param can be used to explicitly specify the numerical columns to treat as categorical. Note, the relevant columns must also be set in inputCols.
Output Size Hint	Mention the size of the output Vector which will be generated after transformation is applied
Output Size Handle Invalid	Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

HashingTF

HashingTF is a transformer which takes sets of terms and converts those sets into fixed-length feature vectors.

Field Name	Description
Input Columns	Name of the input column over which HashingTF transformation is to be applied
Output Column	Name of the output columns. Each output column will contain CountVector for the respective input column.
Number of Features	Should be > 0. (default = 2^18)
Output Size Hint	Mention the size of the output Vector which will be generated after transformation is applied
Output Size Handle Invalid	Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

IDF

The IDFModel takes feature vectors (generally created from HashingTF or CountVectorizer) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus.

Field Name	Description
Input Columns	Name of the input column over which idf transformation is to be applied.
Output Column	Name of the output columns. Each output column will contain CountVector for the respective input column.
Output Size Hint	Mention the size of the output Vector which will be generated after transformation is applied
Output Size Handle Invalid	Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

Imputer

The Imputer transformer completes missing values in a dataset, either using the mean or the median of the columns in which the missing values are located. Imputer doesn’t support categorical features. By default, all null values in the input columns are treated as missing, and so are also imputed. The input columns should be of decimal type.

Field Name	Description
Input Columns	Input column name over which Imputer transformation is to be applied. You can select multiple Input Columns.
Output Column	Name of the output columns. In the output column missing values will be replaced by the surrogate value for the relevant column. Note: You can select multiple columns in Output Column too. However the Input Column and Output Column will be mapping the first input column with first output column and so on and so forth.
Strategy	The imputation strategy. Available options are “mean” and “median”. If “mean” is selected, then all the occurrences of missing values will be replaced with using the mean value of the column. If “median” is selected, then all the missing values will be replaced using the approximate median value of the column. Default is mean
Missing Value	The placeholder for the missing values. All occurrences of missingValue will be imputed. Note that null values are always treated as missing.

MaxAbsScaler

MaxAbsScaler transforms a dataset of Vector rows, rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.

MaxAbsScaler computes summary statistics on a data set and produces a MaxAbsScalerModel. The model can then transform each feature individually to range [-1, 1].

Field Name	Description
Input Columns	Name of the input column over which IndexToString transformation is to be applied
Output Column	Name of the output columns. Each output column will contain CountVector for the respective input column.
Output Size Hint	Mention the size of the output Vector which will be generated after transformation is applied
Output Size Handle Invalid	Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

MinMaxScaler

MinMaxScaler transforms a dataset of Vector rows, rescaling each feature to a specific range (specified by parameter-min/max).

Field Name	Description
Input Columns	Name of the input column over which MinMaxScaler transformation is to be applied
Output Column	Name of the output columns. Each output column will contain CountVector for the respective input column.
Min Value	Lower bound after transformation, shared by all features.
Max Value	Upper bound after transformation, shared by all features.
Output Size Hint	Mention the size of the output Vector which will be generated after transformation is applied
Output Size Handle Invalid	Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

NGram

NGram takes as input a sequence of strings (e.g. the output of a Tokenizer). The parameter n is used to determine the number of terms in each n-gram. The output will consist of a sequence of n-grams where each n-gram is represented by a space-delimited string of n consecutive words. If the input sequence contains fewer than n strings, no output is produced.

Field Name	Description
Input Columns	Name of the input column over which NGram transformation is to be applied.
Output Column	Name of the output columns. Each output column will contain CountVector for the respective input column.
N-Gram Param	Minimum n-gram length, >= 1. Default value is 2.

Normalizer

Normalizer is a transformer that transforms a dataset of Vector rows, normalizing each Vector to have unit norm. It takes parameter p, which specifies the p-norm used for normalization. (p=2 by default.) This normalization can help standardize your input data and improve the behavior of learning algorithms.

Field Name	Description
Input Columns	Name of the input column over which Normalizer transformation is to be applied
Output Column	Name of the output columns. Each output column will contain CountVector for the respective input column.
Norm	Normalize a vector to have unit norm using the given p-norm. P-norm value is given by this norm Param
Output Size Hint	Mention the size of the output Vector which will be generated after transformation is applied
Output Size Handle Invalid	Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

OneHotEncoder

One-hot encoding maps a column of label indices to a column of binary vectors, with at most one single value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.

Field Name	Description
Input Columns	Input column name over which One-Hot Encoder transformation is to be applied. You can add multiple input columns.
Output Column	Name of the output columns. Each output column will contain one-hot-encoded vector for the respective input column. Note: You can select multiple columns in Output Column too. However the Input Column and Output Column will be mapping the first input column with first output column and so on and so forth.
Drop Last	Whether to drop the last category in the encoded vector. Default value is true
Handle Invalid	Parameter for handling invalid values encountered during the transformation. Available options are “keep” (invalid data presented as an extra categorical feature) or “error” (throw an error). Default is “error”.
Output Size Hint	Mention the size of the output Vector which will be generated after transformation is applied
Output Size Handle Invalid	Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

PCA

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A PCA class trains a model to project vectors to a low-dimensional space using PCA.

Field Name	Description
Input Columns	Name of the input column over which PCA transformation is to be applied.
Output Column	Name of the output columns. Each output column will contain CountVector for the respective input column.
Number of Principal Components	Number of principal components.
Output Size Hint	Mention the size of the output Vector which will be generated after transformation is applied.
Output Size Handle Invalid	Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”.

StandardScaler

StandardScaler transforms a dataset of Vector rows, normalizing each feature to have unit standard deviation and or zero mean.

Field Name	Description
Input Columns	Name of the input column over which StandardScaler transformation is to be applied.
Output Column	Name of the output columns. Each output column will contain CountVector for the respective input column.
With Std Dev	Whether to scale the data to unit standard deviation or not.
With Mean	Whether to center the data with mean before scaling or not.
Output Size Hint	Mention the size of the output Vector which will be generated after transformation is applied
Handle Invalid	Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”

StopWordsRemover

Stop words are words which should be excluded from the input, characteristically because the words appear frequently and do not carry much meaning.

StopWordsRemover takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of StopWords is specified by the StopWords parameter.

Field Name	Description
Input Columns	Name of the input column over which StandardScaler transformation is to be applied.
Output Column	Name of the output columns. Each output column will contain CountVector for the respective input column.
Load Default Stop Words	When you check this checkbox it asks for input language of which you want to pick default stop words to be removed by StopWordsTransformer. Some of the options include English, French, Spanish etc. If you do not enable this option, there will be another field available to provide the stop words.
Language	The language of the stop words should be selected.
Case Sensitive	Whether stop words are case sensitive or not

StringIndexer

StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0. The unseen labels will be put at index numLabels if user chooses to keep them. If the input column is numeric, we cast it to string and index the string values.

For configuration of StringIndexer Transformation, one has to select as algorithm StringIndexer on the transformations tab. It asks for various configuration fields. Their description is as below:

Field Name	Description
Input Columns	Name of the input column over which StringIndexer transformation is to be applied
Output Column	Name of the output columns. Each output column will contain CountVector for the respective input column.
Handle Invalid	With this parameter, one can decide what to do with the invalid records. The three options available are Skip/Error. Skip will skip that particular record and Error will raise an exception if invalid record is inputted for transformation.

Tokenizer

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality.

Field Name	Description
Input Columns	Name of the input column over which Tokenizer transformation is to be applied.
Output Column	Name of the output columns. Each output column will contain CountVector for the respective input column.
Pattern	Regex pattern used to match delimiters if [[gaps]] is true or tokens if [[gaps]] is false.
Gaps	Indicates whether regex splits on gaps (true) or matches tokens (false).

VectorAssembler

VectorAssembler is a transformer that combines a given list of columns into a single vector column.

Field Name	Description
Input Columns	Input column name over which VectorAssembler transformation is to be applied
Output Column	Name of the output columns. Each output column will contain CountVector for the respective input column.

VectorIndexer

VectorIndexer helps index categorical features in datasets of Vectors. It can both automatically decide which features are categorical and convert original values to category indices.

For Configuration of VectorIndexer Transformation one has to select as algorithm VectorIndexer on the transformations tab. It asks for various configuration fields. Their description is as below:

Field Name	Description
Input Columns	Name of the input column over which Word2Vec transformation is to be applied.
Output Column	Name of the output columns. Each output column will contain CountVector for the respective input column.
Max Categories	Threshold for the number of values a categorical feature can take. If a feature is found to have > maxCategories value, it is declared continuous. Must be greater than or equal to 2. Default: 20
Output Size Hint	Mention the size of the output Vector which will be generated after transformation is applied.
Handle Invalid	Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”.

Word2Vec

Word2Vec maps each word to a unique fixed-size vector.

For configuration of Word2Vec Transformation, one has to select as algorithm Word2Vec on the transformations tab. It asks for various configuration fields. Their description is as below:

Field Name	Description
Input Columns	Name of the input column over which Word2Vec transformation is to be applied.
Output Column	Name of the output columns. Each output column will contain CountVector for the respective input column.
Vector Size	The dimension of the code that you want to transform from words. Default value is 100.
Window Size	The window size (context words from [-window, window]) default 5
Step Size	The step size.
Min Count	The minimum number of times a token must appear to be included in the word2vec model’s vocabulary. Default value is 5.
Max Iteration	The max iteration.
Max Sentence Length	Set max sentence lengt.h
Output Size Hint	Mention the size of the output Vector which will be generated after transformation is applied.
Handle Invalid	Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”.

If you have any feedback on Gathr documentation, please email us!

Model Pre-Processing

Binarizer #

Bucketizer #

CountVectorizer #

Feature Hasher #

HashingTF #

IDF #

Imputer #

MaxAbsScaler #

MinMaxScaler #

NGram #

Normalizer #

OneHotEncoder #

PCA #

StandardScaler #

StopWordsRemover #

StringIndexer #

Tokenizer #

VectorAssembler #

VectorIndexer #

Word2Vec #