Model Pre-Processing
In Pre-Processing, the data is transformed or consolidated so that the resulting mining process is more efficient, and the patterns found are easier to understand.
Once features are selected on Features selection tab, you can apply various transformations using Pre-Processing tab.
All ML models require feature column to be Vector Data type. For transforming raw input fields into type Vector, use Pre-Processing transformations.
Following are the descriptions of all the transformations/algorithms supported by Gathr over various analytics processor.
Binarizer
Binarizer thresholds numerical features to binary (0/1) features. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0.
Enter values in the following parameters:
Field Name | Description |
---|---|
Input Columns | Input column name over which Binarizer transformation is to be applied. Note: If you wish to apply algorithm on multiple columns, apply Vector Assembler transformation before the algorithm. |
Output Column | Name of the output column which will contain the transformed values after Binarizer transformation is applied. |
Threshold | Threshold value to be used for binarization. Features greater than the threshold, will be binarized to 1.0. The features equal to or less than the threshold, will be binarized to 0.0. Default: 0.0 |
Output Size Hint | Mention the size of the output Vector which will be generated after transformation is applied |
Output Size Handle Invalid | Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error” |
Bucketizer
Bucketizer transforms a column of continuous features to a column of feature buckets.
For configuration of Bucketizer Transformation, select algorithm Bucketizer.
Enter values in the following parameters:
Field Name | Description |
---|---|
Input Columns | Input column name over which Bucketizer transformation is to be applied. Note: If you wish to apply algorithm on multiple columns, apply Vector Assembler transformation before the algorithm. |
Output Column | Name of the output column which will contain the transformed values after Binarizer transformation is applied. |
Splits | Splits are used to define buckets. With n+1 splits, there are n buckets. Splits should be strictly increasing. Use –inf for negative infinity and +inf for positive infinity. |
Handle Invalid | With this parameter, one can decide what to do with the invalid records. The three options available are Keep/Skip/Error. Keep will keep the invalid value and put it in any of the buckets, Skip will skip that particular record and Error will raise an exception if invalid record is inputted for transformation. |
CountVectorizer
CountVectorizer aim to help convert a collection of text documents to vectors of token counts.
Field Name | Description |
---|---|
Input Columns | Name of the input column over which Bucketizer transformation is to be applied |
Output Column | Name of the output columns. Each output column will contain CountVector for the respective input column. |
Vocabulary Size | Max size of the vocabulary. CountVectorizer will build a vocabulary that only considers the top vocabSize terms ordered by term frequency across the corpus. |
Minimum Document Frequency | Specifies the minimum number of different documents a term must appear to be included in the vocabulary. If this is an integer >= 1, this specifies the number of documents the term must appear in and if this is a double in [0,1), then this specifies the fraction of documents |
Minimum Term Frequency | Filter to ignore rare words in a document. For each document, terms with frequency/count less than the given threshold are ignored. If this is an integer >=1, then this specifies a count (of times the term must appear in the document) and if this is a double in [0,1), then this specifies a fraction (out of the document’s token count). |
Output Size Hint | Mention the size of the output Vector which will be generated after transformation is applied |
Output Size Handle Invalid | Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error” |
Feature Hasher
Feature Hasher projects a set of categorical or numerical features into a feature vector of specified dimension. This is done using a hashing trick to map features to indices in the feature vector. Null (missing) values are ignored (implicitly zero in the resulting feature vector).
For configuration of Feature Hasher, one has to select the algorithm on the transformations tab. It asks for various configuration fields. The description is given below:
Field Name | Description |
---|---|
Input Columns | Name of the input column over which Feature Hasher transformation is to be applied |
Output Column | Name of the output column which will contain the transformed feature vector. |
Num Features | Number of features. Should be greater than 0. (default = 2^18^) |
Categorical Columns | Numeric columns to treat as categorical features. By default only string and boolean columns are treated as categorical, so this param can be used to explicitly specify the numerical columns to treat as categorical. Note, the relevant columns must also be set in inputCols. |
Output Size Hint | Mention the size of the output Vector which will be generated after transformation is applied |
Output Size Handle Invalid | Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error” |
HashingTF
HashingTF is a transformer which takes sets of terms and converts those sets into fixed-length feature vectors.
Field Name | Description |
---|---|
Input Columns | Name of the input column over which HashingTF transformation is to be applied |
Output Column | Name of the output columns. Each output column will contain CountVector for the respective input column. |
Number of Features | Should be > 0. (default = 2^18) |
Output Size Hint | Mention the size of the output Vector which will be generated after transformation is applied |
Output Size Handle Invalid | Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error” |
IDF
The IDFModel takes feature vectors (generally created from HashingTF or CountVectorizer) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus.
Field Name | Description |
---|---|
Input Columns | Name of the input column over which idf transformation is to be applied. |
Output Column | Name of the output columns. Each output column will contain CountVector for the respective input column. |
Output Size Hint | Mention the size of the output Vector which will be generated after transformation is applied |
Output Size Handle Invalid | Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error” |
Imputer
The Imputer transformer completes missing values in a dataset, either using the mean or the median of the columns in which the missing values are located. Imputer doesn’t support categorical features. By default, all null values in the input columns are treated as missing, and so are also imputed. The input columns should be of decimal type.
Field Name | Description |
---|---|
Input Columns | Input column name over which Imputer transformation is to be applied. You can select multiple Input Columns. |
Output Column | Name of the output columns. In the output column missing values will be replaced by the surrogate value for the relevant column. Note: You can select multiple columns in Output Column too. However the Input Column and Output Column will be mapping the first input column with first output column and so on and so forth. |
Strategy | The imputation strategy. Available options are “mean” and “median”. If “mean” is selected, then all the occurrences of missing values will be replaced with using the mean value of the column. If “median” is selected, then all the missing values will be replaced using the approximate median value of the column. Default is mean |
Missing Value | The placeholder for the missing values. All occurrences of missingValue will be imputed. Note that null values are always treated as missing. |
MaxAbsScaler
MaxAbsScaler transforms a dataset of Vector rows, rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.
MaxAbsScaler computes summary statistics on a data set and produces a MaxAbsScalerModel. The model can then transform each feature individually to range [-1, 1].
Field Name | Description |
---|---|
Input Columns | Name of the input column over which IndexToString transformation is to be applied |
Output Column | Name of the output columns. Each output column will contain CountVector for the respective input column. |
Output Size Hint | Mention the size of the output Vector which will be generated after transformation is applied |
Output Size Handle Invalid | Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error” |
MinMaxScaler
MinMaxScaler transforms a dataset of Vector rows, rescaling each feature to a specific range (specified by parameter-min/max).
Field Name | Description |
---|---|
Input Columns | Name of the input column over which MinMaxScaler transformation is to be applied |
Output Column | Name of the output columns. Each output column will contain CountVector for the respective input column. |
Min Value | Lower bound after transformation, shared by all features. |
Max Value | Upper bound after transformation, shared by all features. |
Output Size Hint | Mention the size of the output Vector which will be generated after transformation is applied |
Output Size Handle Invalid | Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error” |
NGram
NGram takes as input a sequence of strings (e.g. the output of a Tokenizer). The parameter n is used to determine the number of terms in each n-gram. The output will consist of a sequence of n-grams where each n-gram is represented by a space-delimited string of n consecutive words. If the input sequence contains fewer than n strings, no output is produced.
Field Name | Description |
---|---|
Input Columns | Name of the input column over which NGram transformation is to be applied. |
Output Column | Name of the output columns. Each output column will contain CountVector for the respective input column. |
N-Gram Param | Minimum n-gram length, >= 1. Default value is 2. |
Normalizer
Normalizer is a transformer that transforms a dataset of Vector rows, normalizing each Vector to have unit norm. It takes parameter p, which specifies the p-norm used for normalization. (p=2 by default.) This normalization can help standardize your input data and improve the behavior of learning algorithms.
Field Name | Description |
---|---|
Input Columns | Name of the input column over which Normalizer transformation is to be applied |
Output Column | Name of the output columns. Each output column will contain CountVector for the respective input column. |
Norm | Normalize a vector to have unit norm using the given p-norm. P-norm value is given by this norm Param |
Output Size Hint | Mention the size of the output Vector which will be generated after transformation is applied |
Output Size Handle Invalid | Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error” |
OneHotEncoder
One-hot encoding maps a column of label indices to a column of binary vectors, with at most one single value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.
Field Name | Description |
---|---|
Input Columns | Input column name over which One-Hot Encoder transformation is to be applied. You can add multiple input columns. |
Output Column | Name of the output columns. Each output column will contain one-hot-encoded vector for the respective input column. Note: You can select multiple columns in Output Column too. However the Input Column and Output Column will be mapping the first input column with first output column and so on and so forth. |
Drop Last | Whether to drop the last category in the encoded vector. Default value is true |
Handle Invalid | Parameter for handling invalid values encountered during the transformation. Available options are “keep” (invalid data presented as an extra categorical feature) or “error” (throw an error). Default is “error”. |
Output Size Hint | Mention the size of the output Vector which will be generated after transformation is applied |
Output Size Handle Invalid | Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error” |
PCA
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A PCA class trains a model to project vectors to a low-dimensional space using PCA.
Field Name | Description |
---|---|
Input Columns | Name of the input column over which PCA transformation is to be applied. |
Output Column | Name of the output columns. Each output column will contain CountVector for the respective input column. |
Number of Principal Components | Number of principal components. |
Output Size Hint | Mention the size of the output Vector which will be generated after transformation is applied. |
Output Size Handle Invalid | Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”. |
StandardScaler
StandardScaler transforms a dataset of Vector rows, normalizing each feature to have unit standard deviation and or zero mean.
Field Name | Description |
---|---|
Input Columns | Name of the input column over which StandardScaler transformation is to be applied. |
Output Column | Name of the output columns. Each output column will contain CountVector for the respective input column. |
With Std Dev | Whether to scale the data to unit standard deviation or not. |
With Mean | Whether to center the data with mean before scaling or not. |
Output Size Hint | Mention the size of the output Vector which will be generated after transformation is applied |
Handle Invalid | Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error” |
StopWordsRemover
Stop words are words which should be excluded from the input, characteristically because the words appear frequently and do not carry much meaning.
StopWordsRemover takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of StopWords is specified by the StopWords parameter.
Field Name | Description |
---|---|
Input Columns | Name of the input column over which StandardScaler transformation is to be applied. |
Output Column | Name of the output columns. Each output column will contain CountVector for the respective input column. |
Load Default Stop Words | When you check this checkbox it asks for input language of which you want to pick default stop words to be removed by StopWordsTransformer. Some of the options include English, French, Spanish etc. If you do not enable this option, there will be another field available to provide the stop words. |
Language | The language of the stop words should be selected. |
Case Sensitive | Whether stop words are case sensitive or not |
StringIndexer
StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0. The unseen labels will be put at index numLabels if user chooses to keep them. If the input column is numeric, we cast it to string and index the string values.
For configuration of StringIndexer Transformation, one has to select as algorithm StringIndexer on the transformations tab. It asks for various configuration fields. Their description is as below:
Field Name | Description |
---|---|
Input Columns | Name of the input column over which StringIndexer transformation is to be applied |
Output Column | Name of the output columns. Each output column will contain CountVector for the respective input column. |
Handle Invalid | With this parameter, one can decide what to do with the invalid records. The three options available are Skip/Error. Skip will skip that particular record and Error will raise an exception if invalid record is inputted for transformation. |
Tokenizer
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality.
Field Name | Description |
---|---|
Input Columns | Name of the input column over which Tokenizer transformation is to be applied. |
Output Column | Name of the output columns. Each output column will contain CountVector for the respective input column. |
Pattern | Regex pattern used to match delimiters if [[gaps]] is true or tokens if [[gaps]] is false. |
Gaps | Indicates whether regex splits on gaps (true) or matches tokens (false). |
VectorAssembler
VectorAssembler is a transformer that combines a given list of columns into a single vector column.
Field Name | Description |
---|---|
Input Columns | Input column name over which VectorAssembler transformation is to be applied |
Output Column | Name of the output columns. Each output column will contain CountVector for the respective input column. |
VectorIndexer
VectorIndexer helps index categorical features in datasets of Vectors. It can both automatically decide which features are categorical and convert original values to category indices.
For Configuration of VectorIndexer Transformation one has to select as algorithm VectorIndexer on the transformations tab. It asks for various configuration fields. Their description is as below:
Field Name | Description |
---|---|
Input Columns | Name of the input column over which Word2Vec transformation is to be applied. |
Output Column | Name of the output columns. Each output column will contain CountVector for the respective input column. |
Max Categories | Threshold for the number of values a categorical feature can take. If a feature is found to have > maxCategories value, it is declared continuous. Must be greater than or equal to 2. Default: 20 |
Output Size Hint | Mention the size of the output Vector which will be generated after transformation is applied. |
Handle Invalid | Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”. |
Word2Vec
Word2Vec maps each word to a unique fixed-size vector.
For configuration of Word2Vec Transformation, one has to select as algorithm Word2Vec on the transformations tab. It asks for various configuration fields. Their description is as below:
Field Name | Description |
---|---|
Input Columns | Name of the input column over which Word2Vec transformation is to be applied. |
Output Column | Name of the output columns. Each output column will contain CountVector for the respective input column. |
Vector Size | The dimension of the code that you want to transform from words. Default value is 100. |
Window Size | The window size (context words from [-window, window]) default 5 |
Step Size | The step size. |
Min Count | The minimum number of times a token must appear to be included in the word2vec model’s vocabulary. Default value is 5. |
Max Iteration | The max iteration. |
Max Sentence Length | Set max sentence lengt.h |
Output Size Hint | Mention the size of the output Vector which will be generated after transformation is applied. |
Handle Invalid | Parameter for handling invalid vectors. Invalid vectors include nulls and vectors with the wrong size. The options are “skip” (filter out rows with invalid vectors), “error” (throw an error) and “optimistic” (do not check the vector size, and keep all rows). Default is “error”. |
If you have any feedback on Gathr documentation, please email us!