Random Forest Trees Algorithm
In this article
Random forests are ensembles of decision trees. Random forests combine many decision trees to reduce the risk of over fitting. Random forests can be used for binary and multi-class classification and for regression, using both continuous and categorical features.
Random Forest Trees Analytics processor is used to analyze data using ML’s RandomForestClassificationModel and RandomForestRegressionModel.
To use a Random Forest Trees Regression Model in Data Pipeline, drag and drop the model component to the pipeline canvas and right click on it to configure.
The Configuration Section → of every ML model is identical.
After the Configuration tab comes the Feature Selection → tab. (It is identical for all the models except K Means).
Once Feature Selection is done, perform Pre-Processing → on the data before feeding it to the Model. The configuration settings are identical for all the ML models.
Then configure the Model using Model Configuration.
Model Configuration
Label Column: Column name that will be treated as label column while training a model.
Probability Column: Column name that holds the value of probabilities of predicted output.
Prediction Column: Set the columns to be predicted. Value of Prediction Column must be set as “prediction” in order to deploy the model as REST service.
Feature Column: Column name which will be treated as feature column while training a model.
Max Bins: Specify the value of max Bins parameter for model training.
Max Depth: Specify the depth of the tree that needs to be trained. This should be chosen carefully as it acts as a stopping criteria for model training.
Impurity: Parameter that decides the splitting criteria over each node.
Available options are Gini Impurity and Entropy for classification and Variance for regression problems.
Minimum Information Gain: Calculated based on Impurity parameter. Specifies actually the splitting criteria over each node.
Information gained using any feature column over a particular node should be more than this parameter value, so that tree can be split based on that feature on that node.
Seed: Number used to produce a random number sequence that makes the result of algorithm reproducible.
Specify the value of seed parameter that will be used for model training.
Thresholds: Specify the threshold parameter for class range. Number of thresholds should be equal to number of output classes.
Required only in case of Classification problems
Number of Trees: Number of trees in the forest. Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy. Training time increases roughly linearly in the number of trees.
Feature Subset Strategy: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low.
Sub Sampling Rate: Size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.
After Model Configuration, Post-Processing → is done, Model Evaluation → can be performed.
Then, apply the Hyper Parameters → on the model to enable tuning your configuration; after which you can simply add notes and save the Configuration.
If you have any feedback on Gathr documentation, please email us!