Gradient Boosted Tree Algorithm

Gradient-Boosted Trees (GBTs) are ensembles of decision trees. GBTs can be used for binary classification and for regression, using both continuous and categorical features.

Gradient-Boosted Trees Analytics processor is used to analyze data using ML’s GBTClassificationModel and GBTRegressionModel.

To use a GBT Model in Data Pipeline, drag and drop the model component to the pipeline canvas and right click on it to configure.

The Configuration Section → of every ML model is identical.

After the Configuration tab comes the Feature Selection → tab. (It is identical for all the models except K Means).

Once Feature Selection is done, perform Pre-Processing → on the data before feeding it to the Model. The configuration settings are identical for all the ML models.

Then configure the Model using Model Configuration.

Model Configuration

Label Column: Column name that will be treated as label column while training a model.

Probability Column: Column name that holds the value of probabilities of predicted output.

Prediction Column: Set the columns to be predicted. Value of Prediction Column must be set as “prediction” in order to deploy the model as REST service.

Feature Column: Column name which will be treated as feature column while training a model.

Max Bins: Specify the value of max bins parameter for model training.

Max Depth: Specify the maximum depth of the tree that needs to be trained. This should be chosen carefully as it acts as a stopping criteria for model training.

Impurity: Parameter with the help of which split criteria is decided over each node. Available options are Gini Impurity and Entropy for classification problems and Variance for regression problems.

Minimum Information Gain: Calculated on the basis of Impurity parameter. Specifies actually the splitting criteria over each node.

Information gained using any feature column over a particular node should be more than this parameter value, so that tree can be split on the basis of that feature on that node.

Seed: Specify seed parameter value. This value will be used for model training.

Thresholds: Threshold parameter for the class range.

Number of thresholds should be equal to Number of Output Classes.

Mention only in case of Classification problems.

Loss Type: Loss function which GBT tries to minimize. Supported options are “squared” (L2) and “absolute” (L1) for regression problems and logistic for classification problems.

Max Iterations: Number of Iterations for building ensemble of trees. Number of Output trees is equal to the max iterations specified. This acts as one of the stopping criteria for model training.

Sub Sampling Rate: Specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.

Step Size: Defines the learning rate. This determines the impact of each tree model on the outcome. GBT works by starting with an initial estimate that is updated using the output of each tree.

The learning parameter controls the magnitude of this change in the estimates. Lower values are generally preferred as they make the model robust to the specific characteristics of tree and thus allowing it to generalize well. Lower values would require higher number of trees to model all the records and will be computationally expensive.

After Model Configuration, Post-Processing → is done, Model Evaluation → can be performed.

Then, apply the Hyper Parameters → on the model to enable tuning your configuration; after which you can simply add notes and save the Configuration.

Top