Decision Tree Algorithm

Decision trees and their ensembles are popular methods for the machine learning tasks of classification and regression. Decision trees algorithms are easy to interpret, they handle categorical features, extend to the multi-class classification setting, do not require feature scaling, and are able to capture non-linearity and feature interactions.

Decision Tree Analytics processor is used to analyze data using ML’s DecisionTreeClassificationModel and DecisionTreeRegressionModel.

To use a Decision Tree Model in Data Pipeline, drag and drop the model component to the pipeline canvas and right click on it to configure:

The Configuration Section → of every ML model is identical.

After the Configuration tab comes the Feature Selection → tab. (It is identical for all the models except K Means).

Once Feature Selection is done, perform Pre-Processing → on the data before feeding it to the Model. The configuration settings are identical for all the ML models.

Then configure the Model using Model Configuration.

Model Configuration

Label Column: Column name that will be treated as label column while training a model.

Probability Column: Column name that holds the value of probabilities of predicted output.

Prediction Column: Set the columns to be predicted. Value of Prediction Column must be set as “prediction” in order to deploy the model as REST service.

Feature Column: Column name which will be treated as feature column while training a model.

Max Bins: Number of bins used when discretizing continuous features.

Max Depth: Maximum depth of the tree that needs to be trained.

This should be chosen carefully as it acts as a stopping criteria for model training.

Impurity: Parameter which decides the splitting criteria over each node.

Available options are Gini Impurity and Entropy for classification and Variance for Regression problems.

Minimum Information Gain: Specifies the splitting criteria over each node.

Calculated on the basis of Impurity parameter.

Seed: Number used to produce a random number sequence that makes the result of algorithm reproducible.

Specify the value of seed parameter that will be used for model training.

Thresholds: Threshold parameter for the class range.

Number of thresholds should be equal to Number of Output Classes.

Mention only in case of Classification problems.

After Model Configuration, Post-Processing → is done, Model Evaluation → can be performed.

Then, apply the Hyper Parameters → on the model to enable tuning your configuration; after which you can simply add notes and save the Configuration.

Top