# Decision Forest Classification and Regression (DF)

Decision Forest (DF) classification and regression algorithms are based on an ensemble of
tree-structured classifiers, which are known as decision trees. Decision forest is built
using the general technique of bagging, a bootstrap aggregation, and a random choice of features.
For more details, see [Breiman84] and [Breiman2001].

Operation | Computational methods | Programming Interface | |||

## Mathematical formulation

Training

Given

*feature vectors of size*n

*, their non-negative observation weights and*p

*responses ,*n

Classification

- , where
is the number of classesC

Regression

the problem is to build a decision forest classification or regression model.

The library uses the following algorithmic framework for the training
stage. Let be the set of observations. Given positive
integer parameters, such as the number of trees

*, the bootstrap parameter , where*B

*is a fraction of observations used for a training of each tree in the forest, and the number of features per node*f

*, the algorithm does the following for*m

**:**b = 1, ldots, B

- Selects randomly with replacement the set of
vectors from the setN. The set is called aSset.bootstrap - Trains a decision tree classifier on using parameter
for each tree.m

Decision tree

*is trained using the training set*T

*of size*D

*. Each node*N

*in the tree corresponds to the subset of the training set*t

*, with the root node being*D

*itself. Each internal node*D

*represents a binary test (split) that divides the subset in two subsets, and , corresponding to their children, and .*t

Training method:

Dense

In

*training method, all possible splits for each feature are taken from the subset of selected features for the current node and evaluated for best split computation.*dense

Training method:

Hist

In

*training method, only a selected subset of splits is considered for best split computation. This subset of splits is computed for each feature at the initialization stage of the algorithm. After computing the subset of splits, each value from the initially provided data is substituted with the value of the corresponding bin. Bins are continuous intervals between selected splits.*hist

Split Criteria

The metric for measuring the best split is called

*,*impurity

*. It generally reflects the homogeneity of responses within the subset in the node*i(t)

*.*t

Classification

*is an impurity metric for classification, calculated as follows:*

Gini index

where

is a set of observations that reach the node;D- is specified in the table below:

Without sample weights | With sample weights |
---|---|

is the observed fraction of observations that belong to class in i D | is the observed weighted fraction of observations that belong to class in i :D |

Regression

*is an impurity metric for regression, calculated as follows:*

MSE

Without sample weights | With sample weights |
---|---|

, which is equivalent to the number of elements in S |

Let the

*in the node*impurity decrease

*be*t

Termination Criteria

The library supports the following termination criteria of
decision forest training:

- Minimal number of observations in a leaf node
- Node
is not processed if is smaller than the predefined value. Splits that produce nodes with the number of observations smaller than that value are not allowed.t - Minimal number of observations in a split node
- Node
is not processed if is smaller than the predefined value. Splits that produce nodes with the number of observations smaller than that value are not allowed.t - Minimum weighted fraction of the sum total of weights of all the input observations required to be at a leaf node
- Node
is not processed if is smaller than the predefined value. Splits that produce nodes with the number of observations smaller than that value are not allowed.t - Maximal tree depth
- Node
is not processed if its depth in the tree reached the predefined value.t - Impurity threshold
- Node
is not processed if its impurity is smaller than the predefined threshold.t - Maximal number of leaf nodes
- Grow trees with positive maximal number of leaf nodes in a best-first fashion. Best nodes are defined by relative reduction in impurity. If maximal number of leaf nodes equals zero, then this criterion does not limit the number of leaf nodes, and trees grow in a depth-first fashion.

Tree Building Strategies

Maximal number of leaf nodes defines the strategy of tree building:
depth-first or best-first.

Depth-first Strategy

If maximal number of leaf nodes equals zero, a decision tree is built using depth-first strategy.
In each terminal node

*, the following recursive procedure is applied:*t

- Stop if the termination criteria are met.
- Choose randomly without replacement
feature indices .m - For each , find the best split that partitions subset and maximizes impurity decrease .
- A node is a split if this split induces a decrease of the impurity greater than or equal to the predefined value. Get the best split that maximizes impurity decrease in all splits.
- Apply this procedure recursively to and .

Best-first Strategy

If maximal number of leaf nodes is positive, a decision tree is built using best-first strategy.
In each terminal node

*, the following steps are applied:*t

- Stop if the termination criteria are met.
- Choose randomly without replacement
feature indices .m - For each , find the best split that partitions subset and maximizes impurity decrease .
- A node is a split if this split induces a decrease of the impurity greater than or equal to the predefined value and the number of split nodes is less or equal to . Get the best split that maximizes impurity decrease in all splits.
- Put a node into a sorted array, where sort criterion is the improvement in impurity . The node with maximal improvement is the first in the array. For a leaf node, the improvement in impurity is zero.
- Apply this procedure to and and grow a tree one by one getting the first element from the array until the array is empty.

Inference

Given decision forest classification or regression model and vectors ,
the problem is to calculate the responses for those vectors.

Inference methods:

*and*Dense

Hist

*and*

Dense

*inference methods perform prediction in the same way. To solve the problem for each given query vector , the algorithm does the following:*

hist

Classification

For each tree in the forest, it finds the leaf node that gives its label. The label

*that the majority of trees in the forest vote for is chosen as the predicted label for the query vector .*y

Regression

For each tree in the forest, it finds the leaf node that gives the response as the mean of
dependent variables. The mean of responses from all trees in the forest is the predicted response for the query vector .

Additional Characteristics Calculated by the Decision Forest

Decision forests can produce additional characteristics, such as
an estimate of generalization error and an importance measure
(relative decisive power) of each of p features (variables).

Out-of-bag Error

The estimate of the generalization error based on the training
data can be obtained and calculated as follows:

Classification

- For each vector in the dataset
, predict its label by having the majority of votes from the trees that contain in their OOB set, and vote for that label.X - Calculate the OOB error of the decision forest
as the average of misclassifications:T - If OOB error value per each observation is required, then calculate the prediction error for :

Regression

- For each vector in the dataset
, predict its response as the mean of prediction from the trees that contain in their OOB set:X, where and is the result of prediction by . - Calculate the OOB error of the decision forest
as the Mean-Square Error (MSE):T - If OOB error value per each observation is required, then calculate the prediction error for :

Variable Importance

There are two main types of variable importance measures:

importance (MDI)Mean Decrease Impurity

Importance of the-th variable for predictingjis the sum of weighted impurity decreases for all nodesYthat use , averaged over allttrees in the forest:Bwhere is the fraction of observations reaching nodein the tree , and is the index of the variable used in split .t

(MDA)Mean Decrease Accuracy

Importance of the-th variable for predictingjis the average increase in the OOB error over all trees in the forest when the values of theY-th variable are randomly permuted in the OOB set. For that reason, this latter measure is also known asj.permutation importanceIn more details, the library calculates MDA importance as follows:

Let be the set of feature vectors where the-th variable is randomly permuted over all vectors in the set.j Let be the OOB error calculated for on its out-of-bag dataset . Let be the OOB error calculated for using , and its out-of-bag dataset is permuted on the-th variable. Thenj

is the OOB error increase for the tree . is MDA importance. , where is the variance of