# Details

`feature vectorsn`

`= {X`

`x`

_{1}= (

`x`

_{11}, … ,

`x`

_{1p}), ... ,

`x`

_{n}= (

`x`

_{n1}, … ,

`x`

_{np}) } of

`n
`

`-dimensional feature vectors andp`

`responsesn`

`= {Y`

`y`

_{1}, … ,

`y`

_{n}}, the problem is to build a decision forest classification or regression model.

## Training Stage

`= (S`

`,X`

`) be the set of observations. Given a positive integer parameters, such as the number of treesY`

`, the bootstrap parameterB`

`=N`

`*f`

`, wheren`

`is a fraction of observations used for a training of one tree, and the number of features per nodef`

`, the algorithm does the following form`

`= 1, ... ,b`

`:B`

- Selects randomly with replacement the setD
_{b}ofvectors from the setN. The setSD_{b}is called aset.bootstrap - Trains a decision tree classifierT
_{b}onD_{b}using parameterfor each tree.m

`is trained using the training setT`

`of sizeD`

`. Each nodeN`

`in the tree corresponds to the subsett`

`D`

_{t}of the training set

`, with the root node beingD`

`itself. Its internal nodesD`

`represent a binary test (split) dividing their subsett`

`X`

_{t}in two subsets

`X`

_{tL}and

`X`

_{tR}, corresponding to their children

`t`

_{L}and

`t`

_{R}.

*,impurity*

`. It generally reflects the homogeneity of responses within the subseti(t)`

`D`

_{t}in the node

`.t`

`metrics, see the description of a specific algorithm.i(t)`

*in the nodeimpurity decrease*

`bet`

. NodeMinimal number of observations in a leaf nodeis not processed if |tD_{t}| is smaller than the predefined value. Splits that produce nodes with the number ofsmaller than that value are not allowed.observations. NodeMaximal tree depthis not processed, if its depth in the tree reached the predefined value.t. NodeImpurity thresholdis not processed, if its impurity is smaller than the predefined threshold.t

`:t`

- Stop, if the termination criteria is met.
- Choose randomly without replacement
feature indicesmJ_{t}∈{0, 1, ... ,-1}.p - For eachj∈J
_{t}, find the best splits_{j,t}that partitions subsetD_{t}and maximizes impurity decrease.∆i(t) - Get the best splits
_{t}that maximizes impurity decreasein all∆is_{j,t}splits. - Apply this procedure recursively tot
_{L}andt_{R}.

*set and choose feature indices in the performant way, the training algorithm requires the source of random numbers, capable to produce sequences of random numbers in parallel. See Algorithms > Analysis > Engines for the description of the respective techniques.bootstrap*

- Family: the updated state is the set of states that represent individual engines in the family.
- Leapfrog: the updated state is the state of the sequence with the rightmost position on the sequence. The example below demonstrates the idea for case of 2 subsequences (‘x’ and ‘o’) of the random number sequence:
- SkipAhead: the updated state is the state of the independent sequence with the rightmost position on the sequence. The example below demonstrates the idea for case of 2 subsequences (‘x’ and ‘o’) of the random number sequence:

## Prediction Stage

`1, ... ,x`

`x`

_{r}, the problem is to calculate the responses for those vectors. To solve the problem for each given query vector

`x`

_{i}, the algorithm finds the leaf node in a tree in the forest that gives the response by that tree. The response of the forest is based on an aggregation of responses from all trees in the forest. For the detailed definition, see the description of a specific algorithm.

## Additional Characteristics Calculated by the Decision Forest

`features (variables).p`

## Out-of-bag Error

- For each treeT
_{b}in the forest, trained on the bootstrap setD_{b}, the set - Predict the data fromT
_{b}. - For each vectorx
_{i}in the dataset, predict its responseXx_{i}in their OOB set. - Aggregate the out-of-bag predictions in all trees and calculate the OOB error of the decision forest.
- If OOB error value per each observation is required, then calculate the prediction error forx
_{i}.

## Variable Importance

importance (MDI).Mean Decrease ImpurityImportance of the-th variable for predictingjis the sum of weighted impurity decreasesYp(t)∆i(s_{t}for all nodes,t)that usetx_{j}, averaged over alltrees in the forest:Bwherein the treetT_{b}, andv(s_{t}is the index of the variable used in split)s_{t}.(MDA).Mean Decrease AccuracyImportance of the-th variable for predictingjis the average increase in the OOB error over all trees in the forest when the values of theY-th variable are randomly permuted in the OOB set. For that reason, this latter measure is also known asj.permutation importanceIn more details, the library calculates MDA importance as follows:- Letπ
be the set of feature vectors where the(X,j)-th variable is randomly permuted over all vectors in the set.j - LetE
_{b}be the OOB error calculated forT_{b}on its out-of-bag dataset - LetE
_{b,j}be the OOB error calculated forT_{b}using-th variable. Thenj- T
_{b}. - D
_{b,j}.