Developer Guide and Reference

  • 2021.1
  • 12/04/2020
  • Public Content
Contents

Decision Forest

Details

Given n feature vectors LaTex Math image. of
n
p
-dimensional feature vectors and n responses LaTex Math image. , the problem is to build a decision forest classification or regression model.
Training Stage
Library uses the following algorithmic framework for the training stage. Let LaTex Math image. be the set of observations. Given a positive integer parameters, such as the number of trees
B
, the bootstrap parameter LaTex Math image. , where
f
is a fraction of observations used for a training of one tree, and the number of features per node
m
, the algorithm does the following for LaTex Math image. :
  • Selects randomly with replacement the set LaTex Math image. of
    N
    vectors from the set
    S
    . The set LaTex Math image. is called a
    bootstrap
    set.
  • Trains a decision tree classifier LaTex Math image. on LaTex Math image. using parameter
    m
    for each tree.
Decision tree
T
is trained using the training set
D
of size
N
. Each node
t
in the tree corresponds to the subset LaTex Math image. of the training set
D
, with the root node being
D
itself. Its internal nodes
t
represent a binary test (split) dividing their subset LaTex Math image. in two subsets LaTex Math image. and LaTex Math image. , corresponding to their children LaTex Math image. and LaTex Math image. .
Inexact Histogram Computation Method
In inexact histogram method only a selected subset of splits is considered for computation of a best split. This subset is computed for each feature at the initialization stage of the algorithm. After the set of splits is computed, each value from initially provided data is substituted with the value of the corresponding bin. The bins are continuous intervals between the selected splits.
Split Criteria
The metric for measuring the best split is called
impurity
, LaTex Math image. . It generally reflects the homogeneity of responses within the subset LaTex Math image. in the node
t
. For the detailed definition of LaTex Math image. metrics, see the description of a specific algorithm.
Let the
impurity decrease
in the node
t
be
LaTex Math image.
Termination Criteria
The library supports the following termination criteria of decision forest training:
Minimal number of observations in a leaf node
Node
t
is not processed if LaTex Math image. is smaller than the predefined value. Splits that produce nodes with the number of observations smaller than that value are not allowed.
Minimal number of observations in a split node
Node
t
is not processed if LaTex Math image. is smaller than the predefined value. Splits that produce nodes with the number of observations smaller than that value are not allowed.
Minimum weighted fraction of the sum total of weights of all the input observations required to be at a leaf node
Node
t
is not processed if LaTex Math image. is smaller than the predefined value. Splits that produce nodes with the number of observations smaller than that value are not allowed.
Maximal tree depth
Node
t
is not processed if its depth in the tree reached the predefined value.
Impurity threshold
Node
t
is not processed if its impurity is smaller than the predefined threshold.
Maximal number of leaf nodes
Grow trees with positive maximal number of leaf nodes in a best-first fashion. Best nodes are defined by relative reduction in impurity. If maximal number of leaf nodes equals zero, then this criterion does not limit the number of leaf nodes, and trees grow in a depth-first fashion.
Tree Building Strategies
Maximal number of leaf nodes defines the strategy of tree building: depth-first or best-first.
Depth-first Strategy
If maximal number of leaf nodes equals zero, a decision tree is built using depth-first strategy. In each terminal node
t
, the following recursive procedure is applied:
  • Stop if the termination criteria are met.
  • Choose randomly without replacement
    m
    feature indices LaTex Math image. .
  • For each LaTex Math image. , find the best split LaTex Math image. that partitions subset LaTex Math image. and maximizes impurity decrease LaTex Math image. .
  • A node is a split if this split induces a decrease of the impurity greater than or equal to the predefined value. Get the best split LaTex Math image. that maximizes impurity decrease LaTex Math image. in all LaTex Math image. splits.
  • Apply this procedure recursively to LaTex Math image. and LaTex Math image. .
Best-first Strategy
If maximal number of leaf nodes is positive, a decision tree is built using best-first strategy. In each terminal node
t
, the following steps are applied:
  • Stop if the termination criteria are met.
  • Choose randomly without replacement
    m
    feature indices LaTex Math image. .
  • For each LaTex Math image. , find the best split LaTex Math image. that partitions subset LaTex Math image. and maximizes impurity decrease LaTex Math image. .
  • A node is a split if this split induces a decrease of the impurity greater than or equal to the predefined value and the number of split nodes is less or equal to LaTex Math image. . Get the best split LaTex Math image. that maximizes impurity decrease LaTex Math image. in all LaTex Math image. splits.
  • Put a node into a sorted array, where sort criterion is the improvement in impurity LaTex Math image. . The node with maximal improvement is the first in the array. For a leaf node, the improvement in impurity is zero.
  • Apply this procedure to LaTex Math image. and LaTex Math image. and grow a tree one by one getting the first element from the array until the array is empty.
Random Numbers Generation
To create a
bootstrap
set and choose feature indices in the performant way, the training algorithm requires the source of random numbers, capable to produce sequences of random numbers in parallel.
Initialization of the engine in the decision forest is based on the scheme below:
The state of the engine is updated once the training of the decision forest model is completed. The library provides support to retrieve the instance of the engine with updated state that can be used in other computations. The update of the state is engine-specific and depends on the parallelization technique used as defined earlier:
  • Family: the updated state is the set of states that represent individual engines in the family.
  • Leapfrog: the updated state is the state of the sequence with the rightmost position on the sequence. The example below demonstrates the idea for case of 2 subsequences (‘x’ and ‘o’) of the random number sequence:
  • SkipAhead: the updated state is the state of the independent sequence with the rightmost position on the sequence. The example below demonstrates the idea for case of 2 subsequences (‘x’ and ‘o’) of the random number sequence:
Prediction Stage
Given decision forest classifier and vectors LaTex Math image. , the problem is to calculate the responses for those vectors. To solve the problem for each given query vector LaTex Math image. , the algorithm finds the leaf node in a tree in the forest that gives the response by that tree. The response of the forest is based on an aggregation of responses from all trees in the forest. For the detailed definition, see the description of a specific algorithm.
Additional Characteristics Calculated by the Decision Forest
Decision forests can produce additional characteristics, such as an estimate of generalization error and an importance measure (relative decisive power) of each of p features (variables).
Out-of-bag Error
The estimate of the generalization error based on the training data can be obtained and calculated as follows:
  • For each tree LaTex Math image. in the forest, trained on the bootstrap set LaTex Math image. , the set LaTex Math image. is called the out-of-bag (OOB) set.
  • Predict the data from LaTex Math image. set by LaTex Math image. .
  • For each vector LaTex Math image. in the dataset X, predict its response LaTex Math image. by the trees that contain LaTex Math image. in their OOB set.
  • Aggregate the out-of-bag predictions in all trees and calculate the OOB error of the decision forest.
  • If OOB error value per each observation is required, then calculate the prediction error for LaTex Math image. .
For the detailed definition, see the description of a specific algorithm.
Variable Importance
There are two main types of variable importance measures:
  • Mean Decrease Impurity
    importance (MDI).
Importance of the
j
-th variable for predicting
Y
is the sum of weighted impurity decreases LaTex Math image. for all nodes
t
that use LaTex Math image. , averaged over all
B
trees in the forest:
LaTex Math image.
where LaTex Math image. is the fraction of observations reaching node
t
in the tree LaTex Math image. , and LaTex Math image. is the index of the variable used in split LaTex Math image. .
  • Mean Decrease Accuracy
    (MDA).
Importance of the
j
-th variable for predicting
Y
is the average increase in the OOB error over all trees in the forest when the values of the
j
-th variable are randomly permuted in the OOB set. For that reason, this latter measure is also known as
permutation importance
.
In more details, the library calculates MDA importance as follows:
  • Let LaTex Math image. be the set of feature vectors where the
    j
    -th variable is randomly permuted over all vectors in the set.
  • Let LaTex Math image. be the OOB error calculated for LaTex Math image. on its out-of-bag dataset LaTex Math image. .
  • Let LaTex Math image. be the OOB error calculated for LaTex Math image. using LaTex Math image. , and its out-of-bag dataset LaTex Math image. is permuted on the
    j
    -th variable. Then
    • LaTex Math image. is the OOB error increase for the tree LaTex Math image. .
    • LaTex Math image. is MDA importance.
    • LaTex Math image. , where LaTex Math image. is the variance of LaTex Math image.

Batch Processing

Decision forest classification and regression follows the general workflow described in Classification Usage Model.
Training
At the training stage, decision forest regression has the following parameters:
Parameter
Default Value
Description
nTrees
100
The number of trees in the forest.
observationsPerTreeFraction
1
Fraction of the training set S used to form the bootstrap set for a single tree training, LaTex Math image. . The observations are sampled randomly with replacement.
featuresPerNode
0
The number of features tried as possible splits per node. If the parameter is set to
0
, the library uses the square root of the number of features, LaTex Math image. , for classification and LaTex Math image. features for regression.
maxTreeDepth
0
Maximal tree depth. Default is
0
(unlimited).
DEPRECATED:
seed
777
The seed for random number generator, which is used to choose the bootstrap set, split features in every split node in a tree, and generate permutation required in computations of
MDA
variable importance.
This parameter is deprecated and will be removed in future releases. Use
engine
instead.
engine
SharePtr< engines:: mt2203:: Batch>()
Pointer to the random number generator engine.
The random numbers produced by this engine are used to choose the bootstrap set, split features in every split node in a tree, and generate permutation required in computations of MDA variable importance.
impurityThreshold
0
The threshold value used as stopping criteria: if the impurity value in the node is smaller than the threshold, the node is not split anymore.
varImportance
none
The variable importance computation mode.
Possible values:
  • none
    – variable importance is not calculated
  • MDI
    - Mean Decrease of Impurity, also known as the Gini importance or Mean Decrease Gini
  • MDA_Raw
    - Mean Decrease of Accuracy (permutation importance)
  • MDA_Scaled
    - the MDA_Raw value scaled by its standard deviation
resultsToCompute
0
The 64-bit integer flag that specifies which extra characteristics of the decision forest to compute. Provide one of the following values to request a single characteristic or use bitwise OR to request a combination of the characteristics:
  • computeOutOfBagError
  • computeOutOfBagErrorPerObservation
bootstrap
true
If true, the training set for a tree is a bootstrap of the whole training set. If false, the whole training set is used to build trees.
minObservationsInLeafNode
1
for classification,
5
for regression
Minimum number of observations in the leaf node.
minObservationsInSplitNode
2
Minimum number of samples required to split an internal node; it can be any non-negative number.
minWeightFractionInLeafNode
0.0
Minimum weighted fraction of the sum total of weights of all the input observations required to be at a leaf node, from
0.0
to
0.5
.
All observations have equal weights if the weights of the observations are not provided.
minImpurityDecreaseInSplitNode
0.0
Minimum amount of impurity decrease required to split a node; it can be any non-negative number.
maxLeafNodes
0
Grow trees with positive maximal number of leaf nodes in a best-first fashion. Best nodes are defined as relative reduction in impurity. If maximal number of leaf nodes equals zero, then this parameter does not limit the number of leaf nodes, and trees grow in a depth-first fashion.
Output
In addition to regression or classifier output, decision forest calculates the result described below. Pass the
Result ID
as a parameter to the methods that access the result of your algorithm.
Result ID
Result
outOfBagError
A numeric table LaTex Math image. containing out-of-bag error computed when the
computeOutOfBagErroroption
option is on.
By default, this result is an object of the
HomogenNumericTable
class, but you can define the result as an object of any class derived from
NumericTable
.
variableImportance
A numeric table LaTex Math image. that contains variable importance values for each feature. If you set the
varImportance
parameter to none, the library returns a null pointer to the table.
By default, this result is an object of the
HomogenNumericTable
class, but you can define the result as an object of any class derived from
NumericTable
except
PackedTriangularMatrix
and
PackedSymmetricMatrix
.
outOfBagErrorPerObservation
A numeric table of size LaTex Math image. that contains the computed out-of-bag error when the
computeOutOfBagErrorPerObservation
option is enabled. The value LaTex Math image. in the table indicates that no OOB value was computed because this observation was not in OOB set for any of the trees in the model (never left out during the bootstrap).
By default, this result is an object of the
HomogenNumericTable
class, but you can define the result as an object of any class derived from
NumericTable
.
updatedEngine
Engine instance with state updated after computations.

Performance Considerations

To get the best performance of the decision forest variable importance computation, use the Mean Decrease Impurity (MDI) rather than the Mean Decrease Accuracy (MDA) method.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.