Developer Guide

Contents

Details

Given
n
feature vectors
X
= {
x
1
= (
x
11
,...,
x
1
p
),...,
x
n
= (
x
n
1
,...,
x
np
) } of
n
p
-dimensional feature vectors and
n
responses
Y
= {
y
1
,...,
y
n
}, the problem is to build a gradient boosted trees classification or regression model.
The tree ensemble model uses M additive functions to predict the output
where
is the space of regression trees,
T
is the number of leaves in the tree,
w
is a leaf weights vector,
w
i
is a score on
i
-th leaf.
q
(
x
)represents the structure of each tree that maps an observation to the corresponding leaf index.
Training procedure is an iterative functional gradient descent algorithm which minimizes objective function over function space by iteratively choosing a function (regression tree) that points in the negative gradient direction. The objective function is
where
is a differentiable convex loss function that measures the difference between the prediction
f
(
x
i
) and the response
y
i
, and
is a regularization term that penalizes the complexity of the model defined by the number of leaves T and the L2 norm of the weights
for each tree,
γ
and
λ
are regularization parameters.

Training Stage

Library uses the second-order approximation of objective function
, where
,
and following algorithmic framework for the training stage.
Let
S
= (
X
,
Y
) be the set of observations. Given the training parameters, such as the number of iterations
M
, loss function
, regression tree training parameters, regularization parameters
γ
and
λ
, shrinkage (learning rate) parameter
θ
, the algorithm does the following:
  • Find an initial guess
  • For
    k
    = 1,...,
    M
    :
    • Update
      g
      i
      and
    • Grow a regression tree
      that minimizes the objective function
      , where
    • Assign an optimal weight
      to the leaf
    • Apply shrinkage parameter
      θ
      to the tree leafs and add the tree to the model
    • Update
The algorithm for growing the tree:
  • Generate a bootstrap training set if required (stochastic gradient boosting) as follows: select randomly without replacement
    N
    =
    f
    *
    n
    observations, where
    f
    is a fraction of observations used for training of one tree.
  • Start from the tree with depth 0.
  • For each leaf node in the tree:
    • Choose a subset of feature for split finding if required (stochastic gradient boosting).
    • Find the best split that maximizes the gain:
  • Stop when a termination criterion is met.
For more details, see [Chen2016].
The library supports the following termination criteria when growing the tree:
  • Minimal number of observations in a leaf node.
    Node
    t
    is not processed if the subset of observations is smaller than the predefined value. Splits that produce nodes with the number of observations smaller than that value are not allowed.
  • Maximal tree depth.
    Node
    t
    is not processed, if its depth in the tree reached the predefined value.
  • Minimal split loss.
    Node
    t
    is not processed, if the best possible split is smaller than parameter
    γ
    .

Training Alternative

If you already have a set of precomputed values for nodes in each tree, you can use the Model Builder class to get a trained Intel DAAL Gradient Boosted Trees Classification model based on the external model you have.
The following schema illustrates the use of the Model Builder class for Gradient Boosted Trees Classification:
Model Builder Class for Gradient Boosted Trees
For general information on using the Model Builder class, see Training and Prediction. For details on using the Model Builder class for Gradient Boosted Trees Classification, see Usage of training alternative.

Prediction Stage

Given a gradient boosted trees model and vectors
x
1
,...,
x
r
, the problem is to calculate the responses for those vectors. To solve the problem for each given query vector
x
i
, the algorithm finds the leaf node in a tree in the ensemble which gives the response by that tree. Resulting response is based on an aggregation of responses from all trees in the ensemble. For detailed definition, see description of a specific algorithm.

Split Calculation Mode

The library supports two split calculation modes:
  • exact
    - all possible split values are examined when searching for the best split for a feature.
  • inexact
    - continuous features are bucketed into discrete bins and the possible splits are restricted by the buckets borders only.

Variable Importance

There are five main types of variable importance measures:
  • Weight
    represents the number of nodes that split samples by the given feature:
    where
    featureSplit
    jk
    is the number of features that is used to split samples in the
    k
    th
    node of the
    j
    th
    tree
  • Total cover
    represents the number of samples passing through a node that splits the entire set by the given feature:
    where
    • featureSplit
      jk
      is the number of features used to split samples in the
      k
      th
      node of the
      j
      th
      tree
    • samples
      jk
      is the number of samples passing through the
      k
      th
      node of the
      j
      th
      tree
  • Cover
    represents the average number of samples passing through one split of the given feature:
  • Total gain
    is the reduction of the loss function of the given feature in the entire model:
    where
    • featureSplit
      jk
      is the number of features used to split samples in the
      k
      th
      node of the
      j
      th
      tree
    • impurityLeft
      jk
      is impurity of left child of the
      k
      th
      node of the
      j
      th
      tree
    • impurityRight
      jk
      is impurity of right child of the
      k
      th
      node of the
      j
      th
      tree
    • impurity
      jk
      is impurity of the
      k
      th
      node of the
      j
      th
      tree
  • Gain
    is the reduction of the loss function by one split for the given feature:

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804