Gradient Boosted Trees
Details
Given n feature vectors of
n
p
dimensional feature vectors and n
responses , the problem is to build a gradient boosted trees
classification or regression model.The tree ensemble model uses M additive functions to predict the output
where
is the space of regression trees,
T
is the number of
leaves in the tree, w
is a leaf weights vector, is a score
on i
th leaf. q(x)
represents the structure of each tree that maps an
observation to the corresponding leaf index.Training procedure is an iterative functional gradient descent
algorithm which minimizes objective function over function space by
iteratively choosing a function (regression tree) that points in the
negative gradient direction. The objective function is
where
l(f)
is twice differentiable convex loss function and
is a regularization term that penalizes the complexity of
the model defined by the number of leaves T and the L2 norm of the weights for each tree, and
are regularization parameters.Training Stage
Library uses the secondorder approximation of objective function
where ,
and following algorithmic framework for the training stage.
Let be the set of observations. Given the training
parameters, such as the number of iterations
M
, loss function l(f)
, regression tree training parameters,
regularization parameters and , shrinkage (learning rate) parameter , the
algorithm does the following: Find an initial guess ,i = 1, …, n
 For :
 Update and ,i = 1, …, n
 Grow a regression tree that minimizes the objective function , where , , , .
 Assign an optimal weight to the leafj,j = 1, ldots, T.
 Apply shrinkage parameter to the tree leafs and add the tree to the model
 Update
The algorithm for growing the tree:
 Generate a bootstrap training set if required (stochastic gradient boosting) as follows: select randomly without replacement observations, wherefis a fraction of observations used for training of one tree.
 Start from the tree with depth0.
 For each leaf node in the tree:
 Choose a subset of feature for split finding if required (stochastic gradient boosting).
 Find the best split that maximizes the gain:
 Stop when a termination criterion is met.
For more details, see [Chen2016].
The library supports the following termination criteria when
growing the tree:
 Minimal number of observations in a leaf node.Node t is not processed if the subset of observations is smaller than the predefined value. Splits that produce nodes with the number of observations smaller than that value are not allowed.
 Maximal tree depth.Node t is not processed, if its depth in the tree reached the predefined value.
 Minimal split loss.Node t is not processed, if the best possible split is smaller than parameter .
Prediction Stage
Given a gradient boosted trees model and vectors , the problem is to calculate the responses for those
vectors. To solve the problem for each given query vector , the algorithm finds the leaf node in a tree in the
ensemble which gives the response by that tree. Resulting response
is based on an aggregation of responses from all trees in the
ensemble. For detailed definition, see description of a specific
algorithm.
Split Calculation Mode
The library supports two split calculation modes:
 exact  all possible split values are examined when searching for the best split for a feature.
 inexact  continuous features are bucketed into discrete bins and the possible splits are restricted by the buckets borders only.
Batch Processing
Gradient boosted trees classification and regression follows the
general workflow described in Classification Usage Model and Regression Usage Model.
Training
For description of the input and output, refer to .
At the training stage, the gradient boosted trees batch algorithm
has the following parameters:
Parameter  Default Value  Description 

splitMethod  inexact  Split computation mode. Possible values:

maxIterations  50  Maximal number of iterations when training the model, defines maximal number of trees in the model. 
maxTreeDepth  6  Maximal tree depth. If the parameter is set to 0 then the depth is unlimited. 
shrinkage  0.3  Learning rate of the boosting procedure. Scales the contribution of each tree by a factor 
minSplitLoss  0  Loss regularization parameter. Minimal loss reduction required to make a further partition on a leaf node of the tree. Range: 
lambda  1  L2 regularization parameter on weights. Range: 
observationsPerTreeFraction  1  Fraction of the training set S used for a single tree training, . The observations are sampled randomly without replacement. 
featuresPerNode  0  The number of features tried as the possible splits per node. If the parameter is set to 0 , all features are used. 
minObservationsInLeafNode  5  Minimal number of observations in the leaf node. 
memorySavingMode  false  If true then use memory saving (but slower) mode. 
engine  SharePtr< engines:: mt19937:: Batch>()  Pointer to the random number generator. 
maxBins  256  Used with inexact split method only. Maximal number of discrete bins to
bucket continuous features. Increasing the number results in higher
computation costs 
minBinSize  5  Used with inexact split method only. Minimal number of observations in a bin. 