Decision Forest Classification and Regression (DF)
Operation | Computational methods | Programming Interface |
Mathematical formulation
- For classification,
, where
Cis the number of classes, the tree ensemble model predicts the output by selecting the responsey, which is voted for by the majority of the trees in the forest. - For regression, the tree ensemble model uses the mean ofMfunctions’ results to predict the output, i.e.
where
is a set of regression trees,
Wis a set of tree leaves’ scores andTis the number of leaves in the tree. In other words, each tree maps an observation to the corresponding leaf’s score.
Programming Interface
- enum classerror_metric_mode
- error_metric_mode::none
- Do not compute error metric.
- error_metric_mode::out_of_bag_error
- Train produces
table with cumulative prediction error for out of bag observations.
- error_metric_mode::out_of_bag_error_per_observation
- Train produces
table with prediction error for out of bag observations.
- enum classvariable_importance_mode
- variable_importance_mode::none
- Do not compute variable importance.
- variable_importance_mode::mdi
- Mean Decrease Impurity. Computed as the sum of weighted impurity decreases for all nodes where the variable is used, averaged over all trees in the forest.
- variable_importance_mode::mda_raw
- Mean Decrease Accuracy (permutation importance). For each tree, the prediction error on the out-of-bag portion of the data is computed (error rate for classification, MSE for regression). The same is done after permuting each predictor variable. The difference between the two are then averaged over all trees.
- variable_importance_mode::mda_scaled
- Mean Decrease Accuracy (permutation importance). This is MDA_Raw value scaled by its standard deviation.
- enum classinfer_mode
- infer_mode::class_labels
- Infer produces a
table with the predicted labels.
- infer_mode::class_probabilities
- Infer produces
table with the predicted class probabilities for each observation.
- enum classvoting_mode
- voting_mode::weighted
- The final prediction is combined through a weighted majority voting.
- voting_mode::unweighted
- The final prediction is combined through a simple majority voting.
- template<typenameTask= task::by_default>classdescriptor_base
- Template Parameters
- Task– Tag-type that specifies type of the problem to solve. Can betask::classificationortask::regression.
Constructors- descriptor_base()
- Creates a new instance of the class with the default property values.
Properties- doubleobservations_per_tree_fraction= 1.0
- The fraction of observations per tree.
- Getter & Setter
double get_observations_per_tree_fraction() const
- Invariants
observations_per_tree_fraction > 0.0
observations_per_tree_fraction <= 1.0
- doubleimpurity_threshold= 0.0
- The impurity threshold, a node will be split if this split induces a decrease of the impurity greater than or equal to the input value.
- Getter & Setter
double get_impurity_threshold() const
- Invariants
impurity_threshold >= 0.0
- doublemin_weight_fraction_in_leaf_node= 0.0
- The min weight fraction in leaf node. The minimum weighted fraction of the sum total of weights (of all the input observations) required to be at a leaf node.
- Getter & Setter
double get_min_weight_fraction_in_leaf_node() const
- Invariants
min_weight_fraction_in_leaf_node >= 0.0
min_weight_fraction_in_leaf_node <= 0.5
- doublemin_impurity_decrease_in_split_node= 0.0
- The min impurity decrease in a split node - a threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
- Getter & Setter
double get_min_impurity_decrease_in_split_node() const
- Invariants
min_impurity_decrease_in_split_node >= 0.0
- std::int64_ttree_count= 100
- The number of trees in the forest.
- Getter & Setter
std::int64_t get_tree_count() const
- Invariants
tree_count > 0
- std::int64_t features_per_node = task::classification ? sqrt(p) : p/3, where p is the total number of features
- The number of features to consider when looking for the best split for a node.
- Getter & Setter
std::int64_t get_features_per_node() const
- std::int64_tmax_tree_depth= 0
- The maximal depth of the tree. If 0, then nodes are expanded until all leaves are pure or until all leaves contain less or equal to min observations in leaf node samples.
- Getter & Setter
std::int64_t get_max_tree_depth() const
- std::int64_t min_observations_in_leaf_node = task::classification ? 1 : 5
- The minimal number of observations in a leaf node.
- Getter & Setter
std::int64_t get_min_observations_in_leaf_node() const
- Invariants
min_observations_in_leaf_node > 0
- std::int64_tmin_observations_in_split_node= 2
- The minimal number of observations in a split node.
- Getter & Setter
std::int64_t get_min_observations_in_split_node() const
- Invariants
min_observations_in_split_node > 1
- std::int64_tmax_leaf_nodes= 0
- The maximal number of the leaf nodes. If 0, then no limit for the number of leaf nodes.
- Getter & Setter
std::int64_t get_max_leaf_nodes() const
- std::int64_tmax_bins= 256
- The maximal number of discrete bins to bucket continuous features. Used withmethod::histsplit finding method only. Increasing the number results in higher computation costs.
- Getter & Setter
std::int64_t get_max_bins() const
- Invariants
max_bins > 1
- std::int64_tmin_bin_size= 5
- The minimal number of observations in a bin. Used withmethod::histsplit finding method only.
- Getter & Setter
std::int64_t get_min_bin_size() const
- Invariants
min_bin_size > 0
- boolmemory_saving_mode= False
- The memory saving (but slower) mode.
- Getter & Setter
bool get_memory_saving_mode() const
- boolbootstrap= True
- The bootstrap mode, if True, the training set for a tree is a bootstrap of the whole training set, if False, the whole dataset is used to build each tree.
- Getter & Setter
bool get_bootstrap() const
- The error metric mode.
- Getter & Setter
error_metric_mode get_error_metric_mode() const
- The variable importance mode.
- Getter & Setter
variable_importance_mode get_variable_importance_mode() const
- std::int64_tclass_count= 2
- The class count. Used withtask::classificationonly.
- Getter & Setter
template> std::int64_t get_class_count() const
- infer_modeinfer_mode
- The infer mode. Used withtask::classificationonly.
- Getter & Setter
template> infer_mode get_infer_mode() const
- voting_modevoting_mode
- The voting mode. Used withtask::classificationonly.
- Getter & Setter
template> voting_mode get_voting_mode() const
- template<typenameFloat= detail::descriptor_base<>::float_t, typenameMethod= detail::descriptor_base<>::method_t, typenameTask= detail::descriptor_base<>::task_t>classdescriptor
- Template Parameters
- Float– The floating-point type that the algorithm uses for intermediate computations. Can befloatordouble.
- Method– Tag-type that specifies an implementation of algorithm. Can bemethod::denseormethod::hist.
- Task– Tag-type that specifies type of the problem to solve. Can betask::classificationortask::regression.
Public Methods- auto &set_observations_per_tree_fraction(doublevalue)
- auto &set_impurity_threshold(doublevalue)
- auto &set_min_weight_fraction_in_leaf_node(doublevalue)
- auto &set_min_impurity_decrease_in_split_node(doublevalue)
- auto &set_tree_count(std::int64_tvalue)
- auto &set_features_per_node(std::int64_tvalue)
- auto &set_max_tree_depth(std::int64_tvalue)
- auto &set_min_observations_in_leaf_node(std::int64_tvalue)
- auto &set_min_observations_in_split_node(std::int64_tvalue)
- auto &set_max_leaf_nodes(std::int64_tvalue)
- auto &set_max_bins(std::int64_tvalue)
- auto &set_min_bin_size(std::int64_tvalue)
- auto &set_memory_saving_mode(boolvalue)
- auto &set_bootstrap(boolvalue)
- template<typenameT= Task, typenameNone= detail::enable_if_classification_t<T>> auto &set_class_count(std::int64_tvalue)
- template<typenameT= Task, typenameNone= detail::enable_if_classification_t<T>> auto &set_infer_mode(infer_modevalue)
- template<typenameT= Task, typenameNone= detail::enable_if_classification_t<T>> auto &set_voting_mode(voting_modevalue)
- structdense
- Tag-type that denotes dense computational method.
- structhist
- Tag-type that denotes hist computational method.
- structclassification
- Tag-type that parameterizes entities used for solving classification problem.
- structregression
- Tag-type that parameterizes entities used for solving regression problem.
- Alias tag-type for classification task.
- template<typenameTask= task::by_default>classmodel
- Template Parameters
- Task– Tag-type that specifies type of the problem to solve. Can betask::classificationortask::regression.
Constructors- model()
- Creates a new instance of the class with the default property values.
Properties- std::int64_ttree_count= 100
- The number of trees in the forest.
- Getter & Setter
std::int64_t get_tree_count() const
- Invariants
tree_count > 0
- std::int64_tclass_count= 2
- The class count. Used withtask::classificationonly.
- Getter & Setter
template> std::int64_t get_class_count() const
- template<typenameTask= task::by_default>classtrain_input
- Template Parameters
- Task– Tag-type that specifies the type of the problem to solve. Can betask::classificationortask::regression.
Constructors- train_input(consttable &data,consttable &labels)
- Creates a new instance of the class with the givendataandlabelsproperty values.
Properties- consttable &data= table{}
- The training set
.
- Getter & Setter
const table & get_data() const
auto & set_data(const table &value)
- consttable &labels= table{}
- Vector of labels
for the training set
.
- Getter & Setter
const table & get_labels() const
auto & set_labels(const table &value)
- template<typenameTask= task::by_default>classtrain_result
- Template Parameters
- Task– Tag-type that specifies the type of the problem to solve. Can betask::classificationortask::regression.
Constructors- train_result()
- Creates a new instance of the class with the default property values.
Properties- consttable &oob_err= table{}
- A
table containing cumulative out-of-bag error value. Computed when error_metric_mode set with
error_metric_mode::out_of_bag_error.- Getter & Setter
const table & get_oob_err() const
auto & set_oob_err(const table &value)
- consttable &oob_err_per_observation= table{}
- A
table containing out-of-bag error value per observation. Computed when error_metric_mode set with
error_metric_mode::out_of_bag_error_per_observation.- Getter & Setter
const table & get_oob_err_per_observation() const
auto & set_oob_err_per_observation(const table &value)
- consttable &var_importance= table{}
- A
table containing variable importance value for each feature. Computed when
variable_importance_mode != variable_importance_mode::none.- Getter & Setter
const table & get_var_importance() const
auto & set_var_importance(const table &value)
- template<typenameTask= task::by_default>classinfer_input
- Template Parameters
- Task– Tag-type that specifies the type of the problem to solve. Can betask::classificationortask::regression.
Constructors- Creates a new instance of the class with the givenmodelanddataproperty values.
Properties- consttable &data= table{}
- The dataset for inference
.
- Getter & Setter
const table & get_data() const
auto & set_data(const table &value)
- template<typenameTask= task::by_default>classinfer_result
- Template Parameters
- Task– Tag-type that specifies the type of the problem to solve. Can betask::classificationortask::regression.
Constructors- infer_result()
- Creates a new instance of the class with the default property values.
Properties- consttable &labels= table{}
- The
table with the predicted labels.
- Getter & Setter
const table & get_labels() const
auto & set_labels(const table &value)
- consttable &probabilities
- A
table with the predicted class probabilities for each observation.
- Getter & Setter
template> const table & get_probabilities() const
template> auto & set_probabilities(const table &value)