Intel® Data Analytics Acceleration Library - Decision Forest

By Gennady Fedorov, Published: 08/14/2017, Last Updated: 09/12/2017

Introduction

Decision trees method is known to be unstable, i.e. a small perturbation of the training sample or in the parameters used in tree construction can result in large changes in the resulting predictor.

Ensemble or committee methods is a general technique used for reducing the variance of an estimated prediction function. One of the approaches exploiting this concept is bagging (bootstrap aggregation) introduced by Breinman [1].  Random (decision) forest is the next step by Breinman [2], a substantial improvement of the bagging on decision trees based on a random choice of the features.

This approach turns out to perform very well compared to many other algorithms, simple to train and tune. Trees of random forest can be built in a completely independent manner, which is good for multicore or distributed systems. Random (decision) forests can also produce additional characteristics, such as an estimate of generalization error and the importance measure (relative decisive power) of features (variables).

This article describes the decision forest algorithm and how Intel® Data Analytics Acceleration Library (Intel® DAAL) [2] helps optimize this algorithm when running it on systems equipped with Intel® Xeon® processors.

What is a Decision forest?

It is an ensemble of tree-structured predictors (decision trees) built using general technique of bootstrap aggregation (bagging) and random choice of features.

In bagging each tree is independently grown using a bootstrap sample of the data set. The resulting learning rule is an aggregation of all of the tree-based estimators. The aggregation is based on the average of the predictions for regression algorithm or the majority of votes for classification one.

In addition to it, random (decision) forests change the way how the classification or regression trees are constructed. In standard trees, each node is split using the best split among all variables. In a decision forest, each node is split using the best among a subset of predictors randomly chosen at that node. Each tree is then fully grown or until each node is pure. The trees are not pruned.

Why does it work? It is proven mathematically that the variance of bagging model decreases when correlation between the trees decreases. Forest model component trees are all randomly different from one another. Random choice of features leads to decorrelation between the individual tree predictions and, in turn, results in improved generalization and robustness.
Parameter , the number of features used for computation of one split when growing a tree, actually controls the amount of randomness and tree correlation. Large values of  correspond to little randomness and thus large tree correlation. In this case the forest behaves very much as if it was made of a single tree. Small values of  correspond to large randomness in the training process. Thus the forest component trees are all very different from one another.

Default values for  parameter are  for classification and  for regression, where  is the number of features.

Out of bag

It turns out that bagging, besides its primary purpose of increasing accuracy, has valuable byproducts. Part of the training set examples do not appear in a particular bootstrap training set of one predictor. Thus, to this predictor these examples are unused test examples. The predictions for examples "that are out-of-the-bag" (OOB) can be used to form accurate estimates for important quantities.

An estimate of the generalization error based on the training data can be obtained as OOB error, calculated as follows: for every tree grown, its OOB cases are put down the tree and response is predicted for them. For each vector in the dataset, its predicted responses by the trees that contain it in their OOB set are aggregated and form the OOB error of the decision forest. 

Variable importance

An important task in many applications is not only to make the most accurate predictions of the response but also to identify which predictor variables are the most important to make these predictions. For example, it is required in scientific tasks in order to understand the underlying process. Random (decision) forest provides two different ways of looking at variable importance.

    Mean Decrease Impurity importance (MDI) - Importance of j-th variable for predicting Y is measured as the sum of weighted impurity decreases for all nodes where Xj is used, averaged over all trees in the forest.  When using the Gini index as impurity function, this measure is known as the Gini importance or Mean Decrease Gini. However, it can be defined for any impurity measure.

    Mean Decrease Accuracy (MDA) - Importance of a j-th variable for predicting Y is measured as the average increase in OOB error over all trees in the forest when the values of j-th variable are randomly permuted in the OOB set. For that reason, this latter measure is also known as the permutation importance. In other words, j-th variable can be considered important for predicting Y if by breaking the link between xj and Y the prediction error increases. Random permutation is the way to break the link between Xj breaks and Y.

Intel® Data Analytics Acceleration Library

Intel® DAAL is a library consisting of many basic building blocks that are optimized for data analytics and machine learning. Those building blocks are highly optimized for the latest features of latest Intel® processors. More about Intel DAAL can be found in [2]. Intel DAAL provides Decision forest classification and regression algorithms.

Using Decision forest in Intel® Data Analytics Acceleration Library

This section shows how to invoke Decision forest classification and regression using Intel DAAL. Do the following steps to invoke Decision forest classification algorithm from Intel® DAAL:

1.	Ensure that you have Intel® DAAL installed and environment is prepared. See details in [8, 9, 10] according to your operating system.
2.	Include header file daal.h into your application:
#include <daal.h>
3.	To simplify usage of Intel® DAAL namespaces we will use following using directives:
using namespace daal;
using namespace daal::algorithms;
4.	We will assume that training and testing datasets are in appropriate .csv files. If so, we must read them into Intel® DAAL numeric tables:
const size_t nFeatures = 3; /* Number of features in training and testing data sets */
const size_t categoricalFeaturesIndices[] = { 2 };
/* Decision forest parameters */
const size_t nTrees = 100;
const size_t minObservationsInLeafNode = 8;
const size_t nClasses = 5;  /* Number of classes */

void loadData(const std::string& fileName, NumericTablePtr& pData, NumericTablePtr& pDependentVar)
{
/* Initialize FileDataSource<CSVFeatureManager> to retrieve the input data     from a .csv file */
  FileDataSource<CSVFeatureManager> trainDataSource(fileName,
  	DataSource::notAllocateNumericTable,
       DataSource::doDictionaryFromContext);

  /* Create Numeric Tables for training data and dependent variables */
  pData.reset(new HomogenNumericTable<>(nFeatures, 0, NumericTable::notAllocate));
  pDependentVar.reset(new HomogenNumericTable<>(1, 0, NumericTable::notAllocate));
  NumericTablePtr mergedData(new MergedNumericTable(pData, pDependentVar));

  /* Retrieve the data from input file */
  trainDataSource.loadDataBlock(mergedData.get());
  NumericTableDictionaryPtr pDictionary = pData->getDictionarySharedPtr();
  for(size_t i = 0,
      n = sizeof(categoricalFeaturesIndices)/ sizeof(categoricalFeaturesIndices[0]);
      i < n; ++i)
      (*pDictionary)[categoricalFeaturesIndices[i]].featureType = data_feature_utils::DAAL_CATEGORICAL;
}
5.	Train the model using given dataset and parameters, ask to calculate OOB error and MDI variable importance
training::ResultPtr trainModel()
{
    /* Create Numeric Tables for training data and dependent variables */
    NumericTablePtr trainData;
    NumericTablePtr trainDependentVariable;

    loadData(trainDatasetFileName, trainData, trainDependentVariable);

    /* Create an algorithm object to train the decision forest classification model */
    training::Batch<> algorithm(nClasses);

    /* Pass a training data set and dependent values to the algorithm */
    algorithm.input.set(classifier::training::data, trainData);
    algorithm.input.set(classifier::training::labels, trainDependentVariable);

    algorithm.parameter.nTrees = nTrees;
    algorithm.parameter.featuresPerNode = nFeatures;
    algorithm.parameter.minObservationsInLeafNode = minObservationsInLeafNode;
    algorithm.parameter.varImportance = algorithms::decision_forest::training::MDI;
    algorithm.parameter.resultsToCompute = algorithms::decision_forest::training::computeOutOfBagError;

    /* Build the decision forest classification model */
    algorithm.compute();

    /* Retrieve the algorithm results */
    training::ResultPtr trainingResult = algorithm.getResult();
    printNumericTable(trainingResult->get(training::variableImportance), "Variable importance results: ");
    printNumericTable(trainingResult->get(training::outOfBagError), "OOB error: ");
    return trainingResult;
}
6.	Test the model using given dataset.
void testModel(const training::ResultPtr& trainingResult)
{
    /* Create Numeric Tables for testing data and ground truth values */
    NumericTablePtr testData;
    NumericTablePtr testGroundTruth;

    loadData(testDatasetFileName, testData, testGroundTruth);

    /* Create an algorithm object to predict values of decision forest classification */
    prediction::Batch<> algorithm(nClasses);

    /* Pass a testing data set and the trained model to the algorithm */
    algorithm.input.set(classifier::prediction::data, testData);
    algorithm.input.set(classifier::prediction::model, trainingResult->get(classifier::training::model));

    /* Predict values of decision forest classification */
    algorithm.compute();

    /* Retrieve the algorithm results */
    classifier::prediction::ResultPtr predictionResult = algorithm.getResult();
    printNumericTable(predictionResult->get(classifier::prediction::prediction),
        "Decision forest prediction results (first 10 rows):", 10);
    printNumericTable(testGroundTruth, "Ground truth (first 10 rows):", 10);
}

For decision forest regression, the code is very similar:

1.	We will assume that training and testing datasets are in appropriate .csv files. If so, we must read them into Intel® DAAL numeric tables:
const size_t nFeatures = 13; /* Number of features in training and testing data sets */
const size_t categoricalFeaturesIndices[] = { 3 };
/* Decision forest parameters */
const size_t nTrees = 100;

2.	Train the model using given dataset and parameters, ask to calculate OOB error and MDA variable importance

training::ResultPtr trainModel()
{
    /* Create Numeric Tables for training data and dependent variables */
    NumericTablePtr trainData;
    NumericTablePtr trainDependentVariable;

    loadData(trainDatasetFileName, trainData, trainDependentVariable);

    /* Create an algorithm object to train the decision forest regression model with the default method */
    training::Batch<> algorithm;

    /* Pass a training data set and dependent values to the algorithm */
    algorithm.input.set(training::data, trainData);
    algorithm.input.set(training::dependentVariable, trainDependentVariable);

    algorithm.parameter.nTrees = nTrees;
    algorithm.parameter.varImportance = algorithms::decision_forest::training::MDA_Raw;
    algorithm.parameter.resultsToCompute = algorithms::decision_forest::training::computeOutOfBagError;

    /* Build the decision forest regression model */
    algorithm.compute();

    /* Retrieve the algorithm results */
    training::ResultPtr trainingResult = algorithm.getResult();
    printNumericTable(trainingResult->get(training::variableImportance), "Variable importance results: ");
    printNumericTable(trainingResult->get(training::outOfBagError), "OOB error: ");
    return trainingResult;
}
3.	Test the model using given dataset
void testModel(const training::ResultPtr& trainingResult)
{
    /* Create Numeric Tables for testing data and ground truth values */
    NumericTablePtr testData;
    NumericTablePtr testGroundTruth;

    loadData(testDatasetFileName, testData, testGroundTruth);

    /* Create an algorithm object to predict values of decision forest regression */
    prediction::Batch<> algorithm;

    /* Pass a testing data set and the trained model to the algorithm */
    algorithm.input.set(prediction::data, testData);
    algorithm.input.set(prediction::model, trainingResult->get(training::model));

    /* Predict values of decision forest regression */
    algorithm.compute();

    /* Retrieve the algorithm results */
    prediction::ResultPtr predictionResult = algorithm.getResult();
    printNumericTable(predictionResult->get(prediction::prediction),
        "Decision forest prediction results (first 10 rows):", 10);
    printNumericTable(testGroundTruth, "Ground truth (first 10 rows):", 10);
}

Performance considerations and results

Note, decision forest training algorithms implemented in Intel® DAAL can run in two modes which is regulated by ‘memorySavingMode’ parameter. Its default value is true which indicates that time efficient mode is used which requires some extra memory proportional to the size of input data. If ‘memorySavingMode’ parameter is set to true then training computation won’t use this extra memory but it normally results in longer computation time on big datasets and datasets with big number of features.

Note also that computation of MDA (permutation) variable importance takes longer than MDI variable importance.

The following performance chart demonstrate the performance advantage Intel® DAAL Decision Forest versus Apache Ranger® version 0.7.1.  The results were measured for classification algorithm for three different data sets PenDigits, Letter and Isolet from USI machine learning repository.

Conclusion

Decision forest is a powerful method, which can be used for both classification and regression. Intel® DAAL optimized this algorithm. By using Intel® DAAL developers can take advantage of new features in future generations of Intel® Xeon® processors without having to modify their applications. They only need to link their applications to the latest version of Intel® DAAL.

References

  1. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, Charles J. Stone. Classification and Regression Trees. Chapman & Hall. 1984.
  2. Leo Breiman (2001). “Random Forests”. Machine Learning 45 (1): 5–32
  3. https://software.intel.com/en-us/get-started-with-daal-for-linux
  4. https://software.intel.com/en-us/get-started-with-daal-for-windows
  5. https://software.intel.com/en-us/get-started-with-daal-for-macos

 

 

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804