Details

The library provides kNN classification based on multidimensional binary search tree (K-D tree, where D means the dimension and K means the number of dimensions in the feature space). For more details, see [James2013, Patwary2016].

Given n feature vectors x1= (x11,…,x1p), …, xn= (xn1,…,xnp) of size p and a vector of class labels y = (y1,…,yn), where yi {0, 1, ..., C - 1} and C is the number of classes, describes the class to which the feature vector xi belongs, the problem is to build a kNN classifier.

Given a positive integer parameter k and a test observation x0, the kNN classifier does the following:

  1. Identifies the set N0 of the k feature vectors in the training data that are closest to x0 according to the Euclidian distance
  2. Estimates the conditional probability for the class j as the fraction of vectors in N0 whose labels y are equal to j
  3. Assigns the class with the largest probability to the test observation x0

Intel DAAL version of the kNN algorithm uses the PANDA algorithm [Patwary2016] that uses a space-partitioning data structure known as K-D tree. Each non-leaf node of a tree contains the identifier of a feature along which to split the feature space and an appropriate feature value (a cut-point) that defines the splitting hyperplane to partition the feature space into two parts. Each leaf node of the tree has an associated subset (a bucket) of elements of the training data set. Feature vectors from any bucket belong to the region of the space defined by tree nodes on the path from the root node to the respective leaf.

Training Stage

For each non-leaf node, the process of building a K-D tree involves the choice of the feature (that is, dimension in the feature space) and the value for this feature (a cut-point) to split the feature space. This procedure starts with the entire feature space for the root node of the tree, and for every next level of the tree deals with ever smaller part of the feature space.

The PANDA algorithm constructs the K-D tree by choosing the dimension with the maximum variance for splitting [Patwary2016]. Therefore, for each new non-leaf node of the tree, the algorithm computes the variance of values that belong to the respective region of the space for each of the features and chooses the feature with the largest variance. Due to high computational cost of this operation, PANDA uses a subset of feature values to compute the variance.

PANDA uses a sampling heuristic to estimate the data distribution for the chosen feature and chooses the median estimate as the cut-point.

PANDA generates new K-D tree levels until the number of feature vectors in a leaf node gets less or equal to a predefined threshold. Once the threshold is reached, PANDA stops growing the tree and associates the feature vectors with the bucket of the respective leaf node.

Prediction Stage

Given kNN classifier and query vectors x0, ..., xr, the problem is to calculate the labels for those vectors. To solve the problem for each given query vector xi, the algorithm traverses the K-D tree to find feature vectors associated with a leaf node that are closest to xi. During the search, the algorithm limits exploration of the nodes for which the distance between the query vector and respective part of the feature space is not less than the distance from the kth neighbor. This distance is progressively updated during the tree traverse.
For more complete information about compiler optimizations, see our Optimization Notice.
Select sticky button color: 
Orange (only for download buttons)