Developer Guide and Reference

  • 2021.2
  • 03/26/2021
  • Public Content
Contents

K-Means Clustering

K-Means ans K-Means initialization are also available with oneAPI interfaces:
K-Means is among the most popular and simplest clustering methods. It is intended to partition a data set into a small number of clusters such that feature vectors within a cluster have greater similarity with one another than with feature vectors from other clusters. Each cluster is characterized by a representative point, called a centroid, and a cluster radius.
In other words, the clustering methods enable reducing the problem of analysis of the entire data set to the analysis of clusters.
There are numerous ways to define the measure of similarity and centroids. For K-Means, the centroid is defined as the mean of feature vectors within the cluster.

Details

Given the set LaTex Math image. of
n
p
-dimensional feature vectors and a positive integer
k
, the problem is to find a set LaTex Math image. of
k
p
-dimensional vectors that minimize the objective function (overall error)
LaTex Math image.
where LaTex Math image. is the distance from LaTex Math image. to the closest center in
C
, such as the Euclidean distance. The vectors LaTex Math image. are called centroids. To start computations, the algorithm requires initial values of centroids.
Centroid Initialization
Centroids initialization can be done using these methods:
  • Choice of first
    k
    feature vectors from the data set
    X
    .
  • Random choice of
    k
    feature vectors from the data set using the following simple random sampling draw-by-draw algorithm. The algorithm does the following:
    1. Chooses one of the feature vectors LaTex Math image. from
      X
      with equal probability.
    2. Excludes LaTex Math image. from
      X
      and adds it to the current set of centers.
    3. Resumes from step 1 until the set of centers reaches the desired size
      k
      .
  • K-Means++ algorithm [Arthur2007], which selects centers with the probability proportional to their contribution to the overall error LaTex Math image. according to the following scheme:
    1. Chooses one of the feature vectors LaTex Math image. from
      X
      with equal probability.
    2. Excludes LaTex Math image. from
      X
      and adds it to the current set of centers
      C
      .
    3. For each feature vector LaTex Math image. in
      X
      calculates its minimal distance LaTex Math image. from the current set of centers
      C
      .
    4. Chooses one of the feature vectors LaTex Math image. from
      X
      with the probability LaTex Math image..
    5. Resumes from step 2 until the set of centers
      C
      reaches the desired size
      k
      .
  • Parallel K-Means++ algorithm [Bahmani2012] that does the following:
    1. Chooses one of the feature vectors LaTex Math image. from
      X
      with equal probability.
    2. Excludes LaTex Math image. from
      X
      and adds it to the current set of centers
      C
      .
    3. Repeats
      nRounds
      times:
      1. For each feature vector LaTex Math image. from
        X
        calculates its minimal distance LaTex Math image. from the current set of centers
        C
        .
      2. Chooses LaTex Math image. feature vectors LaTex Math image. from
        X
        with the probability LaTex Math image..
      3. Excludes LaTex Math image. vectors chosen in the previous step from
        X
        and adds them to the current set of centers
        C
        .
    4. For LaTex Math image. sets LaTex Math image. to the ratings, the number of points in
      X
      closer to LaTex Math image. than to any other point in
      C
      .
    5. Applies K-Means++ algorithm with weights LaTex Math image. to the points in
      C
      , which means that the following probability is used in step:
      LaTex Math image.
      frac{{w}_{i}{d}^{2}({x}_{i},C)}{mathrm{ }{sum }_{{x}_{iϵX}}{w}_{i}{d}^{2}({x}_{i},C)}
    The algorithm parameters define the number of candidates
    L
    selected in each round and number of rounds:
    • Choose
      oversamplingFactor
      to make LaTex Math image..
    • Choose nRounds as LaTex Math image., where LaTex Math image. is the estimation of the goal function when the first center is chosen. [Bahmani2012] recommends to set
      nRounds
      to a constant value not greater than
      8
      .
Computation
Computation of the goal function includes computation of the Euclidean distance between vectors LaTex Math image.. The algorithm uses the following modification of the Euclidean distance between feature vectors
a
and
b
: LaTex Math image., where LaTex Math image. is computed for continuous features as
LaTex Math image.
and LaTex Math image. is computed for binary categorical features as
LaTex Math image.
In these equations, LaTex Math image. γ weighs the impact of binary categorical features on the clustering, LaTex Math image. is the number of continuous features, and LaTex Math image. is the number of binary categorical features. Note that the algorithm does not support non-binary categorical features.
The K-Means clustering algorithm computes centroids using Lloyd’s method [Lloyd82]. For each feature vector LaTex Math image., you can also compute the index of the cluster that contains the feature vector.
In some cases, if no vectors are assigned to some clusters on a particular iteration, the iteration produces an empty cluster. It may occur due to bad initialization of centroids or the dataset structure. In this case, the algorithm uses the following strategy to replace the empty cluster centers and decrease the value of the overall goal function:
  • Feature vectors, most distant from their assigned centroids, are selected as the new cluster centers. Information about these vectors is gathered automatically during the algorithm execution.
  • In the distributed processing mode, most distant vectors from the local nodes are computed (Step 1), stored in
    PartialResult
    , and collected on the master node (Step 2). For more details, see the
    PartialResult
    description at Step 1 [Tan2005].

Initialization

The K-Means clustering algorithm requires initialization of centroids as an explicit step. Initialization flow depends on the computation mode. Skip this step if you already calculated initial centroids.
For initialization, the following computation modes are available:

Computation

The following computation modes are available:
Distributed mode is not available for oneAPI interfaces and for Python* with DPC++ support.

Examples

oneAPI DPC++
C++ (CPU)
Java*
There is no support for Java on GPU.
Batch Processing:
Distributed Processing
Python* with DPC++ support
Batch Processing:
Python*
Batch Processing:
Distributed Processing

Performance Considerations

To get the best overall performance of the K-Means algorithm:
  • If input data is homogeneous, provide the input data and store results in homogeneous numeric tables of the same type as specified in the algorithmFPType class template parameter.
  • If input data is non-homogeneous, use AOS layout rather than SOA layout.
  • For the output assignments table, use a homogeneous numeric table of the int type.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex​.
Notice revision #20201201

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.