Developer Guide

Contents

Details

Given the set
X
= {
x
1
= (
x
11
,…,
x
1
p
), ...,
x
n
= (
x
n
1
,…,
x
n
p
) } of
n
p
-dimensional feature vectors and a positive integer
k
, the problem is to find a set
C
= {
c
1
, ... ,
c
k
} of
k
p
-dimensional vectors that minimize the objective function (overall error)
where
d
2
(
x
i
,
C
) is the distance from
x
i
to the closest center in
C
, such as the Euclidean distance.
The vectors
c
1
,…,
c
k
are called centroids. To start computations, the algorithm requires initial values of centroids.

Centroid Initialization

Centroids initialization can be done using these methods:
  • Choice of first
    k
    feature vectors from the data set
    X
    .
  • Random choice of
    k
    feature vectors from the data set using the following simple random sampling draw-by-draw algorithm. The algorithm does the following:
    1. Chooses one of the feature vectors
      x
      i
      from
      X
      with equal probability.
    2. Excludes
      x
      i
      from
      X
      and adds it to the current set of centers.
    3. Resumes from step 1 until the set of centers reaches the desired size
      k
      .
  • K-Means++ algorithm [ Arthur2007 ], which selects centers with the probability proportional to their contribution to the overall error
    Φ
    X
    (
    C
    ) according to the following scheme:
    1. Chooses one of the feature vectors
      x
      i
      from
      X
      with equal probability.
    2. Excludes
      x
      i
      from
      X
      and adds it to the current set of centers
      C
      .
    3. For each feature vector
      x
      i
      in
      X
      calculates its minimal distance
      d
      (
      x
      i
      ,
      C
      ) from the current set of centers 
      C
      .
    4. Chooses one of the feature vectors
      x
      i
      from
      X
      with the probability
      .
    5. Resumes from step 2 until the set of centers
      C
      reaches the desired size
      k
      .
  • Parallel K-Means++ algorithm [ Bahmani2012 ] that does the following:
    1. Chooses one of the feature vectors
      x
      i
      from
      X
      with equal probability.
    2. Excludes
      x
      i
      from
      X
      and adds it to the current set of centers
      C
      .
    3. Repeats
      nRounds
      times:
      1. For each feature vector
        x
        i
        from
        X
        calculates its minimal distance
        d
        (
        x
        i
        ,
        C
        ) from the current set of centers
        C
        .
      2. Chooses
        L
        =
        oversamplingFactor
        *
        k
        feature vectors
        x
        i
        from
        X
        with the probability
        .
      3. Excludes
        x
        i
        vectors chosen in the previous step from
        X
        and adds them to the current set of centers 
        C
        .
    4. For
      c
      i
      C
      sets
      w
      i
      to the ratings, the number of points in
      X
      closer to
      c
      i
      than to any other point in
      C
      .
    5. Applies K-Means++ algorithm with weights
      w
      i
      to the points in
      C
      , which means that the following probability is used in step 4 :
    The algorithm parameters define the number of candidates
    L
    selected in each round and number of rounds:
    • Choose
      oversamplingFactor
      to make
      L
      =O(
      k
      ).
    • Choose
      nRounds
      as O(log
      Φ
      x
      (
      C
      )), where
      Φ
      x
      (
      C
      ) is the estimation of the goal function when the first center is chosen. [ Bahmani2012 ] recommends to set
      nRounds
      to a constant value not greater than 8.

Computation

Computation of the goal function includes computation of the Euclidean distance between vectors ||
x
j
-
m
i
||. The algorithm uses the following modification of the Euclidean distance between feature vectors
a
and
b
:
d
(
a
,
b
) =
d
1
(
a
,
b
) +
d
2
(
a
,
b
), where
d
1
is computed for continuous features as
and
d
2
is computed for binary categorical features as
In these equations,
γ
weighs the impact of binary categorical features on the clustering,
p
1
is the number of continuous features, and
p
2
is the number of binary categorical features. Note that the algorithm does not support non-binary categorical features.
The K-Means clustering algorithm computes centroids using Lloyd's method [ Lloyd82 ]. For each feature vector
x
1
,…,
x
n
, you can also compute the index of the cluster that contains the feature vector.

Handling empty clusters

In some cases, if no vectors are assigned to some clusters on a particular iteration, the iteration produces an empty cluster. It may occur due to bad initialization of centroids or the dataset structure. In this case, the algorithm uses the following strategy to replace the empty cluster centers and decrease the value of the overall goal function:
  • Feature vectors, most distant from their assigned centroids, are selected as the new cluster centers. Information about these vectors is gathered automatically during the algorithm execution.
  • In the distributed processing mode, most distant vectors from the local nodes are computed (Step 1), stored in
    PartialResult
    , and collected on the master node (Step 2). For more details, see the
    PartialResult
    description at Step 1 [ Tan2005 ].

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804