# K-Means Clustering

K-Means is among the most popular and simplest clustering methods. It
is intended to partition a data set into a small number of clusters
such that feature vectors within a cluster have greater similarity
with one another than with feature vectors from other clusters. Each
cluster is characterized by a representative point, called a
centroid, and a cluster radius.

In other words, the clustering methods enable reducing the problem of
analysis of the entire data set to the analysis of clusters.

There are numerous ways to define the measure of similarity and
centroids. For K-Means, the centroid is defined as the mean of
feature vectors within the cluster.

## Details

Given the set
of

n

*-dimensional feature vectors and a positive integer*p

*, the problem is to find a set of*k

k

*-dimensional vectors that minimize the objective function (overall error)*p

where
is the distance from
to the closest center in

*, such as the Euclidean distance. The vectors are called centroids. To start computations, the algorithm requires initial values of centroids.*C

Centroid Initialization

Centroids initialization can be done using these methods:

- Choice of first
feature vectors from the data setk.X - Random choice of
feature vectors from the data set using the following simple random sampling draw-by-draw algorithm. The algorithm does the following:k- Chooses one of the feature vectors from
with equal probability.X - Excludes from
and adds it to the current set of centers.X - Resumes from step 1 until the set of centers reaches the desired size
.k

- K-Means++ algorithm [Arthur2007], which selects centers with the probability proportional to their contribution to the overall error according to the following scheme:
- Chooses one of the feature vectors from
with equal probability.X - Excludes from
and adds it to the current set of centersX.C - For each feature vector in
calculates its minimal distance from the current set of centersX.C - Chooses one of the feature vectors from
with the probability .X - Resumes from step 2 until the set of centers
reaches the desired sizeC.k

- Parallel K-Means++ algorithm [Bahmani2012] that does the following:
- Chooses one of the feature vectors from
with equal probability.X - Excludes from
and adds it to the current set of centersX.C - Repeats
times:nRounds- For each feature vector from
calculates its minimal distance from the current set of centersX.C - Chooses feature vectors from
with the probability .X - Excludes vectors chosen in the previous step from
and adds them to the current set of centersX.C

- For sets to the ratings, the number of points in
closer to than to any other point inX.C - Applies K-Means++ algorithm with weights to the points in
, which means that the following probability is used in step:C

The algorithm parameters define the number of candidatesselected in each round and number of rounds:L- Choose
to make .oversamplingFactor - Choose nRounds as , where is the estimation of the goal function when the first center is chosen. [Bahmani2012] recommends to set
to a constant value not greater thannRounds.8

Computation

Computation of the goal function includes computation of the
Euclidean distance between vectors
.
The algorithm uses the following modification of the Euclidean
distance between feature vectors

*and*a

*: , where is computed for continuous features as*b

and
is computed for binary categorical features as

In these equations,
γ weighs the impact of binary categorical
features on the clustering,
is the number of continuous
features, and
is the number of binary categorical
features. Note that the algorithm does not support non-binary
categorical features.

The K-Means clustering algorithm computes centroids using Lloyd’s
method [Lloyd82].
For each feature vector
, you can also
compute the index of the cluster that contains the feature vector.

In some cases, if no vectors are assigned to some clusters on a
particular iteration, the iteration produces an empty cluster. It
may occur due to bad initialization of centroids or the dataset
structure. In this case, the algorithm uses the following strategy
to replace the empty cluster centers and decrease the value of the
overall goal function:

- Feature vectors, most distant from their assigned centroids, are selected as the new cluster centers. Information about these vectors is gathered automatically during the algorithm execution.
- In the distributed processing mode, most distant vectors from the local nodes are computed (Step 1), stored in
, and collected on the master node (Step 2). For more details, see thePartialResultdescription at Step 1 [Tan2005].PartialResult

## Initialization

The K-Means clustering algorithm requires initialization of
centroids as an explicit step. Initialization flow depends on the
computation mode. Skip this step if you already calculated initial
centroids.

For initialization, the following computation modes are available:

## Computation

The following computation modes are available:

Distributed mode is not available for oneAPI interfaces and for Python* with DPC++ support.

## Examples

oneAPI DPC++

Batch Processing:

oneAPI C++

Batch Processing:

C++ (CPU)

Batch Processing:

Distributed Processing:

Java*

There is no support for Java on GPU.

Batch Processing:

Distributed Processing

Python* with DPC++ support

Batch Processing:

Python*

## Performance Considerations

To get the best overall performance of the K-Means algorithm:

- If input data is homogeneous, provide the input data and store results in homogeneous numeric tables of the same type as specified in the algorithmFPType class template parameter.
- If input data is non-homogeneous, use AOS layout rather than SOA layout.
- For the output assignments table, use a homogeneous numeric table of the int type.

Optimization Notice |
---|

Intel’s compilers may or may not optimize to the same degree for
non-Intel microprocessors for optimizations that are not unique to
Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not
guarantee the availability, functionality, or effectiveness of any
optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not
specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and
Reference Guides for more information regarding the specific
instruction sets covered by this notice. Notice revision #20110804 |