## Developer Guide and Reference

• 2021.3
• 06/28/2021
• Public Content
Contents

# K-Means

The K-Means algorithm solves clustering problem by partitioning
n
feature vectors into
k
clusters minimizing some criterion. Each cluster is characterized by a representative point, called
a centroid
.
 Operation Computational methods Programming Interface

## Mathematical formulation

Training
Given the training set of
p
-dimensional feature vectors and a positive integer
k
, the problem is to find a set of
p
-dimensional centroids that minimize the objective function
where is the squared Euclidean distance from to the closest centroid in
C
,
Expression denotes norm.
In the general case,
d
may be an arbitrary distance function. Current version of the oneDAL spec defines only Euclidean distance case.
Training method:
Lloyd’s
The Lloyd’s method [Lloyd82] consists in iterative updates of centroids by applying the alternating
Assignment
and
Update
steps, where
t
denotes a index of the current iteration, e.g., is the set of centroids at the
t
-th iteration. The method requires the initial centroids to be specified at the beginning of the algorithm ().
(1) Assignment step:
Assign each feature vector to the nearest centroid. denotes the assigned label (cluster index) to the feature vector .
Each feature vector from the training set
X
is assigned to exactly one centroid so that
X
is partitioned to
k
disjoint sets (clusters)
(2) Update step:
Recalculate centroids by averaging feature vectors assigned to each cluster.
The steps (1) and (2) are performed until the following
stop condition
,
is satisfied or number of iterations exceeds the maximal value
T
defined by the user.
Inference
Given the inference set of
p
-dimensional feature vectors and the set of centroids produced at the training stage, the problem is to predict the index , , of the centroid in accordance with a method-defined rule.
Inference method:
Lloyd’s
Lloyd’s inference method computes the as an index of the centroid closest to the feature vector ,

## Usage example

Training
``````kmeans::model<> run_training(const table& data,
const table& initial_centroids) {
const auto kmeans_desc = kmeans::descriptor<float>{}
.set_cluster_count(10)
.set_max_iteration_count(50)
.set_accuracy_threshold(1e-4);

const auto result = train(kmeans_desc, data, initial_centroids);

print_table("labels", result.get_labels());
print_table("centroids", result.get_model().get_centroids());
print_value("objective", result.get_objective_function_value());

return result.get_model();
}``````
Inference
``````table run_inference(const kmeans::model<>& model,
const table& new_data) {
const auto kmeans_desc = kmeans::descriptor<float>{}
.set_cluster_count(model.get_cluster_count());

const auto result = infer(kmeans_desc, model, new_data);

print_table("labels", result.get_labels());
}``````

## Examples

oneAPI DPC++
Batch Processing:
oneAPI C++
Batch Processing:
Python* with DPC++ support
Batch Processing:

#### Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.