Developer Guide and Reference

  • 2021.2
  • 03/26/2021
  • Public Content
Contents

K-Means

The K-Means algorithm solves clustering problem by partitioning
n
feature vectors into
k
clusters minimizing some criterion. Each cluster is characterized by a representative point, called
a centroid
.

Mathematical formulation

Training
Given the training set LaTex Math image. of
p
-dimensional feature vectors and a positive integer
k
, the problem is to find a set LaTex Math image. of
p
-dimensional centroids that minimize the objective function
LaTex Math image.
where LaTex Math image. is the squared Euclidean distance from LaTex Math image. to the closest centroid in
C
,
LaTex Math image.
Expression LaTex Math image. denotes LaTex Math image. norm.
In the general case,
d
may be an arbitrary distance function. Current version of the oneDAL spec defines only Euclidean distance case.
Training method:
Lloyd’s
The Lloyd’s method [Lloyd82] consists in iterative updates of centroids by applying the alternating
Assignment
and
Update
steps, where
t
denotes a index of the current iteration, e.g., LaTex Math image. is the set of centroids at the
t
-th iteration. The method requires the initial centroids LaTex Math image. to be specified at the beginning of the algorithm (
t = 1
).
(1) Assignment step:
Assign each feature vector LaTex Math image. to the nearest centroid. LaTex Math image. denotes the assigned label (cluster index) to the feature vector LaTex Math image..
LaTex Math image.
Each feature vector from the training set
X
is assigned to exactly one centroid so that
X
is partitioned to
k
disjoint sets (clusters)
LaTex Math image.
(2) Update step:
Recalculate centroids by averaging feature vectors assigned to each cluster.
LaTex Math image.
The steps (1) and (2) are performed until the following
stop condition
,
LaTex Math image.
is satisfied or number of iterations exceeds the maximal value
T
defined by the user.
Inference
Given the inference set LaTex Math image. of
p
-dimensional feature vectors and the set LaTex Math image. of centroids produced at the training stage, the problem is to predict the index LaTex Math image., LaTex Math image., of the centroid in accordance with a method-defined rule.
Inference method:
Lloyd’s
Lloyd’s inference method computes the LaTex Math image. as an index of the centroid closest to the feature vector LaTex Math image.,
LaTex Math image.

Programming Interface

Usage example

Training
kmeans::model<> run_training(const table& data, const table& initial_centroids) { const auto kmeans_desc = kmeans::descriptor<float>{} .set_cluster_count(10) .set_max_iteration_count(50) .set_accuracy_threshold(1e-4); const auto result = train(kmeans_desc, data, initial_centroids); print_table("labels", result.get_labels()); print_table("centroids", result.get_model().get_centroids()); print_value("objective", result.get_objective_function_value()); return result.get_model(); }
Inference
table run_inference(const kmeans::model<>& model, const table& new_data) { const auto kmeans_desc = kmeans::descriptor<float>{} .set_cluster_count(model.get_cluster_count()); const auto result = infer(kmeans_desc, model, new_data); print_table("labels", result.get_labels()); }

Examples

oneAPI DPC++
Batch Processing:
oneAPI C++
Batch Processing:
Python* with DPC++ support
Batch Processing:

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.