Developer Guide and Reference

  • 2021.1
  • 12/04/2020
  • Public Content
Contents

K-Means

The K-Means algorithm solves clustering problem by partitioning
n
feature vectors into
k
clusters minimizing some criterion. Each cluster is characterized by a representative point, called
a centroid
.

Mathematical formulation

Training
Given the training set LaTex Math image. of
p
-dimensional feature vectors and a positive integer
k
, the problem is to find a set LaTex Math image. of
p
-dimensional centroids that minimize the objective function
LaTex Math image.
where LaTex Math image. is the squared Euclidean distance from LaTex Math image. to the closest centroid in
C
,
LaTex Math image.
Expression LaTex Math image. denotes LaTex Math image. norm.
In the general case,
d
may be an arbitrary distance function. Current version of the oneDAL spec defines only Euclidean distance case.
Training method:
Lloyd’s
The Lloyd’s method [Lloyd82] consists in iterative updates of centroids by applying the alternating
Assignment
and
Update
steps, where
t
denotes a index of the current iteration, e.g., LaTex Math image. is the set of centroids at the
t
-th iteration. The method requires the initial centroids LaTex Math image. to be specified at the beginning of the algorithm (
t = 1
).
(1) Assignment step:
Assign each feature vector LaTex Math image. to the nearest centroid. LaTex Math image. denotes the assigned label (cluster index) to the feature vector LaTex Math image. .
LaTex Math image.
Each feature vector from the training set
X
is assigned to exactly one centroid so that
X
is partitioned to
k
disjoint sets (clusters)
LaTex Math image.
(2) Update step:
Recalculate centroids by averaging feature vectors assigned to each cluster.
LaTex Math image.
The steps (1) and (2) are performed until the following
stop condition
,
LaTex Math image.
is satisfied or number of iterations exceeds the maximal value
T
defined by the user.
Inference
Given the inference set LaTex Math image. of
p
-dimensional feature vectors and the set LaTex Math image. of centroids produced at the training stage, the problem is to predict the index LaTex Math image. , LaTex Math image. , of the centroid in accordance with a method-defined rule.
Inference method:
Lloyd’s
Lloyd’s inference method computes the LaTex Math image. as an index of the centroid closest to the feature vector LaTex Math image. ,
LaTex Math image.

Programming Interface

All types and functions in this section are declared in the
oneapi::dal::kmeans
namespace and be available via inclusion of the
oneapi/dal/algo/kmeans.hpp
header file.
Descriptor
template<typename
Float
= detail::descriptor_base<>::float_t, typename
Method
= detail::descriptor_base<>::method_t, typename
Task
= detail::descriptor_base<>::task_t>
class
descriptor
Template Parameters
  • Float
    – The floating-point type that the algorithm uses for intermediate computations. Can be
    float
    or
    double
    .
  • Method
    – Tag-type that specifies an implementation of algorithm. Can be
    method::lloyd
    .
  • Task
    – Tag-type that specifies the type of the problem to solve. Can be
    task::clustering
    .
Constructors
descriptor
(std::int64_t
cluster_count
= 2)
Creates a new instance of the class with the given
cluster_count
.
Public Methods
auto &
set_cluster_count
(int64_t
value
)
auto &
set_max_iteration_count
(int64_t
value
)
auto &
set_accuracy_threshold
(double
value
)
Method tags
struct
lloyd_dense
Tag-type that denotes Lloyd’s computational method.
using
by_default
= lloyd_dense
Alias tag-type for Lloyd’s computational method.
Task tags
struct
clustering
Tag-type that parameterizes entities used for solving clustering problem.
using
by_default
= clustering
Alias tag-type for the clustering task.
Model
template<typename
Task
= task::by_default>
class
model
Template Parameters
Task
– Tag-type that specifies type of the problem to solve. Can be
task::clustering
.
Constructors
model
()
Creates a new instance of the class with the default property values.
Properties
const
table &
centroids
= table{}
A LaTex Math image. table with the cluster centroids. Each row of the table stores one centroid.
Getter & Setter


const table & get_centroids() const
auto & set_centroids(const table &value)

std::int64_t
cluster_count
= 0
Number of clusters LaTex Math image. in the trained model.
Getter & Setter


std::int64_t get_cluster_count() const

Invariants


cluster_count == centroids.row_count

Training
train(...)
Input
template<typename
Task
= task::by_default>
class
train_input
Template Parameters
Task
– Tag-type that specifies type of the problem to solve. Can be
task::clustering
.
Constructors
train_input
(
const
table &
data
)
train_input
(
const
table &
data
,
const
table &
initial_centroids
)
Creates a new instance of the class with the given
data
and
initial_centroids
.
Properties
const
table &
data
An LaTex Math image. table with the data to be clustered, where each row stores one feature vector.
Getter & Setter


const table & get_data() const
auto & set_data(const table &data)

const
table &
initial_centroids
A LaTex Math image. table with the initial centroids, where each row stores one centroid.
Getter & Setter


const table & get_initial_centroids() const
auto & set_initial_centroids(const table &data)

Result
template<typename
Task
= task::by_default>
class
train_result
Template Parameters
Task
– Tag-type that specifies type of the problem to solve. Can be
task::clustering
.
Constructors
train_result
()
Creates a new instance of the class with the default property values.
Properties
const
model<Task> &
model
= model<Task>{}
The trained K-means model.
Getter & Setter


const model< Task > & get_model() const
auto & set_model(const model< Task > &value)

const
table &
labels
= table{}
An LaTex Math image. table with the labels LaTex Math image. assigned to the samples LaTex Math image. in the input data, LaTex Math image. .
Getter & Setter


const table & get_labels() const
auto & set_labels(const table &value)

int64_t
iteration_count
= 0
The number of iterations performed by the algorithm.
Getter & Setter


int64_t get_iteration_count() const
auto & set_iteration_count(std::int64_t value)

Invariants


iteration_count >= 0

double
objective_function_value
The value of the objective function LaTex Math image. , where LaTex Math image. is
model.centroids
(see
kmeans::model::centroids
).
Getter & Setter


double get_objective_function_value() const
auto & set_objective_function_value(double value)

Invariants


objective_function_value >= 0.0

Inference
infer(...)
Input
template<typename
Task
= task::by_default>
class
infer_input
Template Parameters
Task
– Tag-type that specifies type of the problem to solve. Can be
task::clustering
.
Constructors
infer_input
(
const
model<Task> &
trained_model
,
const
table &
data
)
Creates a new instance of the class with the given
model
and
data
.
Properties
const
model<Task> &
model
= model<Task>{}
An LaTex Math image. table with the data to be assigned to the clusters, where each row stores one feature vector.
Getter & Setter


const model< Task > & get_model() const
auto & set_model(const model< Task > &value)

const
table &
data
= table{}
The trained K-Means model.
Getter & Setter


const table & get_data() const
auto & set_data(const table &value)

Result
template<typename
Task
= task::by_default>
class
infer_result
Template Parameters
Task
– Tag-type that specifies type of the problem to solve. Can be
task::clustering
.
Constructors
infer_result
()
Creates a new instance of the class with the default property values.
Properties
const
table &
labels
= table{}
An LaTex Math image. table with assignments labels to feature vectors in the input data.
Getter & Setter


const table & get_labels() const
auto & set_labels(const table &value)

double
objective_function_value
= 0.0
The value of the objective function LaTex Math image. , where LaTex Math image. is defined by the corresponding
infer_input::model::centroids
.
Getter & Setter


double get_objective_function_value() const
auto & set_objective_function_value(double value)

Invariants


objective_function_value >= 0.0

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.