#
Details

`= {
X`

`
x`

_{ 1}= (

`
x`

_{ 11},…,

`
x`

_{ 1 p}), ...,

`
x`

_{ n}= (

`
x`

_{ n 1},…,

`
x`

_{ n p}) } of

`
n`

`-dimensional feature vectors and a positive integer
p`

`, the problem is to find a set
k`

`= {
C`

`
c`

_{ 1}, ... ,

`
c`

_{ k}} of

`
k`

`-dimensional vectors that minimize the objective function (overall error) where
p`

`
d`

^{ 2}(

`
x`

_{ i},

`) is the distance from
C`

`
x`

_{ i}to the closest center in

`, such as the Euclidean distance.
C`

`
c`

_{ 1},…,

`
c`

_{ k}are called centroids. To start computations, the algorithm requires initial values of centroids.

##
Centroid Initialization

- Choice of first
feature vectors from the data setk.X - Random choice of
feature vectors from the data set using the following simple random sampling draw-by-draw algorithm. The algorithm does the following:k- Chooses one of the feature vectorsx
_{ i}fromwith equal probability.X - Excludesx
_{ i}fromand adds it to the current set of centers.X - Resumes from step 1 until the set of centers reaches the desired size
.k

- K-Means++ algorithm [ Arthur2007 ], which selects centers with the probability proportional to their contribution to the overall errorΦ
_{ X}() according to the following scheme:C- Chooses one of the feature vectorsx
_{ i}fromwith equal probability.X - Excludesx
_{ i}fromand adds it to the current set of centersX.C - For each feature vectorx
_{ i}incalculates its minimal distanceX(dx_{ i},) from the current set of centersC.C - Chooses one of the feature vectorsx
_{ i}fromwith the probabilityX

- Parallel K-Means++ algorithm [ Bahmani2012 ] that does the following:
- Chooses one of the feature vectorsx
_{ i}fromwith equal probability.X - Excludesx
_{ i}fromand adds it to the current set of centersX.C - RepeatsnRoundstimes:
- For each feature vectorx
_{ i}fromcalculates its minimal distanceX(dx_{ i},) from the current set of centersC.C - Chooses
=LoversamplingFactor*feature vectorskx_{ i}fromwith the probabilityX - Excludesx
_{ i}vectors chosen in the previous step fromand adds them to the current set of centersX.C

- Forc
_{ i}∈setsCw_{ i}to the ratings, the number of points incloser toXc_{ i}than to any other point in.C - Applies K-Means++ algorithm with weightsw
_{ i}to the points in, which means that the following probability is used in step 4 :C

The algorithm parameters define the number of candidatesselected in each round and number of rounds:L- ChooseoversamplingFactorto make
=O(L).k - ChoosenRoundsas O(logΦ
_{ x}()), whereCΦ_{ x}() is the estimation of the goal function when the first center is chosen. [ Bahmani2012 ] recommends to setCnRoundsto a constant value not greater than 8.

##
Computation

`
x`

_{ j}-

`
m`

_{ i}||. The algorithm uses the following modification of the Euclidean distance between feature vectors

`and
a`

`:
b`

`(
d`

`,
a`

`) =
b`

`
d`

_{ 1}(

`,
a`

`) +
b`

`
d`

_{ 2}(

`,
a`

`), where
b`

`
d`

_{ 1}is computed for continuous features as

`
d`

_{ 2}is computed for binary categorical features as

`
p`

_{ 1}is the number of continuous features, and

`
p`

_{ 2}is the number of binary categorical features. Note that the algorithm does not support non-binary categorical features.

`
x`

_{ 1},…,

`
x`

_{ n}, you can also compute the index of the cluster that contains the feature vector.

##
Handling empty clusters

- Feature vectors, most distant from their assigned centroids, are selected as the new cluster centers. Information about these vectors is gathered automatically during the algorithm execution.
- In the distributed processing mode, most distant vectors from the local nodes are computed (Step 1), stored in
, and collected on the master node (Step 2). For more details, see thePartialResultdescription at Step 1 [ Tan2005 ].PartialResult