# Details

`= {X`

`x`

_{1}= (

`x`

_{11},…,

`x`

_{1p}), ...,

`x`

_{n}= (

`x`

_{n1},…,

`x`

_{np}) } of

`n`

`-dimensional feature vectors and a positive integerp`

`, the problem is to find a setk`

`= {C`

`c`

_{1}, ... ,

`c`

_{k}} of

`k`

`-dimensional vectors that minimize the objective function (overall error) wherep`

`d`

^{2}(

`x`

_{i},

`) is the distance fromC`

`x`

_{i}to the closest center in

`, such as the Euclidean distance.C`

`c`

_{1},…,

`c`

_{k}are called centroids. To start computations, the algorithm requires initial values of centroids.

## Centroid Initialization

- Choice of first
feature vectors from the data setk.X - Random choice of
feature vectors from the data set using the following simple random sampling draw-by-draw algorithm. The algorithm does the following:k- Chooses one of the feature vectorsx
_{i}fromwith equal probability.X - Excludesx
_{i}fromand adds it to the current set of centers.X - Resumes from step 1 until the set of centers reaches the desired size
.k

- K-Means++ algorithm [Arthur2007], which selects centers with the probability proportional to their contribution to the overall errorΦ
_{X}() according to the following scheme:C- Chooses one of the feature vectorsx
_{i}fromwith equal probability.X - Excludesx
_{i}fromand adds it to the current set of centersX.C - For each feature vectorx
_{i}incalculates its minimal distanceX(dx_{i},) from the current set of centersC.C - Chooses one of the feature vectorsx
_{i}fromwith the probabilityX

- Parallel K-Means++ algorithm [Bahmani2012] that does the following:
- Chooses one of the feature vectorsx
_{i}fromwith equal probability.X - Excludesx
_{i}fromand adds it to the current set of centersX.C - RepeatsnRoundstimes:
- For each feature vectorx
_{i}fromcalculates its minimal distanceX(dx_{i},) from the current set of centersC.C - Chooses
=LoversamplingFactor*feature vectorskx_{i}fromwith the probabilityX - Excludesx
_{i}vectors chosen in the previous step fromand adds them to the current set of centersX.C

- Forc
_{i}∈setsCw_{i}to the ratings, the number of points incloser toXc_{i}than to any other point in.C - Applies K-Means++ algorithm with weightsw
_{i}to the points in, which means that the following probability is used in step 4:C

The algorithm parameters define the number of candidatesselected in each round and number of rounds:L- ChooseoversamplingFactorto make
=O(L).k - ChoosenRoundsas O(logΦ
_{x}()), whereCΦ_{x}() is the estimation of the goal function when the first center is chosen. [Bahmani2012] recommends to setCnRoundsto a constant value not greater than 8.

## Computation

`x`

_{j}-

`m`

_{i}||. The algorithm uses the following modification of the Euclidean distance between feature vectors

`anda`

`:b`

`(d`

`,a`

`) =b`

`d`

_{1}(

`,a`

`) +b`

`d`

_{2}(

`,a`

`), whereb`

`d`

_{1}is computed for continuous features as

`d`

_{2}is computed for binary categorical features as

`p`

_{1}is the number of continuous features, and

`p`

_{2}is the number of binary categorical features. Note that the algorithm does not support non-binary categorical features.

`x`

_{1},…,

`x`

_{n}, you can also compute the index of the cluster that contains the feature vector.

## Handling empty clusters

- Feature vectors, most distant from their assigned centroids, are selected as the new cluster centers. Information about these vectors is gathered automatically during the algorithm execution.
- In the distributed processing mode, most distant vectors from the local nodes are computed (Step 1), stored in
, and collected on the master node (Step 2). For more details, see thePartialResultdescription at Step 1 [Tan2005].PartialResult