Developer Guide

Contents

Distributed Processing

This mode assumes that the data set is split into
nblocks
blocks across computation nodes.

Algorithm Parameters

The K-Means clustering algorithm in the distributed processing mode has the following parameters:
Parameter
Default Value
Description
computeStep
Not applicable
The parameter required to initialize the algorithm. Can be:
  • step1Local
    - the first step, performed on local nodes
  • step2Master
    - the second step, performed on a master node
algorithmFPType
float
The floating-point type that the algorithm uses for intermediate computations. Can be
float
or
double
.
method
defaultDense
Available computation methods for K-Means clustering:
  • defaultDense
    - implementation of Lloyd's algorithm
  • lloydCSR
    - implementation of Lloyd's algorithm for CSR numeric tables
nClusters
Not applicable
The number of clusters. Required to initialize the algorithm.
gamma
1.0
The weight to be used in distance calculation for binary categorical features.
distanceType
euclidean
The measure of closeness between points (observations) being clustered. The only distance type supported so far is the Euclidian distance.
assignFlag
false
A flag that enables computation of assignments, that is, assigning cluster indices to respective observations.
To compute K-Means clustering in the distributed processing mode, use the general schema described in Algorithms as follows:

Step 1 - on Local Nodes

K-Means Clustering Distributed Workflow Step 1
In this step, the K-Means clustering algorithm accepts the input described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms.
Input ID
Input
data
Pointer to the
n
i
x
p
numeric table that represents the
i
-th data block on the local node. The input can be an object of any class derived from
NumericTable
.
inputCentroids
Pointer to the
nClusters
x
p
numeric table with the initial cluster centroids. This input can be an object of any class derived from
NumericTable
.
In this step, the K-Means clustering algorithm calculates the partial results and results described below. Pass the Partial Result ID or Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms.
Partial Result ID
Result
nObservations
Pointer to the
nClusters
x 1 numeric table that contains the number of observations assigned to the clusters on local node. By default, this result is an object of the
HomogenNumericTable
class, but you can define this result as an object of any class derived from
NumericTable
except
CSRNumericTable
.
partialSums
Pointer to the
nClusters
x
p
numeric table with partial sums of observations assigned to the clusters on the local node. By default, this result is an object of the
HomogenNumericTable
class, but you can define the result as an object of any class derived from
NumericTable
except
PackedTriangularMatrix
,
PackedSymmetricMatrix
, and
CSRNumericTable
.
DEPRECATED:
partialGoalFunction
USE INSTEAD:
partialObjectiveFunction
Pointer to the 1 x 1 numeric table that contains the value of the partial goal function for observations processed on the local node. By default, this result is an object of the
HomogenNumericTable
class, but you can define this result as an object of any class derived from
NumericTable
except
CSRNumericTable
.
partialCandidatesDistances
Pointer to the
nClusters
x 1 numeric table that contains the value of the
nClusters
largest goal function for the observations processed on the local node and stored in descending order. By default, this result if an object of the
HomogenNumericTable
class, but you can define this result as an object of any class derived from
NumericTable
except
PackedTriangularMatrix
,
PackedSymmetricMatrix
,
CSRNumericTable
.
partialCandidatesCentroids
Pointer to the
nClusters
x 1 numeric table that contains the observations of the
nClusters
largest goal function value processed on the local node and stored in descending order of the goal function. By default, this result if an object of the
HomogenNumericTable
class, but you can define this result as an object of any class derived from
NumericTable
except
PackedTriangularMatrix
,
PackedSymmetricMatrix
,
CSRNumericTable
.
Result ID
Result
assignments
Use when
assignFlag
= true. Pointer to the
n
i
x 1 numeric table with 32-bit integer assignments of cluster indices to feature vectors in the input data on the local node. By default, this result is an object of the
HomogenNumericTable
class, but you can define this result as an object of any class derived from
NumericTable
except
PackedTriangularMatrix
,
PackedSymmetricMatrix
, and
CSRNumericTable
.

Step 2 - on Master Node

K-Means Clustering Distributed Workflow Step 2
In this step, the K-Means clustering algorithm accepts the input from each local node described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms.
Input ID
Input
partialResuts
A collection that contains results computed in Step 1 on local nodes.
In this step, the K-Means clustering algorithm calculates the results described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms.
Result ID
Result
centroids
Pointer to the
nClusters
x
p
numeric table with the cluster centroids. By default, this result is an object of the
HomogenNumericTable
class, but you can define the result as an object of any class derived from
NumericTable
except
PackedTriangularMatrix
,
PackedSymmetricMatrix
, and
CSRNumericTable
.
DEPRECATED:
goalFunction
USE INSTEAD:
objectiveFunction
Pointer to the 1 x 1 numeric table that contains the value of the goal function. By default, this result is an object of the
HomogenNumericTable
class, but you can define this result as an object of any class derived from
NumericTable
except
CSRNumericTable
.
The algorithm computes assignments using input centroids. Therefore, to compute assignments using final computed centroids, after the last call to Step2
compute
() method on the master node, on each local node set
assignFlag
to true and do one additional call to Step1
compute
() and
finalizeCompute
() methods. Always set
assignFlag
to true and call
finalizeCompute
() to obtain assignments in each step.
To compute assignments using original
inputCentroids
on the given node, you can use K-Means clustering algorithm in the batch processing mode with the subset of the data available on this node. See Batch Processing for more details.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804