Developer Guide

Contents

Distributed Processing

This mode assumes that the data set is split into
nblocks
blocks across computation nodes.

Parameters

Centroid initialization for K-Means clustering in the distributed processing mode has the following parameters:
Parameter
method
Default Value
Description
computeStep
any
Not applicable
The parameter required to initialize the algorithm. Can be:
  • step1Local
    - the first step, performed on local nodes. Applicable for all methods.
  • step2Master
    - the second step, performed on a master node. Applicable for
    deterministic
    and
    random
    methods only.
  • step2Local
    - the second step, performed on local nodes. Applicable for
    plusPlus
    and
    parallelPlus
    methods only.
  • step3Master
    - the third step, performed on a master node. Applicable for
    plusPlus
    and
    ParallelPlus
    methods only.
  • step4Local
    - the forth step, performed on local nodes. Applicable for
    plusPlus
    and
    parallelPlus
    methods only.
  • step5Master
    - the fifth step, performed on a master node. Applicable for
    plusPlus
    and
    parallelPlus
    methods only.
algorithmFPType
any
float
The floating-point type that the algorithm uses for intermediate computations. Can be
float
or
double
.
method
Not applicable
defaultDense
Available initialization methods for K-Means clustering:
  • defaultDense
    - uses first
    nClusters
    feature vectors as initial centroids
  • deterministicCSR
    - uses first
    nClusters
    feature vectors as initial centroids for data in a CSR numeric table
  • randomDense
    - uses random
    nClusters
    feature vectors as initial centroids
  • randomCSR
    - uses random
    nClusters
    feature vectors as initial centroids for data in a CSR numeric table
  • plusPlusDense
    - uses K-Means++ algorithm [ Arthur2007 ]
  • plusPlusCSR
    - uses K-Means++ algorithm for data in a CSR numeric table
  • parallelPlusDense
    - uses parallel K-Means++ algorithm [ Bahmani2012 ]
  • parallelPlusCSR
    - uses parallel K-Means++ algorithm for data in a CSR numeric table
For more details, see the algorithm description .
nClusters
any
Not applicable
The number of centroids. Required.
nRowsTotal
any
0
The total number of rows in all input data sets on all nodes. Required in the distributed processing mode in the first step.
offset
Not applicable
Offset in the total data set specifying the start of a block stored on a given local node. Required.
oversampling
Factor
parallelPlusDense
,
parallelPlusCSR
0.5
A fraction of
nClusters
in each of
nRounds
of parallel K-Means++.
L
=
nClusters
*
oversamplingFactor
points are sampled in a round. For details, see [ Bahmani2012 ], section 3.3.
nRounds
parallelPlusDense
,
parallelPlusCSR
5
The number of rounds for parallel K-Means++. (
L
*
nRounds
) must be greater than
nClusters
. For details, see [ Bahmani2012 ], section 3.3.
firstIteration
plusPlusDense
,
plusPlusCSR
,
parallelPlusDense
,
parallelPlusCSR
false
Set to
true
if
step2Local
is called for the first time.
outputForStep5
Required
parallelPlusDense
,
parallelPlusCSR
false
Set to
true
if
step4Local
is called on the last iteration of the Step 2 - Step 4 loop.
Centroid initialization for K-Means clustering follows the general schema described in Algorithms .
plusPlus methods
K-Means++ Clustering Initialization Distributed General Workflow
parrallelPlus methods
K-Means Parallel Clustering Initialization Distributed General Workflow

Step 1 - on Local Nodes (
deterministic
,
random
,
plusPlus
, and
parallelPlus
methods)

plusPlus methods
K-Means++ Clustering Initialization Distributed Step 1
parrallelPlus methods
K-Means Parallel Clustering Initialization Distributed Step 1
In this step, centroid initialization for K-Means clustering accepts the input described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms .
Input ID
Input
data
Pointer to the
n
i
x
p
numeric table that represents the
i
-th data block on the local node. While the input for
defaultDense
,
randomDense
,
plusPlusDense
, or
parallelPlusDense
method can be an object of any class derived from
NumericTable
, the input for
deterministicCSR
,
randomCSR
,
plusPlusCSR
, or
parallelPlusCSR
method can only be an object of the
CSRNumericTable
class.
In this step, centroid initialization for K-Means clustering calculates the results described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms .
Result ID
Result
DEPRECATED:
partialClustersNumber
Pointer to the 1 x 1 numeric table that contains the number of centroids computed on the local node. By default, this result is an object of the
HomogenNumericTable
class, but you can define the result as an object of any class derived from
NumericTable
except
CSRNumericTable
,
PackedTriangularMatrix
, and
PackedSymmetricMatrix
.
DEPRECATED:
partialClusters
USE INSTEAD:
partialCentroids
Pointer to the
nClusters
x
p
numeric table with the centroids computed on the local node. By default, this result is an object of the
HomogenNumericTable
class, but you can define the result as an object of any class derived from
NumericTable
except
PackedTriangularMatrix
,
PackedSymmetricMatrix
, and
CSRNumericTable
.

Step 2 - on Master Node (
deterministic
and
random
methods)

This step is applicable for
deterministic
and
random
methods only. Centroid initialization for K-Means clustering accepts the input from each local node described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms .
Input ID
Input
partialResuts
A collection that contains results computed in Step 1 on local nodes (two numeric tables from each local node).
In this step, centroid initialization for K-Means clustering calculates the results described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms .
Result ID
Result
centroids
Pointer to the
nClusters
x
p
numeric table with centroids. By default, this result is an object of the
HomogenNumericTable
class, but you can define the result as an object of any class derived from
NumericTable
except
PackedTriangularMatrix
,
PackedSymmetricMatrix
, and
CSRNumericTable
.

Step 2 - on Local Nodes (
plusPlus
and
parallelPlus
methods)

plusPlus methods
K-Means++ Clustering Initialization Distributed Step 2
parrallelPlus methods
K-Means Parallel Clustering Initialization Distributed Step 2
This step is applicable for
plusPlus
and
parallelPlus
methods only. Centroid initialization for K-Means clustering accepts the input from each local node described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms .
Input ID
Input
data
Pointer to the
n
i
x
p
numeric table that represents the
i
-th data block on the local node. While the input for
defaultDense
,
randomDense
,
plusPlusDense
, or
parallelPlusDense
method can be an object of any class derived from
NumericTable
, the input for
deterministicCSR
,
randomCSR
,
plusPlusCSR
, or
parallelPlusCSR
method can only be an object of the
CSRNumericTable
class.
inputOfStep2
Pointer to the
m
x
p
numeric table with the centroids calculated in the previous steps ( Step 1 or Step 4 ).
The value of
m
is defined by the method and iteration of the algorithm:
  • plusPlus
    method:
    m
    = 1
  • parallelPlus
    method:
    • m
      = 1 for the first iteration of the Step 2 - Step 4 loop
    • m
      =
      L
      =
      nClusters
      *
      oversamplingFactor
      for other iterations
This input can be an object of any class derived from
NumericTable
, except
CSRNumericTable
,
PackedTriangularMatrix
, and
PackedSymmetricMatrix
.
internalInput
Pointer to the
DataCollection
object with the internal data of the distributed algorithm used by its local nodes in Step 2 and Step 4 . The
DataCollection
is created in Step 2 when
firstIteration
is set to
true
, and then the
DataCollection
should be set from the partial result as an input for next local steps ( Step 2 and Step 4 ).
In this step, centroid initialization for K-Means clustering calculates the results described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms .
Result ID
Result
outputOfStep2ForStep3
Pointer to the 1 x 1 numeric table that contains the
overall error
accumulated on the node. For a description of the overall error, see K-Means Clustering Details .
By default, this result is an object of the
HomogenNumericTable
class, but you can define the result as an object of any class derived from
NumericTable
except
PackedTriangularMatrix
,
PackedSymmetricMatrix
, and
CSRNumericTable
.
outputOfStep2ForStep5
Applicable for
parallelPlus
methods only and calculated when
outputForStep5Required
is set to true. Pointer to the 1 x
m
numeric table with the ratings of centroid candidates computed on the previous steps and
m
=
oversamplingFactor
*
nClusters
*
nRounds
+ 1. For a description of ratings, see K-Means Clustering Details .
By default, this result is an object of the
HomogenNumericTable
class, but you can define the result as an object of any class derived from
NumericTable
except
PackedTriangularMatrix
,
PackedSymmetricMatrix
, and
CSRNumericTable
.

Step 3 - on Master Node (
plusPlus
and
parallelPlus
methods)

plusPlus methods
K-Means++ Clustering Initialization Distributed Step 3
parrallelPlus methods
K-Means Parallel Clustering Initialization Distributed Step 3
This step is applicable for
plusPlus
and
parallelPlus
methods only. Centroid initialization for K-Means clustering accepts the input from each local node described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms .
Input ID
Input
inputOfStep3FromStep2
A key-value data collection that maps parts of the accumulated error to the local nodes:
i
-th element of this collection is a numeric table that contains overall error accumulated on the
i
-th node.
In this step, centroid initialization for K-Means clustering calculates the results described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms .
Result ID
Result
outputOfStep3ForStep4
A key-value data collection that maps the input from Step 4 to local nodes:
i
-th element of this collection is a numeric table that contains the input from Step 4 on the
i
-th node. Note that Step 3 may produce no input for Step 4 on some local nodes, which means the collection may not contain the
i
-th node entry. The single element of this numeric table
v
Φ
X
(
C
), where the overall error
Φ
X
(
C
) calculated on the node. For a description of the overall error, see K-Means Clustering Details . This value defines the probability to sample a new centroid on the
i
-th node.
outputOfStep3ForStep5
Applicable for
parallelPlus
methods only. Pointer to the service data to be used in Step 5 .

Step 4 - on Local Nodes (
plusPlus
and
parallelPlus
methods)

plusPlus methods
K-Means++ Clustering Initialization Distributed Step 4
parrallelPlus methods
K-Means Parallel Clustering Initialization Distributed Step 4
This step is applicable for
plusPlus
and
parallelPlus
methods only. Centroid initialization for K-Means clustering accepts the input from each local node described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms .
Input ID
Input
data
Pointer to the
n
i
x
p
numeric table that represents the
i
-th data block on the local node. While the input for
defaultDense
,
randomDense
,
plusPlusDense
, or
parallelPlusDense
method can be an object of any class derived from
NumericTable
, the input for
deterministicCSR
,
randomCSR
,
plusPlusCSR
, or
parallelPlusCSR
method can only be an object of the
CSRNumericTable
class.
inputOfStep4FromStep3
Pointer to the
l
x
m
numeric table with the values calculated in Step 3 .
The value of
m
is defined by the method of the algorithm:
  • plusPlus
    method:
    m
    = 1
  • parallelPlus
    method:
    m
    L
    ,
    L
    =
    nClusters
    *
    oversamplingFactor
This input can be an object of any class derived from
NumericTable
, except
CSRNumericTable
,
PackedTriangularMatrix
, and
PackedSymmetricMatrix
.
internalInput
Pointer to the
DataCollection
object with the internal data of the distributed algorithm used by its local nodes in Step 2 and Step 4 . The
DataCollection
is created in Step 2 when
firstIteration
is set to
true
, and then the
DataCollection
should be set from the partial result as the input for next local steps ( Step 2 and Step 4 ).
In this step, centroid initialization for K-Means clustering calculates the results described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms .
Result ID
Result
outputOfStep4
Pointer to the
m
x
p
numeric table that contains centroids computed on this local node, where
m
equals to the one in
inputOfStep4FromStep3
. By default, this result is an object of the
HomogenNumericTable
class, but you can define the result as an object of any class derived from
NumericTable
except
CSRNumericTable
,
PackedTriangularMatrix
, and
PackedSymmetricMatrix
.

Step 5 - on Master Node (
parallelPlus
methods)

K-Means Parallel Clustering Initialization Distributed Step 5
This step is applicable for
parallelPlus
methods only. Centroid initialization for K-Means clustering accepts the input from each local node described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms .
Input ID
Input
inputCentroids
A data collection with the centroids calculated in Step 1 or Step 4 . Each item in the collection is the pointer to
m
x
p
numeric table, where the value of
m
is defined by the method and the iteration of the algorithm:
  • parallelPlus
    method:
    • m
      = 1 for the data added as the output of Step 1
    • m
      L
      ,
      L
      =
      nClusters
      *
      oversamplingFactor
      for the data added as the output of Step 4
Each numeric table can be an object of any class derived from
NumericTable
, except
CSRNumericTable
,
PackedTriangularMatrix
, and
PackedSymmetricMatrix
.
inputOfStep5FromStep2
A data collection with the items calculated in Step 2 on local nodes. For a detailed definition, see
outputOfStep2ForStep5
above.
inputOfStep5FromStep3
Pointer to the service data generated as the output of Step 3 on master node. For a detailed definition, see
outputOfStep3ForStep5
above.
In this step, centroid initialization for K-Means clustering calculates the results described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms .
Result ID
Result
centroids
Pointer to the
nClusters
x
p
numeric table with the centroids. By default, this result is an object of the
HomogenNumericTable
class, but you can define the result as an object of any class derived from
NumericTable
except
PackedTriangularMatrix
,
PackedSymmetricMatrix
, and
CSRNumericTable
.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804