Batch Processing#

Input#

Centroid initialization for K-Means clustering accepts the input described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm.

Algorithm Input for K-Means Initialization (Batch Processing)#
Input ID	Input
`data`	Pointer to the \(n \times p\) numeric table with the data to be clustered.

Note

The input can be an object of any class derived from NumericTable.

Parameters#

The following table lists parameters of centroid initialization for K-Means clustering, which depend on the initialization method parameter method.

Algorithm Parameters for K-Means Initialization (Batch Processing)#
Parameter	method	Default Value	Description
`algorithmFPType`	any	`float`	The floating-point type that the algorithm uses for intermediate computations. Can be `float` or `double`.
`method`	Not applicable	`defaultDense`	Available initialization methods for K-Means clustering: For CPU: `defaultDense` - uses first nClusters points as initial centroids `deterministicCSR` - uses first nClusters points as initial centroids for data in a CSR numeric table `randomDense` - uses random nClusters points as initial centroids `randomCSR` - uses random nClusters points as initial centroids for data in a CSR numeric table `plusPlusDense` - uses K-Means++ algorithm [Arthur2007] `plusPlusCSR` - uses K-Means++ algorithm for data in a CSR numeric table `parallelPlusDense` - uses parallel K-Means++ algorithm [Bahmani2012] `parallelPlusCSR` - uses parallel K-Means++ algorithm for data in a CSR numeric table For GPU: `defaultDense` - uses first nClusters points as initial centroids `randomDense` - uses random nClusters points as initial centroids
`nClusters`	any	Not applicable	The number of clusters. Required.
`nTrials`	`parallelPlusDense` `parallelPlusCSR`	\(1\)	The number of trails to generate all clusters but the first initial cluster. For details, see [Arthur2007], section 5
`oversamplingFactor`	`parallelPlusDense` `parallelPlusCSR`	\(0.5\)	A fraction of nClusters in each of nRounds of parallel K-Means++. L=nClusters*oversamplingFactor points are sampled in a round. For details, see [Bahmani2012], section 3.3.
`nRounds`	`parallelPlusDense` `parallelPlusCSR`	\(5\)	The number of rounds for parallel K-Means++. (L*nRounds) must be greater than nClusters. For details, see [Bahmani2012], section 3.3.
`engine`	any	SharePtr< engines:: mt19937:: Batch>()	Pointer to the random number generator engine that is used internally for random numbers generation.

Output#

Centroid initialization for K-Means clustering calculates the result described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm.

Algorithm Output for K-Means Initialization (Batch Processing)#
Result ID	Result
`centroids`	Pointer to the \(nClusters \times p\) numeric table with the cluster centroids.

Note

By default, this result is an object of the HomogenNumericTable class, but you can define the result as an object of any class derived from NumericTable except for PackedTriangularMatrix, PackedSymmetricMatrix, and CSRNumericTable.

Batch Processing

Contents

Batch Processing#

Input#

Parameters#

Output#