KMeans A simple yet effective clustering algorithm


The k-means clustering algorithm is simple: given a set of data points, it finds a number ("k") of centroids which represent the data distribution pretty well. Each centroid is representative of one cluster, and each data point is labelled as belonging to the cluster whose centroid is nearest.


This can be used for unsupervised classification of data, and it can be used "on-line" meaning that results are available even after only a few data points have been added, and you can easily add more data points and the algorithm can update the cluster positions and labels accordingly.


Any dimensionality of data can be clustered (the examples below use 2D data).


// Create an instance. This one will have 3 clusters.

k = KMeans.new(3)


// Feed it a single data point:

k.add([15,32]);

// The data is stored internally, and some initial centroid positions and assignments are generated:

k.data

k.centroids

k.assignments

// Feed it some more:

k.add([7,13]);

k.add([1,11]);

k.add([5,2]);

k.add([1,1]);

k.data

k.centroids

k.assignments

// When you have AT LEAST k points stored (k.centroids.size == k.k), you can run the learning step:

k.update;

// After update, the centroids should have moved to reflect the data better:

k.centroids


// Here's how we add a whole batch of data points -

// here we'll deliberately generate data clustered around three points.

// We add all the data points in a batch, before performing an update.

// For on-the-fly learning you can update after each datum, if you like.

(

var datum;

1000.do{

// The deliberately-designed centres are these three pairs of co-ordinates:

datum = [[5,2], [7,3], [1,1]].choose + [1.0.sum3rand, 1.0.sum3rand];

k.add(datum);

};

k.update;

"The classifier's centroids are:".postln;

k.centroids.do(_.postln); 

"How nicely do they match the deliberately-designed centres?";

)

// If you notice that the result of the above is too strongly affected by the first few data points

// (which define the initialisations for the centroids), you can .reset and recalculate:

(

k.reset.update;

"The classifier's centroids are:".postln;

k.centroids.do(_.postln); 

"How nicely do they match the deliberately-designed centres?";

)