Monday, March 26, 2007

K-Means Clustering

We talked today about unsupervised learning, with some time spent on dimensionality reduction and principal component analysis, but the most interesting subtopic was k-means clustering.

Suppose you have some data set organized into clusters. The first and most apparent problem is that you can't know ahead of time how many clusters you have; This means you can't know how many categories into which new data will fall. The basic idea of k-means clustering is that you can algorithmically derive a location (called a
prototype) for each cluster which minimizes the sum of the square distances to the points in that cluster.

The procedure is fairly straightforward:

1. You need to initialize some number of prototypes (M
k). In general this is done by randomly choosing some number of points in the training data.
2. For each M
k compute the set of all points in the data for which the square of the norm of the distance is lowest versus all other prototypes. This produces a number of sets equal to the number of initial Mk each of which is the 'cluster' associated with that prototype.
3. Re-compute the location of each M
k to be the position which minimizes the square of the norm of the distances to all of the points in its matching set Sk.
4. Iterate 2 and 3 until convergence.

Once this algorithm has converged, classification of new data is extremely cheap - Any new data point is of the category of the closest prototype, with error equal to the square of the distance between these two points.

There are some great advantages to this scheme, first and foremost that it will work with any distance measure, and makes easy sense in data of any dimension. Also an advantage, and certainly an important one, is the fact that this algorithm is extremely straightforward to efficiently implement.

One major problem, though, needs to be considered. With poor initial prototype selection it's very possible to get stuck in extremely bad solutions. Consider data with some high number of clusters where initial prototype selection places several prototypes
in one or more clusters and leaves several other clusters empty. Some clusters will go unrepresented in the final list of prototypes (matching instead some other nearby cluster), while some other clusters will be incorrectly split between two or more prototypes.

This can be partially compensated for by performing what's called hierarchical k-means; Select exactly two prototypes, and run normal k-means. Then run k-means again with k=2 on each of the resulting subclusters. This procedure will tend to avoid the problem of missed clusters while multiple prototypes converge to local minima, but can be significantly more expensive.

No comments: