In ecology and other sciences, grouping similar objects together for further analytical purposes, or just as an end in itself, is a fundamental task, one accomplished by *cluster analysis*, one of the most fundamental tools in statistics. In all but the smallest sample sizes, the number of possible groupings very rapidly gets enormous, and it is necessary therefore to both (1) have some way of efficiently avoiding the vast number of clearly non-optimal clusters, and (2) choosing the best solution from among those that seem at least reasonable.

First some background. There are (at least) three basic approaches to clustering. Two of these are inherently *hierarchical* in nature: they either aggregate individual objects into ever-larger groups (*agglomerative* methods), or successively divide the entire set into ever smaller ones (*divisive* methods). Hierarchical methods are based on a distance matrix that defines the distance (in measurement space) between every possible pair of objects, as determined by the variables of interest (typically multivariate) and the choice of distance measure, of which there are several depending on one’s definitions of “distance”. This distance matrix increases in size as a function of (n-1)(n/2), or roughly a squared function of n, and so for large datasets these methods quickly become untenable, unless one has an enormous amount of computer memory available, which typically the average scientist does not.

The k-means clustering algorithm works differently–it doesn’t use a distance matrix. Instead it chooses a number of random cluster starting points (“centers”) and then measures the distance to all objects from those points, and agglomerates stepwise according to which objects are closest to which centers. This greatly reduces the memory requirement for large data sets, but a drawback is that the output depends on the initial choice of centers; one should thus try many different starting combinations, and even then, the best solution is not guaranteed. Furthermore, one sets the number of final clusters desired beforehand, but there is no guarantee that the optimal overall solution will in fact correspond to that choice, and so one has to repeat the process for all possible cluster numbers that one deems reasonable, with “reasonable” often being less than obvious.

When I first did a k-means cluster analysis, years ago, I did it in SPSS and I remember being surprised that the output did not include a probability value, that is, the likelihood of obtaining a given clustering by chance alone. There was thus no way to determine which among the many possible solutions was in fact the best one, which seemed to be a pretty major shortcoming, possibly inexcusable. Now I’m working in R, and I find…the same thing. In R, the two workhorse clustering algorithms, both in the main *stats* package are *kmeans* and *hclust*, corresponding to k-means and hierarchical clustering, respectively. In neither method is the probability of the solution given as part of the output. So, it wasn’t just SPSS–if R doesn’t provide it, then it’s quite possible that *no *statistical software program (SAS, S-Plus, SigmaStat, etc.) does so, although I don’t know for sure.

There is one function in R that attempts to identify what it calls the “optimal clustering”, function optCluster in the package of the same name. But that function, while definitely useful, only appears to provide a set of different metrics by which to evaluate the effectiveness of any given clustering solution, as obtained from 16 possible clustering methods, but with no actual probabilities attached to any of them. What I’m after is different, more defensible and definitely more probabilistic. It requires some careful thought regarding just what clustering should be all about in the first place.

If we talk about grouping objects together, we gotta be careful. This piece at *Variance Explained* gives the basic story of why, using examples from a k-means clustering. A principal point is that one can create clusters from *any* data set, but the result doesn’t necessarily* mean* anything. And I’m not just referring to the issue of relating the variable being clustered to other variables of interest in the system under study. I’m talking about inherent structure in the data, even univariate data.

This point is easy to grasp with a simple example. If I have the set of 10 numbers from 0 to 9, a k-means clustering into two groups will place 0 to 4 in one group and 5 to 9 in the other, as will most hierarchical clustering trees trimmed to two groups. Even if some clustering methods were to sometimes place say, 0 to 3 in one group and 4 to 9 in the other, or similar outcome (which they conceivably might–I haven’t tested them), the main point remains: there are no “natural” groupings in those ten numbers–they are as evenly spaced as is possible to be, a perfect gradient. No matter how you group them, the number of groups and the membership of each will be an arbitrary and trivial result. If, on the other hand, you’ve got the set {0,1,2,7,8,9} it’s quite clear that 0-2 and 7-9 define two natural groupings, since the members of each group are all within 1 unit of the means thereof, and with an obvious gap between the two.

This point is critical, as it indicates that we should seek a clustering evaluation method that is based in an algorithm capable of making this discrimination between a perfect gradient and tightly clustered data. Actually it has to do better than that–it has to be able to distinguish between perfectly spaced data, randomly spaced data, and clustered data. Randomly spaced data will have a natural degree of clustering by definition, and we need to be able to distinguish that situation from truly clustered data, which might not be so easy in practice.

There are perhaps several ways to go about doing this, but the one that is most directly obvious and relevant is based on the Poisson distribution. The Poisson defines the expected values in a set of sub-samples, given a known value determined from the entire object collection, for the variable of interest. Thus, from the mean value over all objects (no clustering), we can determine the probability that the mean values for each of the n groups resulting from a given clustering algorithm (of any method), follow the expectation defined by the Poisson distribution determined by that overall mean (the Poisson being defined by just one parameter). The lower that probability is, the more likely that the clusters returned by the algorithm *do in fact represent a real feature of the data set*, a natural aggregation, and not just an arbitrary partitioning of random or gradient data. Now maybe somebody’s already done this, I don’t know, but I’ve not seen it in any of the statistical software I’ve used, including R’s two workhorse packages *stats* and *cluster*.

More hideous detail to come so take cover and shield your eyes.