On clustering, part one

In ecology and other sciences, grouping similar objects together for further analytical purposes, or just as an end in itself, is a fundamental task, one accomplished by cluster analysis, one of the most fundamental tools in statistics. In all but the smallest sample sizes, the number of possible groupings very rapidly gets enormous, and it is necessary therefore to both (1) have some way of efficiently avoiding the vast number of clearly non-optimal clusters, and (2) choosing the best solution from among those that seem at least reasonable.

First some background. There are (at least) three basic approaches to clustering. Two of these are inherently hierarchical in nature: they either aggregate individual objects into ever-larger groups (agglomerative methods), or successively divide the entire set into ever smaller ones (divisive methods). Hierarchical methods are based on a distance matrix that defines the distance (in measurement space) between every possible pair of objects, as determined by the variables of interest (typically multivariate) and the choice of distance measure, of which there are several depending on one’s definitions of “distance”. This distance matrix increases in size as a function of (n-1)(n/2), or roughly a squared function of n, and so for large datasets these methods quickly become untenable, unless one has an enormous amount of computer memory available, which typically the average scientist does not.

The k-means clustering algorithm works differently–it doesn’t use a distance matrix. Instead it chooses a number of random cluster starting points (“centers”) and then measures the distance to all objects from those points, and agglomerates stepwise according to which objects are closest to which centers. This greatly reduces the memory requirement for large data sets, but a drawback is that the output depends on the initial choice of centers; one should thus try many different starting combinations, and even then, the best solution is not guaranteed. Furthermore, one sets the number of final clusters desired beforehand, but there is no guarantee that the optimal overall solution will in fact correspond to that choice, and so one has to repeat the process for all possible cluster numbers that one deems reasonable, with “reasonable” often being less than obvious.

When I first did a k-means cluster analysis, years ago, I did it in SPSS and I remember being surprised that the output did not include a probability value, that is, the likelihood of obtaining a given clustering by chance alone. There was thus no way to determine which among the many possible solutions was in fact the best one, which seemed to be a pretty major shortcoming, possibly inexcusable. Now I’m working in R, and I find…the same thing. In R, the two workhorse clustering algorithms, both in the main stats package are kmeans and hclust, corresponding to k-means and hierarchical clustering, respectively. In neither method is the probability of the solution given as part of the output. So, it wasn’t just SPSS–if R doesn’t provide it, then it’s quite possible that no statistical software program (SAS, S-Plus, SigmaStat, etc.) does so, although I don’t know for sure.

There is one function in R that attempts to identify what it calls the “optimal clustering”, function optCluster in the package of the same name. But that function, while definitely useful, only appears to provide a set of different metrics by which to evaluate the effectiveness of any given clustering solution, as obtained from 16 possible clustering methods, but with no actual probabilities attached to any of them. What I’m after is different, more defensible and definitely more probabilistic. It requires some careful thought regarding just what clustering should be all about in the first place.

If we talk about grouping objects together, we gotta be careful. This piece at Variance Explained gives the basic story of why, using examples from a k-means clustering. A principal point is that one can create clusters from any data set, but the result doesn’t necessarily mean anything. And I’m not just referring to the issue of relating the variable being clustered to other variables of interest in the system under study. I’m talking about inherent structure in the data, even univariate data.

This point is easy to grasp with a simple example. If I have the set of 10 numbers from 0 to 9, a k-means clustering into two groups will place 0 to 4 in one group and 5 to 9 in the other, as will most hierarchical clustering trees trimmed to two groups. Even if some clustering methods were to sometimes place say, 0 to 3 in one group and 4 to 9 in the other, or similar outcome (which they conceivably might–I haven’t tested them), the main point remains: there are no “natural” groupings in those ten numbers–they are as evenly spaced as is possible to be, a perfect gradient. No matter how you group them, the number of groups and the membership of each will be an arbitrary and trivial result. If, on the other hand, you’ve got the set {0,1,2,7,8,9} it’s quite clear that 0-2 and 7-9 define two natural groupings, since the members of each group are all within 1 unit of the means thereof, and with an obvious gap between the two.

This point is critical, as it indicates that we should seek a clustering evaluation method that is based in an algorithm capable of making this discrimination between a perfect gradient and tightly clustered data. Actually it has to do better than that–it has to be able to distinguish between perfectly spaced data, randomly spaced data, and clustered data. Randomly spaced data will have a natural degree of clustering by definition, and we need to be able to distinguish that situation from truly clustered data, which might not be so easy in practice.

There are perhaps several ways to go about doing this, but the one that is most directly obvious and relevant is based on the Poisson distribution. The Poisson defines the expected values in a set of sub-samples, given a known value determined from the entire object collection, for the variable of interest. Thus, from the mean value over all objects (no clustering), we can determine the probability that the mean values for each of the n groups resulting from a given clustering algorithm (of any method), follow the expectation defined by the Poisson distribution determined by that overall mean (the Poisson being defined by just one parameter). The lower that probability is, the more likely that the clusters returned by the algorithm do in fact represent a real feature of the data set, a natural aggregation, and not just an arbitrary partitioning of random or gradient data. Now maybe somebody’s already done this, I don’t know, but I’ve not seen it in any of the statistical software I’ve used, including R’s two workhorse packages stats and cluster.

More hideous detail to come so take cover and shield your eyes.

Advertisements

3 thoughts on “On clustering, part one

  1. Oh goody! At some point will we get to discuss the merits of applying Touchard polynomials? I seldom have any trouble sleeping. But when I do I pull out my old statistics notes and have another go at figuring out how TPs (or Bell numbers for that matter) relate to Poisson distributed data. It never fails me.

    • Well, we’re in the electronic age now so consider my posts a replacement for your old notes. Bear in mind however, that sleep therapy is a billable item and we no longer take soybeans as barter here at the Institute.

      No but seriously, I think you might find it interesting. Compared to say, asphalt studies.

    • Yes, sleep therapy as a billable item. That does have a reasonable ring to it. And I am all too used to having my soybeans turned away as something akin to Jack’s treatment in Jack and the Beanstalk. [Didn’t need that cow anyway]

      But I do want to quickly agree that I am all ears (or all eyes as the case may be) when it comes to using a more appropriate statistical tool (Poisson in this case) for an analysis. The jab at Touchard was merely an immodest crack on my part… suggesting I might perhaps know more about stats than I really do. Having heard of Bell numbers and Touchard polynomials should by no means insinuate that I have the slightest ability to apply the math. In my dreams. With a bit of homework I can keep up with the Poisson – so by all means – do carry on!!

Have at it

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s