# On clustering, part two

In part one of what has quickly developed into an enthralling series, I made the points that (1) at least some important software doesn’t provide a probability value for cluster outputs, and (2) that while it’s possible to cluster any data set, univariate or multivariate, into clearly distinct groups, so doing doesn’t necessarily mean anything important. Such outputs only tell us something useful if there is some actual structure in the data, and the clustering algorithm can detect it.

But just what is “structure” in the data? The univariate case is simplest because with multivariate data, structure can have two different aspects. But in either situation we can take the standard statistical stance that structure is the detectable departure from random expectation, at some defined probability criterion (p value). The Poisson and gamma distributions define this expectation, the former for count (integer valued) data, and the latter for continuous data. By “expectation” I mean the expected distribution of values across the full data. If we have a calculated overall mean value, i.e. an overall rate, the Poisson and gamma then define this distribution, assuming each value is measured over an identical sampling interval. With the Poisson, the latter takes the form of a fixed denominator, whereas with the gamma it takes the form of a fixed numerator.

Using the familiar example of density (number of objects per unit space or time), the Poisson fixes the unit space while the integer number of objects in each unit varies, whereas the gamma fixes the integer rank of the objects that will be measured to from random starting points, with the distance to each such object (and corresponding area thereof) varying. The two approaches are just flip sides of the same coin really, but with very important practical considerations related to both data collection and mathematical bias. Without getting heavily into the data collection issue here, the Poisson approach–counting the number of objects in areas of pre-defined size–can get you into real trouble in terms of time efficiency (regardless of whether density tends to be low or high). This consideration is very important in opting for distance-based sampling and the use of the gamma distribution over area-based sampling and use of the Poisson.

But returning to the original problem as discussed in part one, the two standard clustering approaches–k-means and hierarchical–are always going to return groupings that are of low probability of random occurrence, no matter what “natural structure” in the data there may actually be. The solution, it seems to me, is to instead evaluate relative probabilities: the probability of the values within each group being Poisson or gamma distributed, relative to the probability of the overall distribution of values. In each case these probabilities are determined by a goodness-of-fit test, namely a Chi-square test for count (integer) data and a Kolmogorov-Smirnov test for continuous data. If there is in fact some natural structure in the data–that is, groups of values that are overly similar (or dissimilar) to each other than that defined by the Poisson or gamma–then this relative probability (or odds ratio if you like), will be maximized at the clustering solution that most closely reflects the actual structure in the data, this solution being defined by (1) the number of groups, and (2) the membership of each. It is a maximum likelihood approach to the problem.

If there is little or no actual structure in the data, then these odds ratios computed across different numbers of final groups will show no clearly defensible maximal value, but rather a broad, flat plateau in which all the ratios are similar, varying from each other only randomly. But when there is real structure therein, there will be a ratio that is quantifiably higher than all others, a unimodal response with a peak value. The statistical significance of this maximum can be evaluated with the Likelihood Ratio test or something similar, though I haven’t thought very hard about that issue yet.

Moving from the univariate case, to the multivariate, ain’t not no big deal really, in terms of the above discussion–it just requires averaging those odds ratios over all variables. But multivariate data does introduce a second, subtle aspect into what we mean by the term “data structure”, in the following respect. It is a possible situation wherein no variable in the data shows clear evidence of structure, per the above approach, when in fact there very much is such, but of a different kind. That outcome would occur whenever particular pairs (or larger groups) of variables are correlated with each other (above random expectation), even though the values for each such variable are in fact Poisson/gamma distributed overall. That is, there is a statistically defensible relationship between variables across sample units, but no detectable variation in values within each variable, across those sample units.

Such an outcome would provide definite evidence of behavioral similarity among variables even in the absence of a structuring of those variables by some latent (unmeasured) variable. I think it would be interesting to know how often such a situation occurs in different types of ecological and other systems, and I’m pretty sure nobody’s done any such analysis. Bear in mind however that I also once thought, at about 4:30 AM on a sleep deprived week if I remember right, that it would be interesting to see if I could beat the Tahoe casinos at blackjack based on quick probability considerations.

I hope the above has made at least some sense and you have not damaged your computer by say, throwing your coffee mug through the screen, or yelled something untoward, at volume, within earshot of those who might take offense. The Institute hereby disavows any responsibility, liability or other legal or financial connection to such events, past or future.

There will be more!