On clustering, part four

I have not written any post, or series thereof, in which I had more questions or made me think more precisely about purposes and products of statistical approaches/algorithms, than this one. I’m finding this topic of statistical clustering of multivariate data to be far subtler and more interesting than I thought when I waded into it. Justifiably counter-intuitive even; I encourage you to read through this one as I think the root concepts are pretty important.

It’s easy to misunderstand exactly what the output of a statistical/mathematical procedure represents, and thus vital to grasp just what is going on, and why. This might seem like a fully obvious statement, but my experience has often been that we who work in mostly non-experimental fields (e.g. ecology, geology, climatology, etc) frequently do not in fact have a clear and very exact grasp of what our various quantitative methods are doing. I’m not just referring to highly complex methods either, but even to simple methods we often take for granted. This applies to me personally, and to all kinds of papers by others that I’ve read, in various fields.

The first point to be made here is to retract and correct the example I gave in part one of this series. I had said there that if one clusters the ten numbers 0-9, into two groups, using either a k-means or hierarchical clustering procedure, the first group will contain {0 to 4} and the second will contain {5 to 9}. That’s not the problem–they will indeed do so (at least in R). The problem is the statement that this output is meaningless–that it does not imply anything about structure in the data. On thinking this through a little more, I conclude that whether it does or doesn’t depends on the specifics of the data.

It is very true that an inherent potential problem with clustering methods is the imposition of an artificial structure on the data when none actually exists. With the “k-means” algorithm this is a direct result of minimizing sums of the squared differences from the mean within groups, whereas with hierarchical methods (AGNES, etc), it results from iteratively joining the closest items (item to item or item to group), as determined by a constantly updated distance matrix after each join. Either way, the algorithms cause similarly valued items to be placed together in groups.

The problem is with the example I used, specifically, that a set of values appearing to follow a smooth gradient, such as the integers 0:9 in the example, rather than displaying definite groupings/clusters, necessarily indicates a lack of structure, and thus a meaningless clustering result. This is not necessarily the case, which can only be grasped in the context of an apriori expected distribution of values, given an observed overall mean value. These expectations are given by the Poisson and gamma distributions, for integer- and real-valued data respectively.

The most immediate question arising from that last statement should be “Why exactly should the Poisson or gamma distributions define our expectation, over any other distribution?”. The answer is pretty important, as it gets at just what we’re trying to do when we cluster values. I would strongly argue that that purpose has to be the identification of structure in the data, that is, a departure from randomness, which in turn means we need an estimate of the random condition against which to gauge possible departures. Without having this “randomness reference point”, the fact that {0:4} and {5:9} will certainly fall into two different groups when clustered is nothing more than the trivial re-statement that {5:9} are larger values than {0:4}: fully meaningless. But R (and presumably most other statistical analysis programs as well), does not tell you that–it gives you no information on just how meaningful a given clustering output is. Not good: as scientists we’re after meaningful output, not just output.

The answer to the above question is that the Poisson and gamma distributions provide those randomness reference points, and the reason why is important: they are essentially descriptors of expected distributions due to a type of sampling error alone. Both are statements that if one has some randomly distributed item–say tree locations in two dimensions in a forest, or whatever example you like–then a set of pre-defined measurement units, placed at random throughout the sampling space (geographic area in my example), will vary in the number of items contained (Poisson), and in the area encompassed by measurements from random points to the nth closest objects thereto (gamma). Not sampling error in the traditional sense, but the concept is very similar, so that’s what I’ll call it.

Therefore, departures from such expectations, in an observed data set, are necessarily evidence for structure in those data, in the sense that something other than simple sampling variation–some causative agent–is operating to cause the departure. The main goal of science is to seek and quantify these agents…but how are you supposed to seek those somethings until you have evidence that there is some something there to begin with? And similarly, of the many variables that might be important in a given system, how does one decide which of those are most strongly structured? These questions are relevant because this discussion is in the context of large, multivariate, non-experimental systems, with potentially many variables of unknown interaction. We want evidence regarding whether or not the observations are explainable by a single homogeneous process, or not.

Things get interesting, given this perspective, and I can demonstrate why using say, the set of n = 21 integers {0:20}, a perfect gradient with no evidence of aggregation.

The Poisson distribution tells us that we do not expect a distribution of {0:20} for 21 integers having a mean value of 10. We observe and expect (for a cdf at 5% quantile steps):

   quantile observed expected
     0.05        1        5
     0.10        2        6
     0.15        3        7
     0.20        4        7
     0.25        5        8
     0.30        6        8
     0.35        7        9
     0.40        8        9
     0.45        9        9
     0.50       10       10
     0.55       11       10
     0.60       12       11
     0.65       13       11
     0.70       14       12
     0.75       15       12
     0.80       16       13
     0.85       17       13
     0.90       18       14
     0.95       19       15

…as obtained by (in R):

observed = 0:20; n = length(observed)
expected = qpois(p = seq(0, 1, length.out=n), lambda=mean(observed))
df = data.frame(quantile = seq(0, 1, length.out=n), observed, expected)[2:20,]

Running a chi-square test on the observed values:

chisq.test(x=df$observed)

…we get:

data: df$observed
X-squared = 57, df = 18, p-value = .000006

There is thus only a very tiny chance of getting the observed result from sampling variation alone. Observation of the above table shows that the observed data are stretched–having longer left and right tails, relative to expectation.

But this series is about clustering and how to find evidence for it in the data…and there is no tendency toward clustering evident in these data. Indeed, the opposite is true–the observed distribution is long-tailed, “stretched out” compared to expectation. But…this result must mean that there is indeed some structuring force on the data to cause this result–some departure from the homogeneous state that the Poisson assumes. We can’t know, from just this analysis alone, just how many “real” clusters of data there are, but we do know that it must be more than one (and I hope it is apparent as to why)! More precisely, if we’ve decided to look for clusters in the data, then the chi-square test gives us evidence that more than one cluster is highly likely.

Just how many clusters is a more difficult question, but we could evaluate the first step in that direction (i.e. two clusters) by dividing the values into two roughly even sized groups defined by the midpoint (= 10) and evaluating whether Poisson-distributed values for the two groups, having means of 4.5 and 15, give a better fit to the observations:

(exp1 = qpois(p = seq(0, 1, length.out=11), lambda=4.5)[2:10])
(exp2 = qpois(p = seq(0, 1, length.out=12), lambda=15)[2:11])
(df2 = data.frame(quantile = seq(0, 1, length.out=n)[2:20], obs=df$observed, exp = sort(c(exp1,exp2))))

…which gives:

   quant. obs. exp.
     0.05   1   2
     0.10   2   3
     0.15   3   3
     0.20   4   4
     0.25   5   4
     0.30   6   5
     0.35   7   5
     0.40   8   6
     0.45   9   7
     0.50  10  10
     0.55  11  11
     0.60  12  13
     0.65  13  14
     0.70  14  14
     0.75  15  15
     0.80  16  16
     0.85  17  17
     0.90  18  18
     0.95  19  20

…and then compute the mean of the two chi-square probabilities:

obs1=0:9; obs2=10:20
p = mean(c(chisq.test(obs1)$p.value, chisq.test(obs2)$p.value))

…which returns p = 0.362
[Edit note: I had wrong values here originally–because I mistakenly ran the chi square tests on the expected values instead of the observed. Now corrected–still a very high odds ratio, just not as high as before.]

The odds ratio of the two hypotheses (two Poisson-distributed groups having means of 4.5 and 15.0, versus just one group with a mean of 10.0) is thus 0.362 / .000006 = 60,429. Thus, clustering the observed data into these two groups would be a very highly defensible decision, even though the observed data comprise a perfect gradient having no tendency toward aggregation whatsoever!

The extension from the univariate to the multivariate case is straight-forward, involving nothing more than performing the same analysis on each variable and then averaging the resulting set of probabilities or odds ratios.

Advertisements

2 thoughts on “On clustering, part four

  1. Jim,

    thank you very much for sharing your thoughts here! I was often tempted to use clustering with multivariate data sets, but then abstained from it exactly for the reason that I did not see any obvious metric to assess the multitude of possible solutions. A problem I see is that running k different parametrizations of a clustering algorithm and then look for the p-values would be the equivalent of testing k different hypotheses on the same data set, so depending on k any data set analyzed that way needs to be large. Anyway, I’d like to see your approach implemented as an R package and would be willing to contribute.

    • Hugo, thanks for your comment.

      Yes, the approach I’m advocating would definitely involve evaluating a number of different solutions (e.g. values of k in a k-means algorithm, or tree prune levels in a hierarchical method). It’s a maximum likelihood approach for sure–I just haven’t worked out the exact tack to take and thus the algorithmic details.

      Hmmmm, an R package, yes that would be useful, I hadn’t thought about it really. I don’t have time for it right now, and need to develop the ideas a little more, but I will definitely consider it, and get hold of you if I decide to do it.

Have at it

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s