Recognition For Review

I just found out that the second annual Peer Review Week is well underway. There are several online articles on the topic, perhaps best found via Twitter searches using #RecognizeReview or #PeerRevWk16, or via links at the link above.

This year’s theme thereof is Recognition For Review, and in that context it’s perfect timing, relative to a peer review post that I already had in mind. I don’t think there’s any question that the peer review process as a whole has very major problems, ones which greatly weaken the clarity, efficiency and reliability of the scientific process. These problems originate largely in the design of the review process, which in turn affect review execution. However, this reality doesn’t preclude the fact that thousands of people perform excellent review work, daily. And they’re not getting much credit for it either.

Some attention then, to one of the most interesting, important–and puzzling–reviews I’ve ever seen. Occasionally a paper comes out which is worth paying intense attention to, for reasons that go beyond just its technical content, and this is surely one in my opinion. The review and paper in question are publicly available at Atmospheric Chemistry and Physics (ACP). This was a long, involved review on a long, involved paper. If you have limited time to devote to this, go read Peter Thorne’s ICARUS article, a summary of his overall review experience.

The journal is one of a set of European Geophysical Union (EGU) journals that have gone to a completely open review process. The commenting process is online and open to anyone, although two or more official reviewers are also designated by the editor, who (unlike volunteer reviewers) may remain anonymous if they choose. For this open process alone the EGU deserves major recognition and gratitude, as it is arguably the single biggest step that can be taken to improve the peer review process. Everything has to be open.

There is a lot to say on this and I’ll start with the puzzling aspect of it. The article in question’s lead author is James Hansen, arguably still the most famous climate scientist in the world. Several of the reviews show that the article’s main claims are quite contentious, relative to the evidence and analysis presented, as summarized most completely by Thorne’s two reviews, the second of which–a phenomenal piece of review work–also summarizes Hansen et al’s responses (and non-responses) to the numerous reviewer comments, a job which presumably should really have fallen to the editor.

I’ve not yet worked all the way through everything, but you can’t read it and not wonder about some things. The authors didn’t have to submit their paper to an open review journal. So why did they? Did they assume the claims of the paper were largely non-contentious and it would thus slide smoothly through review? But given the clearly important claims, why not then submit to a highly prominent journal like Science or Nature for maximum attention and effect? Maybe they did, had it rejected and this was the second or third submission–I don’t know.

A second issue, one of several that did not sit at all well with Thorne, was the fact that Hansen et al. notified members of the press before submission, some of whom Thorne points out then treated it as if it were in fact a new peer reviewed paper, which it surely was not. When confronted on this point, Hansen was completely unapologetic, saying he would do the same thing again if given the chance, and giving as his reason the great importance of the findings to the world at large, future generations in particular. What? That response pretty well answers the question regarding his confidence in the main conclusions of the paper, and is disturbing in more than one way.

Thorne was also not at all pleased with Hansen’s flippant and/or non-responses to some of the review comments, and for this he took him severely to task for his general attitude, especially given the major weaknesses of the paper. The most important of the latter was the fact that there was no actual, model connection between the proposed processes driving rapid ice sheet melt, and the amount of fresh water flowing into the oceans to drive the rapid sea level rise that is the main claim of the paper. Rather, that flow was prescribed independently of the ice melt processes in what amounted to a set of “what if” scenarios more or less independent of the model’s ice melt dynamics. More importantly, this highly important fact was not clear and prominent: it had to be dug out by careful reading, and moreover, Hansen essentially denied that this was in fact the case.

There are major lessons here regarding conduct of peer review, how scientists should behave (senior scientists in particular), and scientific methodology. Unfortunately, I have no more time to give this right now–and I would give it a LOT more if I did. This is thus largely a “make aware” post. The paper and its review comprise a case study in many respects, and requires a significant commitment. I personally have not seen a more important paper review in a very long time, if ever. Peter Thorne, some of the other volunteer reviewers, and ACP, deserve recognition for this work.

Please do not fire off any uninformed comments. Thanks.

Too many memories

I remember this town, with a girl by my side
And a love seldom found, in this day and time
And it gets melancholy, every now and again
When you let your mind go, and it drifts way back when
Now life plays its tricks, some cruel but fair
And even a fool can’t pretend they don’t care

When there’s too many memories for one heart to hold
Once a future so bright now seems so distant and cold
And the shadows grow long and your eyes look so old
When there’s too many memories for one heart to hold

There are those moments, and they just never fade
Like the look in her eyes and the way the light played
God moved in that moment, and the angels all cried
And they gave you a memory that you’ll have ’til you die
Now the lesson you learned, and you don’t dare forget
What makes you grow old is replacing hope with regret

And there’s too many memories for one heart to hold
Once a future so bright, now seems so distant and cold
And the shadows grow long, and your eyes look so old
When there’s too many memories for one heart to hold

The late Stephen Bruton, Too Many Memories
(Thanks to Mike Flynn for playing the Tom Rush cover of this last night on his great show, The Folk Sampler)

“Why the Americans Are More Addicted to Practical Than to Theoretical Science”

Those who cultivate the sciences among a democratic people are always afraid of losing their way in visionary speculation. They mistrust systems; they adhere closely to facts and the study of facts with their own senses. As they do not easily defer to the mere name of any fellow-man, they are never inclined to rest upon any man’s authority; but, on the contrary, they are unremitting in their efforts to point out the weaker points of their neighbor’s opinions. Scientific precedents have very little weight with them; they are never long detained by the subtlety of the schools, nor ready to accept big words for sterling coin; they penetrate, as far as they can, into the principal parts of the subject which engages them, and they expound them in the vernacular tongue. Scientific pursuits then follow a freer and a safer course, but a less lofty one.

Tocqueville Vol 2 frontis

The mind may, as it appears to me, divide science into three parts. The first comprises the most theoretical principles, and those more abstract notions, whose application is either unknown or very remote. The second is composed of those general truths, which still belong to pure theory, but lead nevertheless by a straight and short road to practical results. Methods of application and means of execution make up the third. Each of these different portions of science may be separately cultivated, although reason and experience show that none of them can prosper long, if it be absolutely cut off from the two others.

In America the purely practical part of science is admirably understood, and careful attention is paid to the theoretical portion which is immediately requisite to application. On this head the Americans always display a clear, free, original, and inventive power of mind. But hardly any one in the United States devotes himself to the essentially theoretical and abstract portion of human knowledge. In this respect the Americans carry to excess a tendency which is, I think, discernible, though in a less degree, among all democratic nations.

Continue reading

Does the Poisson scale up?

I often get obsessed with certain topics, especially statistical and mathematical ones. Lately I’ve been thinking about the Poisson distribution a lot, as it figures heavily in analyses of random populations, and thus also in assessing deviations therefrom, a topic that I’ve been heavily involved for a while, relative to historic and current tree populations. I also find that often a specific question will arise, related to whatever it is I’m doing, that is best (and often quickest) answered in a way that gives me the most confidence, by direct simulation, for which I use R.

The topic of varying scales of variation, and the information that can be obtained by analysis thereof, is a pretty interesting one IMO. When the generating processes for a phenomenon of interest are complex, or poorly understood for whatever reason, one can (and should!) obtain valuable information regarding likelihoods of various hypothetical process drivers, by multi-scale analysis–essentially, obtaining evidence for the scale(s) at which departures from randomness are greatest and using that information to suggest, or constrain, possible explanations. There are a number of ways to go about doing so.

The Poisson distribution is the appropriate descriptor of a homogeneous (constant) rate process whose individual event outcomes are random. An “under-dispersed” population at a particular scale of analysis will be more “regular” in its arrangement than expected by a random process, and it is clear that in such a situation there must necessarily also be under-dispersion at at least some other scales as well, both smaller and larger. To illustrate via an extreme example, suppose some location that gets 36 inches of precipitation (P) per year on average, distributed as exactly three inches per month, every month. The probability of such a result arising, when P varies randomly (Poisson) at any sub-monthly scale, is extremely low. It won’t occur over any extended period of time. The same principle holds, though muted, if there is some monthly variance around 3.0.

In an over-dispersed (“clustered”) population, again at some defined scale, the situation is different. Such a population will also be over-dispersed at smaller scales, but not necessarily at larger ones, at least not at the same intensity, and there should be some unknown scale at which the variation reduces to the Poisson. This means that a Poisson population is not necessarily Poisson at smaller scales, but it should be so at larger scales. That is, it should “scale up” according to Poisson expectation, i.e. with the same rate but a greater absolute number, and variance therein, per sample unit.

But does it? Or rather, what does R think about the matter?

Well, here’s what I get, using the example case of a mean annual P/yr of 36″ and 100,000 simulated monthly, or weekly, sums from randomly sampling the Poisson expectation at sub-interval (weekly or daily) time scales.

rm(list=ls())
options(digits=3)
# 1. Observed annual mean and corresponding scaled sub-int. means.  Year = 364, month = 28, days.
obs.ann.mn = 36			# observed annual mean, from record
(monthly.mn = obs.ann.mn/13)	# 13 months/yr! 
(weekly.mn = obs.ann.mn/52)
(daily.mn = obs.ann.mn/364)

# 2. Poisson CDF expectations, ignore slight variations in days:
 # Equal interval CDF probs determined by no. time intervals in a year, eases interpr.
 # Set CDF probs to correspond to the various time ints. to give temporal distribution 
 # NOTE that qpois, for cdf = 1.0, = Inf., so omit last interval
# Poisson
P.Pmonth = qpois(p=(1:12)/13, lambda = monthly.mn) 		# 13 mos
P.Pweek = qpois(p=(1:51)/52, lambda = weekly.mn)		# 52 weeks
P.Pday = qpois(p=(1:363)/364, lambda = daily.mn)		# 364 days
table (P.Pmonth); table(P.Pweek); table(P.Pday)

# 3. Simulations: do repeated samples taken from shorter time periods and summed, match Poisson/gamma expectations at longer periods?
n.trials = 1e5
P.month.week = rep(NA,n.trials)
 for (i in 1:n.trials) P.month.week[i] = sum(sample(P.Pweek, 4, replace=T))			# Exactly 4 weeks to our months
 q.P.month.week = as.vector(quantile(P.month.week, probs = (1:12)/13)); rm(P.month.week)
P.month.day = rep(NA,n.trials)
 for (i in 1:n.trials) P.month.day[i] = sum(sample(P.Pday, 28, replace=T))
 q.P.month.day = as.vector(quantile(P.month.day, probs = (1:12)/13)); rm(P.month.day)
P.week.day = rep(NA,n.trials)
 for (i in 1:n.trials) P.week.day[i] = sum(sample(P.Pday, 7, replace=T))
 q.P.week.day = as.vector(quantile(P.week.day, probs = (1:51)/52)); rm(P.week.day)

mw = data.frame(table (P.Pmonth), table(q.P.month.week))[,-3]; colnames(mw)=c("Precip, monthly", "Poisson Expect.", "Aggr., weekly")
md = data.frame(table (P.Pmonth), table(q.P.month.day))[,-3]; colnames(md)=c("Precip, monthly", "Poisson Expect.", "Aggr., daily")
wd = data.frame(table (P.Pweek), table(q.P.week.day))[,-3]; colnames(wd)=c("Precip, weekly", "Poisson Expect.", "Aggr., daily")
mw; md; wd

Answer: Yes, it does exactly.*

Precip, monthly 	Poisson Exp. 	Aggr., weekly
               1               3             3
               2               3             3
               3               3             3
               4               2             2
               5               1             1
Precip, monthly 	Poisson Exp.	Aggr., daily
               1               3            3
               2               3            3
               3               3            3
               4               2            2
               5               1            1
Precip, weekly 		Poisson Exp.	 Aggr., daily
               0              26           26
               1              18           18
               2               6            6
               3               1            1

*However, I also evaluated gamma expectations, mainly as a check and/or curiosity (the gamma interpolates between the Poisson’s integer values). I didn’t always get the exact correspondence as expected, and I’m not really sure why. Close, but not close enough to be due to rounding errors, so that’s kind of interesting, but not enough to pursue further.

Funding for this post was provided in equal part by the French Fish Ectoderm and Statistics Foundation and the American Association for Advancement of Amalgamated and Aggregated Associations. These organizations are solely responsible for any errors herein, and associated management related consequences.

Twitter science

Discussing science on the internet can be interesting at times, even on Twitter, which seems to have been designed specifically to foster misunderstanding by way of brevity. Here are two examples from my week.

Early in the week, Brian Brettschneider, a climatologist in Alaska, put up a global map of monthly precipitation variability:
Brettschneider map
Brian said the metric graphed constitutes the percentiles of a chi-square goodness-of-fit test comparing average monthly precipitation (P) against uniform monthly P. I then made the point that he might consider using the Poisson distribution of monthly P as the reference departure point instead, as this was the more correct expectation of the “no variation” situation. Brian responded that there was no knowledge, or expectation, regarding the dispersion of data, upon which to base such a decision. That response made me think a bit, and I then realized that I was thinking of the issue in terms of variation in whatever driving processes lead to precipitation measured at monthly scales, whereas Brian was thinking strictly in terms of the observations themselves–the data as they are, without assumptions. So, my suggestion was only “correct” if one is thinking about the issue the way I was. Then, yes, the Poisson distribution around the overall monthly mean, will describe the expected variation of a homogeneous, random process, sampled monthly. But Brian was right in that there is no necessary reason to assume, apriori, that this is in fact the process that generated the data in various locations.

The second interchange was more significant, and worrisome. Green Party candidate for President, physician Jill Stein, stated “12.3M Americans could lose their homes due to a sea level rise of 9ft by 2050. 100% renewable energy by 2030 isn’t a choice, it’s a must.” This was followed by criticisms, but not just by the expected group but also by some scientists and activists who are concerned about climate change. One of them, an academic paleoecologist, Jacquelyn Gill, stated “I’m a climate scientist and this exceeds even extreme estimates“, and later “This is NOT correct by even the most extreme estimates“. She later added some ad-hominem barbs such as “That wasn’t a scientist speaking, it was a lawyer” and “The point of Stein’s tweet was to court green voters with a cherry-picked figure“. And some other things that aren’t worth repeating really.

OK so what’s the problem here? Shouldn’t we be criticizing exaggerations of science claims when they appear in the mass culture? Sure, fine, to the extent that you are aware of them and have the time and expertise to do so. But that ain’t not really the point here, which is instead something different and more problematic IMO. Bit of a worm can in fact.

Steve Bloom has been following the climate change debate for (at least) several years, and works as hard to keep up on the science as any non-scientist I’ve seen. He saw Gill’s tweets and responded, that no, Stein’s statement did not really go so far beyond the extreme scientific estimates. He did not reference some poor or obsolete study by unknown authors from 25 years ago, but rather a long, wide ranging study by James Hansen and others, only a few months old, one that went through an impressive and unique open review process (Peter Thorne was one of the reviewers, and critical of several major aspects of the paper, final review here, and summary of overall review experience here). Their work does indeed place such a high rate of rise within the realm of defensible consideration, depending on glacier and ice sheet dynamics in Greenland and Antarctica, for which they incorporate into their modeling some recent findings on the issue. So, Jill Stein is not so off-the-wall in her comments after all, though she may have exaggerated slightly, and I don’t know where she got the “12.3M homes” figure.

The point is not that James Hansen is the infallible king of climate science, and therefore to be assumed correct. Hanson et al. might be right or they might be wrong, I don’t know. [If they’re right we’re in big trouble]. I wasn’t aware of the study until Steve’s tweeted link, and without question it will take some serious time and work to work through the thing, even just to understand what they claim and how they got there, which is all I can expect to achieve. If I get to it at all that is.

One point is that some weird process has developed, where all of a sudden a number of scientists sort of gang up on some politician or whatever who supposedly said some outrageous thing or other. It’s not scientist A criticizing public person B this week and then scientist C criticizing public person D the next week–it’s a rather predictable group all ganging up on one source, at once. To say the least, this is suspicious behavior, especially given the magnitude of the problems I see within science itself. I do wonder how much of this is driven by climate change “skeptics” complaining about the lack of criticisms of extreme statements in the past.

To me, the bigger problem is that these criticisms are rarely aimed at scientists, but rather at various public persons. Those people are not immune to criticism, far from it. But in many cases, and clearly in this one, things being claimed originate from scientists themselves, in publications, interviews or speeches. For the most part, people don’t just fabricate claims, they derive them from science sources (or what they consider to be such), though they certainly may exaggerate them. If you don’t think the idea of such a rapid rise is tenable, fine…then take Hanson et al. to the cleaners, not Jill Stein. But, unless you are intimately familiar with the several issues involving sea level rise rates, especially ice melt, then you’ve got some very long and serious work ahead of you before you’re in any position to do so. This stuff is not easy or simple and the authors are no beginners or lightweights.

The second issue involves the whole topic of consensus, which is a very weird phenomenon among certain climate scientists (not all, by any means). As expected, when I noted that Stein was indeed basically referencing Hanson et al., I was hit with the basic argument (paraphrased) “well they’re outside of the consensus (and/or IPCC) position, so the point remains”. Okay, aside from the issues of just exactly how this sacred consensus is to be defined anyway… yeah, let’s say they are outside of it, so what? The “consensus position” now takes authority over evidence and reasoning, modeling and statistics, newly acquired data etc., that is, over the set of tools we have for deciding which, of a various set of claims, is most likely correct? Good luck advancing science with that approach, and especially in cases where questionable or outright wrong studies have formed at least part of the basis of your consensus. It’s remarkably similar to Bayesian philosophy–they’re going to force the results from prior studies to be admitted as evidence, like it or not, independent of any assessment of their relative worth. Scientific ghoulash.

And yes, such cases do indeed exist, even now–I work on a couple of them in ecology, and the whole endeavor of trying to clarify issues and correct bad work can be utterly maddening when you have to deal with that basic mindset.

“We live in a Chi-square society due to political correctness”

So, without getting into the reasons, I’m reading through the entry in the International Encyclopedia of Statistical Science on “Statistical Fallacies: Misconceptions and Myths”, written by one “Shlomo Sawilowsky, Professor, Wayne State University, Detroit MI, USA”. Within the entry, 20 such fallacies are each briefly described.

Sawilowsky introduces the topic by stating:

Compilations and illustrations of statistical fallacies, misconceptions, and myths abound…The statistical faux pas is appealing, intuitive, logical, and persuasive, but demonstrably false. They are uniformly presented based on authority and supported based on assertion…these errors spontaneously regenerate every few years, propagating in peer reviewed journal articles…and dissident literature. Some of the most egregious and grievous are noted below.

Great, let’s get after it then.

He then gets into his list, which proceeds through a set of +/- standard types of issues, including misunderstanding of the Central Limit Theorem, Type I errors, p values, effect sizes and etc. Up comes item 14:

14. Chi-square
(a) We live in a Chi-square society due to political correctness that dictates equality of outcome instead of equality of opportunity. The test of independence version of this statistic is accepted sans voire dire by many legal systems as the single most important arbiter of truth, justice, and salvation. It has been asserted that any statistical diffšerence between (oftŸen even nonrandomly selected) samples of ethnicity, gender, or other demographic as compared with (oftŸen even inaccurate, incomplete, and outdated) census data is primae faciea evidence of institutional racism, sexism, or other ism. A plaintiffš allegation that is supportable by a significant Chi-square is oŸften accepted by the court (judges and juries) praesumptio iuris et de iure. Similarly, the goodness of fit version of this statistic is also placed on an unwarranted pedestal.

Bingo Shlomo!!

Now this is exactly what I want from my encyclopedia entries: a strictly apolitical, logical description of the issue at hand. In fact, I hope to delve deep into other statistical writings of Dr. Sawilowsky to gain, hopefully, even better insights than this one.

Postscript: I’m not really bent out of shape on this, and would indeed read his works (especially this one: Sawilowsky, S. (2003) Deconstructing arguments from the case against hypothesis testing. J. Mod. Appl. Stat. Meth. 2(2):467-474). I can readily overlook ideologically driven examples like this to get at the substance I’m after, but I do wonder how a professional statistician worked that into an encyclopedia entry.

I note also that the supposed “screening fallacy” popular on certain blogs is not included in the list…and I’m not the least bit surprised.

Adjusting the various contentions of the elements

General views of the Fashioned, be it matter aggregated into the farthest stars of heaven, be it the phenomena of earthly things at hand, are not merely more attractive and elevating than the special studies which embrace particular portions of natural science; they further recommend themselves peculiarly to those who have little leisure to bestow on occupations of the latter kind. The descriptive natural sciences are mostly adapted to particular circumstances: they are not equally attractive at every season of the year, in every country, or in every district we inhabit. The immediate inspection of natural objects, which they require, we must often forego, either for long years, or always in these northern latitudes; and if our attention be limited to a determinate class of objects, the most graphic accounts of the travelling naturalist afford us little pleasure if the particular matters, which have been the special subjects of our studies, chance to be passed over without notice.

As universal history, when it succeeds in exposing the true causal connection of events, solves many enigmas in the fate of nations, and explains the varying phases of their intellectual progress—-why it was now impeded, now accelerated—-so must a physical history of creation, happily conceived, and executed with a due knowledge of the state of discovery, remove a portion of the contradictions which the warring forces of nature present, at first sight, in their aggregate operations. General views raise our conceptions of the dignity and grandeur of nature; and have a peculiarly enlightening and composing influence on the spirit; for they strive simultaneously to adjust the contentions of the elements by the discovery of universal laws, laws that reign in the most delicate textures which meet us on earth, no less than in the Archipelagos of thickly clustered nebulae which we see in heaven, and even in the awful depths of space-—those wastes without a world.

General views accustom us to regard each organic form as a portion of a whole; to see in the plant and in the animal less the individual or dissevered kind, than the natural form, inseparably linked with the aggregate of organic forms. General views give an irresistible charm to the assurance we have from the late voyages of discovery undertaken towards either pole, and sent from the stations now fixed under almost every parallel of latitude, of the almost simultaneous occurrence of magnetic disturbances or storms, and which furnish us with a ready means of divining the connection in which the results of later observation stand to phenomena recorded as having occurred in bygone times; general views enlarge our spiritual existence, and bring us, even if we live in solitude and seclusion, into communion with the whole circle of life and activity — with the earth, with the universe.

Alexander von Humboldt, 1845
Kosmos: A General Survey of the Physical Phenomena of the Universe, Vol. I, pp. 23-24

Warner, 1833

Just one reason why I will never tire of reading history and exploration, extracted from:
Anonymous (1891). A Memorial and Biographical History of Northern California. Lewis Publishing Co., Chicago IL.

Colonel J. J. Warner, now of Los Angeles, a member of the Ewing trapping expedition, which passed north through these valleys in 1832, and back again in 1833, says:

“In the fall of 1832, there were a number of Indian villages on King’s River, between its mouth and the mountains; also on the San Joaquin River, from the base of the mountains down to and some distance below the great slough. On the Merced River, from the mountains to its junction with the San Joaquin, there were no Indian villages; but from about this point on the San Joaquin, as well as on its principal tributaries, the Indian villages were numerous, many of them containing some fifty to one hundred dwellings, built with poles and thatched with rushes. With some few exceptions, the Indians were peaceably disposed. On the Tuolumne, Stanislaus and Calaveras rivers there were no Indian villages above the mouths, as also at or near their junction with the San Joaquin. The most hostile were on the Calaveras River. The banks of the Sacramento River, in its whole course through the valley, was studded with Indian villages, the houses of which, in the spring, during the day-time, were red with the salmon the aborigines were curing.

At this time there were not, on the San Joaquin or Sacramento river, or any of their tributaries, nor within the valleys of the two rivers, any inhabitants but Indians. On no part of the continent over which I had then, or have since, traveled, was so numerous an Indian population, subsisting on the natural products of the soil and waters, as in the valleys of the San Joaquin and Sacramento. There was no cultivation of the soil by them; game, fish, nuts of the forest and seeds of the field constituted their entire food. They were experts in catching fish in many ways, and in snaring game in diverse modes.

On our return, late in the summer of 1833,we found the valleys depopulated. From the head of the Sacramento to the great bend and slough of the San Joaquin we did not see more than six or eight live Indians, while large numbers of their bodies and skulls were to be seen under almost every shade-tree near water, where the uninhabited and deserted villages had been converted into grave-yards; and on the San Joaquin River, in the immediate neighborhood of the larger class of villages, which the preceding year were the abodes of large numbers of these Indians, we found not only many graves, but the vestiges of a funeral pyre. At the mouth of King’s River we encountered the first and only village of the stricken race that we had seen after entering the great valley; this village contained a large number of Indians temporarily stopping at that place.

We were encamped near the village one night only, and during that time the death angel, passing over the camping-ground of the plague stricken fugitives, waved his wand, summoning from a little remnant of a once numerous people a score of victims to muster in the land of the Manitou; and the cries of the dying, mingling with the wails of the bereaved, made the night hideous in that veritable valley of death.

History N CA cover

On clustering, part five

This post constitutes a wrap-up and summary of this series of articles on clustering data.

The main point thereof is that one needs an objective method for obtaining evidence of meaningful groupings of values (clusters) in a given data set. This issue is most relevant to non-experimental science, in which one is trying to obtain evidence of whether the observed data are explainable by random processes alone, versus processes that lead to whatever structure may have been observed in the data.

But I’m still not happy with my description in part four of this series, regarding observed and expected data distributions, and what these imply for data clustering outputs. In going over Ben Bolker’s outstanding book for the zillionth time (I have literally worn the cover off of this book, parts of it freely available here), I find that he explains what I was trying to better, in his description of the negative binomial distribution relative to the concept of statistical over-dispersion, where he writes (p. 124):

…rather than counting the number of successes obtained in a fixed number of trials, as in a binomial distribution, the negative binomial counts the number of failures before a pre-determined number of successes occurs.

This failure-process parameterization is only occasionally useful in ecological modeling. Ecologists use the negative binomial because it is discrete, like the Poisson, but its variance can be larger than its mean (i.e. it can be over-dispersed). Thus, it’s a good phenomenological description of a patchy or clustered distribution with no intrinsic upper limit, that has more variance than the Poisson…The over-dispersion parameter measures the amount of clustering, or aggregation, or heterogeneity in the data…

Specifically, you can get a negative binomial distribution as the result of a Poisson sampling process where the rate lambda itself varies. If lambda is Gamma-distributed (p.131) with shape parameter k and mean u, and x is Poisson-distributed with mean lambda, then the distribution of x will be a negative binomial distribution with mean u and over-dispersion parameter k (May, 1978; Hilborn and Mangel, 1997). In this case, the negative binomial reflects unmeasured (“random”) variability in the population.

The relevance of this quote is that a distribution that is over-dispersed, that is, one that has longer right or left (or both) tails than expected from a Poisson distribution having a given mean, is evidence for a non-constant process structuring the data. The negative binomial distribution describes this non-constancy, in the form of an “over-dispersion parameter” (k). In that case, the process that is varying is doing so smoothly (as defined by a gamma distribution), and the resulting distribution of observations will therefore also be smooth. In a simpler situation, one where there are say, just two driving process states, a bi-modal distribution of observations will result.

Slapping a clustering algorithm on the latter will return two clusters whose distinction is truly meaningful–the two sets of values were likely generated by two different generating parameters. A clustering applied to a negative binomial distribution will be arbitrary with respect to just which values get placed in which cluster, and even to the final number of clusters delineated, but not with respect to the idea that the observations do not result from a single homogeneous process, which is a potentially important piece of information. Observation of the data, followed by some maximum likelihood curve fits of negative binomial distributions, would then inform one that the driving process parameters varied smoothly, rather than discretely/bimodally.

On clustering, part four

I have not written any post, or series thereof, in which I had more questions or made me think more precisely about purposes and products of statistical approaches/algorithms, than this one. I’m finding this topic of statistical clustering of multivariate data to be far subtler and more interesting than I thought when I waded into it. Justifiably counter-intuitive even; I encourage you to read through this one as I think the root concepts are pretty important.

It’s easy to misunderstand exactly what the output of a statistical/mathematical procedure represents, and thus vital to grasp just what is going on, and why. This might seem like a fully obvious statement, but my experience has often been that we who work in mostly non-experimental fields (e.g. ecology, geology, climatology, etc) frequently do not in fact have a clear and very exact grasp of what our various quantitative methods are doing. I’m not just referring to highly complex methods either, but even to simple methods we often take for granted. This applies to me personally, and to all kinds of papers by others that I’ve read, in various fields.

The first point to be made here is to retract and correct the example I gave in part one of this series. I had said there that if one clusters the ten numbers 0-9, into two groups, using either a k-means or hierarchical clustering procedure, the first group will contain {0 to 4} and the second will contain {5 to 9}. That’s not the problem–they will indeed do so (at least in R). The problem is the statement that this output is meaningless–that it does not imply anything about structure in the data. On thinking this through a little more, I conclude that whether it does or doesn’t depends on the specifics of the data.

It is very true that an inherent potential problem with clustering methods is the imposition of an artificial structure on the data when none actually exists. With the “k-means” algorithm this is a direct result of minimizing sums of the squared differences from the mean within groups, whereas with hierarchical methods (AGNES, etc), it results from iteratively joining the closest items (item to item or item to group), as determined by a constantly updated distance matrix after each join. Either way, the algorithms cause similarly valued items to be placed together in groups.

The problem is with the example I used, specifically, that a set of values appearing to follow a smooth gradient, such as the integers 0:9 in the example, rather than displaying definite groupings/clusters, necessarily indicates a lack of structure, and thus a meaningless clustering result. This is not necessarily the case, which can only be grasped in the context of an apriori expected distribution of values, given an observed overall mean value. These expectations are given by the Poisson and gamma distributions, for integer- and real-valued data respectively.

The most immediate question arising from that last statement should be “Why exactly should the Poisson or gamma distributions define our expectation, over any other distribution?”. The answer is pretty important, as it gets at just what we’re trying to do when we cluster values. I would strongly argue that that purpose has to be the identification of structure in the data, that is, a departure from randomness, which in turn means we need an estimate of the random condition against which to gauge possible departures. Without having this “randomness reference point”, the fact that {0:4} and {5:9} will certainly fall into two different groups when clustered is nothing more than the trivial re-statement that {5:9} are larger values than {0:4}: fully meaningless. But R (and presumably most other statistical analysis programs as well), does not tell you that–it gives you no information on just how meaningful a given clustering output is. Not good: as scientists we’re after meaningful output, not just output.

The answer to the above question is that the Poisson and gamma distributions provide those randomness reference points, and the reason why is important: they are essentially descriptors of expected distributions due to a type of sampling error alone. Both are statements that if one has some randomly distributed item–say tree locations in two dimensions in a forest, or whatever example you like–then a set of pre-defined measurement units, placed at random throughout the sampling space (geographic area in my example), will vary in the number of items contained (Poisson), and in the area encompassed by measurements from random points to the nth closest objects thereto (gamma). Not sampling error in the traditional sense, but the concept is very similar, so that’s what I’ll call it.

Therefore, departures from such expectations, in an observed data set, are necessarily evidence for structure in those data, in the sense that something other than simple sampling variation–some causative agent–is operating to cause the departure. The main goal of science is to seek and quantify these agents…but how are you supposed to seek those somethings until you have evidence that there is some something there to begin with? And similarly, of the many variables that might be important in a given system, how does one decide which of those are most strongly structured? These questions are relevant because this discussion is in the context of large, multivariate, non-experimental systems, with potentially many variables of unknown interaction. We want evidence regarding whether or not the observations are explainable by a single homogeneous process, or not.

Things get interesting, given this perspective, and I can demonstrate why using say, the set of n = 21 integers {0:20}, a perfect gradient with no evidence of aggregation.

The Poisson distribution tells us that we do not expect a distribution of {0:20} for 21 integers having a mean value of 10. We observe and expect (for a cdf at 5% quantile steps):

   quantile observed expected
     0.05        1        5
     0.10        2        6
     0.15        3        7
     0.20        4        7
     0.25        5        8
     0.30        6        8
     0.35        7        9
     0.40        8        9
     0.45        9        9
     0.50       10       10
     0.55       11       10
     0.60       12       11
     0.65       13       11
     0.70       14       12
     0.75       15       12
     0.80       16       13
     0.85       17       13
     0.90       18       14
     0.95       19       15

…as obtained by (in R):

observed = 0:20; n = length(observed)
expected = qpois(p = seq(0, 1, length.out=n), lambda=mean(observed))
df = data.frame(quantile = seq(0, 1, length.out=n), observed, expected)[2:20,]

Running a chi-square test on the observed values:

chisq.test(x=df$observed)

…we get:

data: df$observed
X-squared = 57, df = 18, p-value = .000006

There is thus only a very tiny chance of getting the observed result from sampling variation alone. Observation of the above table shows that the observed data are stretched–having longer left and right tails, relative to expectation.

But this series is about clustering and how to find evidence for it in the data…and there is no tendency toward clustering evident in these data. Indeed, the opposite is true–the observed distribution is long-tailed, “stretched out” compared to expectation. But…this result must mean that there is indeed some structuring force on the data to cause this result–some departure from the homogeneous state that the Poisson assumes. We can’t know, from just this analysis alone, just how many “real” clusters of data there are, but we do know that it must be more than one (and I hope it is apparent as to why)! More precisely, if we’ve decided to look for clusters in the data, then the chi-square test gives us evidence that more than one cluster is highly likely.

Just how many clusters is a more difficult question, but we could evaluate the first step in that direction (i.e. two clusters) by dividing the values into two roughly even sized groups defined by the midpoint (= 10) and evaluating whether Poisson-distributed values for the two groups, having means of 4.5 and 15, give a better fit to the observations:

(exp1 = qpois(p = seq(0, 1, length.out=11), lambda=4.5)[2:10])
(exp2 = qpois(p = seq(0, 1, length.out=12), lambda=15)[2:11])
(df2 = data.frame(quantile = seq(0, 1, length.out=n)[2:20], obs=df$observed, exp = sort(c(exp1,exp2))))

…which gives:

   quant. obs. exp.
     0.05   1   2
     0.10   2   3
     0.15   3   3
     0.20   4   4
     0.25   5   4
     0.30   6   5
     0.35   7   5
     0.40   8   6
     0.45   9   7
     0.50  10  10
     0.55  11  11
     0.60  12  13
     0.65  13  14
     0.70  14  14
     0.75  15  15
     0.80  16  16
     0.85  17  17
     0.90  18  18
     0.95  19  20

…and then compute the mean of the two chi-square probabilities:

obs1=0:9; obs2=10:20
p = mean(c(chisq.test(obs1)$p.value, chisq.test(obs2)$p.value))

…which returns p = 0.362
[Edit note: I had wrong values here originally–because I mistakenly ran the chi square tests on the expected values instead of the observed. Now corrected–still a very high odds ratio, just not as high as before.]

The odds ratio of the two hypotheses (two Poisson-distributed groups having means of 4.5 and 15.0, versus just one group with a mean of 10.0) is thus 0.362 / .000006 = 60,429. Thus, clustering the observed data into these two groups would be a very highly defensible decision, even though the observed data comprise a perfect gradient having no tendency toward aggregation whatsoever!

The extension from the univariate to the multivariate case is straight-forward, involving nothing more than performing the same analysis on each variable and then averaging the resulting set of probabilities or odds ratios.

On throwing a change up when a fastball’s your best pitch

Sports are interesting, and one of the interesting aspects about them, among many, is that the very unlikely can sometimes happen.

The Louisville Cardinals baseball team went 50-12 this year through the regular season and first round (“regional”) of the NCAA baseball playoff. Moreover, they were an astounding 36-1 at home, the only loss coming by three runs at the hands of last year’s national champion, Virginia. Over the last several years they have been one of the best teams in the country, making it to the College World Series twice, though not yet winning it. They were considered by the tournament selection committee to be the #2 team in the country, behind Florida, but many of the better computer polls had Louisville as #1.

The college baseball playoff is one of the most interesting tournaments out there, from a structural perspective. Because it’s baseball, it’s not a one-loss tournament, at any of the four levels thereof, at least since 2003. Those four levels are: (1) the sixteen regionals of four teams each, (2) the eight “super regionals” determined by the regional champs, and (3) two rounds at the College World Series in Omaha, comprised of the eight super regional champs. A team can in fact lose as many as four games total over the course of the playoff, and yet still win the national championship. It’s not easy to do though, because a loss in the first game, at either the regional level, or in round one of the CWS, requires a team to win four games to advance, instead of three. In the 13 years of this format, only Fresno State has pulled that feat off, in 2008.

In winning their regional and being one of the top eight seeds, Louisville hosted the winner of the Nashville regional, which was won in an upset over favorite Vanderbilt, by UC Santa Barbara of the Big West Conference. That conference is not as good top to bottom as is the Atlantic Coast Conference (ACC) that Louisville plays in, but neither is it any slouch, containing perennial power CSU Fullerton, and also Long Beach State, who gave third ranked Miami fits in its regional. More generally, the caliber of the baseball played on the west coast, including the PAC-12 and the Big West, is very high, though often slighted by writers and pollsters in favor of teams from the southeast (ACC and Southeast (SEC) conferences in particular). Based on the results of the regional and super regional playoff rounds, the slighting this year was serious: only two of the eight teams in the CWS are from the ACC/SEC, even though teams from the two conferences had home field advantage in fully 83 percent (20/24) of all the first and second round series. Five schools west of the Mississippi River are in, including the top three from the Big 12 conference.

In the super regional, the first team to win twice goes on to the CWS in Omaha. To make a long and interesting story short, UCSB won the first game 4-2 and thus needed just one more win to knock out Louisville and advance to the CWS for the first time in their history. Down 3-0, in the bottom of the ninth inning, they were facing one of the best closers in all of college baseball, just taken as the 27th overall pick in the MLB amateur draft by the Chicago White Sox. Coming in with 100+ mph fastballs, he got the first batter out without problem. However, the second batter singled, and then he began to lose his control and he did exactly what you shouldn’t do: walked the next two batters to load the bases. The UCSB coach decided to go to his bench to bring in a left-handed hitting pinch-hitter, a freshman with only 26 at-bats on the season, albeit with one home run among his nine hits on the year.

And the rest, as they say, is history:

(All the games from this weekend are available for replay here)

On clustering, part three

It’s not always easy to hit all the important points in explaining an unfamiliar topic and I need to step back and mention a couple of important but omitted points.

The first of these is that we can estimate the expected distribution of individual values, from a known mean, assuming a random distribution of values, which, since the mean must be obtained from a set of individual values, means we can compare the expected and observed values, and thus evaluate randomness. The statistical distributions designed for this task are the Poisson and the gamma, for integer- and real-valued data respectively. Much of common statistical analysis is built around the normal distribution, and people are thus generally most familiar with it and prone to use it, but the normal won’t do the job here. This is primarily because it’s not designed to handle skewed distributions, which are a problem whenever data values are small or otherwise limited at one end of the distribution (most often by the value of zero).

Conversely, the Poisson and gamma have no problem with such situations: they are built for the task. This fact is interesting, given that both are defined by just one parameter (the overall mean) instead of two, as is the case for the normal (mean and standard deviation). So, they are simpler, and yet are more accurate over more situations than is the normal–not an everyday occurrence in modeling. Instead, for whatever reason, there’s historically been a lot of effort devoted to transforming skewed distributions into roughly normal ones, usually by taking logarithms or roots, as in e.g. the log-normal distribution. But this is ad-hoc methodology that brings with it other problems, including back transformation.

The second point is hopefully more obvious. This is that although it is easy to just look at a small set of univariate data and see evidence of structure (clustered or overly regular values), large sample sizes and/or multivariate data quickly overwhelm the brain’s ability to do this well, and at any rate we want to assign a probability to this non-randomness.

The third point is maybe the most important one, and relates to why the Poisson and gamma (and others, e.g. the binomial, negative binomial etc.) are very important in analyzing non-experimental data in particular. Indeed, this point relates to the issue of forward versus inverse modeling, and to issues in legitimacy of data mining approaches. I don’t know that it can be emphasized enough how radically different the experimental and non-experimental sciences are, in terms of method and approach and consequent confidence of inference. This is no small issue, constantly overlooked IMO.

If I’ve got an observed data set, originating from some imperfectly known set of processes operating over time and space, I’ve got immediate trouble on my hands in terms of causal inference. Needless to say there are many such data sets in the world. When the system is known to be complex, such that elucidating the mechanistic processes at the temporal and spatial scales of interest is likely to be difficult, it makes perfect sense to examine whether certain types of structures might exist just in the observed data themselves, structures that can provide clues as to just what is going on. The standard knock on data mining and inverse modeling approaches more generally is that of the possibility of false positive results–concluding that apparent structures in the data are explainable by some driving mechanism when in fact they are due to random processes. This is of course a real possibility, but I find this objection to be more or less completely overblown, primarily because those who conduct this type of analysis are usually quite well aware of this possibility thank you.

Overlooked in those criticisms is the fact that by first identifying real structure in the data–patterns explainable by random processes at only a very low probability–one can immediately gain important clues as to just what possible causal factors to examine more closely instead of going on a random fishing expedition. A lot of examples can be given here, but I’m thinking ecologically and in ecology there are many variables that vary in a highly discontinuous way, and this affects the way we have to consider things. This concept applies not only to biotic processes, which are inherently structured by the various aggregational processes inherent in populations and communities of organisms, but to various biophysical thresholds and inflection points as well, whose operation over large scales of space of time are often anything but well understood or documented. As just one rough but informative example, in plant ecology a large fraction of what is going on occurs underground, where all kinds of important discontinuities can occur–chemical, hydrologic, climatic, and of course biological.

So, the search for non-random patterns within observed data sets–before ever even considering the possible drivers of those patterns–is, depending on the level of apriori knowledge of the system in question, a potentially very important activity. In fact, I would argue that this is the most natural and efficient way to proceed in running down cause and effect in complex systems. And it is also one requiring a scientist to have a definite awareness of the various possible drivers of observed patterns and their scales of variation.

So, there’s a reason plant ecologists should know some physiology, some reproductive biology, some taxonomy, some soil science, some climatology, some…

On clustering, part two

In part one of what has quickly developed into an enthralling series, I made the points that (1) at least some important software doesn’t provide a probability value for cluster outputs, and (2) that while it’s possible to cluster any data set, univariate or multivariate, into clearly distinct groups, so doing doesn’t necessarily mean anything important. Such outputs only tell us something useful if there is some actual structure in the data, and the clustering algorithm can detect it.

But just what is “structure” in the data? The univariate case is simplest because with multivariate data, structure can have two different aspects. But in either situation we can take the standard statistical stance that structure is the detectable departure from random expectation, at some defined probability criterion (p value). The Poisson and gamma distributions define this expectation, the former for count (integer valued) data, and the latter for continuous data. By “expectation” I mean the expected distribution of values across the full data. If we have a calculated overall mean value, i.e. an overall rate, the Poisson and gamma then define this distribution, assuming each value is measured over an identical sampling interval. With the Poisson, the latter takes the form of a fixed denominator, whereas with the gamma it takes the form of a fixed numerator.

Using the familiar example of density (number of objects per unit space or time), the Poisson fixes the unit space while the integer number of objects in each unit varies, whereas the gamma fixes the integer rank of the objects that will be measured to from random starting points, with the distance to each such object (and corresponding area thereof) varying. The two approaches are just flip sides of the same coin really, but with very important practical considerations related to both data collection and mathematical bias. Without getting heavily into the data collection issue here, the Poisson approach–counting the number of objects in areas of pre-defined size–can get you into real trouble in terms of time efficiency (regardless of whether density tends to be low or high). This consideration is very important in opting for distance-based sampling and the use of the gamma distribution over area-based sampling and use of the Poisson.

But returning to the original problem as discussed in part one, the two standard clustering approaches–k-means and hierarchical–are always going to return groupings that are of low probability of random occurrence, no matter what “natural structure” in the data there may actually be. The solution, it seems to me, is to instead evaluate relative probabilities: the probability of the values within each group being Poisson or gamma distributed, relative to the probability of the overall distribution of values. In each case these probabilities are determined by a goodness-of-fit test, namely a Chi-square test for count (integer) data and a Kolmogorov-Smirnov test for continuous data. If there is in fact some natural structure in the data–that is, groups of values that are overly similar (or dissimilar) to each other than that defined by the Poisson or gamma–then this relative probability (or odds ratio if you like), will be maximized at the clustering solution that most closely reflects the actual structure in the data, this solution being defined by (1) the number of groups, and (2) the membership of each. It is a maximum likelihood approach to the problem.

If there is little or no actual structure in the data, then these odds ratios computed across different numbers of final groups will show no clearly defensible maximal value, but rather a broad, flat plateau in which all the ratios are similar, varying from each other only randomly. But when there is real structure therein, there will be a ratio that is quantifiably higher than all others, a unimodal response with a peak value. The statistical significance of this maximum can be evaluated with the Likelihood Ratio test or something similar, though I haven’t thought very hard about that issue yet.

Moving from the univariate case, to the multivariate, ain’t not no big deal really, in terms of the above discussion–it just requires averaging those odds ratios over all variables. But multivariate data does introduce a second, subtle aspect into what we mean by the term “data structure”, in the following respect. It is a possible situation wherein no variable in the data shows clear evidence of structure, per the above approach, when in fact there very much is such, but of a different kind. That outcome would occur whenever particular pairs (or larger groups) of variables are correlated with each other (above random expectation), even though the values for each such variable are in fact Poisson/gamma distributed overall. That is, there is a statistically defensible relationship between variables across sample units, but no detectable variation in values within each variable, across those sample units.

Such an outcome would provide definite evidence of behavioral similarity among variables even in the absence of a structuring of those variables by some latent (unmeasured) variable. I think it would be interesting to know how often such a situation occurs in different types of ecological and other systems, and I’m pretty sure nobody’s done any such analysis. Bear in mind however that I also once thought, at about 4:30 AM on a sleep deprived week if I remember right, that it would be interesting to see if I could beat the Tahoe casinos at blackjack based on quick probability considerations.

I hope the above has made at least some sense and you have not damaged your computer by say, throwing your coffee mug through the screen, or yelled something untoward, at volume, within earshot of those who might take offense. The Institute hereby disavows any responsibility, liability or other legal or financial connection to such events, past or future.

There will be more!

On clustering, part one

In ecology and other sciences, grouping similar objects together for further analytical purposes, or just as an end in itself, is a fundamental task, one accomplished by cluster analysis, one of the most fundamental tools in statistics. In all but the smallest sample sizes, the number of possible groupings very rapidly gets enormous, and it is necessary therefore to both (1) have some way of efficiently avoiding the vast number of clearly non-optimal clusters, and (2) choosing the best solution from among those that seem at least reasonable.

First some background. There are (at least) three basic approaches to clustering. Two of these are inherently hierarchical in nature: they either aggregate individual objects into ever-larger groups (agglomerative methods), or successively divide the entire set into ever smaller ones (divisive methods). Hierarchical methods are based on a distance matrix that defines the distance (in measurement space) between every possible pair of objects, as determined by the variables of interest (typically multivariate) and the choice of distance measure, of which there are several depending on one’s definitions of “distance”. This distance matrix increases in size as a function of (n-1)(n/2), or roughly a squared function of n, and so for large datasets these methods quickly become untenable, unless one has an enormous amount of computer memory available, which typically the average scientist does not.

The k-means clustering algorithm works differently–it doesn’t use a distance matrix. Instead it chooses a number of random cluster starting points (“centers”) and then measures the distance to all objects from those points, and agglomerates stepwise according to which objects are closest to which centers. This greatly reduces the memory requirement for large data sets, but a drawback is that the output depends on the initial choice of centers; one should thus try many different starting combinations, and even then, the best solution is not guaranteed. Furthermore, one sets the number of final clusters desired beforehand, but there is no guarantee that the optimal overall solution will in fact correspond to that choice, and so one has to repeat the process for all possible cluster numbers that one deems reasonable, with “reasonable” often being less than obvious.

When I first did a k-means cluster analysis, years ago, I did it in SPSS and I remember being surprised that the output did not include a probability value, that is, the likelihood of obtaining a given clustering by chance alone. There was thus no way to determine which among the many possible solutions was in fact the best one, which seemed to be a pretty major shortcoming, possibly inexcusable. Now I’m working in R, and I find…the same thing. In R, the two workhorse clustering algorithms, both in the main stats package are kmeans and hclust, corresponding to k-means and hierarchical clustering, respectively. In neither method is the probability of the solution given as part of the output. So, it wasn’t just SPSS–if R doesn’t provide it, then it’s quite possible that no statistical software program (SAS, S-Plus, SigmaStat, etc.) does so, although I don’t know for sure.

There is one function in R that attempts to identify what it calls the “optimal clustering”, function optCluster in the package of the same name. But that function, while definitely useful, only appears to provide a set of different metrics by which to evaluate the effectiveness of any given clustering solution, as obtained from 16 possible clustering methods, but with no actual probabilities attached to any of them. What I’m after is different, more defensible and definitely more probabilistic. It requires some careful thought regarding just what clustering should be all about in the first place.

If we talk about grouping objects together, we gotta be careful. This piece at Variance Explained gives the basic story of why, using examples from a k-means clustering. A principal point is that one can create clusters from any data set, but the result doesn’t necessarily mean anything. And I’m not just referring to the issue of relating the variable being clustered to other variables of interest in the system under study. I’m talking about inherent structure in the data, even univariate data.

This point is easy to grasp with a simple example. If I have the set of 10 numbers from 0 to 9, a k-means clustering into two groups will place 0 to 4 in one group and 5 to 9 in the other, as will most hierarchical clustering trees trimmed to two groups. Even if some clustering methods were to sometimes place say, 0 to 3 in one group and 4 to 9 in the other, or similar outcome (which they conceivably might–I haven’t tested them), the main point remains: there are no “natural” groupings in those ten numbers–they are as evenly spaced as is possible to be, a perfect gradient. No matter how you group them, the number of groups and the membership of each will be an arbitrary and trivial result. If, on the other hand, you’ve got the set {0,1,2,7,8,9} it’s quite clear that 0-2 and 7-9 define two natural groupings, since the members of each group are all within 1 unit of the means thereof, and with an obvious gap between the two.

This point is critical, as it indicates that we should seek a clustering evaluation method that is based in an algorithm capable of making this discrimination between a perfect gradient and tightly clustered data. Actually it has to do better than that–it has to be able to distinguish between perfectly spaced data, randomly spaced data, and clustered data. Randomly spaced data will have a natural degree of clustering by definition, and we need to be able to distinguish that situation from truly clustered data, which might not be so easy in practice.

There are perhaps several ways to go about doing this, but the one that is most directly obvious and relevant is based on the Poisson distribution. The Poisson defines the expected values in a set of sub-samples, given a known value determined from the entire object collection, for the variable of interest. Thus, from the mean value over all objects (no clustering), we can determine the probability that the mean values for each of the n groups resulting from a given clustering algorithm (of any method), follow the expectation defined by the Poisson distribution determined by that overall mean (the Poisson being defined by just one parameter). The lower that probability is, the more likely that the clusters returned by the algorithm do in fact represent a real feature of the data set, a natural aggregation, and not just an arbitrary partitioning of random or gradient data. Now maybe somebody’s already done this, I don’t know, but I’ve not seen it in any of the statistical software I’ve used, including R’s two workhorse packages stats and cluster.

More hideous detail to come so take cover and shield your eyes.