This post constitutes a wrap-up and summary of this series of articles on clustering data.
The main point thereof is that one needs an objective method for obtaining evidence of meaningful groupings of values (clusters) in a given data set. This issue is most relevant to non-experimental science, in which one is trying to obtain evidence of whether the observed data are explainable by random processes alone, versus processes that lead to whatever structure may have been observed in the data.
But I’m still not happy with my description in part four of this series, regarding observed and expected data distributions, and what these imply for data clustering outputs. In going over Ben Bolker’s outstanding book for the zillionth time (I have literally worn the cover off of this book, parts of it freely available here), I find that he explains what I was trying to better, in his description of the negative binomial distribution relative to the concept of statistical over-dispersion, where he writes (p. 124):
…rather than counting the number of successes obtained in a fixed number of trials, as in a binomial distribution, the negative binomial counts the number of failures before a pre-determined number of successes occurs.
This failure-process parameterization is only occasionally useful in ecological modeling. Ecologists use the negative binomial because it is discrete, like the Poisson, but its variance can be larger than its mean (i.e. it can be over-dispersed). Thus, it’s a good phenomenological description of a patchy or clustered distribution with no intrinsic upper limit, that has more variance than the Poisson…The over-dispersion parameter measures the amount of clustering, or aggregation, or heterogeneity in the data…
Specifically, you can get a negative binomial distribution as the result of a Poisson sampling process where the rate lambda itself varies. If lambda is Gamma-distributed (p.131) with shape parameter k and mean u, and x is Poisson-distributed with mean lambda, then the distribution of x will be a negative binomial distribution with mean u and over-dispersion parameter k (May, 1978; Hilborn and Mangel, 1997). In this case, the negative binomial reflects unmeasured (“random”) variability in the population.
The relevance of this quote is that a distribution that is over-dispersed, that is, one that has longer right or left (or both) tails than expected from a Poisson distribution having a given mean, is evidence for a non-constant process structuring the data. The negative binomial distribution describes this non-constancy, in the form of an “over-dispersion parameter” (k). In that case, the process that is varying is doing so smoothly (as defined by a gamma distribution), and the resulting distribution of observations will therefore also be smooth. In a simpler situation, one where there are say, just two driving process states, a bi-modal distribution of observations will result.
Slapping a clustering algorithm on the latter will return two clusters whose distinction is truly meaningful–the two sets of values were likely generated by two different generating parameters. A clustering applied to a negative binomial distribution will be arbitrary with respect to just which values get placed in which cluster, and even to the final number of clusters delineated, but not with respect to the idea that the observations do not result from a single homogeneous process, which is a potentially important piece of information. Observation of the data, followed by some maximum likelihood curve fits of negative binomial distributions, would then inform one that the driving process parameters varied smoothly, rather than discretely/bimodally.