# On clustering, part three

It’s not always easy to hit all the important points in explaining an unfamiliar topic and I need to step back and mention a couple of important but omitted points.

The first of these is that we can estimate the expected distribution of individual values, from a known mean, assuming a random distribution of values, which, since the mean must be obtained from a set of individual values, means we can compare the expected and observed values, and thus evaluate randomness. The statistical distributions designed for this task are the Poisson and the gamma, for integer- and real-valued data respectively. Much of common statistical analysis is built around the normal distribution, and people are thus generally most familiar with it and prone to use it, but the normal won’t do the job here. This is primarily because it’s not designed to handle skewed distributions, which are a problem whenever data values are small or otherwise limited at one end of the distribution (most often by the value of zero).

Conversely, the Poisson and gamma have no problem with such situations: they are built for the task. This fact is interesting, given that both are defined by just one parameter (the overall mean) instead of two, as is the case for the normal (mean and standard deviation). So, they are simpler, and yet are more accurate over more situations than is the normal–not an everyday occurrence in modeling. Instead, for whatever reason, there’s historically been a lot of effort devoted to transforming skewed distributions into roughly normal ones, usually by taking logarithms or roots, as in e.g. the log-normal distribution. But this is ad-hoc methodology that brings with it other problems, including back transformation.

The second point is hopefully more obvious. This is that although it is easy to just look at a small set of univariate data and see evidence of structure (clustered or overly regular values), large sample sizes and/or multivariate data quickly overwhelm the brain’s ability to do this well, and at any rate we want to assign a probability to this non-randomness.

The third point is maybe the most important one, and relates to why the Poisson and gamma (and others, e.g. the binomial, negative binomial etc.) are very important in analyzing non-experimental data in particular. Indeed, this point relates to the issue of forward versus inverse modeling, and to issues in legitimacy of data mining approaches. I don’t know that it can be emphasized enough how radically different the experimental and non-experimental sciences are, in terms of method and approach and consequent confidence of inference. This is no small issue, constantly overlooked IMO.

If I’ve got an observed data set, originating from some imperfectly known set of processes operating over time and space, I’ve got immediate trouble on my hands in terms of causal inference. Needless to say there are many such data sets in the world. When the system is known to be complex, such that elucidating the mechanistic processes at the temporal and spatial scales of interest is likely to be difficult, it makes perfect sense to examine whether certain types of structures might exist just in the observed data themselves, structures that can provide clues as to just what is going on. The standard knock on data mining and inverse modeling approaches more generally is that of the possibility of false positive results–concluding that apparent structures in the data are explainable by some driving mechanism when in fact they are due to random processes. This is of course a real possibility, but I find this objection to be more or less completely overblown, primarily because those who conduct this type of analysis are usually quite well aware of this possibility thank you.

Overlooked in those criticisms is the fact that by first identifying real structure in the data–patterns explainable by random processes at only a very low probability–one can immediately gain important clues as to just what possible causal factors to examine more closely instead of going on a random fishing expedition. A lot of examples can be given here, but I’m thinking ecologically and in ecology there are many variables that vary in a highly discontinuous way, and this affects the way we have to consider things. This concept applies not only to biotic processes, which are inherently structured by the various aggregational processes inherent in populations and communities of organisms, but to various biophysical thresholds and inflection points as well, whose operation over large scales of space of time are often anything but well understood or documented. As just one rough but informative example, in plant ecology a large fraction of what is going on occurs underground, where all kinds of important discontinuities can occur–chemical, hydrologic, climatic, and of course biological.

So, the search for non-random patterns within observed data sets–before ever even considering the possible drivers of those patterns–is, depending on the level of apriori knowledge of the system in question, a potentially very important activity. In fact, I would argue that this is the most natural and efficient way to proceed in running down cause and effect in complex systems. And it is also one requiring a scientist to have a definite awareness of the various possible drivers of observed patterns and their scales of variation.

So, there’s a reason plant ecologists should know some physiology, some reproductive biology, some taxonomy, some soil science, some climatology, some…