This post is about how binomial probability models can, and cannot, be applied for inference in a couple of very unrelated biological contexts. The issue has (once again) made popular media headlines recently, been the focus of talk shows, etc., and so I thought it would be a good time to join in the discussions. We should, after all, always focus our attention wherever other large masses of people have focused theirs, particularly on the internet. No need to depart from the herd.

Binomial models give the coefficients/probabilities for all possible outcomes when repeated trials are performed of an event that has two possible outcomes that occur with known probabilities. The classic example is flipping a coin; each flip has two possible outcomes of h = t = 0.5, and if you flip it, say twice (two trials), you get 1:2:1 as binomial coefficients for the three possible outcomes of (1) hh = two heads, (2) ht = one head and one tail, or (3) tt = two tails, which gives corresponding probabilities of {hh, ht, tt} = {0.25, 0.50, 0.25}. These probabilities are given by the three terms of (h + t)^2, where the exponent 2 gives the number of trials. The number of possible outcomes after all trials is always one greater than the number of trials, with the order of the outcomes being irrelevant. Simple and easy to understand. The direct extension of this concept is found in multinomial models, in which more than two possible outcomes for each trial exist; the concept is identical, there are just more total probabilities to compute. Throwing a pair of dice would be a classic example.

The most well-known application of binomial probability in biology is probably Hardy-Weinberg equilibrium (HWeq) analysis in population genetics, due to the fact that chromosome segregation (in diploids) always gives a dichotomous result, each chromosome of each pair having an equal probability of occurrence in the haploid gametes. The binomial coefficients then apply to the expected gamete combinations (i.e. genotypes) in the diploid offspring, under conditions of random mating, no selection acting on the gene (and on closely linked genes), and no migration in or out of the defined population.

So, given a single gene with two alleles (call them p and q), each of frequency 0.5 in the population, there are three genotypes that can result from random mating; call them pp, pq and qq. If the conditions just mentioned hold, the three should be found in the population in ratios of {0.25, 0.50, 0.25}. If observations depart from this expectation, the traditional conclusion is that one or more of the conditions are therefore unmet (but more on this interpretation below). The degree of departure from expectation, and other information, gives evidence as to which condition it likely is.

The connection to the tree sampling by government land surveyors in the United States preceding European settlement should now be patently obvious. 🙂

So about that old land survey. To make a longish story very short, the earliest federal land surveyors very commonly recorded in their notes one tree on either side of a line running along cardinal bearings (N-S or E-W). They did so every half mile, along survey lines that were typically gridded on a one mile spacing. A long-standing question for forest and landscape ecologists (including me) who’ve been examining these data for over 75 years now, starting with Paul Sears at Ohio State in the 1930s, has been the degree to which the choices of those trees were unbiased, that is, representative of the true proportions of different species (and diameters) occurring on the landscape at the time. Another important question has involved the spatial patterns and co-occurrence of different species and tree sizes.

If there were only two tree species on the landscape, binomial models could be used exactly as described in HWeq analysis above, to predict the number of sampling points expected to have two trees of one species, one of each, or two of the other, *if certain conditions hold*. These values could then be compared with the observations, to assess whether or not those conditions did in fact hold. The analog of the assumption of completely random mating in HWeq, is that the two species are thoroughly mixed over the landscape–as if each were randomly distributed throughout. The analog for the assumption of no immigration is that the tree population is continually regenerated from within. And most relevant here, the analog to no genetic selection is the condition that the surveyors were not biased in their choices of trees–they chose them randomly.

Taking the simplest possible example, suppose surveyors recorded 600 sugar maple and 400 american beech at 500 sampling points, so p = 0.6 and q = 0.4, where p now represents sugar maples and q, beech. If the “equilibrium” conditions are met, 0.36 of the samples will have two sugar maples, 0.48 one maple and one beech, and 0.16 two beeches, i.e. {pp, pq, qq} = {0.36, 0.48, 0.16}, from (p + q)^2. If we actually observed those frequencies, then the classical interpretation, in a Hardy-Weinberg sense, would be that all the necessary conditions are in fact met, and thus that the two species occur in a 60/40 ratio and are maximally mixed throughout the landscape. Unfortunately, this conclusion would not necessarily be correct: all the assumptions are not in fact necessarily met. [Note that this is also true for conclusions of HWeq in population genetics.]

In fact, two possible explanations for the observations actually exist, due to the possibility of surveyor bias in the tree selections. The first would be that assumptions *are* in fact met; the two species are always found in close association with each other (as perfectly intermixed as possible), and never as single species stands, and that surveyors were not biased in their tree choices. [I use “stand” here to refer to a much smaller spatial scale than the typical usage implies–on the order of say ~100 to 200 square meters, corresponding to the linear point-to-tree distances typically recorded by the surveyors.] The fact that 52% percent of the samples consisted of only one species or the other does *not* indicate that 52% of the landscape was in a single species condition, but rather the results of random sampling when only two trees are selected and ~~each one is equally probable~~the overall probabilities of selecting the two are 0.6 and 0.4.

The second possibility is the tricky, subtle one. Specifically, the actual frequencies on the landscape may in fact differ from p = 0.6 and q = 0.4, but biased tree selections by the surveyors occurred in such a way that tree selection appears to have been random, i.e. the frequencies of the different possible species compositions in the samples equal those expected from binomial probability calculations. Worse still, almost any combination of actual frequencies and degree of surveyor bias can give this same result, which would appear to represent an unsolvable inference problem, i.e. multiple pathways to the same result. Fortunately however, there is a way to distinguish between these two possibilities, but it requires using other data from the surveys–I’ll come back to that later.

To elaborate more, consider situations in which the recorded data values for the three possible sample outcomes cannot be explained by simple random sampling of a completely mixed population. One such example would be an excess of the two single species results and a deficit of the mixture, relative to the expectation under random sampling, say {pp, pq, qq} = {0.48, 0.24, 0.28}. [I got that by subtracting 0.20 from the “pq” frequency and adding 0.10 to each of the two single species fractions; the species proportions of 0.6 maple and 0.4 beech remain unchanged.] If we knew that surveyors were in fact unbiased in their tree selections, some fairly simple math tells us that the two species occur as single species stands exactly half the time, and mixed stands the other half.*

However, we *don’t* in fact know that the surveyors were unbiased, and therefore cannot be sure of that conclusion. These identical results can also occur when the tree selections were biased and the actual frequencies different from those recorded. To illustrate this, suppose the actual (true) maple and beech frequencies on the landscape were p = q = 0.5 each, and were thoroughly mixed, i.e. no single species stands. If, at 20% of the sample points, the surveyor was biased towards maples (choosing a maple when random sampling practice indicated a beech should be chosen), then sample frequencies that should have been {pp, pq, qq} = {0.25, 0.50, 0.25} become {0.45, 0.30, 0.25}. These in turn translate to landscape scale frequencies of maple = 0.6 (= 0.45 + 0.5*0.30), and beech = 1 – 0.6 = 0.4.

But ff those species proportions were actually correct, *and* random sampling were practiced, the expected outcome probabilities would be {pp, pq, qq} = (0.6 + 0.4)^2 = {0.36, 0.48, 0.16}. They are instead observed to be {0.45, 0.30, 0.25}, values which can arise in the condition in which 0.625 of the landscape contains mixed species stands, and 1.00 – .625 = 0.375 of it contains single species stands, under unbiased tree selection by the surveyors. The value of 0.625 is obtained from the fraction of the observed to expected values of mixed species stands, 0.30/0.48. *But this explanation would be quite wrong*, both in the estimate of overall species composition, and in its smaller scale structure across the landscape: the two species actually occur in equal proportions and are completely mixed throughout. So this is a big inferential problem.

Fortunately as mentioned above, there is a way to distinguish between these possibilities. Two types of information can be used for this. One is the distance information recorded by the surveyors: they recorded the distance (and the compass bearing) from the survey point to each tree. When tree selection is entirely random, there is no reason to expect that on average, the two species will differ in their distance from the sampling point in mixed species samples. Conversely, if surveyors were consciously favoring one over the other, the distance to the favored species will be *greater* than it is to any non-favored species. Therefore, non-parametric tests, including a different application of binomial probability (using the ranked distances), can be used to evaluate the probability that surveyors made biased tree selections. This approach does have some limits which I won’t go into, but suffice it to say that it’s possible to accurately estimate both the actual proportions of the various species across the landscape, and also the degree to which they were associated with each other, spatially.

The other useful ancillary data are sample points at which up to four trees, not two, were sampled. Such points constitute about 1/3 of the total for all data collected after about 1852. With those data, there are four binomial trials and hence five possible outcomes. The key point there is that three of those outcomes are mixed species samples, which together constitute a *much* larger fraction of the observed outcomes than is the case when only two trees are sampled. This allows for a greater number of distance comparisons between species, and therefore greater power in determining if surveyors were biased for or against particular taxa. Those points are therefore much more informative, per unit, then are the two-tree sample points, and thus, far more powerful in distinguishing between the two possible explanations described above.

Now, it is also possible to get results of the opposite nature to those just discussed, i.e. an excess of the mixed species samples and a deficit of one or both single species samples, relative to random sampling, for example {pp, pq, qq} = {0.25, 0.60, 0.15}. These values give frequencies of maple = 0.55 (= 0.25 + 0.5*0.30) and beech = 0.45 instead of the actual 0.50 for each. This situation is more definitive as to cause: it can only arise by surveyor bias. This is because, under random sampling there can *never *be more mixed species stands in the observations than that predicted by binomial probability, which in this case is 2*p*q = 2*0.55*0.45 = 0.495.

Note that I made the surveyor bias take the form of an increase in mixed species samples (up 0.10) at the expense of beech-only samples (down 0.10)–the maple-only frequency was unchanged at 0.25. That is, at 10% of the sample points, surveyors chose a maple and a beech instead of the two beeches that they should have chosen if they were unbiased. That’s only one form the bias towards maple can take. They could also choose two maples when random sampling called for one of each, or for that matter, two maples when two beeches were actually called for. And if they did, they could create results in which the sample result frequencies are perfectly predicted from the overall tree frequencies, and yet are perfectly wrong. As before, four-tree samples are much more powerful in detecting this than are the two-tree samples, primarily through the non-parametric tests of rank distances mentioned briefly above.

I will stop here to allow for the Discovery Channel crew debriefing, as they prepare for the first episode in their new series “*Landscape-level tree population analysis from old data, the untold story, the forgotten heroes*“.

*The percentage of mixed species stands is given by the ratio of the fraction of such stands in the observations (0.24) to that in the expectation under random sampling (0.48). This might be clearer if one imagines results of {pp, pq, qq} = {0.60, 0.00, 0.40}, where there are clearly no mixed species stands, 0.00/0.48.

Paging Matt Skaggs…

Hi Jim, I read most of what you post. What’s up?

Matt

You can rot your brain doing that.

I just thought you might be interested in the genetics part of this, albeit pretty boring stuff. I got bogged down, couldn’t respond to your last comment on the speciation discussion, but I’ll get back to it eventually I think.

I did find it interesting, I could actually follow the genetics! I have continued to dive back in to speciation theory after fifteen years away from it. Your invitation to discuss a botanical topic was the trigger. I’m finding that there is plenty of recent literature with much more sophisticated genetic analysis (alas!) then there used to be, which is both exciting and quite challenging for me. Not surprisingly, despite all the work, the battle lines between the sympatricists and the saltationists have hardly moved. I continue to think that speciation is primarily or exclusively saltational, and follows a simple set of rules. Sewall Wright got extremely close to figuring it out but lacked access to modern genetic tools to prove it. The tools are now available and the proverbial brass ring awaits the researcher who poses the proper questions.

On a barely related topic, are you familiar with the book “Island of the Colorblind”? I like it because it provides conclusive proof that a deleterious mutation can increase in frequency solely due to historical contingency.

No I’m not familiar with that, but it sounds interesting. Your last sentence there: is that meant in the sense that historical contingency is the only way it can happen, or rather that historical contingency can in some cases be solely responsible?

The explosion in data production in genetics is phenomenal, thanks to sequencing (and other) technology advances. I don’t know the degree to which it informs resolving the two speciation modes you mention. I’d be interested in hearing more about your Sewall Wright hypothesis there, and what data he lacked to demonstrate his case.

Jim wrote:

“I’d be interested in hearing more about your Sewall Wright hypothesis there, and what data he lacked to demonstrate his case.”

I need to go back and re-read this stuff again, can’t trust old memories. I will be visiting my folks in south Texas next week, lots of free time, i will see if I can put this back together.

Matt

Sorry, I left this hanging:

“Your last sentence there: is that meant in the sense that historical contingency is the only way it can happen, or rather that historical contingency can in some cases be solely responsible?”

I think the most parsimonious interpretation of diversity points toward a simple set of rules for speciation, as I have previously written. The process would be stochastic and based upon extremely rare events. The simplest way to make this work is through mutations that are deleterious to individuals but advantageous to populations. The current paradigm does not allow for such things. However, the theory of founder mutations (which currently stops well short of speciation) suggests such things can occur, and the book documents how a deleterious founder mutation can become fixed in a population. So to answer your question, historical contingency that leads to the fixation of a deleterious mutation is one way around the obstacle, but not the only way. In this paradigm, historical contingency plays the same role as serpentine tolerance in the sense that both release the hopeful monster from the need to compete with its relatives. The hopeful monster is simply doing something different. I mentioned it because speciation writ large clearly does not always involve serpentine tolerance nor selfing. These are merely the simplest mechanisms I can find that point toward the set of rules that govern speciation.