# Be careful, look closely

The scientific enterprise involves inferring cause from effect, where “effect” refers to something(s) that can be readily observed and measured. Many would argue that this task is the very essence of science, and this definition sure is simpler than wading into Kant, Bacon, Popper, etc. Along those lines, this post gives one important example of how to discriminate between different possible causative explanations, given a set of observed data. I’ll use binomial probability models, because they are simple, very widely applicable, and I’ve discussed them some in previous posts.

Binomial probability models are applicable to data from any dichotomous variable; the classic example involves the repeated flipping of a coin. But any continuous (or categorical) variable can be made dichotomous simply by choosing a breakpoint somewhere, either arbitrarily or using e.g., cluster analysis (although, methods designed for continuous data are usually more powerful). Thus the very wide applicability. When we flip a coin, we know, both from practical experience (i.e. empirical data) and basic reasoning based on its shape (which we might refer to as theoretical knowledge), that only two results are possible. But with many observed phenomenon, we have no such extensive experience, nor clear and definite theoretical expectations, since the system is typically complex.

One fundamentally important question we can/should ask is whether or not the process(es) that generated the observations is/are roughly constant, i.e. repeatable, across some wide ranging set of observations. Suppose, as an example, that 10,000 observations over a large area indicate that red fir (Abies magnifica) trees have cones of clearly different shape, and that these occur at frequencies of p1 and p2 = 0.50 each. The question then arises as to how constant/repeatable these two values are, temporally and spatially. The more constant they are, the more likely that the governing process is relatively simple, and the less constant, the less likely, and one common and basic goal of science is to gain more insight into the likely complexity of the governing process. But just how do we go about determining what constitutes relative constancy, given an observed data set?

Binomial probability models can address these questions very well; the key in so doing is dividing the full population into samples of defined size and evaluating the distribution of the statistic across those samples, rather than over the entire population. For this example, we know an important biological fact: red fir is a wind pollinated out-crosser, and therefore the potential for spatial autocorrelation in genetic traits is minimized relative to other possible mating systems (e.g. self fertilization and insect-based out-crossing), but I need go no deeper into biology or genetics than that, for the purposes here, and could avoid biology altogether by discussing the whole topic in terms of coin flipping. But I don’t want anyone to flip out, and I like trees, and I especially like red fir trees.

Suppose also that our population of 10,000 trees is structured as 1000 samples of 10 trees each, with the 1000 samples widely spaced on a regular grid across part of the Sierra Nevada. If there is truly a 50-50 chance of getting either cone shape, regardless of location, then a binomial model with p1 = 0.5 will give us the expected numbers of cone shapes 1 (and hence shape 2, since p2 = 1 – p1) across the 1000 samples, a scenario that I’ll call “random chance”. Conversely, if there is something else going on, i.e. some more complex process operating, then some different distribution across those samples will arise. For example, it could be that regardless of the mating system, there is still some spatial structure in the cone shapes across the samples: shape 1 occurs more frequently in some locations, and less frequently in others, than would be expected by random chance. This is an entirely reasonable possibility, biologically.

It is at this point that we have to be careful to avoid coming to wrong conclusions. In R, the maximum likelihood estimate for p1 or p2, given an observed data set, can be derived using the “optim” and “mle2” functions. These two return identical results, usually very close to the true (overall) p1 and p2 values. But we already know those values; they are our starting point, so R is telling us nothing new. The reason for this is that both functions return only a point estimate (the mean p1 value), regardless of the distribution of p1 over the set of 1000 samples. But many different such distributions will return very similar point estimates, and these two optimization functions do not recognize this. They assume a universally operating (i.e. constant) process, and this is a very real potential problem.

Two sets of R code below demonstrate the issue, by returning similar p value estimates from two very different situations. [I’ve embedded some of the results right in with the code and hopefully it’s clear].

First, the “random chance” situation (where p1 = p2 = 0.5, universally):

```library(bbmle)
## Make a random sample
# 1. Via binomial prob., using rbinom:
nsamples = 1000; ntrials = 10; prob = 0.500
k = rbinom(n=nsamples, size=ntrials, prob=prob)

mean(k); var(k); median(k); table(k)
 5.004
 2.678663
 5
k
0   1   2   3   4   5   6   7   8   9  10
3  12  43 131 191 229 198 134  53   6   0

# Compute ML estimates using mle2 and optim; mean success rate is initial estimate:
# note that mle2 is a wrapper for optim and other optimization functions
binom.func = function(k,p,N) {-sum(dbinom(x=k, prob=p, size=ntrials, log=T))}
m1 = mle2(binom.func, start = list(p=mean(k)/ntrials), data=list(N=nsamples, k=k))
m2 = optim(binom.func, p=mean(k)/ntrials, N=N, k=k, method="BFGS")

# MLE results should be identical; check:
as.vector(c(m1@coef, m2\$par))
 0.5004 0.5004
```

And second, a situation in which, instead of p1 and p2 = 0.5 universally, we instead have a gradient where each varies from p = 0.0 to 1.0 across the samples, i.e. a uniform distribution of values spanning the full domain of [0,1], but still having an overall mean of p = 0.5, as before:

```## Make a random sample:
# 2. From a UNIFORM distribution, but having the same mean as before (0.50):
nsamples = 1000; ntrials = 10; prob = 0.500
mn = prob*ntrials; limit = min(c(abs(mn-ntrials), mn))
k = sample((mn-limit):(mn+limit), nsamples, replace=T)

mean(k); var(k); median(k); table(k)
 5.116
 9.700244
 5
k
0   1   2   3   4   5   6   7   8   9  10
76  87  98  91  81  93 101  94  90 103  86

# Compute ML estimates using mle2 and optim; mean success rate is initial estimate:
# note that mle2 is a wrapper for optim and other optimization functions
binom.func = function(k,p,N) {-sum(dbinom(x=k, prob=p, size=ntrials, log=T))}
m1 = mle2(binom.func, start = list(p=mean(k)/ntrials), data=list(N=nsamples, k=k))
m2 = optim(binom.func, p=mean(k)/ntrials, N=N, k=k, method="BFGS")

# MLE results should be identical; check:
as.vector(c(m1@coef, m2\$par))
 0.5116 0.5116
```

We can see that the true value (p1 = 0.50) is very well estimated in each case: 0.5004 in the first case and 0.5116 in the second. For a sample of 10 trees, we thus expect 5.004 and 5.116 of them, on average, to have cone shape 1. The very small difference between the two is due strictly to the randomness inherent in creating the simulated data of the two scenarios.

What we now need, to discriminate between these two very different situations, is simply a fuller look at the distribution statistics in the two cases, for example the higher order “moments” (i.e. variance, skew, kurtosis, etc) or quantiles (% breakpoints) of the cumulative distribution, because unlike the mean, these values will not be expected to be similar in the two scenarios.

Here, I can illustrate the main point well enough just by examining the variance in each case (“var(k)” in the code above). In the first case, the variance is 2.678, which is close to the theoretically expected value of 2.5 (given by np(1-p), where n = 10). But in the second case, it’s almost 4x as large, at 9.7. And this large difference is a powerful discriminator between the two situations. But I also computed (via “table(k)”), the distribution of the 11 possible outcomes (i.e. from 0 to 10 trees having cone shape 1), for each of the 1000 samples. In the first case, this distribution is strongly unimodal, centered on 5, with almost no occurrences of 0 or 10. [This is exactly analogous to flipping a coin 10 times and recording the number of heads (or tails), and repeating this 1000 times–very rarely will you get 0 or 10 heads or tails, and most frequently, 5 such.]

In the second case however, the distribution is strongly uniform, varying only across the much narrower range from 76 to 103, but with p still centered on 0.5, and thus the much larger variance. You will never get such a result by flipping a coin; it’s not possible because p = 0.5 for a head or tail is a universally applicable value. You don’t sometimes, or with some coins, have p = 0.1 for heads and p = 0.9 for tails, or whatever. It’s an extremely simple, and constant, system.

The conclusion from the second set of results could only be that, rather than having a universally operating process wherein either cone shape 1 or 2 occur with equal probabilities (0.5) regardless of location, that instead p1 (and p2) vary strongly as a function of location. Biologically, the simplest and likeliest explanation for this result would be restricted gene flow across the sampled area.

To conclude, the more closely you look at the distribution of some variable of interest, the more information you can obtain about the likely universality and complexity of the underlying process, and you can do this even with simple models, such as binomial probability.

Addendum. In the deluge of emails and comments I’ve received on this post thanking me profusely and begging for more posts like this one, there were many requests for code giving the “brute force” approach to a maximum likelihood estimate, in order to see how the results compare to those from the optimization algorithm above. So, for the binomial distribution, here it is:

```# brute force approach, i.e. likelihoods by stepped probabilities, on domain of (0,1)
step = 1/1000
probs = seq(from=step, to=(1-step), by=step)
likelihoods = rep(NA,length(probs))
for (i in 1:length(probs)) likelihoods[i] = -sum(dbinom(x=k, size=ntrials, prob=i*step, log=T))

#names = c(paste("p.",0,1:9,sep=""), paste("p.",10:99,sep=""))
L = data.frame(probs,likelihoods)
maxL = which(likelihoods==min(likelihoods))/(1/step);MaxL
```

Result: the method returns an estimate of 0.512, close to the true value of 0.500 and to the optimization-derived estimate of 0.5004.