I often get obsessed with certain topics, especially statistical and mathematical ones. Lately I’ve been thinking about the Poisson distribution a lot, as it figures heavily in analyses of random populations, and thus also in assessing deviations therefrom, a topic that I’ve been heavily involved for a while, relative to historic and current tree populations. I also find that often a specific question will arise, related to whatever it is I’m doing, that is best (and often quickest) answered *in a way that gives me the most confidence*, by direct simulation, for which I use R.

The topic of varying scales of variation, and the information that can be obtained by analysis thereof, is a pretty interesting one IMO. When the generating processes for a phenomenon of interest are complex, or poorly understood for whatever reason, one can (and should!) obtain valuable information regarding likelihoods of various hypothetical process drivers, by *multi-scale analysis*–essentially, obtaining evidence for the scale(s) at which departures from randomness are greatest and using that information to suggest, or constrain, possible explanations. There are a number of ways to go about doing so.

The Poisson distribution is the appropriate descriptor of a homogeneous (constant) rate process whose individual event outcomes are random. An “under-dispersed” population at a particular scale of analysis will be more “regular” in its arrangement than expected by a random process, and it is clear that in such a situation there must necessarily also be under-dispersion at at least *some* other scales as well, both smaller and larger. To illustrate via an extreme example, suppose some location that gets 36 inches of precipitation (P) per year on average, distributed as exactly three inches per month, every month. The probability of such a result arising, when P varies randomly (Poisson) at any sub-monthly scale, is extremely low. It won’t occur over any extended period of time. The same principle holds, though muted, if there is some monthly variance around 3.0.

In an *over-dispersed* (“clustered”) population, again at some defined scale, the situation is different. Such a population will also be over-dispersed at smaller scales, but not necessarily at larger ones, at least not at the same intensity, and there should be some unknown scale at which the variation reduces to the Poisson. This means that a Poisson population is not necessarily Poisson at smaller scales, but it *should* be so at larger scales. That is, it should “scale up” according to Poisson expectation, i.e. with the same rate but a greater absolute number, and variance therein, per sample unit.

But does it? Or rather, what does R think about the matter?

Well, here’s what I get, using the example case of a mean annual P/yr of 36″ and 100,000 simulated monthly, or weekly, sums from randomly sampling the Poisson expectation at sub-interval (weekly or daily) time scales.

rm(list=ls()) options(digits=3) # 1. Observed annual mean and corresponding scaled sub-int. means. Year = 364, month = 28, days. obs.ann.mn = 36 # observed annual mean, from record (monthly.mn = obs.ann.mn/13) # 13 months/yr! (weekly.mn = obs.ann.mn/52) (daily.mn = obs.ann.mn/364) # 2. Poisson CDF expectations, ignore slight variations in days: # Equal interval CDF probs determined by no. time intervals in a year, eases interpr. # Set CDF probs to correspond to the various time ints. to give temporal distribution # NOTE that qpois, for cdf = 1.0, = Inf., so omit last interval # Poisson P.Pmonth = qpois(p=(1:12)/13, lambda = monthly.mn) # 13 mos P.Pweek = qpois(p=(1:51)/52, lambda = weekly.mn) # 52 weeks P.Pday = qpois(p=(1:363)/364, lambda = daily.mn) # 364 days table (P.Pmonth); table(P.Pweek); table(P.Pday) # 3. Simulations: do repeated samples taken from shorter time periods and summed, match Poisson/gamma expectations at longer periods? n.trials = 1e5 P.month.week = rep(NA,n.trials) for (i in 1:n.trials) P.month.week[i] = sum(sample(P.Pweek, 4, replace=T)) # Exactly 4 weeks to our months q.P.month.week = as.vector(quantile(P.month.week, probs = (1:12)/13)); rm(P.month.week) P.month.day = rep(NA,n.trials) for (i in 1:n.trials) P.month.day[i] = sum(sample(P.Pday, 28, replace=T)) q.P.month.day = as.vector(quantile(P.month.day, probs = (1:12)/13)); rm(P.month.day) P.week.day = rep(NA,n.trials) for (i in 1:n.trials) P.week.day[i] = sum(sample(P.Pday, 7, replace=T)) q.P.week.day = as.vector(quantile(P.week.day, probs = (1:51)/52)); rm(P.week.day) mw = data.frame(table (P.Pmonth), table(q.P.month.week))[,-3]; colnames(mw)=c("Precip, monthly", "Poisson Expect.", "Aggr., weekly") md = data.frame(table (P.Pmonth), table(q.P.month.day))[,-3]; colnames(md)=c("Precip, monthly", "Poisson Expect.", "Aggr., daily") wd = data.frame(table (P.Pweek), table(q.P.week.day))[,-3]; colnames(wd)=c("Precip, weekly", "Poisson Expect.", "Aggr., daily") mw; md; wd

**Answer:** Yes, it does exactly.*

Precip, monthly Poisson Exp. Aggr., weekly 1 3 3 2 3 3 3 3 3 4 2 2 5 1 1 Precip, monthly Poisson Exp. Aggr., daily 1 3 3 2 3 3 3 3 3 4 2 2 5 1 1 Precip, weekly Poisson Exp. Aggr., daily 0 26 26 1 18 18 2 6 6 3 1 1

*However, I also evaluated gamma expectations, mainly as a check and/or curiosity (the gamma interpolates between the Poisson’s integer values). I didn’t always get the exact correspondence as expected, and I’m not really sure why. Close, but not close enough to be due to rounding errors, so that’s kind of interesting, but not enough to pursue further.

Funding for this post was provided in equal part by the French Fish Ectoderm and Statistics Foundation and the American Association for Advancement of Amalgamated and Aggregated Associations. These organizations are solely responsible for any errors herein, and associated management related consequences.

I’d bet a fin that the sole reason you wrote this post was to juxtapose “Poisson” and “scales”. 😉

That, and to avoid doing what I should have been doing of course.

You have been MIA around here Harold. I was beginning to suspect something fishy might be up, so glad this post reeled you back in.

Glad to see the divulgence of funding sources. A conflict of interest statement would also be nice, but somehow I really don’t expect there is any COI in this instance.

I do have a teensy question, however. Given that your two auspicious funders each contributed equally to this effort (and that this effort must be worth a cool million) I’m wondering what the odds of them each desiring to put a half million – exactly – on the table so that this wonderful work could be accomplished? Or asked a different way – would a curious administrator, wondering about the likelihood of certain funding probabilities for his/her faculty be better advised to employ Poisson or Gamma distributions in an analysis of this type? I can already see future tenure decision committees running R code to evaluate young Ph.D.s and we really should help these poor souls get a fair shot.

“The authors declare no conflict of interest, or much interest of any kind for that matter, and only a marginally better memory of their high school French”

Moving on from the crappy fish puns, I was curious about your observation that a similar script didn’t show that the gamma distribution scales. [Since one can show this algebraically.] So I looked in more detail at your script. Not to carp, but I think the fault lies with generating random variables by sampling a coarse set of inverse cdf points. This distorts the distribution somewhat — for example, P.Pweek gives the chance of a week with zero precipitation as 26/51~51%, but the actual Poisson probability is a hair above 50%. Perhaps this flaw is more noticeable for gamma because it quantizes a continuous distribution, and has less effect on a discrete distribution. Or maybe the Poisson result is a fluke.

If you replace “sample(P.Pweek, 4, replace=T)” with “rpois(4, lambda = weekly.mn)” and make analogous changes to the other “sample” calls, you should get a more accurate Monte Carlo. This should work for Poisson, and also for gamma. Mull it over.

Yes, I think you very likely called it right Harold. From my perch, I can’t imagine what else it could poissonibly be. Sometimes I just flounder about on these things.

Although I’m still very much an R novice, I’ve noticed that “for” loops are eschewed. I replaced the line

“for (i in 1:n.trials) P.month.week[i] = sum(rpois(4, lambda = weekly.mn))”

with

“P.month.week <- colSums(matrix(rpois(4*n.trials,lambda=weekly.mn),nrow=4))”

and that executed considerably faster on my machine. [The number of trials was increased to 1e6 to make timing easier.]

Ah, thanks for that solution! Yes, loops of any kind are considered inefficient. I use them only when I can’t figure out a better way, or don’t care about the time considerations, the latter being the case here. But you are right and I always try to use the various “apply” functions (sapply, lapply etc) whenever possible. In the past I’ve had to do things where several computations were involved in each step and I could figure out no more efficient solution than a loop.