Does the Poisson scale up?

I often get obsessed with certain topics, especially statistical and mathematical ones. Lately I’ve been thinking about the Poisson distribution a lot, as it figures heavily in analyses of random populations, and thus also in assessing deviations therefrom, a topic that I’ve been heavily involved for a while, relative to historic and current tree populations. I also find that often a specific question will arise, related to whatever it is I’m doing, that is best (and often quickest) answered in a way that gives me the most confidence, by direct simulation, for which I use R.

The topic of varying scales of variation, and the information that can be obtained by analysis thereof, is a pretty interesting one IMO. When the generating processes for a phenomenon of interest are complex, or poorly understood for whatever reason, one can (and should!) obtain valuable information regarding likelihoods of various hypothetical process drivers, by multi-scale analysis–essentially, obtaining evidence for the scale(s) at which departures from randomness are greatest and using that information to suggest, or constrain, possible explanations. There are a number of ways to go about doing so.

The Poisson distribution is the appropriate descriptor of a homogeneous (constant) rate process whose individual event outcomes are random. An “under-dispersed” population at a particular scale of analysis will be more “regular” in its arrangement than expected by a random process, and it is clear that in such a situation there must necessarily also be under-dispersion at at least some other scales as well, both smaller and larger. To illustrate via an extreme example, suppose some location that gets 36 inches of precipitation (P) per year on average, distributed as exactly three inches per month, every month. The probability of such a result arising, when P varies randomly (Poisson) at any sub-monthly scale, is extremely low. It won’t occur over any extended period of time. The same principle holds, though muted, if there is some monthly variance around 3.0.

In an over-dispersed (“clustered”) population, again at some defined scale, the situation is different. Such a population will also be over-dispersed at smaller scales, but not necessarily at larger ones, at least not at the same intensity, and there should be some unknown scale at which the variation reduces to the Poisson. This means that a Poisson population is not necessarily Poisson at smaller scales, but it should be so at larger scales. That is, it should “scale up” according to Poisson expectation, i.e. with the same rate but a greater absolute number, and variance therein, per sample unit.

But does it? Or rather, what does R think about the matter?

Well, here’s what I get, using the example case of a mean annual P/yr of 36″ and 100,000 simulated monthly, or weekly, sums from randomly sampling the Poisson expectation at sub-interval (weekly or daily) time scales.

rm(list=ls())
options(digits=3)
# 1. Observed annual mean and corresponding scaled sub-int. means.  Year = 364, month = 28, days.
obs.ann.mn = 36			# observed annual mean, from record
(monthly.mn = obs.ann.mn/13)	# 13 months/yr!
(weekly.mn = obs.ann.mn/52)
(daily.mn = obs.ann.mn/364)

# 2. Poisson CDF expectations, ignore slight variations in days:
# Equal interval CDF probs determined by no. time intervals in a year, eases interpr.
# Set CDF probs to correspond to the various time ints. to give temporal distribution
# NOTE that qpois, for cdf = 1.0, = Inf., so omit last interval
# Poisson
P.Pmonth = qpois(p=(1:12)/13, lambda = monthly.mn) 		# 13 mos
P.Pweek = qpois(p=(1:51)/52, lambda = weekly.mn)		# 52 weeks
P.Pday = qpois(p=(1:363)/364, lambda = daily.mn)		# 364 days
table (P.Pmonth); table(P.Pweek); table(P.Pday)

# 3. Simulations: do repeated samples taken from shorter time periods and summed, match Poisson/gamma expectations at longer periods?
n.trials = 1e5
P.month.week = rep(NA,n.trials)
for (i in 1:n.trials) P.month.week[i] = sum(sample(P.Pweek, 4, replace=T))			# Exactly 4 weeks to our months
q.P.month.week = as.vector(quantile(P.month.week, probs = (1:12)/13)); rm(P.month.week)
P.month.day = rep(NA,n.trials)
for (i in 1:n.trials) P.month.day[i] = sum(sample(P.Pday, 28, replace=T))
q.P.month.day = as.vector(quantile(P.month.day, probs = (1:12)/13)); rm(P.month.day)
P.week.day = rep(NA,n.trials)
for (i in 1:n.trials) P.week.day[i] = sum(sample(P.Pday, 7, replace=T))
q.P.week.day = as.vector(quantile(P.week.day, probs = (1:51)/52)); rm(P.week.day)

mw = data.frame(table (P.Pmonth), table(q.P.month.week))[,-3]; colnames(mw)=c("Precip, monthly", "Poisson Expect.", "Aggr., weekly")
md = data.frame(table (P.Pmonth), table(q.P.month.day))[,-3]; colnames(md)=c("Precip, monthly", "Poisson Expect.", "Aggr., daily")
wd = data.frame(table (P.Pweek), table(q.P.week.day))[,-3]; colnames(wd)=c("Precip, weekly", "Poisson Expect.", "Aggr., daily")
mw; md; wd


Precip, monthly 	Poisson Exp. 	Aggr., weekly
1               3             3
2               3             3
3               3             3
4               2             2
5               1             1
Precip, monthly 	Poisson Exp.	Aggr., daily
1               3            3
2               3            3
3               3            3
4               2            2
5               1            1
Precip, weekly 		Poisson Exp.	 Aggr., daily
0              26           26
1              18           18
2               6            6
3               1            1


*However, I also evaluated gamma expectations, mainly as a check and/or curiosity (the gamma interpolates between the Poisson’s integer values). I didn’t always get the exact correspondence as expected, and I’m not really sure why. Close, but not close enough to be due to rounding errors, so that’s kind of interesting, but not enough to pursue further.

Funding for this post was provided in equal part by the French Fish Ectoderm and Statistics Foundation and the American Association for Advancement of Amalgamated and Aggregated Associations. These organizations are solely responsible for any errors herein, and associated management related consequences.

On clustering, part four

I have not written any post, or series thereof, in which I had more questions or made me think more precisely about purposes and products of statistical approaches/algorithms, than this one. I’m finding this topic of statistical clustering of multivariate data to be far subtler and more interesting than I thought when I waded into it. Justifiably counter-intuitive even; I encourage you to read through this one as I think the root concepts are pretty important.

It’s easy to misunderstand exactly what the output of a statistical/mathematical procedure represents, and thus vital to grasp just what is going on, and why. This might seem like a fully obvious statement, but my experience has often been that we who work in mostly non-experimental fields (e.g. ecology, geology, climatology, etc) frequently do not in fact have a clear and very exact grasp of what our various quantitative methods are doing. I’m not just referring to highly complex methods either, but even to simple methods we often take for granted. This applies to me personally, and to all kinds of papers by others that I’ve read, in various fields.

The first point to be made here is to retract and correct the example I gave in part one of this series. I had said there that if one clusters the ten numbers 0-9, into two groups, using either a k-means or hierarchical clustering procedure, the first group will contain {0 to 4} and the second will contain {5 to 9}. That’s not the problem–they will indeed do so (at least in R). The problem is the statement that this output is meaningless–that it does not imply anything about structure in the data. On thinking this through a little more, I conclude that whether it does or doesn’t depends on the specifics of the data.

It is very true that an inherent potential problem with clustering methods is the imposition of an artificial structure on the data when none actually exists. With the “k-means” algorithm this is a direct result of minimizing sums of the squared differences from the mean within groups, whereas with hierarchical methods (AGNES, etc), it results from iteratively joining the closest items (item to item or item to group), as determined by a constantly updated distance matrix after each join. Either way, the algorithms cause similarly valued items to be placed together in groups.

The problem is with the example I used, specifically, that a set of values appearing to follow a smooth gradient, such as the integers 0:9 in the example, rather than displaying definite groupings/clusters, necessarily indicates a lack of structure, and thus a meaningless clustering result. This is not necessarily the case, which can only be grasped in the context of an apriori expected distribution of values, given an observed overall mean value. These expectations are given by the Poisson and gamma distributions, for integer- and real-valued data respectively.

The most immediate question arising from that last statement should be “Why exactly should the Poisson or gamma distributions define our expectation, over any other distribution?”. The answer is pretty important, as it gets at just what we’re trying to do when we cluster values. I would strongly argue that that purpose has to be the identification of structure in the data, that is, a departure from randomness, which in turn means we need an estimate of the random condition against which to gauge possible departures. Without having this “randomness reference point”, the fact that {0:4} and {5:9} will certainly fall into two different groups when clustered is nothing more than the trivial re-statement that {5:9} are larger values than {0:4}: fully meaningless. But R (and presumably most other statistical analysis programs as well), does not tell you that–it gives you no information on just how meaningful a given clustering output is. Not good: as scientists we’re after meaningful output, not just output.

The answer to the above question is that the Poisson and gamma distributions provide those randomness reference points, and the reason why is important: they are essentially descriptors of expected distributions due to a type of sampling error alone. Both are statements that if one has some randomly distributed item–say tree locations in two dimensions in a forest, or whatever example you like–then a set of pre-defined measurement units, placed at random throughout the sampling space (geographic area in my example), will vary in the number of items contained (Poisson), and in the area encompassed by measurements from random points to the nth closest objects thereto (gamma). Not sampling error in the traditional sense, but the concept is very similar, so that’s what I’ll call it.

Therefore, departures from such expectations, in an observed data set, are necessarily evidence for structure in those data, in the sense that something other than simple sampling variation–some causative agent–is operating to cause the departure. The main goal of science is to seek and quantify these agents…but how are you supposed to seek those somethings until you have evidence that there is some something there to begin with? And similarly, of the many variables that might be important in a given system, how does one decide which of those are most strongly structured? These questions are relevant because this discussion is in the context of large, multivariate, non-experimental systems, with potentially many variables of unknown interaction. We want evidence regarding whether or not the observations are explainable by a single homogeneous process, or not.

Things get interesting, given this perspective, and I can demonstrate why using say, the set of n = 21 integers {0:20}, a perfect gradient with no evidence of aggregation.

The Poisson distribution tells us that we do not expect a distribution of {0:20} for 21 integers having a mean value of 10. We observe and expect (for a cdf at 5% quantile steps):

   quantile observed expected
0.05        1        5
0.10        2        6
0.15        3        7
0.20        4        7
0.25        5        8
0.30        6        8
0.35        7        9
0.40        8        9
0.45        9        9
0.50       10       10
0.55       11       10
0.60       12       11
0.65       13       11
0.70       14       12
0.75       15       12
0.80       16       13
0.85       17       13
0.90       18       14
0.95       19       15


…as obtained by (in R):

observed = 0:20; n = length(observed)
expected = qpois(p = seq(0, 1, length.out=n), lambda=mean(observed))
df = data.frame(quantile = seq(0, 1, length.out=n), observed, expected)[2:20,]


Running a chi-square test on the observed values:

chisq.test(x=df$observed)  …we get: data: df$observed
X-squared = 57, df = 18, p-value = .000006

There is thus only a very tiny chance of getting the observed result from sampling variation alone. Observation of the above table shows that the observed data are stretched–having longer left and right tails, relative to expectation.

But this series is about clustering and how to find evidence for it in the data…and there is no tendency toward clustering evident in these data. Indeed, the opposite is true–the observed distribution is long-tailed, “stretched out” compared to expectation. But…this result must mean that there is indeed some structuring force on the data to cause this result–some departure from the homogeneous state that the Poisson assumes. We can’t know, from just this analysis alone, just how many “real” clusters of data there are, but we do know that it must be more than one (and I hope it is apparent as to why)! More precisely, if we’ve decided to look for clusters in the data, then the chi-square test gives us evidence that more than one cluster is highly likely.

Just how many clusters is a more difficult question, but we could evaluate the first step in that direction (i.e. two clusters) by dividing the values into two roughly even sized groups defined by the midpoint (= 10) and evaluating whether Poisson-distributed values for the two groups, having means of 4.5 and 15, give a better fit to the observations:

(exp1 = qpois(p = seq(0, 1, length.out=11), lambda=4.5)[2:10])
(exp2 = qpois(p = seq(0, 1, length.out=12), lambda=15)[2:11])
(df2 = data.frame(quantile = seq(0, 1, length.out=n)[2:20], obs=df$observed, exp = sort(c(exp1,exp2))))  …which gives:  quant. obs. exp. 0.05 1 2 0.10 2 3 0.15 3 3 0.20 4 4 0.25 5 4 0.30 6 5 0.35 7 5 0.40 8 6 0.45 9 7 0.50 10 10 0.55 11 11 0.60 12 13 0.65 13 14 0.70 14 14 0.75 15 15 0.80 16 16 0.85 17 17 0.90 18 18 0.95 19 20  …and then compute the mean of the two chi-square probabilities: obs1=0:9; obs2=10:20 p = mean(c(chisq.test(obs1)$p.value, chisq.test(obs2)\$p.value))


…which returns p = 0.362
[Edit note: I had wrong values here originally–because I mistakenly ran the chi square tests on the expected values instead of the observed. Now corrected–still a very high odds ratio, just not as high as before.]

The odds ratio of the two hypotheses (two Poisson-distributed groups having means of 4.5 and 15.0, versus just one group with a mean of 10.0) is thus 0.362 / .000006 = 60,429. Thus, clustering the observed data into these two groups would be a very highly defensible decision, even though the observed data comprise a perfect gradient having no tendency toward aggregation whatsoever!

The extension from the univariate to the multivariate case is straight-forward, involving nothing more than performing the same analysis on each variable and then averaging the resulting set of probabilities or odds ratios.

Count ’em

Suppose you have n objects and want to enumerate all possible unique allocations thereof among q groups, where “unique” is defined by numerical allocation only, independent of actual group membership for each object. Each group can take any size from 0 to n, contingent on the total in the other q-1 groups. I’m working on a problem where I have to do this; there might well be an available method for doing this, but one can waste a lot of time looking for such things in my experience, you typically learn a lot more when you have to solve a problem yourself. Anyway I couldn’t find one readily so I wrote this R function as a script. Feel free to use if it’s, well, useful.

all.breaks = function(n, groups){
t1 = cbind(n - 0:n, n - n:0)
t1 = unique(t(apply(t1, 1, sort)))
test = groups - 2
if (test>0){
for (i in 1:test){
split=ncol(t1)
t2a = sapply(t1[,split], function(x) 0:x)
t2b = sapply(t1[,split], function(x) x:0)
reps = unlist(lapply(t2a, length))
t2a = unlist(t2a); t2b = unlist(t2b)
temp = t(apply(cbind(t2a,t2b), 1, sort))
t1 = apply(t1, 2, function(x) rep(x,reps))
t1 = cbind(t1[,-split], temp)
t1 = unique(t(apply(t1, 1, sort)))
} # End for()
} # End if()
t1 = t(apply(t1,1,sort, decreasing=T))
colnames(t1) = letters[1:groups]; rownames(t1) = 1:nrow(t1)
t1
} # End function


Example with 15 objects and 6 groups:

> all.breaks(n=15, groups=6)

a b c d e f
1   15 0 0 0 0 0
2   14 1 0 0 0 0
3   13 2 0 0 0 0
4   12 3 0 0 0 0
5   11 4 0 0 0 0
6   10 5 0 0 0 0
7    9 6 0 0 0 0
8    8 7 0 0 0 0
9   13 1 1 0 0 0
10  12 2 1 0 0 0
11  11 3 1 0 0 0
12  10 4 1 0 0 0
13   9 5 1 0 0 0
14   8 6 1 0 0 0
15   7 7 1 0 0 0
16  11 2 2 0 0 0
17  10 3 2 0 0 0
18   9 4 2 0 0 0
19   8 5 2 0 0 0
20   7 6 2 0 0 0
21   9 3 3 0 0 0
22   8 4 3 0 0 0
23   7 5 3 0 0 0
24   6 6 3 0 0 0
25   7 4 4 0 0 0
26   6 5 4 0 0 0
27   5 5 5 0 0 0
28  12 1 1 1 0 0
29  11 2 1 1 0 0
30  10 3 1 1 0 0
31   9 4 1 1 0 0
32   8 5 1 1 0 0
33   7 6 1 1 0 0
34  10 2 2 1 0 0
35   9 3 2 1 0 0
36   8 4 2 1 0 0
37   7 5 2 1 0 0
38   6 6 2 1 0 0
39   8 3 3 1 0 0
40   7 4 3 1 0 0
41   6 5 3 1 0 0
42   6 4 4 1 0 0
43   5 5 4 1 0 0
44   9 2 2 2 0 0
45   8 3 2 2 0 0
46   7 4 2 2 0 0
47   6 5 2 2 0 0
48   7 3 3 2 0 0
49   6 4 3 2 0 0
50   5 5 3 2 0 0
51   5 4 4 2 0 0
52   6 3 3 3 0 0
53   5 4 3 3 0 0
54   4 4 4 3 0 0
55  11 1 1 1 1 0
56  10 2 1 1 1 0
57   9 3 1 1 1 0
58   8 4 1 1 1 0
59   7 5 1 1 1 0
60   6 6 1 1 1 0
61   9 2 2 1 1 0
62   8 3 2 1 1 0
63   7 4 2 1 1 0
64   6 5 2 1 1 0
65   7 3 3 1 1 0
66   6 4 3 1 1 0
67   5 5 3 1 1 0
68   5 4 4 1 1 0
69   8 2 2 2 1 0
70   7 3 2 2 1 0
71   6 4 2 2 1 0
72   5 5 2 2 1 0
73   6 3 3 2 1 0
74   5 4 3 2 1 0
75   4 4 4 2 1 0
76   5 3 3 3 1 0
77   4 4 3 3 1 0
78   7 2 2 2 2 0
79   6 3 2 2 2 0
80   5 4 2 2 2 0
81   5 3 3 2 2 0
82   4 4 3 2 2 0
83   4 3 3 3 2 0
84   3 3 3 3 3 0
85  10 1 1 1 1 1
86   9 2 1 1 1 1
87   8 3 1 1 1 1
88   7 4 1 1 1 1
89   6 5 1 1 1 1
90   8 2 2 1 1 1
91   7 3 2 1 1 1
92   6 4 2 1 1 1
93   5 5 2 1 1 1
94   6 3 3 1 1 1
95   5 4 3 1 1 1
96   4 4 4 1 1 1
97   7 2 2 2 1 1
98   6 3 2 2 1 1
99   5 4 2 2 1 1
100  5 3 3 2 1 1
101  4 4 3 2 1 1
102  4 3 3 3 1 1
103  6 2 2 2 2 1
104  5 3 2 2 2 1
105  4 4 2 2 2 1
106  4 3 3 2 2 1
107  3 3 3 3 2 1
108  5 2 2 2 2 2
109  4 3 2 2 2 2
110  3 3 3 2 2 2


The next post will examine the probabilities of each such outcome.

Karl et al 2015, updated analysis

Update: As mentioned in the piece below, I had suspicions that the data at the NOAA ERSST v4 webpage were not the correct (version 4) data, and it turns out that this is in fact the case. I learned this after Gavin Schmidt of NASA forwarded me data he had been sent directly from NOAA personnel. So, I need to do the analysis below again with the version 4 data, when I get time. Needless to say, NOAA needs to put correct data on their web pages, especially when it forms the basis of a paper making this magnitude of a claim. It’s also telling, and depressing, that of the large number of scientists commenting on this paper, none had apparently gone to get the actual data with which to check Karl et al’s analysis. If they had, they’d have noticed the discrepancy against what Karl et al report.
End Update

And the situation’s worse than I had first thought. Unfortunately.

As I mentioned in the update to previous post, Gavin Schmidt of NASA brought to my attention on Twitter the fact that you cannot average trend rates obtained via Ordinary Least Squares (OLS), between two consecutive time periods, and assume that value will equal the trend computed over the combined periods. I was confused by this at first, but Gavin made a plot showing exactly what the issue is:
Nothing forces the two trend lines to be continuous at the junction of the two intervals, and a big jump or drop in the data at or near that junction, with suitably short sample sizes, will cause this to happen. A slight variation of this is just what I did in comparing trend rates over the time intervals that Karl et al discussed. I did so because they did not include the rates, in their Table S1, for all of the intervals they discussed, so I had to make algebraic estimates. [Note that Table S1 is in fact a big problem, as it gives a hodgepodge of time periods that do not in fact allow one to make a direct and clear comparison of rates between consecutive time intervals, the very point of the paper. And their Figure 1 and discussion do not help.]

So, here I compute and report the OLS-derived trend estimates for Karl et al.’s variously defined intervals, and some others they didn’t discuss, using the new NOAA data directly. These intervals are defined by: (1) four start dates (1911, 1931, 1951, 1971), (2) two breakpoint years (1998 and 2000), and two end years (2012 and 2014); this creates 16 possible pairs of time intervals. For each I compute the OLS-derived trend (degrees C per decade) for the two consecutive intervals, and the ratio between them, as the second (“slowdown”) over the first (pre-“slowdown”). [Note that the mismatch issue arises because an OLS-derived straight line is the “stiffest” possible model one can fit to a time series. Using something more flexible like lowess or a polynomial model, would lessen this problem, but then you run into other issues. Since OLS is “BLUE” (Best Linear Unbiased Estimator), and minimizes the possibility of over-fitting the data (very important here!), and since Karl et al used it, I stick with it here. Also, I am only presenting data on the trend estimates themselves, not on their probabilities relative to a null, or any other possible, model. Doing that is trickier and more time consuming than people might think, time I don’t have right now.]

First, the new NOAA annual GMST data, from 1880 on, look like this:

Second, the trend analysis results:

   Start1 End1 Start2 End2 Slope1 Slope2 Ratio
2    1911 1997   1998 2012 0.0685 0.0342 0.499
3    1911 1997   1998 2014 0.0685 0.0560 0.817
4    1911 1999   2000 2012 0.0719 0.0390 0.542
5    1911 1999   2000 2014 0.0719 0.0649 0.903
6    1931 1997   1998 2012 0.0588 0.0342 0.582
7    1931 1997   1998 2014 0.0588 0.0560 0.952
8    1931 1999   2000 2012 0.0650 0.0390 0.600
9    1931 1999   2000 2014 0.0650 0.0649 1.000
10   1951 1997   1998 2012 0.0938 0.0342 0.365
11   1951 1997   1998 2014 0.0938 0.0560 0.597
12   1951 1999   2000 2012 0.1026 0.0390 0.380
13   1951 1999   2000 2014 0.1026 0.0649 0.633
14   1971 1997   1998 2012 0.1584 0.0342 0.216
15   1971 1997   1998 2014 0.1584 0.0560 0.354
16   1971 1999   2000 2012 0.1714 0.0390 0.227
17   1971 1999   2000 2014 0.1714 0.0649 0.379


The table shows several important results. First, there is only one case in which the slope of the slowdown period is as large as that of the pre-slowdown period, given by the eighth row of the table (start = 1931, break = 2000, end = 2014). In two other cases the ratio exceeds 0.90 (rows 5 and 7) and in one more case it exceeds 0.8 (row 3). For the 1951 start date most emphasized by Karl (rows 10-13), no ratio higher than 0.63 results, no matter how the intervals are defined. The mean and median of the 16 ratios are both about 0.56, so there’s no skew issue.

If one’s goal is to compare the data against the trends and discussions given in the AR5, that is, to assess the effects of the bias reductions in the new NOAA data on possible rate changes, then 2012 is of course the end year to use. For the two possible break years, those ratios are given in rows 10 and 12, as 0.365 and 0.380 respectively: the “slowdown” interval rates are thus ~ 35 to 40 percent of the pre-slowdown rates. In the previous post, in which I had to estimate the trend rate in the first interval algebraically, I got estimates of about 65-68%. So…the problem’s actually worse than I first estimated, and by quite a bit, 35-40% instead of 65-68%. And there are some serious discrepancies in my trend values vs theirs. For example, Karl et al report the trend from 2000 to 2014 as 0.116; I get 0.0649. For 1998-2012 they report 0.086 and I get 0.034. The numbers I’m getting are much closer to their reported “old” (bias uncorrected) values, than to their “new” ones, but they don’t exactly match those either. And the data link names provided by NOAA at their ERSSTv4 page do not seem to refer to version 4. So, some problems here.

If one’s goal is instead to assess what the combined effects of the bias reductions and the extra two years of data is, relative to the general question of whether the rate of global warming has slowed over the last ~1.5 decades or not, then rows 11 and 13, which go to 2014, give higher estimates: 60 to 63 percent post/pre. So those two years of extra data heavily affect the answer to that general question. This same basic story holds true for other start dates. And the further back you push the start date, the less the rate difference between the two periods, for any given choice of end and breakpoint dates.

Again, no significance testing here. It is important to remember though, that each year’s GMST value derives from many hundreds of data points, even if reduced to an effective size by accounting for autocorrelation, which certainly exists. I much doubt that any except the very highest ratios given here would prove to be statistically insignificant if the trend rates were computed from those raw data, instead of from the annual means, and even using the latter, they still would not be in many cases.

So, these results seem to make the principal conclusion of Karl et al, that there has been no slowdown in the rate of global warming, even more suspect than I reported in the previous post. The bias reductions and extra years’ data reduce the differences, but they most certainly do not eliminate them.

The R code for the analysis is below

## 1. Get and read data
# The ERSSTv4 GMST data from 1880 on is at: http://www1.ncdc.noaa.gov/pub/data/mlost/operational/products/aravg.ann.land_ocean.90S.90N.v3.5.4.201504.asc
options(digits=3)

## 2. Define time periods; start year = 1951; breakpoint year at either 1998 or 2000, end year at either 2012 or 2014
# breakpoint year always included in *2nd* of the 2 intervals
# 1998 and 2000 useful choices--differing global means
# start years: 40 and 20 years before, and 20 years after, Karl et al's emphasis on 1951; thus covers last 100+ years

start.yr = c(1911,1931,1951,1971); break.yr = c(1998,2000); end.yr = c(2012,2014)
dates = expand.grid(start.yr,break.yr,end.yr)
dates[,4] = dates[,2]-1; dates= dates[,c(1,4,2,3)]; names(dates) = c("Start1","End1","Start2","End2")
which.dates = dates-1879; n = nrow(dates)
intvl.1=paste("int",1:n,".1",sep=""); intvl.2=paste("int",1:n,".2",sep="")
for (i in 1:n){
assign(intvl.1[i], which.dates[i,1]:which.dates[i,2])
assign(intvl.2[i], which.dates[i,3]:which.dates[i,4])
}

## 3. Get OLS-derived lm and slope estimates, ignoring possible AC for now, since no p testing
lm.1 = paste("lm",1:n,".1",sep=""); lm.2 = paste("lm",1:n,".2",sep="")
sl.1 = paste("sl",1:n,".1",sep=""); sl.2 = paste("sl",1:n,".2",sep="")

for (i in 1:n){
assign(lm.1[i], lm(T[get(intvl.1[i]), 2] ~ T[get(intvl.1[i]), 1]) )		# OLS lm fits
assign(sl.1[i], get(lm.1[i]) [[1]][[2]])							# slope only
assign(lm.2[i], lm(T[get(intvl.2[i]), 2] ~ T[get(intvl.2[i]), 1]) )
assign(sl.2[i], get(lm.2[i]) [[1]][[2]])
}

Ratios = rep(NA,n); Slope1=Ratios; Slope2=Ratios
for (i in 1:n){
Slope1[i] = get(sl.1[i]); Slope2[i] = get(sl.2[i])
Ratios[i] =  Slope2[i] / Slope1[i]
}
final = cbind(dates, Slope1, Slope2, Ratios)
final = final[order(final[,1], final[,2], decreasing=F) ,]
rownames(final) = 1:n; final
plot(T[,1], T[,2], xlab="Year", ylab= "ERSSTv4 GMST (deg. C) rel. to 1971-2000 mean", type="l", lwd=2)


Transient temperature change estimation under varying radiative forcing scenarios, part three

In this third post of this series, I’ll demonstrate various outputs of the method/approach described in parts one and two of this series. You may need to look at those to follow this. The key point is that the methods described there allow one to estimate the model-predicted, global mean surface temperature (GMST) at any point in time, if one knows the equilibrium climate sensitivity and the temperature trajectory (i.e. response and lag) of a pulsed increase in radiative forcing. Both of these are obtained from CMIP5 climate model output.  This can be used to both look at future temperature predictions, and also to examine historic temperature changes in light of estimated RFs for various agents. This post concentrates on the former, the next one on the latter.

A few more points first though.  The key one is that Good et al. (2011) demonstrated that a series of linearly scaled RF pulses (at 1% of the instant 4X CO2 experiment), when continued for 140 years, gave a very good estimate of that produced by the annual 1% CO2 increase experiments, for the nine models’ outputs available in 2011.  Caldeira and Myhrvold (2013) confirm this, and note two and three parameter negative exponential models performed about equally well in this prediction (based on RSMSE criteria; actually these are four- and six-parameter fits–see the paper). Information criteria tell us we should go with the simpler model in such cases, as it will often be a better predictor for cases beyond the original sample range. Note that everything is fully deterministic here–I’m not including any random variation, i.e. systemic variability, measurement error, etc. Also, CO2 is the lone RF evaluated, and thus an important underlying assumption is that the actual RF change is known with high confidence.

First, what do the 20 CMIP5 AOGCM model temperature trajectories of the instant 4X CO2 experiment look like?  For the CM2013 two-parameter model fits, they look like this over 1000 years (starting from 1850, which I take as +/- end of the pre-industrial period):

and like this over the first 50 years:

Ebola rates, updated

Latest data from the WHO on the W. Africa Ebola outbreak (report of 09-18-14; data therein as of 09-14-14). For data table go here, and for R code generating data and graphs go here.

Reporting issues are likely responsible for the large fluctuations in the raw data, hence the loess smoothing (dark line) for a better approximation of the true rates. See here for a more in depth discussion of this issue.

On baseball (finally!)

I’ve discussed no baseball here yet, which is kind of surprising, given that I’ve been a big fan all my life. I played a lot growing up, through high school and even a little in college and afterwards. If I had the time, I would likely start a blog just devoted strictly to baseball (and not just analysis either), because I have a lot to say on a lot of topics. But alas…

To me, the real interest in any sport comes from actually playing the game, not watching it, and I watch very little baseball (now) because the games are just too time consuming (though I still have a hard time refraining in October). When I do watch, I’m not obsessively analytical–that takes the fun out of it for me. It’s an athletic contest, not a statistics class; I want to see the center fielder go full speed and lay out for a catch, or a base thief challenge the pitcher or whatever, not sit there with numbers in my head. Analysis is for later, and I do like it, so I wade in at times, thereby joining the SABR-metric (or “sabermetric”) revolution of the last 3-4 decades (the Society for American Baseball Research (SABR), initiated much of this). And baseball offers endless analytical opportunities, for (at least) two reasons.

Billy Hamilton of the Cincinnati Reds

June US historical precipitation, a quick analysis

The NCDC June climate summary for the United States is out. June’s a real important month in the Northern Hemisphere, especially agriculturally. I’ll use these data as a chance to demonstrate the probability that June 2014 precipitation (P) in the US is drawn from a stationary distribution (over 1895-2014, 120 years, CONUS), using both state- and region-based P rankings, and 250k Monte Carlo simulations. [There are 97 years of record for Alaska, and Hawaii’s not included.] If precipitation is stationary over that time, we expect the state rankings (normalized to a 0-1 scale) not to differ significantly from 0.5. Easier still, although a slightly different question, is to evaluate the probability of having 9 states rank among their 10 wettest (as was observed), as then we only need that information, not the actual rankings of each state.

Below are the graphics showing the rankings (120 = wettest; Alaska, not shown, had the 2nd wettest June in its 97 year record):

One nice thing about this is that only a few lines of code are needed. Here it is for the first situation:

## 1. State-based. Probability that 9 (of 49) states' June 2014 precip ranks among the state's wettest.  Hawaii not included.
# Non-Alaska: 120 years in the NCDC record (1895-2014).  Alaska: 97 year record (1918-2014)
# 1=dryest, 120=wettest; all rankings normalized to 0-1 scale; test stat = 111/120 = .9175.

rm(list=ls()); options(digits=3)
trials=250000; data=rep(NA,trials)
for (i in 1:trials) {z1=sample(seq(.001,1,.001),size=49,replace=F); data[i]=length(z1[z1 >= 111/120])}
(p = length(data[data>=9])/trials)


States are not all the same size, so we should normalize accordingly. A quicker approximation is just to use climate regions, which are more roughly equal in size than the states are. However, there are only ten of them, so it might be better to look at their central tendency and dispersion, rather than the number placing in the ten wettest years. [Of course, for both analyses, it would be even better to use the actual P values themselves, instead of their ranks, but with 120 years of data, this will be a good approximation of that].

## 2. Region-based. Probability that mean and std. dev. of regional (including AK) June 2014 precip ranks exceed expectation under hypothesis of stationarity (no change).  Hawaii not included.
regn.ranks = c(c(88,40,107,120,112,97,23,13,50)/120, 96/97)
par1=mean(regn.ranks); par2=sd(regn.ranks)
trials=250000; data=matrix(NA,nrow=trials,ncol=2)
for (i in 1:trials){
z1=sample((1:1000)/1000,10,F)
data[i,1]=mean(z1); data[i,2]=sd(z1)
print(i)
}
p.mean = length(data[,1][data[,1]>=par1])/trials
p.sd = length(data[,2][data[,2]>=par2])/trials


OK, so the results then. The state-based analysis (top) returns a value of p = 0.009, or just under 1%, for the probability of having 9 states out of 49 rank in the top 10 of their historical records. The region-based analysis gives p = 0.063 for a stationary mean, and p = 0.098 for a stationary standard deviation, at the region level, thus neither quite reaching the standard p = .05 signficance level, but both getting there. Remember, p = 0.5 would be the expected value for each metric under a stationary June precipitation; values deviating therefrom, either way, indicating evidence for dynamics. Note also that this is not a trend analysis; for that you would need the time series of either the P values or the rankings for each state or region.

Ebola epidemiology data scraper

Note: The following post is current as of WHO report of 09-18-14, which includes data to 09-14-14. [I’ve altered this code a number of times because of the nearly constantly changing format and location of the WHO data.]
#######

I wanted to find certain statistics for the West African Ebolavirus (EBV) outbreak from its inception, e.g. recent case and death rates. By “recent” I mean at WHO GAR reporting intervals, typically a few days. But the sites where I’d expect to find such (WHO, CDC etc) didn’t have them, at least in a synthesized form. So, I wrote an R script to scrape, compile, rearrange, and plot the data, for any country or all together, as taken from two sources. Starting with the WHO July 1 GAR report, tables of new and cumulative cases and deaths, categorized by degree of certainty, for each of the three countries in the outbreak, are given. Wikipedia has a less detailed table for all dates from March 25 on. I used it to obtain total cases and deaths up to July 1.

Below are graphs of the recent, per-day (1) case rates, and (2) death rates, from March 22. Each shows the raw data (thinner line), and a loess-fitted trend (thicker line). Note that reporting issues are partly, perhaps largely, responsible for extreme fluctuations in the raw data.

Global temperature change computations using transient sensitivity, in comparison to AR5

This post is more or less a thinking-out-loud exercise, but I thought I would inflict it on you, since there are not enough blog posts about climate change. It’s a comparison of point estimates of expected global mean surface temperature change, as given by the IPCC AR5, and ones I computed using Transient Climate Sensitivities (TCS).

This was inspired by a recent cartoon in xkcd and a conversation with Brian McGill (starting here) over at Dynamic Ecology. It’s just a cartoon, and the author seems to try hard to get science and stat topics right. But on first look, I thought it was not even “ballpark” correct, except perhaps under the highest radiative forcing (RF) scenario (RCP85), whereas some other people thought it was ballpark OK. Of course, one person’s concept of “ballpark” might not be another’s.

Be careful, look closely

The scientific enterprise involves inferring cause from effect, where “effect” refers to something(s) that can be readily observed and measured. Many would argue that this task is the very essence of science, and this definition sure is simpler than wading into Kant, Bacon, Popper, etc. Along those lines, this post gives one important example of how to discriminate between different possible causative explanations, given a set of observed data. I’ll use binomial probability models, because they are simple, very widely applicable, and I’ve discussed them some in previous posts.

Tree growth analysis issues, again, part five

Continuing in this series, the point of which is to explore issues in tree growth analysis and their relationship to claims made by Stephenson et al. (2014)

Previously, I derived a function relating tree above-ground biomass (AGB) to time, given an apriori equation from the literature (Chave et al., 2005) relating biomass to diameter that is similar to the one used by Stephenson et al. [I used this similar equation, rather than the one actually used by Stephenson et al. (2014), because doing so allowed for an easier derivation of the function relating diameter to time that would produce a constant mass growth rate.] The point of that exercise was to establish a reference frame of constant mass growth rate, from which I could then investigate what increasing and decreasing mass growth rates would imply for radial growth rates. The analysis showed me that models of that basic structure could only produce continuously increasing or decreasing mass (and radial) growth rates.

But continuously decreasing rates are biologically unreasonable, because they require impossible growth rates when trees are very young, and although continuously increasing rates do support what Stephenson et al. claim, they too are untenable. This is simply because, of course, nothing in nature can grow at an ever-accelerating rate; there has to be a deceleration at some point, at the very least to an asymptote, but more likely to an actual decline. The universe is not over-run by tree mass after all. In fact, one of the things that worries me about their paper is that their data shows very little rate deceleration at all, including in species that reach very large size. So…given the rate accelerations at the maximum tree sizes they report–the main point of their paper–then where are all the trees that are > 1.0X the mass of the largest they report? Shouldn’t they also exist? Did all their tree sizes just happen to fall short of the size threshold where growth rates start to decline? Or do they imagine that tree growth rates never slow down and maximum size is limited by various other forces? If so, what are they? Some things just don’t add up here.

Tree growth analysis issues, again, part four

Continuing….

Last time, I derived an equation that gives constant tree mass growth rates with time, given an apriori, defined relationship between biomass and diameter. Varying two parameters thereof (q and s) produced continuously increasing or decreasing mass and radial (diameter) growth rates, but only the continuously increasing mass growth rates also had reasonable radial growth rates over the full range of small and large trees. I now want to investigate whether there are radial growth models that can give more complex relationships of mass to time, unimodal responses in particular. But before I get to that I’ll show some graphs illustrating these dynamics, and provide the R code as well (at the end).

I’ve altered two things from the previous post, for convenience. First, I re-defined q so as to make the connection between M = f(D), and D = f(t), easier to follow. Specifically,

$M = de^{-1.864 + 2.608\ln(D)}$ and
$ln(D) = \frac{\ln(st)}{2.608q}$ and therefore
$M = de^{-1.864 + \frac{\ln(st)}{q}}$

where M is tree mass, D is diameter, d is wood density, e is the base of natural logs, t is time in years and s and q are computed parameters that jointly give equal mean growth rates at year t = 200 (as explained in the previous post). In this formulation, any model having q < 1.0 will result in continuously increasing mass growth rates as functions of either time or diameter. Second, I changed the target diameter from 156 cm to 120 cm, at 200 years (i.e. slower mean growth rate). This slower mean growth rate just allows me to extend growth beyond 200 years and not get unrealistically sized trees as quickly, at say 300 years.

But before going on, below are four graphs, each showing the growth dynamics of six models, given what I’ve done so far. All produce increasing mass growth rates as a function of time, result from varying q and s in the above equation, and have q < 1.0. The graphs show: (1) diameter and (2) mass, as functions of time, and then mass growth rates as functions of (3) time and (4) diameter. The values of q and s listed in the legends are for the third equation above $M = de^{-1.864 + \frac{\ln(st)}{q}}$

Severe analytical problems in dendroclimatology, part five

OK, posting code here doesn’t work; let’s try this again.

Last week after I put up part four of this series, there was some discussion regarding demonstration/specification of the claims presented there.  I gave it some thought and decided that, since I’ve already described the fundamental issues involved, and framed the approach taken, the cat is really already out of the bag, so I might as well demonstrate it with some code. As mentioned, I could not put up the full code, nor simply extract a snippet from the jungle that it is, so I wrote something much briefer that does the job (I think).  Perhaps it was a needed exercise anyway, to prove to myself that some wires didn’t get crossed somewhere. Discussion is welcome.

The code is here. Please heed the notice at the top!

When and what to share publically

In my last post in the series on analytical problems in dendroclimatology, the issue of sharing computer code was raised, which is part of the larger issue of being fully open about one’s claims, so that other people can check them.  I am pretty much 100% in the camp that everything in science needs to be defensible, and so all critical assumptions, inferential methods, data, computer code etc., does need to be made available to the world at large.  We all do definitely have to defend our work, and that’s how you defend it.  Saying “trust me” will not cut it.