Rate process estimation: Poisson vs gamma

This post is about estimating a rate process, particularly when certain required input data are missing or critical assumptions are unmet. Although it’s quite unrelated to recent posts (here and here) I’m actually headed somewhere definite with the collection of them, if you can bear with me. If you can’t bear with me, I can bear that.

As we know, estimating rate processes is a very common task in science, where “rate” is defined generally as a measured change in one variable per unit measure of another, usually expressed as a ratio or fraction. In the discussion that follows I’ll use object density–the number of objects of interest per unit area–to illustrate the concepts, but they are applicable to rates of any type.

Obtaining rate estimates, with rare exceptions, requires empirical sampling over the domain of interest, followed by some mathematical operation on the samples, often an averaging. Sampling however can take two opposing approaches, as determined by the measurement units of the denominator and the numerator. Using our density example, in the first approach we tally the objects occurring within samples of some apriori defined area (the denominator thus fixed), and then simply average the values over all samples. In the second distance sampling (DS) approach, we instead measure to the nth closest objects from random starting points, with the value of n chosen apriori, and thus with the numerator fixed and the denominator varying. Low values of n are typically chosen for convenience, and distances are converted to density using a suitable density estimator.

This DS approach has definite advantages. For one, it is time efficient, as no plot delineations are required, and decisions on which objects to tally or which have already been so, are typically easier. A second advantage is that the total sample size is independent of object density, and a third is an often superior scatter of samples through the study area, leading to a better characterization of spatial pattern.

The data from the two approaches are modeled by two well-established statistical models, the Poisson and the gamma, respectively. With the Poisson, the number of objects falling in each of a collection of plots will follow a Poisson distribution, the exact values determined by the overall density. With the second approach, the distances to the nth closest objects follow a gamma distribution.

Either approach requires that objects be randomly located throughout the study area for unbiased estimates, but there is a second potential bias source with distance sampling (DS), and hence a down-side. This bias has magnitude of (n+1)/n, and so e.g., measuring to the n = 1st closest object will bias the density estimate by a factor of 2x, to the n = 2nd closest by 3/2, etc. This bias is easily removed simply by multiplying by the inverse, n/(n+1), or equivalently, measuring to the 2nd closest objects will give the area corresponding to unit density, that is, the area at which object density equals one. Which is kind of interesting, assuming you’ve made it this far and nothing more enthralling occupies your mind at the moment.

The shape of the gamma distribution varies with n. When n = 1, the gamma assumes a negative exponential shape, and when n > 1 it is unimodal, strongly skewed at low values of n, but of decreasing skew (more normal) at higher n. The upshot is that one can diagnose n if unknown, because these distributions will all differ, at least slightly. However the power to do so decreases with increasing n, because these differences decrease with n, and it also assumes the objects occur randomly.

If our rate denominator measurement space is defined on more than one dimension–as it is in our example, area–we can subdivide it into equally sized sectors and measure distances to the nth closest object within each sector. Sectors here would correspond to angles emanating from the sample point, and this “angle-order” method increases the sample size at each sample point. Supposing four sectors, for a given n value, the four measurements at a point give (four choose two) = six possible ratios between the objects, these ratios defined as the further distance over the closer. These distributions of the six values, over a collection of sample points, are then collectively discriminatory for n.

Usually both the rate and its homogeneity are unknown, and we need the latter to get the former. If a non-random pattern exists, the non-randomness can be quantified (in several ways) if we know n, and density estimates made using suitable corrections. If we don’t know n but do know the objects are in fact randomly arranged, we can still infer density, although not with nearly as much precision as if we knew the value of n. Interestingly, the bias arising from a clustered non-random pattern is of the same direction as that from an under-estimate of n, both leading to density under-estimates. Similarly, a tendency toward spatial pattern regularity gives a bias in the same direction as an overestimate of n.

These last facts can be useful, because our main interest is not n or the spatial pattern, but rather the rate (density), and the effects of uncertainties in them are not additive, and can even be compensating, depending on the situation. This serves to greatly constrain the total uncertainty in the rate estimate even when these critical pieces of input information are lacking. I’m going to try to expound on that in the next post in this utterly enthralling series.

Support for this post has been provided by the Society for Public Education of Not Very Well Recognized or Cared About Issues. Nothing in the post should be construed, or mis-construed, as in any way necessarily reflecting the views, opinions, sentiments, thoughts, conceptual leanings, quasi-conscious daydreaming or water cooler banter of said Society, or really of anyone in particular other than me, and even that is open to debate. You may now return to your regularly scheduled life.

Revisited, revisited

Fifty years ago later this week. August 30, 1965 to be exact.

Hwy61cover One of the more interesting, unexpected and outlandish albums in modern American music. And the first of many changes in musical direction for its author, who wasn’t afraid to upset the apple cart of expectations and go in some new direction. As he’s quoted to have said “I try my hardest to be just like I am“.

Oh God said to Abraham, “Kill me a son”
Abe says, “Man, you must be puttin’ me on”
God say, “No.” Abe say, “What?”
God say, “You can do what you want Abe, but
The next time you see me comin’ you better run”
Well Abe says, “Where do you want this killin’ done?”
God says, “Out on Highway 61”

Outcome probabilities

Continuing from the previous post., where I discussed earth’s recent surface temperature increase hiatus/slowdown/backoff/vacation

Well not really–I discussed instead the very closely related topic of enumerating the outcomes of a given probabilistic process. And actually not so much a discussion as a monologue. But couldn’t somebody please address that other issue, it’s just been badly neglected… :)

Anyway, enumerating the possible allocations of n objects into q groups is rarely an end in itself; the probability, p of each is usually what we want. This is a multinomial probability (MP) problem when q > 2, and binomial (BP) when q = 2, in which we know apriori the per-trial p values and want to determine probabilities of the various possible outcomes over some number, n, of such trials. In the given example, the per-trial probabilities of group membership are all equal (1/6) and we want to know the probability of each possible result from n = 15 trials.

One has to be careful in defining exactly what “trials” and “sample sizes” constitute in these models though, because the number of trials can be nested. We could for example, conduct n2 = 100 higher level trials, in each of which the results from n1 = 2 lower level trials are combined. This is best exemplified by Hardy-Weinberg analysis in population genetics; a lower level trial consists of randomly choosing n1 = 2 alleles from the population and combining them into a genotype. This is repeated n2 times and the expected genotype frequencies, under certain assumption of population behavior, are then compared to the observed, to test whether the assumptions are likely met or not. If only two possible outcomes of a single trial (alleles in this case) exist in the population, the model is binomial, and if more than two, multinomial.

There are two types of MP/BP models, corresponding to whether or not group identity matters. When it does, the BP/MP coefficients determine the expected probabilities of each specific outcome. For n objects, q groups and group sizes a through f, these equal the number of permutations, as given by n! / (a!b!c!d!e!f!), where “!” is the factorial operator and 0! = 1 by definition. This formula is extremely useful; without it we’d have to enumerate all permutations of a given BP/MP process. And this would choke us quickly, as such values become astronomical in a hurry: with n = 15 and q = 6, we already have 6^15 = 470 billion possible permutations.

When group identity doesn’t matter, only the numerical distribution matters, and this will decrease the total number of coefficients but increase the value of each of them. For example, in locating the n = 50 closest trees to a random sampling point, one may want to know only the expected numerical distribution across the q angle sectors around the point. In that case, the allocation [2,1,1,0] into groups [a,b,c,d] would be identical to [0,1,1,2] and to 10 others, and these thus have to be aggregated. The number of aggregations is given by the number of permutations of the observed group sizes, which in turn depends on their variation. When all differ, e.g. [0,1,2,3,4,5] for n = 15, the number of equivalent outcomes is maximized, equaling q + (q-1) + (q-2)…+ 1 (in this case, 21) [Edit: oops, confused that with what follows; the number of permutations there is given by q!]. When some but not all of the group sizes are identical it’s more complex, as determined by products of factorials and permutations. When the number of identical group sizes is maximized, the equivalent outcomes are minimized, always at either q-1 or 1. In this case there are 6 variations of [n,0,0,0,0,0].

To get the desired probability density function, the raw MP/BP coefficients, obtained by either method, are standardized to sum to 1.0.

Next time I’m going to discuss a general solution to the problem of estimating the otherwise unknown relationships between pairs of objects involved in a rate process, such as density, the number of objects per unit area or time. These can be deduced analytically using multinomial and gamma probability models in conjunction.

It will make you call your neighbors over for a party and group discussion. If it doesn’t you get a full refund.

Count ’em

Suppose you have n objects and want to enumerate all possible unique allocations thereof among q groups, where “unique” is defined by numerical allocation only, independent of actual group membership for each object. Each group can take any size from 0 to n, contingent on the total in the other q-1 groups. I’m working on a problem where I have to do this; there might well be an available method for doing this, but one can waste a lot of time looking for such things in my experience, you typically learn a lot more when you have to solve a problem yourself. Anyway I couldn’t find one readily so I wrote this R function as a script. Feel free to use if it’s, well, useful.

all.breaks = function(n, groups){
 t1 = cbind(n - 0:n, n - n:0)
 t1 = unique(t(apply(t1, 1, sort)))
 test = groups - 2
  if (test>0){
   for (i in 1:test){
    split=ncol(t1)
    t2a = sapply(t1[,split], function(x) 0:x)
    t2b = sapply(t1[,split], function(x) x:0)
    reps = unlist(lapply(t2a, length))
    t2a = unlist(t2a); t2b = unlist(t2b)
    temp = t(apply(cbind(t2a,t2b), 1, sort))
    t1 = apply(t1, 2, function(x) rep(x,reps))
    t1 = cbind(t1[,-split], temp)
    t1 = unique(t(apply(t1, 1, sort)))
   } # End for() 
  } # End if()
  t1 = t(apply(t1,1,sort, decreasing=T))
  colnames(t1) = letters[1:groups]; rownames(t1) = 1:nrow(t1)
  t1
 } # End function

Example with 15 objects and 6 groups:

> all.breaks(n=15, groups=6)

     a b c d e f
1   15 0 0 0 0 0
2   14 1 0 0 0 0
3   13 2 0 0 0 0
4   12 3 0 0 0 0
5   11 4 0 0 0 0
6   10 5 0 0 0 0
7    9 6 0 0 0 0
8    8 7 0 0 0 0
9   13 1 1 0 0 0
10  12 2 1 0 0 0
11  11 3 1 0 0 0
12  10 4 1 0 0 0
13   9 5 1 0 0 0
14   8 6 1 0 0 0
15   7 7 1 0 0 0
16  11 2 2 0 0 0
17  10 3 2 0 0 0
18   9 4 2 0 0 0
19   8 5 2 0 0 0
20   7 6 2 0 0 0
21   9 3 3 0 0 0
22   8 4 3 0 0 0
23   7 5 3 0 0 0
24   6 6 3 0 0 0
25   7 4 4 0 0 0
26   6 5 4 0 0 0
27   5 5 5 0 0 0
28  12 1 1 1 0 0
29  11 2 1 1 0 0
30  10 3 1 1 0 0
31   9 4 1 1 0 0
32   8 5 1 1 0 0
33   7 6 1 1 0 0
34  10 2 2 1 0 0
35   9 3 2 1 0 0
36   8 4 2 1 0 0
37   7 5 2 1 0 0
38   6 6 2 1 0 0
39   8 3 3 1 0 0
40   7 4 3 1 0 0
41   6 5 3 1 0 0
42   6 4 4 1 0 0
43   5 5 4 1 0 0
44   9 2 2 2 0 0
45   8 3 2 2 0 0
46   7 4 2 2 0 0
47   6 5 2 2 0 0
48   7 3 3 2 0 0
49   6 4 3 2 0 0
50   5 5 3 2 0 0
51   5 4 4 2 0 0
52   6 3 3 3 0 0
53   5 4 3 3 0 0
54   4 4 4 3 0 0
55  11 1 1 1 1 0
56  10 2 1 1 1 0
57   9 3 1 1 1 0
58   8 4 1 1 1 0
59   7 5 1 1 1 0
60   6 6 1 1 1 0
61   9 2 2 1 1 0
62   8 3 2 1 1 0
63   7 4 2 1 1 0
64   6 5 2 1 1 0
65   7 3 3 1 1 0
66   6 4 3 1 1 0
67   5 5 3 1 1 0
68   5 4 4 1 1 0
69   8 2 2 2 1 0
70   7 3 2 2 1 0
71   6 4 2 2 1 0
72   5 5 2 2 1 0
73   6 3 3 2 1 0
74   5 4 3 2 1 0
75   4 4 4 2 1 0
76   5 3 3 3 1 0
77   4 4 3 3 1 0
78   7 2 2 2 2 0
79   6 3 2 2 2 0
80   5 4 2 2 2 0
81   5 3 3 2 2 0
82   4 4 3 2 2 0
83   4 3 3 3 2 0
84   3 3 3 3 3 0
85  10 1 1 1 1 1
86   9 2 1 1 1 1
87   8 3 1 1 1 1
88   7 4 1 1 1 1
89   6 5 1 1 1 1
90   8 2 2 1 1 1
91   7 3 2 1 1 1
92   6 4 2 1 1 1
93   5 5 2 1 1 1
94   6 3 3 1 1 1
95   5 4 3 1 1 1
96   4 4 4 1 1 1
97   7 2 2 2 1 1
98   6 3 2 2 1 1
99   5 4 2 2 1 1
100  5 3 3 2 1 1
101  4 4 3 2 1 1
102  4 3 3 3 1 1
103  6 2 2 2 2 1
104  5 3 2 2 2 1
105  4 4 2 2 2 1
106  4 3 3 2 2 1
107  3 3 3 3 2 1
108  5 2 2 2 2 2
109  4 3 2 2 2 2
110  3 3 3 2 2 2

The next post will examine the probabilities of each such outcome.

Meandering Numps

If you need something conducive to positive mental health, then you want to check out Meandering Numps ASAP. I’m jealous that I didn’t think of it first. Hats off to Andy–stuff like this makes life livable.

OLYMPUS DIGITAL CAMERA

“The Wally was a black beast, and although he had a ferocious demeanor, he was in fact a daft ape, and a massive dope.”


Just one of the cast of characters over at Numps’ place.

Local history, Underground Railroad edition

There was a local history event in the old home town today. Decided to go check it out, as I’m a sucker for these type things and am rarely disappointed, as was the case again. Learning, for example, that James Ashley, who introduced the 13th Amendment to Congress that legally abolished slavery, was a local guy. And of the myriad ways there are of hiding and transporting people, and distracting the runaway slave hounds, when you put your mind to it. And more.

IMG_1986Right in our back yard. We had no idea growing up. But we could name all 50 states…

IMG_1989Mr Douglass was there–and he was impressive.

Continue reading

When you sing you pray twice

Gorka
John Gorka has long been one of my favorite song writers–his combination of lyrics, guitar style, humor, and intelligence are equaled by very few (like maybe 2-3 others) in my experience. Had a much tougher go learning this one than expected though–John’s chord progressions can be tricky–but eventually got it. Thanks be to Chordify!

When you sing
You make the world a better place
When you sing
You bring us to a state of grace
When you sing, when you sing
Do you know the joy you bring?
When you sing

Like when you sing in your car
All alone with the radio
Or along with the Sunday songs
Breathe in, breathe out, move on
The song begins in your heart
Respond with your voice
Freedom’s right where you are
You still have a choice

St Francis from Italy, he said long ago
When you sing, you pray twice and
That was, and always will be, so
Don’t worry, you can’t make a sound
To please a stranger’s ear
Do it for others, and you’ll
Sing sweet, loud, and clear
Sweet, loud, and clear

You can sing with your hands
You can sing with your words
Sing with that still small voice
It’ll still be heard
You can sing with your thoughts
You can sing with your gifts
Sing to the weather, ’til the weather lifts

John Gorka, When You Sing

I was free to wonder

I couldn’t bribe a wino on what I used to make
My fortune was as sure as the wind
But I was free to wonder and time was on my hands
It was mine to burn and to bend

Freedom for freedom, call that an even scheme
Give me time to wonder and to dream
I need the money; they’ll take the time
Down to the land of the bottom line

Then there came a chance to make some steady dough
Bouncing up my alley to the door
You fill your clothes with keys, and damned responsibilities
You trade in the maybe for the sure

Freedom for freedom, call that an even scheme
Give me time to wonder and to dream
I need the money; they’ll take the time
Down to the land of the bottom line

Land of the Bottom Line, John Gorka

Shakespeare from the natives

An interesting incident in connection with the Bradstreet expedition was a journey undertaken by Captain Morris…Under instructions from his superior he “set out in good spirits from Cedar Point (mouth of the Maumee), Lake Erie on the 26th of August, 1764…and at the same time the army proceeded for Detroit”. He was accompanied by two Canadians and a dozen Indians…The party proceeded up the Maumee to the headquarters of Pontiac, whose “army consisting of 600 savages with tomahawks in their hands”, surrounded him….At Fort Wayne another rabble of Indians met the embassy in a threatening manner, but Morris remained in a canoe reading “The Tragedy of Anthony and Cleopatra” in a volume of Shakespeare which had been presented to him by an Indian Chief. This was undoubtedly one of the strangest circumstances under which the works of Shakespeare were ever perused.

Winter, N.O. 1917. A History of Northwest Ohio, vol. 1

Without saying a word

It’s amazing how you can speak right to my heart
Without saying a word, you can light up the dark
Try as I may I could never explain
What I hear when you don’t say a thing

The smile on your face lets me know that you need me
There’s a truth in you eye sayin’ you’ll never leave me
The touch of your hand says you’ll catch me whenever I fall
You say it best…when you say nothing at all

Allison Krauss

Karl et al., once again

I’m just going to pick up here from the update posted at the top of the previous post. Read the previous two posts if you need the background on what I’m doing.

I was fortunately able to get the global mean annual values for NOAA’s ERSST version 4 data from Gavin [edit: this must actually be merged land and ocean, or MLOST, not ERSST data]. Here’s the same analysis I did previously, using those data. I also realized that the data at the NOAA web page included year-to-date value for 2015 (which so far is apparently record warm). I removed that year here. As before, no significance testing here, which is trickier than might appear, or that I have time for. (Note that Karl et al. did not test for significance either).

First, here’s a graph of the two time series (previous ERSST version = black, new data = blue). The graph appears to be identical, or nearly so, to Karl et al.’s Fig 1a:
ERSSTv4_2

This gets interesting here. Visually, there’s clearly little difference between the two series; their correlation is about 0.995. But…when I run the same analysis on the new data that I did on the previous version, the results (right-most two columns) are very different indeed:

   Start1 End1 Start2 End2  Slope1 Slope2  Ratio Previous
2    1911 1997   1998 2012 0.06933 0.0855 1.2332   0.4995
3    1911 1997   1998 2014 0.06933 0.1063 1.5325   0.8170
4    1911 1999   2000 2012 0.07252 0.0920 1.2685   0.5422
5    1911 1999   2000 2014 0.07252 0.1162 1.6018   0.9033
6    1931 1997   1998 2012 0.06517 0.0855 1.3120   0.5821
7    1931 1997   1998 2014 0.06517 0.1063 1.6305   0.9522
8    1931 1999   2000 2012 0.07071 0.0920 1.3011   0.5999
9    1931 1999   2000 2014 0.07071 0.1162 1.6430   0.9995
10   1951 1997   1998 2012 0.10488 0.0855 0.8152   0.3647
11   1951 1997   1998 2014 0.10488 0.1063 1.0131   0.5966
12   1951 1999   2000 2012 0.11211 0.0920 0.8206   0.3797
13   1951 1999   2000 2014 0.11211 0.1162 1.0363   0.6327
14   1971 1997   1998 2012 0.17069 0.0855 0.5009   0.2162
15   1971 1997   1998 2014 0.17069 0.1063 0.6225   0.3536
16   1971 1999   2000 2012 0.17887 0.0920 0.5143   0.2274
17   1971 1999   2000 2014 0.17887 0.1162 0.6495   0.3788

All of the ratios are now higher; that is, there is less of a difference in slope between the post- and pre-breakpoint time periods. This is regardless of start, break, or end years. Most of the ratios are now > 1.0; before, none of the them were so. For the 1951 start date emphasized the most by Karl et al., and ending in 2014, there is near equality in slopes, just as they state. If one ends instead at 2012, the recent interval is just over 80% of the earlier, much higher than using the previous version data, where it was 36-38%. Choice of start year has a large effect: the highest ratios arise using a 1931 start and the lowest from a 1971 start. A 1971 start date gives the largest discrepancy in rates among any of the four tested; there has clearly been a slowdown if one starts there, and there’s just as clearly been an increase if one starts from 1931. It washes out as a draw if one starts from 1951.

The only point of real contention now is the last sentence in the paper: “…based on our new analysis, the IPCC’s (1) statement of two years ago – that the global surface temperature “has shown a much smaller increasing linear trend over the past 15 years than over the past 30 to 60 years” – is no longer valid. Comparing a 30 to 60 year interval with a sub-interval of it, is not the proper comparison. You have to compare non-overlapping intervals, and if you start those from 1971, then yes there definitely has been a slowdown, based on these data.

What’s interesting–and unexpected–to me, is how such a very small difference in the two time series can have such a large impact on the ratio of warming rates in the two time periods. When I first graphed the above, my first thought was “nearly identical, not going to affect the rate ratios much at all”.

Wrong!

p.s. On the NOAA data issue–I noticed that one can in fact get version 4 data from NOAA, just not the spatio-temporally aggregated versions, which as mentioned, have incorrect links to the previous version. You have to be willing and able to do your own aggregating, but if you are, you can get both ASCII and NetCDF format, by month and grid cell.

Karl et al 2015, updated analysis

Update: As mentioned in the piece below, I had suspicions that the data at the NOAA ERSST v4 webpage were not the correct (version 4) data, and it turns out that this is in fact the case. I learned this after Gavin Schmidt of NASA forwarded me data he had been sent directly from NOAA personnel. So, I need to do the analysis below again with the version 4 data, when I get time. Needless to say, NOAA needs to put correct data on their web pages, especially when it forms the basis of a paper making this magnitude of a claim. It’s also telling, and depressing, that of the large number of scientists commenting on this paper, none had apparently gone to get the actual data with which to check Karl et al’s analysis. If they had, they’d have noticed the discrepancy against what Karl et al report.
End Update

And the situation’s worse than I had first thought. Unfortunately.

As I mentioned in the update to previous post, Gavin Schmidt of NASA brought to my attention on Twitter the fact that you cannot average trend rates obtained via Ordinary Least Squares (OLS), between two consecutive time periods, and assume that value will equal the trend computed over the combined periods. I was confused by this at first, but Gavin made a plot showing exactly what the issue is:
OLS trend averaging Nothing forces the two trend lines to be continuous at the junction of the two intervals, and a big jump or drop in the data at or near that junction, with suitably short sample sizes, will cause this to happen. A slight variation of this is just what I did in comparing trend rates over the time intervals that Karl et al discussed. I did so because they did not include the rates, in their Table S1, for all of the intervals they discussed, so I had to make algebraic estimates. [Note that Table S1 is in fact a big problem, as it gives a hodgepodge of time periods that do not in fact allow one to make a direct and clear comparison of rates between consecutive time intervals, the very point of the paper. And their Figure 1 and discussion do not help.]

So, here I compute and report the OLS-derived trend estimates for Karl et al.’s variously defined intervals, and some others they didn’t discuss, using the new NOAA data directly. These intervals are defined by: (1) four start dates (1911, 1931, 1951, 1971), (2) two breakpoint years (1998 and 2000), and two end years (2012 and 2014); this creates 16 possible pairs of time intervals. For each I compute the OLS-derived trend (degrees C per decade) for the two consecutive intervals, and the ratio between them, as the second (“slowdown”) over the first (pre-“slowdown”). [Note that the mismatch issue arises because an OLS-derived straight line is the “stiffest” possible model one can fit to a time series. Using something more flexible like lowess or a polynomial model, would lessen this problem, but then you run into other issues. Since OLS is “BLUE” (Best Linear Unbiased Estimator), and minimizes the possibility of over-fitting the data (very important here!), and since Karl et al used it, I stick with it here. Also, I am only presenting data on the trend estimates themselves, not on their probabilities relative to a null, or any other possible, model. Doing that is trickier and more time consuming than people might think, time I don’t have right now.]

First, the new NOAA annual GMST data, from 1880 on, look like this:
ERSSTv4

Second, the trend analysis results:

   Start1 End1 Start2 End2 Slope1 Slope2 Ratio
2    1911 1997   1998 2012 0.0685 0.0342 0.499
3    1911 1997   1998 2014 0.0685 0.0560 0.817
4    1911 1999   2000 2012 0.0719 0.0390 0.542
5    1911 1999   2000 2014 0.0719 0.0649 0.903
6    1931 1997   1998 2012 0.0588 0.0342 0.582
7    1931 1997   1998 2014 0.0588 0.0560 0.952
8    1931 1999   2000 2012 0.0650 0.0390 0.600
9    1931 1999   2000 2014 0.0650 0.0649 1.000
10   1951 1997   1998 2012 0.0938 0.0342 0.365
11   1951 1997   1998 2014 0.0938 0.0560 0.597
12   1951 1999   2000 2012 0.1026 0.0390 0.380
13   1951 1999   2000 2014 0.1026 0.0649 0.633
14   1971 1997   1998 2012 0.1584 0.0342 0.216
15   1971 1997   1998 2014 0.1584 0.0560 0.354
16   1971 1999   2000 2012 0.1714 0.0390 0.227
17   1971 1999   2000 2014 0.1714 0.0649 0.379

The table shows several important results. First, there is only one case in which the slope of the slowdown period is as large as that of the pre-slowdown period, given by the eighth row of the table (start = 1931, break = 2000, end = 2014). In two other cases the ratio exceeds 0.90 (rows 5 and 7) and in one more case it exceeds 0.8 (row 3). For the 1951 start date most emphasized by Karl (rows 10-13), no ratio higher than 0.63 results, no matter how the intervals are defined. The mean and median of the 16 ratios are both about 0.56, so there’s no skew issue.

If one’s goal is to compare the data against the trends and discussions given in the AR5, that is, to assess the effects of the bias reductions in the new NOAA data on possible rate changes, then 2012 is of course the end year to use. For the two possible break years, those ratios are given in rows 10 and 12, as 0.365 and 0.380 respectively: the “slowdown” interval rates are thus ~ 35 to 40 percent of the pre-slowdown rates. In the previous post, in which I had to estimate the trend rate in the first interval algebraically, I got estimates of about 65-68%. So…the problem’s actually worse than I first estimated, and by quite a bit, 35-40% instead of 65-68%. And there are some serious discrepancies in my trend values vs theirs. For example, Karl et al report the trend from 2000 to 2014 as 0.116; I get 0.0649. For 1998-2012 they report 0.086 and I get 0.034. The numbers I’m getting are much closer to their reported “old” (bias uncorrected) values, than to their “new” ones, but they don’t exactly match those either. And the data link names provided by NOAA at their ERSSTv4 page do not seem to refer to version 4. So, some problems here.

If one’s goal is instead to assess what the combined effects of the bias reductions and the extra two years of data is, relative to the general question of whether the rate of global warming has slowed over the last ~1.5 decades or not, then rows 11 and 13, which go to 2014, give higher estimates: 60 to 63 percent post/pre. So those two years of extra data heavily affect the answer to that general question. This same basic story holds true for other start dates. And the further back you push the start date, the less the rate difference between the two periods, for any given choice of end and breakpoint dates.

Again, no significance testing here. It is important to remember though, that each year’s GMST value derives from many hundreds of data points, even if reduced to an effective size by accounting for autocorrelation, which certainly exists. I much doubt that any except the very highest ratios given here would prove to be statistically insignificant if the trend rates were computed from those raw data, instead of from the annual means, and even using the latter, they still would not be in many cases.

So, these results seem to make the principal conclusion of Karl et al, that there has been no slowdown in the rate of global warming, even more suspect than I reported in the previous post. The bias reductions and extra years’ data reduce the differences, but they most certainly do not eliminate them.

The R code for the analysis is below

## 1. Get and read data
# The ERSSTv4 GMST data from 1880 on is at: http://www1.ncdc.noaa.gov/pub/data/mlost/operational/products/aravg.ann.land_ocean.90S.90N.v3.5.4.201504.asc
# Download data and set appropriate working directory
options(digits=3)
T = read.delim("NOAA ERSST aravg.ann.land_ocean.90S.90N.v3.5.4.201504.asc.txt", sep="", header=F)

## 2. Define time periods; start year = 1951; breakpoint year at either 1998 or 2000, end year at either 2012 or 2014
# breakpoint year always included in *2nd* of the 2 intervals
# 1998 and 2000 useful choices--differing global means
# start years: 40 and 20 years before, and 20 years after, Karl et al's emphasis on 1951; thus covers last 100+ years
  
start.yr = c(1911,1931,1951,1971); break.yr = c(1998,2000); end.yr = c(2012,2014)
dates = expand.grid(start.yr,break.yr,end.yr)
dates[,4] = dates[,2]-1; dates= dates[,c(1,4,2,3)]; names(dates) = c("Start1","End1","Start2","End2")
which.dates = dates-1879; n = nrow(dates)
intvl.1=paste("int",1:n,".1",sep=""); intvl.2=paste("int",1:n,".2",sep="")
for (i in 1:n){
 assign(intvl.1[i], which.dates[i,1]:which.dates[i,2])
 assign(intvl.2[i], which.dates[i,3]:which.dates[i,4])
}

## 3. Get OLS-derived lm and slope estimates, ignoring possible AC for now, since no p testing
lm.1 = paste("lm",1:n,".1",sep=""); lm.2 = paste("lm",1:n,".2",sep="")
sl.1 = paste("sl",1:n,".1",sep=""); sl.2 = paste("sl",1:n,".2",sep="")

for (i in 1:n){
 assign(lm.1[i], lm(T[get(intvl.1[i]), 2] ~ T[get(intvl.1[i]), 1]) )		# OLS lm fits
 assign(sl.1[i], get(lm.1[i]) [[1]][[2]])							# slope only
 assign(lm.2[i], lm(T[get(intvl.2[i]), 2] ~ T[get(intvl.2[i]), 1]) )
 assign(sl.2[i], get(lm.2[i]) [[1]][[2]])
}

Ratios = rep(NA,n); Slope1=Ratios; Slope2=Ratios
for (i in 1:n){
 Slope1[i] = get(sl.1[i]); Slope2[i] = get(sl.2[i])
 Ratios[i] =  Slope2[i] / Slope1[i]
}
final = cbind(dates, Slope1, Slope2, Ratios)
final = final[order(final[,1], final[,2], decreasing=F) ,]
rownames(final) = 1:n; final
plot(T[,1], T[,2], xlab="Year", ylab= "ERSSTv4 GMST (deg. C) rel. to 1971-2000 mean", type="l", lwd=2)

Did the rate of global warming decline around year 2000 or not?

Update, 6-6-15: I’m revising this post based on an issue that Gavin Schmidt of NASA made me aware of on Twitter, involving the averaging of trends. It will change the results I’ve given below. Stay tuned.

I don’t like having to write posts like this one but I’m at loss right now. I’m going to keep it short, if for no other reason than the depressing quality of this so-called “debate”.

People following the climate change debate probably know that Karl et al came out with a paper in Science yesterday (Supplement here). Try to give it a read if you can. The lead author is a NOAA scientist and it was heavily promoted by that agency on Twitter and their agency blog. I counted at least 4 or 5 Twitter announcements, with bold headlines regarding the paper’s findings. Gavin Schmidt and Doug McNeall both wrote blog pieces that I thought did a decent to good job tempering the NOAA hoopla. Judith Curry also wrote one, more critical, but I haven’t read it yet. I’m going to take it a step further here.

The basic claim of the paper is that the muliply- and variously-named “hiatus”, “slowdown” etc, in global mean surface temperatures over the last 15 to 17 years or so, is basically non-existent statistically. The last sentence of the abstract states “These results do not support the notion of a “slowdown” in the increase of global surface temperature“, and the last paragraph states:

“In summary, newly corrected and updated global surface temperature data from NOAA’s NCEI do not support the notion of a global warming “hiatus.” As shown in Fig. 1, there is no discernable statistical or otherwise) decrease in the rate of warming between the second half of the 20th century and the first 15 years of the 21st century.”

That figure is not particularly helpful as far as trend differences go. If you look instead at their data table in the Supplement (Table S1), it doesn’t show what they claim there, at all. Look at the table–the trend from 1951 to 2012, based on their new data/method with ocean temperature biases accounted for (“New”), gives a decadal warming rate of 0.129 degrees C. The rate from 1998 to 2012 is given as 0.086. This means the rate from 1951 to 1997, which is what you need for the proper rate comparison, but which is not reported(!), must therefore be:

0.086*(15/62) + x*(47/62) = 0.129, or x = (0.129 – .028) * 62/47 = 0.133
(fractions of 62 are the time period weights).

So that’s 0.133 vs 0.086, which is a ratio of about 1.55. If one wants to use the year 2000 as the demarcation point between the two periods, then the comparison’s a little trickier because Karl et al don’t actually report an estimated rate from 2000 to 2012. But one can estimate it by assuming the rate difference between between the two periods 1998-2014, and 2000-2014, both of which they do report, is a decent approximation of the difference between the 1998-2012 and 2000-2012 periods. When I do so, I get a ratio of warming rates between the two periods (1951-1999 and 2000-2012) that’s very similar to the first: 1.47. Taking the inverses, the rate of warming between the slowdown period and the pre-slowdown period, is about 2/3. Given the large number of data points (grid boxes x time points) that make up each year’s mean value, it has to be essentially certain that, according to these very data, there has in fact been a very statistically significant difference in warming rates between these two periods, regardless of whether you use 1998 or 2000 as the breakpoint year.

Note that I”m excluding any data post-2012 here, because they really shouldn’t have been including those years’ data when analyzing the effect of data biases on trends going back to 1951–that’s just additional data for more up-to-date trend estimates relative to the AR5, which is another issue entirely. And, although the authors discuss the decadal warming rates between variously defined periods, it is confusing and not at all a direct apples-to-apples discussion. In fact, it’s misleading.

So what gives on this?

As always, bring a definite argument to the table, preferably with hard numbers.

Here in the going, going, gone

Dark laughter on the teeter-totter
An old song floats across the water
I know I should pack up and move on
One note Johnnies proliferate
The wind rises, the hour is late
Here in the going, going, gone

My heart ain’t mine, my heart is yours
Or else I left it out-of-doors
Like a baseball glove out on the lawn
I’d walk through fire to retrieve it
But still you never would believe it
Here in the going, going, gone

Everywhere you look you see
More of you and more of me
Scrambling for the goods; the lines are drawn
Peace and quiet–is there any?
We are the beautiful too many
Here in the going, going, gone

Greg Brown

Walls of Red Wing

Oh, the age of the inmates
I remember quite freely:
No younger than twelve,
No older ‘n seventeen.
Thrown in like bandits
And cast off like criminals,
Inside the walls,
On the grounds of Red Wing.

From the dirty old mess hall
You march to the brick wall,
Too weary to talk
And too tired to sing.
Oh, it’s all afternoon
You remember your home town,
Inside the walls,
The walls of Red Wing.

The night aimed shadows
Through the crossbar windows,
And the wind punched hard
To make the wall-siding sing.
It’s many a night
I pretended to be a-sleepin’,
Inside the walls,
The walls of Red Wing.

Oh, some of us’ll end up
In St. Cloud Prison,
And some of us’ll wind up
To be lawyers and things,
And some of us’ll stand up
To meet you on your crossroads,
From inside the walls,
The walls of Red Wing.

Bob Dylan, 1963