On trend estimates in auto-correlated data

I’ve been doing some investigations on the detection of trends, using simulation, because I encountered some real confusion (again) on the topic this week in a discussion that referred to an amazing climate change-related discussion at Bart Verheggen’s site a couple of years ago.  That discussion went for over 2000 comments!, much of it discussing the validity/invalidity of trend estimates, with special reference to temperature change and its potential causes over the instrumental record period.  These same types of arguments have also been made elsewhere including in the scientific literature.  It’s easy to get this issue wrong (see here for a current example) by mistaken or incomplete technical arguments. I’ll get into those issues in a second post.  Here I just want to show some interesting results regarding what auto-correlation does to trend lines fit by different methods, results which surprised me.

The issue here is that if you have auto-correlation (“red” noise) in a time series , you will get an increased frequency of spurious trends when fitting a simple linear regression line, i.e. a straight line, to the series, compared to what you’ll get if there is no auto-correlation (i.e. the data are “white” noise).  The number of these spurious trends increases in proportion to the strength of the auto-correlation.  I wanted to see if this was still the case if the regression lines were instead fit by a more “robust” fitting method, by which I mean one that does not depend on minimizing the sum of the squared differences of the residuals from the trend line.  This procedure is referred to as  Ordinary Least Squares, or “OLS” fitting and is the standard, time-weary method.

However, OLS methods do not account for auto-correlation in the residuals from the fitted line (residual = actual value – trend line value), and this increases the variation in estimates of the true slope, causing more than five percent of them to exceed the standard p = .05 statistical significance level. (Five percent is the error rate you expect from OLS fitting methods, if there is in fact no real trend.) But I wasn’t sure if that rate would still hold if other methods of line fitting were used, ones in which the “leverage” of individual data points near the ends of the series was not as problematic as it is in OLS fitting.  And the very clear answer from the results is no, it does not hold true: the robust methods produce far fewer spurious trend estimates in the presence of (even extremely high) auto-correlation.

The simulation involves generating time series with four different levels of auto-correlation.  The model is “AR(1)”, where AR stands for Auto-Regressive and 1 indicates that the auto-correlation is at time lag = 1.  This model is simply:

x(t) = a * x(t-1) + w

where x is the variable in question, t is time, w is white noise drawn from a normal distribution, and a is the lag-1 correlation coefficient.  This model simply says that values of your variable at succeeding time points are similar to each other, in direct proportion to the value of a.  When a=0 there is no similarity and the series is thus defined entirely by w, which is random “white” noise.  The individual values are then  completely independent from each other, and there is no ability to predict the next value in the series (time t+1) at any given time t.  Conversely, when a=1, there is very high similarity between succeeding values, the number of independent values is greatly reduced, and prediction one time step out is greatly increased. When a = 1, the series is referred to as a “random walk”, and it gives a spurious, statistically significant slope the vast majority of the time, like 80-90 70-80 percent.

I used values of a = {0, 0.50, 0.75, and 1.0} here to illustrate these concepts in the next four graphs.  I ran 1000 simulation for each such, recording the slopes of the trend lines fit by the two different methods. The linear regression model is the standard y = b*x + c, where b = slope. The robust line fit uses a standardized average slope between all pairs of successive time points. There are other possible robust methods also, such as Tukey’s (= function “line” in R, although I could not discern exactly how it works from the R code).

ar1_0
ar1_0.50
ar1_0.750
ar1_1.0

In each case, the lm-fitted (OLS) trend line (solid red) shown is that corresponding to exactly p = .05.  You will get a slope that great or greater 5 percent of the time, even though there is in fact no actual trend in the time series.  The robust line (dashed red) is also shown, to indicate the difference.  The white noise series (black line), and its trend line, are also shown, to give an idea of how auto-correlation changes the nature of the time series.  The only difference between the white noise series and red noise series in each graph is the value of parameter a: the mean and standard deviation of the series are identical in all cases.  Note that in the top figure, in which a = 0, there is perfect overlay between the black and red lines showing the raw time series, as there should be (because the red line is actually just white noise in that case).

The take home message here is that traditional linear modeling using OLS is frequently fooled when auto-correlation is present in the data, giving very wrong (exaggerated) trend estimates, whereas robust line fitting is not so fooled: it returns a far superior estimate of the true trend in all cases.  Interesting and important.

There is a bigger issue here however.  This fact has been used by some to argue, sometimes adamantly as in the linked article at the top, that strong auto-correlation (as in e.g., a random walk) weakens or invalidates many trend estimates.  This argument is not correct however, as there is more involved in deciding that question than just whether the observed data are strongly auto-correlated or not.  Specifically, it ignores the very important concept embedded in maximum likelihood estimation, a bedrock concept in statistical analysis, one especially important when evaluations of different possible causes of an observed effect are the goal.  That topic’s more important than this one;  I’ll detail it in a future post.

p.s. I just noticed that Tamino also has a brand new post up on this same topic, but emphasizing some slightly different aspects of it.

About these ads

6 thoughts on “On trend estimates in auto-correlated data

  1. Sorry to be very late on this, but I have only just looked at it. I am not quite clear what you have done. Did you generate a series using the equation x(t)=ax(t-1)+w – equation 1 -and then fit a trend line to the resulting series of the form x(t)=bt – equation 2, where t is a time index? If you did then equation 2 is misspecified which will show up in the residuals, which you have not plotted. This was VS’s first point in the thread you mention. If the equation is misspecified it should be respecified, in this case to include the lagged value of the dependent variable.
    More generally the VS thread was about, among other things, whether temperature was best described as a stochastic trend (in the simple case of only first order autocorrelation, a random walk) or as varying randomly around a deterministic trend. In the first order case the stochastic trend is
    x(t)=x(t-1)+w
    where w is white noise.
    The deterministic trend is
    x(t)=bt + z
    where z is white noise.
    A natural way to distinguish between these formulations (adding for completeness a drift term d) is to estimate the following equation
    x(t)= d + a(x(t-1) + bt + u
    where u is white noise, and test whether a=1 and b=0.

    • Right, that’s what I did, for four different values of “a”. It’s not that “equation 2″ as you call it is mis-specified as you state, but simply that the value of b is inflated. The specification, which refers to the form of the equation and not the parameters themselves, is by definition, a straight line. We’re just trying to see if there’s a linear trend in the data (“equation 1″ as you call it), and for that, you simply use y = bx + c.

      That thread went all over the place and he said many different things, but the main point I remember was the claim that a random walk for global temperature over the last 150 years or so cannot be ruled out. The key concept implicit in that statement is that the fluctuations from the mean are random and trend-less. This by itself is a very specious statement but he immediately gave away his ignorance when he said that the increase in atmospheric CO2 is also a random walk, which is 100% nonsense, violated by all kinds of physical evidence. So his subsequent tests for “co-integration” between the two, and all that other stuff he went into is just a huge cloud of smoke. If you want to argue that CO2 increases are not a driver of T over that time period, fine go ahead, but that’s not the way to go about it; you need to bring physical evidence into the discussion. He was shot down numerous times on this point, by many different people, but he never appeared to understand this simple fact. All this “co-integration test” focus is prominent in fields like economics where data mining of a whole bunch of variables with unknown real-world relationships to each other are being explored. It has little value when you have other cause and effect information on your variables, which is why nobody performs them.

      The other, closely related point here is that a random walk (where a = 1.0) does not by itself prove anything, simply because no specific autocorrelation values proves anything about the underlying process causing the ac. It just increases the chance of a spurious correlation, relative to what you would get for the same sample sizes but with no autocorrelation. That’s part of the point of my post.

    • Jim,
      There are a number of misconceptions here, not all of which I can address. Let’s start with the misspecification issue. I do not mean this as a vague term of abuse. I am using the term as it used in econometrics. Econometrics textbooks these days are full of tests for misspecification – the residuals from the equation are not iid. As far as I can see the residuals (r) from the process you describe are
      r=ax(t-1)-bt+w
      These are not iid. Therefore the usual significance tests will be misleading. The correct specification in this case, by construction, we know to be be a simple regression of x on lagged values of itself. Using this specification will recover a good estimate of a and if you include as well as linear time trend the coefficient on that will be insignificantly different from zero. leading to the correct conclusion that there is no deterministic trend. I suggest you consult some textbook such as the one I have by me now, An Introduction to Applied Econometrics by Kerry Patterson, sections 6.3 and 6.4.

      You also mischaracterise economics. Before the unit root revolution the basic econometrics approach was not data mining. Instead it assumed the underlying theory was true and concentrated on using the data to derive the best estimates of the unknown parameters. The problem with this approach was that it did not allow theories themselves to be tested.

      VS did not say that CO2 was a random walk. The whole point of the Beenstock paper, which he was explaining, was that temperature was integrated of order one (meaning that taking first differences produces a stationary series), the simplest realisation of which is a random walk, but CO2 was integrated of order two, requiring differencing twice to get a stationary series. A random walk is not integrated of order two.

      Finally note that equations like my equation 1 are not theories. They are supposed to be concise descriptions of the data in terms of a stochastic process that has white noise residuals. If temperature numbers can be described as a random walk with white noise residuals that is a fact that requires explanation.

    • Mike the first thing you need to do is to explain fully and carefully why you would be using terms and concepts from from econometrics in a climate science discussion. When somebody like VS wanders into such a discussion using such terms and ideas, is it any wonder that he creates a lot of confusion?? You’re doing exactly the same thing. I don’t walk into a room full of electrical engineers and start using biology terms and concepts to explain why they’re supposedly wrong on something or other. That’s not going to work. Same thing here. VS never had a clue as to that whole concept. Now you’re telling me to go read some econometrics textbook? Uh, no; I think I’ll instead tell you to go read any number of hundreds of books on how inferences and conclusions are made in science. Key point: you don’t do it by intensively analyzing the autocorrelation structure of single variables, or even multiple variables.

      Mis-specification has nothing to do with residuals from a regression, but rather refers to getting the structural form of an equation wrong, such that no matter what choice of parameters are made, the estimate will have structurally-induced errors that thereby get the relationship (i.e. signal estimate) wrong, as opposed to just sampling errors. That’s the use of the term I’m familiar with in science.

      Second, your second paragraph there is exactly my point: autocorrelated data give misleading trend estimates, so the question arises “Might there then be an estimator that gives a better estimate”. That’s the point of this post, but you don’t seem to understand this for some reason, because you’re hung up on something else altogether.

      Third, I never said anything about econometric practice before the “unit root revolution”, the point was that economic studies often has no solid idea of cause and effect relations between observed variables, thus lots of statistical tests designed to estimate probability of same.

      Third, the guy most certainly did say that CO2 could be explained as a random walk, go back and read the thread, it’s somewhere in one of his first few comments, mixed in there with the put downs of various people and other derogatory comments which are the dead give-away of somebody who’s got an agenda to their viewpoint. He then proceeds to try to defend it with complex discussions of Dickey-Fuller tests, co-integration and so forth.

      The bottom line is that it’s just about 100% unclear why both VS and you are hung up on the whole co-integration and random walk issues.

  2. The concepts I am using are not specific to econometrics but are common parts of time series analysis. The whole point about this literature is that the statistical analysis of observed data, by which I mean data that is generated by a system outside our control so that experiment is not possible, raises a host of difficult issues which do not arise in the statistics of controlled experiments. Parts of climate science would seem to share this situation with much of economics, including the analysis of global temperature.

    You were the one who chose to post about estimating trends with autocorrelated data. This is a problem that has been extensively studied in the econometrics literature and there is nothing peculiar to economics about this analysis. The “complex” discussions of Dickey Fuller tests and co-integration are second year textbook standard. My use of “misspecification” is standard statistical terminology.

    The example you gave is one where we know the process which generated the data. It is a simple autoregressive process with no trend. Fitting a trend to the data is simply not a sensible thing to do. The straight line, as you say, gives a very poor representation of the data. With real data we don’t know the generating process, but we can test to see if it is likely that a particular process has generated the data. Looking at the residuals from the postulated mechanism gives some evidence. It’s not a question of finding a better estimator but of finding a better specified model.
    .

    • You are absolutely missing the point here, such that it’s difficult to know where to begin. Similar to VS at the other thread.

      We have an observed time series and we want to estimate the magnitude of the trend in that series. We do not know the process that generated the data, we simply want to estimate any possible trend in the data. So we choose the simplest possible trend model, which is a straight line, and fit such to the data, estimating the slope, b. We know that the residuals are autocorrelated and therefore the standard error is under-estimated, widening the confidence intervals relative to a white noise condition. Whatever that estimate is, it’s still the most likely trend value, regardless of how much autocorrelation there is in the data.

      It’s not a matter of getting a “better specified” model, it’s a matter of coming up with a better estimate of the trend in the data and then looking for causes of that trend using external data.

Comments are closed.