Sometimes when you do science, it’s all immersion in the literature–reading what others have done. Other times it’s about exploring things on your own, maybe out of curiosity, and because you’ve got the tools to do it, maybe because you get some important insights faster that way, sometimes much faster. There are lots of possible reasons. Lately I’ve been exploring trend estimation issues in R for whatever reason, probably because I’ve been doing a lot of work on trend estimation lately.
I’ve discussed the fact that the least squares estimator (“OLS”) for linear regression trend estimation does not always appear to be best when the data are highly autocorrelated. I gave previous examples with an extremely simple estimator, based on the mean of the lag-1 slopes, that gives a better answer in those situations. As Nick Stokes and HaroldW pointed out, this equates simply to the difference, last value minus the first; can’t get much simpler than that. Since then, I’ve done some more exploring on exactly what conditions that one, and a few others, do better or worse than OLS. [Incidentally, one of them is a parabolic weighting as described mathematically by HaroldW, and sure enough, it follows the OLS estimate almost exactly. My thanks to him for pointing that interesting fact out].
This issue is important because it affects confidence in trends in autocorrelated data, of which there are a lot. It’s well known that autocorrelation widens standard errors and confidence intervals around an estimate, exaggerating the statistical significance of the values computed by standard methods. There are ways to correct for this (see note at end on that). However, there is another problem here: the OLS-derived estimate of the most likely (point) value, is itself often thrown off, i.e. it is biased. So, you can correct for exaggerated confidence intervals, but you will often be doing so around a point estimate that is itself mis-estimated. The correction procedures give a higher precision confidence interval, but that interval surrounds a biased value. The value of doing this is therefore questionable.
The good news is that it appears the problem only gets serious when the autocorrelation is toward the high end. Below a lag-1 alpha coefficient of about 0.75 in the AR1 model (x(t) = alpha * x(t-1) + w), the OLS estimator appears to be as good or better than the few others I’ve tested. Above 0.75 however, one or more of these estimators give superior estimates to OLS, as judged by both the accuracy and precision (variance) of the mean estimate. When a random walk state is reached (alpha = 1.0) the difference is substantial and sustained. The most extreme OLS estimates are the ones having the greatest bias. This is true for alpha < 0.75 also, and the gray areas all involve the various possible combinations of lower AR1 alpha values with the higher percentiles of the slopes of the OLS-derived slope distribution.
Here are two results from 1000 simulation runs for alpha = 0.90, at the 5th and 95th percentiles of the distribution of OLS-estimated trends, showing several other trendline estimates as well. Although the spread of the estimates is large, the trends closest to the true value of zero are so consistently. These results can be used to provide more accurate estimates of linear trends in highly autocorrelated data, which is more important than simply increasing the precision of a potentially biased estimate.
One other thing of note. The practice of computing an autocorrelation-corrected probability (p) value, by adjusting the effective degrees of freedom of the residuals, using the formula N = n * (1-r)/(1+r) where n is the observed d.f. and r is the lag-1 autocorrelation coefficient, does not seem to be effective. This method appears to give far higher numbers of statistically significant results than are obtained by simulation of red noise series and observed frequencies of OLS-fitted line slopes.
Correction: The above paragraph is wrong. The formula given above is referred to as Quenouille’s method, dating to the late 1940s. It provides a correction to the p values computed by OLS regression that, though not perfect, will often improve them greatly. I was applying it incorrectly to the F statistic and distribution. See Nick Stokes’ comments here and elsewhere in that thread, for a discussion.