In this post I’m going to give an example from the work I’m doing at the moment, of the kinds of problems that scientists often have to deal with (i.e. confront and resolve) in order to advance the state of knowledge in a given field. The topic under investigation involves statistical and mathematical issues that are fundamental to what scientific research is all about at its crux. I’ll give the briefest of introductions of the subject so as to quickly get to these important, general issues.
The subject area involves the estimation of forest structural conditions in the United States just prior to the time of European settlement (roughly, over the late 18th to late 19th centuries, depending on longitude). By “structure” I mean essentially the tree density (number per unit area) and size class (i.e. diameter) distribution of forest trees, at various spatial scales across the landscape, including the very large (i.e. multi-state). Forest ecologists are interested in this question for primarily ecological reasons, but there are also important ties to climate-related questions, such as estimates of albedo, evapotranspiration and the carbon cycle.
We are lucky, as ecologists go, in that very large, systematically collected tree data sets exist across much of the United States at that time, providing at least the potential for a historical picture that other ecology subdisciplines just don’t have. These data sets were collected by the earliest federal (and sometimes private) land surveyors, and one of them is far and away the most important, encompassing most of the area between Ohio and Florida in the east, and the Pacific Ocean in the west. The central point for my purpose here is that although these data are exceedingly useful just as is, the surveyors did not collect data on one crucial variable that would make them even moreso, by allowing us to estimate forest structure with very high confidence. [Which together with the taxonomic information collected would provide an essentially unequaled historical picture for any ecological variable].
A logical scientific goal would thus be to try to estimate the parameters (mean, variance, distribution quantiles, etc.) of this missing variable (or at least to definitvely rule out certain values that could not have occurred), from other data that they did collect. This is therefore what I am trying to do; nobody’s ever done so, and there is a reasonable possibility of success based on some known mathematics. So, some quick background on the math/stat issues involved, before getting to the problems encountered. One can estimate the density of objects in space, if one has (1) measurements from a set of randomly selected points to randomly selected objects, and (2) the rank orders of those objects from those points (i.e. whether the object measured to is the 1st, 2nd,…nth closest object from the point). Our objects are trees in this case. The essential problem is that surveyors always recorded data for the former, but never the latter, variable: we have exact distances but no rank orders. So, my grand goal is to estimate the latter, from the former, data.
Leaving aside field based approaches (useful sometimes, but limited for various reasons), there are two very different (100% independent) ways to go about doing this: theoretical and empirical. These correspond very closely to mathematical, versus statistical/simulated, approaches (“empirical” in this context refers to simulated data with known statistical parameters). The central point here is that results from the two approaches must support each other if claims to validity of the new method are to be made. We can also explore here exactly what we mean by the terms “theoretical” and “empirical”.
First, the theoretical approach. By “theoretical”, I mean the prediction of some unknown, more complex (or “higher level”) relationship, using known, lower-level mathematical relationship(s). In this case, there are two lower level relationships that can be combined, and without going into any detail, these comprise (1) standard multinomial probability models with four possible outcomes of each of n independent trials, and (2) exact distributions of point-to-object distances as derived long ago by mathematical ecologists (e.g. Skellam, 1952; Pollard, 1971), from the even much older Poisson distribution. Very briefly, one can compute the exact distributions of ratios of distances of any pair of trees by combining these two relationships mathematically (and there are from one to six such pairs of trees at each sample point in the surveyors’ data against which to test them).
Second, the empirical approach, which is very straightforward, but not as exact. I can simply create simulated tree locations for any designated spatial pattern, and measure the ratios of distances between any pair of trees, over many thousands of such pairs, and then estimate the expected distributions and associated parameters therefrom. This type of approach is referred to generally as a “Monte Carlo” method, referencing the element of chance involved (at the level of the individual trial, but not asymptotically, where definite predictability and repeatability emerges).
And so this is exactly what I have done to date, computed expected distributions of distance ratios, using these two completely independent approaches. Great! So the results from the two methods are very nearly identical (asymptotically), and a major advancement has been made right? No they are not, and no it has not, not yet anyway. The numbers I’m getting from the theoretical approach range roughly from 1x to 2x of the values I’m getting from the Monte Carlo simulations.
And this unexpected and highly disappointing non-equality is exactly the point of this post.
Since I know the two sets of results must line up, I need to find out exactly why they do not. Because the programming algorithm (in R) for the theoretical computations was a multi-step process that I had to think carefully through at each step, whereas the Monte Carlo approach was pretty simple, the likelihood of a mistake somewhere seems much greater in the former. But I have gone over each step therein several times now, and though I have found results that should not occur and cannot pinpoint the cause/origin of these, they only occur in part of the results, and though troubling, they are small and do not seem to be the cause of the discrepancy between the results from the two approaches. Each step in the algorithm in fact appears to be completely fine.
It is at this point that certain things have been observed to occur. These include head grabbing, muttering, and swearing, among several others. I can’t figure out what’s wrong, but I know something is. And until I figure it out, I can’t publish it, and of course I can’t apply it the tree data sets that it’s designed for. I’m stuck and the swearing, though sometimes fun and creative, doesn’t seem to solve it. Nor does more coffee. And you can only watch so much hockey and baseball.
Addressing complex questions in science, or trying to solve previously unsolved problems, is not easy. There are almost always a whole host of points at which you can get something wrong, either because you made a bad assumption, or a bad calculation, didn’t understand the domain of applicability of a mathematical relationship, programmed something wrong, forgot to include an important variable, and so forth. It’s not simple and it’s not straightforward.
What if I didn’t have the Monte Carlo results against which to compare the theoretically derived results? What impression would I be under? The answer to that is clear and obvious: since I’ve gone over the algorithm several times, I would be convinced that my theoretically derived numbers were correct. I would then proceed to apply the method to the old tree data sets, wherein, if they were not correct, I would of course come to wrong conclusions about the rank orders of tree distances and hence, of the desired forest structural estimates. I’d publish this and some people would be impressed and say, wow that’s nifty, novel and important, a significant breakthrough for sure.
But the numbers, and hence the message of the paper, would likely be wrong. And nobody would be the wiser until they went through the same process.