Analytical problems in science; an example

In this post I’m going to give an example from the work I’m doing at the moment, of the kinds of problems that scientists often have to deal with (i.e. confront and resolve) in order to advance the state of knowledge in a given field. The topic under investigation involves statistical and mathematical issues that are fundamental to what scientific research is all about at its crux. I’ll give the briefest of introductions of the subject so as to quickly get to these important, general issues.

The subject area involves the estimation of forest structural conditions in the United States just prior to the time of European settlement (roughly, over the late 18th to late 19th centuries, depending on longitude). By “structure” I mean essentially the tree density (number per unit area) and size class (i.e. diameter) distribution of forest trees, at various spatial scales across the landscape, including the very large (i.e. multi-state). Forest ecologists are interested in this question for primarily ecological reasons, but there are also important ties to climate-related questions, such as estimates of albedo, evapotranspiration and the carbon cycle.

We are lucky, as ecologists go, in that very large, systematically collected tree data sets exist across much of the United States at that time, providing at least the potential for a historical picture that other ecology subdisciplines just don’t have. These data sets were collected by the earliest federal (and sometimes private) land surveyors, and one of them is far and away the most important, encompassing most of the area between Ohio and Florida in the east, and the Pacific Ocean in the west. The central point for my purpose here is that although these data are exceedingly useful just as is, the surveyors did not collect data on one crucial variable that would make them even moreso, by allowing us to estimate forest structure with very high confidence. [Which together with the taxonomic information collected would provide an essentially unequaled historical picture for any ecological variable].

A logical scientific goal would thus be to try to estimate the parameters (mean, variance, distribution quantiles, etc.) of this missing variable (or at least to definitvely rule out certain values that could not have occurred), from other data that they did collect. This is therefore what I am trying to do; nobody’s ever done so, and there is a reasonable possibility of success based on some known mathematics. So, some quick background on the math/stat issues involved, before getting to the problems encountered. One can estimate the density of objects in space, if one has (1) measurements from a set of randomly selected points to randomly selected objects, and (2) the rank orders of those objects from those points (i.e. whether the object measured to is the 1st, 2nd,…nth closest object from the point). Our objects are trees in this case. The essential problem is that surveyors always recorded data for the former, but never the latter, variable: we have exact distances but no rank orders. So, my grand goal is to estimate the latter, from the former, data.

Leaving aside field based approaches (useful sometimes, but limited for various reasons), there are two very different (100% independent) ways to go about doing this: theoretical and empirical. These correspond very closely to mathematical, versus statistical/simulated, approaches (“empirical” in this context refers to simulated data with known statistical parameters). The central point here is that results from the two approaches must support each other if claims to validity of the new method are to be made. We can also explore here exactly what we mean by the terms “theoretical” and “empirical”.

First, the theoretical approach. By “theoretical”, I mean the prediction of some unknown, more complex (or “higher level”) relationship, using known, lower-level mathematical relationship(s). In this case, there are two lower level relationships that can be combined, and without going into any detail, these comprise (1) standard multinomial probability models with four possible outcomes of each of n independent trials, and (2) exact distributions of point-to-object distances as derived long ago by mathematical ecologists (e.g. Skellam, 1952; Pollard, 1971), from the even much older Poisson distribution. Very briefly, one can compute the exact distributions of ratios of distances of any pair of trees by combining these two relationships mathematically (and there are from one to six such pairs of trees at each sample point in the surveyors’ data against which to test them).

Second, the empirical approach, which is very straightforward, but not as exact. I can simply create simulated tree locations for any designated spatial pattern, and measure the ratios of distances between any pair of trees, over many thousands of such pairs, and then estimate the expected distributions and associated parameters therefrom. This type of approach is referred to generally as a “Monte Carlo” method, referencing the element of chance involved (at the level of the individual trial, but not asymptotically, where definite predictability and repeatability emerges).

And so this is exactly what I have done to date, computed expected distributions of distance ratios, using these two completely independent approaches. Great! So the results from the two methods are very nearly identical (asymptotically), and a major advancement has been made right? No they are not, and no it has not, not yet anyway. The numbers I’m getting from the theoretical approach range roughly from 1x to 2x of the values I’m getting from the Monte Carlo simulations.

And this unexpected and highly disappointing non-equality is exactly the point of this post.

Since I know the two sets of results must line up, I need to find out exactly why they do not. Because the programming algorithm (in R) for the theoretical computations was a multi-step process that I had to think carefully through at each step, whereas the Monte Carlo approach was pretty simple, the likelihood of a mistake somewhere seems much greater in the former. But I have gone over each step therein several times now, and though I have found results that should not occur and cannot pinpoint the cause/origin of these, they only occur in part of the results, and though troubling, they are small and do not seem to be the cause of the discrepancy between the results from the two approaches. Each step in the algorithm in fact appears to be completely fine.

It is at this point that certain things have been observed to occur. These include head grabbing, muttering, and swearing, among several others. I can’t figure out what’s wrong, but I know something is. And until I figure it out, I can’t publish it, and of course I can’t apply it the tree data sets that it’s designed for. I’m stuck and the swearing, though sometimes fun and creative, doesn’t seem to solve it. Nor does more coffee. And you can only watch so much hockey and baseball.

Addressing complex questions in science, or trying to solve previously unsolved problems, is not easy. There are almost always a whole host of points at which you can get something wrong, either because you made a bad assumption, or a bad calculation, didn’t understand the domain of applicability of a mathematical relationship, programmed something wrong, forgot to include an important variable, and so forth. It’s not simple and it’s not straightforward.

What if I didn’t have the Monte Carlo results against which to compare the theoretically derived results? What impression would I be under? The answer to that is clear and obvious: since I’ve gone over the algorithm several times, I would be convinced that my theoretically derived numbers were correct. I would then proceed to apply the method to the old tree data sets, wherein, if they were not correct, I would of course come to wrong conclusions about the rank orders of tree distances and hence, of the desired forest structural estimates. I’d publish this and some people would be impressed and say, wow that’s nifty, novel and important, a significant breakthrough for sure.

But the numbers, and hence the message of the paper, would likely be wrong. And nobody would be the wiser until they went through the same process.

Refs:

Pollard, J.H. (1971). On distance estimators of density in randomly distributed forests. Biometrics 27:991-1002

Skellam, J.G. (1952). Studies in statistical ecology: I. Spatial pattern. Biometrika 39:346-362.

Advertisements

7 thoughts on “Analytical problems in science; an example

  1. See, there’s your problem: you’re actually trying to reconcile the two models, instead of just averaging them and declaring that that result must be right, or simply ignoring the one you didn’t like… [Sorry, couldn’t help myself – I’ve seen too much of both approaches. I do appreciate that you try for the right answer, not just an answer.]

    More seriously, you state that “…field based approaches [are] (useful sometimes, but limited for various reasons)”. Why? [My quick thought is that there is little old-growth left, and what is left is heavily skewed towards certain types of terrain and so as a whole would not be a good representation of the original.]

    Even if limited, would it not be possible to use a field based approach to generate some data that could be used as a rough check on the theoretical and empirical approaches? If only as a pointer to the correct approach?

    Also, sorry about the Red Wings. Could be worse, though – you could be a Leafs fan. πŸ™‚

    • Great question as usual kch, and I like the way you think–specifically, that yes, we should always use whatever information we have, no matter how little or limited, if it will help give an idea or constraint on whatever it is we’re after.

      It’s kind of an involved answer so hang on.

    • Yes, you’re on the right track kch. The largest problem with a field-based approach is that you need to find at least some original bearing trees. That’s impossible in many places (they no longer exist) and difficult in others. However, I have done exactly that in Yosemite and elsewhere and a couple of papers have been published in the last couple of years using this exact approach elsewhere. However, their work only applies to ponderosa pine forests in the western US, and the overall conclusion they come to about which trees the surveyors selected can readily be shown not to apply to at least some, perhaps most, forests in the eastern US, and possibly elsewhere. They don’t recognize this fact however and seem to imply that their study is somehow definitive. It isn’t. They also don’t recognize some mistaken assumptions/methods in their work, even though I corresponded at length with the lead author, explaining exactly why.

      A method that looks strictly at parameters of the tree distance information is much more general, applying to any surveyor tree data in which two or more trees were sampled at each survey point, and especially so when four trees were sampled. But it’s analytically difficult, which is probably why it’s never been done. It’s difficult enough just to get across some fundamental points about the statistics of distance-based sampling, let alone apply those concepts to the surveyor’s data sets in a way people understand.

      and re: the Leafs: I feel your pain!

    • Thanks, it seems to be about what I thought. After I commented I looked around for old-growth forest maps, and yeah, there’s not much left, and what is, is pretty geographically bounded. Tough to extrapolate from, so I see why you’re having to fall back on models. Interesting topic, hopefully you’ll expand further as you progress.

      Oh, and I seem to have misled you – I’m not a Leafs fan (or even much of a hockey fan in general). They are simply my favorite example of hilarious futility in the NHL. I always seem to have some die-harders on my staff, and their annual cycle of boundless optimism followed by the long slide into depressing reality is a source of much amusement to me. It’s un-Canadian of me, but, really, I’d rather watch football…

    • Don’t they kick people out of Canada for that?

      As for progress, I’m now fairly sure (say maybe 95%), based on work I did today, that my theoretical computations are off and that the simulated data and results are in fact correct. This is frustrating because I spent lots of time on those and yet can’t seem to get them right. So it goes. Science frustrates more than it rewards.

      Another point about the field approach is that it’s waaaaaaaaay more time consuming. However it’s also waaaaaaaaaay more fun. πŸ™‚

  2. RE: “(or even much of a hockey fan in general)”

    Unfortunately that pretty much covers Leaf fans . . . in general . . .

    I enjoy your Posts Jim, please keep it up.

    • Personally, if I were a Senators or Canadiens fan I’d be fairly irked that somehow or other the CBC has the Leafs in the early game on HNiC every single week! . That would seem to be a fairly good way to piss your fellow countrymen off.

      And thanks for the good word again barn, appreciated.

Have at it

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s