This post is about estimating a rate process, particularly when certain required input data are missing or critical assumptions are unmet. Although it’s quite unrelated to recent posts (here and here) I’m actually headed somewhere definite with the collection of them, if you can bear with me. If you can’t bear with me, I can bear that.
As we know, estimating rate processes is a very common task in science, where “rate” is defined generally as a measured change in one variable per unit measure of another, usually expressed as a ratio or fraction. In the discussion that follows I’ll use object density–the number of objects of interest per unit area–to illustrate the concepts, but they are applicable to rates of any type.
Obtaining rate estimates, with rare exceptions, requires empirical sampling over the domain of interest, followed by some mathematical operation on the samples, often an averaging. Sampling however can take two opposing approaches, as determined by the measurement units of the denominator and the numerator. Using our density example, in the first approach we tally the objects occurring within samples of some apriori defined area (the denominator thus fixed), and then simply average the values over all samples. In the second distance sampling (DS) approach, we instead measure to the nth closest objects from random starting points, with the value of n chosen apriori, and thus with the numerator fixed and the denominator varying. Low values of n are typically chosen for convenience, and distances are converted to density using a suitable density estimator.
This DS approach has definite advantages. For one, it is time efficient, as no plot delineations are required, and decisions on which objects to tally or which have already been so, are typically easier. A second advantage is that the total sample size is independent of object density, and a third is an often superior scatter of samples through the study area, leading to a better characterization of spatial pattern.
The data from the two approaches are modeled by two well-established statistical models, the Poisson and the gamma, respectively. With the Poisson, the number of objects falling in each of a collection of plots will follow a Poisson distribution, the exact values determined by the overall density. With the second approach, the distances to the nth closest objects follow a gamma distribution.
Either approach requires that objects be randomly located throughout the study area for unbiased estimates, but there is a second potential bias source with distance sampling (DS), and hence a down-side. This bias has magnitude of (n+1)/n, and so e.g., measuring to the n = 1st closest object will bias the density estimate by a factor of 2x, to the n = 2nd closest by 3/2, etc. This bias is easily removed simply by multiplying by the inverse, n/(n+1), or equivalently, measuring to the 2nd closest objects will give the area corresponding to unit density, that is, the area at which object density equals one. Which is kind of interesting, assuming you’ve made it this far and nothing more enthralling occupies your mind at the moment.
The shape of the gamma distribution varies with n. When n = 1, the gamma assumes a negative exponential shape, and when n > 1 it is unimodal, strongly skewed at low values of n, but of decreasing skew (more normal) at higher n. The upshot is that one can diagnose n if unknown, because these distributions will all differ, at least slightly. However the power to do so decreases with increasing n, because these differences decrease with n, and it also assumes the objects occur randomly.
If our rate denominator measurement space is defined on more than one dimension–as it is in our example, area–we can subdivide it into equally sized sectors and measure distances to the nth closest object within each sector. Sectors here would correspond to angles emanating from the sample point, and this “angle-order” method increases the sample size at each sample point. Supposing four sectors, for a given n value, the four measurements at a point give (four choose two) = six possible ratios between the objects, these ratios defined as the further distance over the closer. These distributions of the six values, over a collection of sample points, are then collectively discriminatory for n.
Usually both the rate and its homogeneity are unknown, and we need the latter to get the former. If a non-random pattern exists, the non-randomness can be quantified (in several ways) if we know n, and density estimates made using suitable corrections. If we don’t know n but do know the objects are in fact randomly arranged, we can still infer density, although not with nearly as much precision as if we knew the value of n. Interestingly, the bias arising from a clustered non-random pattern is of the same direction as that from an under-estimate of n, both leading to density under-estimates. Similarly, a tendency toward spatial pattern regularity gives a bias in the same direction as an overestimate of n.
These last facts can be useful, because our main interest is not n or the spatial pattern, but rather the rate (density), and the effects of uncertainties in them are not additive, and can even be compensating, depending on the situation. This serves to greatly constrain the total uncertainty in the rate estimate even when these critical pieces of input information are lacking. I’m going to try to expound on that in the next post in this utterly enthralling series.
Support for this post has been provided by the Society for Public Education of Not Very Well Recognized or Cared About Issues. Nothing in the post should be construed, or mis-construed, as in any way necessarily reflecting the views, opinions, sentiments, thoughts, conceptual leanings, quasi-conscious daydreaming or water cooler banter of said Society, or really of anyone in particular other than me, and even that is open to debate. You may now return to your regularly scheduled life.