Science night in America

Suppose you’re sitting there watching the Stanley Cup playoffs and you realize all at once, “Hey, I bet I can estimate an important missing parameter in old tree data sets using probability theory, and then use it to evaluate, and make, tree density estimates“. Of course that issue is plastered all over the internet–everybody’s sick of it frankly–and there are only about 2.36 million NHL playoff games, so you might not act on your idea at all. But you also might jump up and rush to the computer, startling the dog in the process, open up R and start whacking away at the keyboard, and be swearing furiously and grabbing your head in no time at all. There’s big, big money to be made on this after all.

Now the question soon arises, “Just how the hell am I supposed to go about this anyway?”. Well the answer to that is “Carefully, with patience, gobs of it”. Because you’ll learn a lot about some important science concepts if you do. And of course, there’s the money, gobs of it, that should motivate a fair bit.

Some background here. One can estimate the density (or “intensity” in statistical parlance) of objects if you measure the distance from random points to a sample of those objects. The accuracy and precision of the estimate will depend on the sample size, the objects’ spatial pattern, and the rank order of the distances of the objects you measure to. All very exciting stuff, but old hat.

Now, suppose some abject clowns went out and purposely didn’t record those rank distances on like a whole bunch of trees over a biggish area, like, say, two-thirds of the United States, simply because they were distracted by avoiding arrows from the Native Americans, or felt the malaria and blood loss from the mosquitoes over the 50 miles of swamp crossed, or some other lame excuse like that. Well, now if that isn’t frustrating for keyboard stats jockeys! The complete lack of concern for our needs 150 years later trying to get publications! Wankers!

Should time allow, we shall investigate this question in coma-inducing detail in future posts. We shall however take periodic breaks to watch the Red Wings dispose of the Beantown “Bruins” in short order, and of course for beer runs. Might want to replace the batteries in the remote control if you get a chance.

Say what?

I’ll try to cut to the chase on this one, though there’s a lot to say actually.

I’ve had some collaboration and discussions with a Chinese (mainland) ecologist over the last few years, ever since he did a sabbatical here. He’s a member of the CAS (Chinese Academy of Sciences) and has been trying to talk me into going there (permanently) ever since, where he says the research opportunities are greater and the picture generally rosier. I have my doubts on that, but whatever, I’ve never been there and know nothing about the matter. Never had any desire to go, except maybe to Tibet or the Tian Shan mountains in western China.

This week I told him some things about my situation, including that I had a number of potential papers sitting there in various stages of completion (mostly, not very close) from various data analyses done over the last 15 years, and other things. In response he made the following offer. He would get me an affiliation with the CAS, and pay me $2000 apiece if I got six of them published in the next year. In return I had to list him as either the lead or corresponding author on each. He said he’d assign one of his grad students to help me out if I needed it. He stated these things very plainly, in an email message.

Say what?

Each of these potential papers represents quite a lot of work already, including field work and computer data analyses of various types, and even more to bring them to completion of course. The guy knows absolutely nothing about most of the topics involved, some of which are very specific and technical, like my tree ring analysis work, other work on new statistical methods for the estimation of American pre-settlement forest conditions, and work on the effect of fire suppression on tree biomass/carbon accumulation rates in California pine forests. These represent essentially my life’s work over the last 15 years. And yet he’s requesting that I list him as an author on each, even lead author.

Aside from the fantastically unrealistic pipe dream of publishing six high quality articles in one year, what exactly does he plan to do when people start contacting him with various technical questions on the material, or for presentation of it at conferences, or to review papers in those areas, or similar academic type things? And I won’t get into the numerous other things he’s told me that indicate pretty strongly that he’s attempting to build his research empire over there, except that apparently, collaboration with American scientists counts pretty highly among the powers that be, and science is run “like a business”.

So, uh… thanks friend, but uh, no dice. I need the money pretty badly for sure, and the affiliation would be great also, but you know, I just really don’t need either that bad. Former friend, make that. Good luck over there.

Skellam, 1952

Been reading the old distance-based density estimation literature recently. I love how they under-stated things back in the day, including article titles. For example, a 1952 paper by J.G. Skellam goes into great mathematical detail on the theory of spatial patterns and their analysis, and unless I’m mistaken, was the first to translate Poisson’s distribution into a point-to-object framework, a significant achievement. Yet it is titled simply Studies in statistical ecology: I. Spatial pattern.

Anyway, here’s an interesting comment therein regarding the power of mathematically-based theory versus empirical analyses:

“In the world of organic nature there seems to exist an uneasy balance between the factors which increase randomness and those that oppose it. This is particularly true of the distribution in space of animals and plants. The broad outlines of the pattern are determined by the main structural features of the physical environment. But even under constant conditions neither uniformity nor complete randomness prevail.

On the one hand the reproduction of organisms and the interactions between them tend to develop a closely knit pattern; whilst on the other, locomotory movements and dispersive processes bring about an ever-increasing randomness. An ecological complex of interacting species is a dynamical system, which may not only display a regular seasonal rhythm, but also appears liable by reason of its intrinsic nature to undergo oscillations (Volterra, 1931) or cyclical changes (Watt, 1947), all of which are liable to be disturbed in an irregular manner by apparently unpredictable fluctuations in weather conditions or by the spasmodic arrival of additional components to the system from outside.

It is unfortunate however that the use of probability generating functions should not have featured more prominently in the literature on these and related topics, for by means of them the subject under consideration [the distribution of individuals in census sample units] can be given greater unity and understanding. Many statistical results already deduced with much labour by the pioneers of quantitative ecology can be immediately derived by this method, and the way opened for further generalization and development.”

Skellam, J.G. (1952). Studies in statistical ecology: I. Spatial pattern. Biometrika 39: 346-362.

Interpreting graphs with logarithmic scales

Graphical portrayals of data relationships are of course ubiquitous in science.  Because graphs heavily involve visual impressions, there are a number of issues to consider in order to not leave mistaken impressions (which is why I prefer tables over graphs, but I won’t get into that).  A very important one involves the choice of scales, especially the use of logarithmic data transformations.

You have to be careful when looking at such graphs and I’ll illustrate some of the main problems using three graphs from each of three different example situations, with particular (but not exclusive) reference to time series data and trend interpretation.  In all cases, I allow both axes to span only the range of the data used, no more or less. This avoids issues of appearances determined by choice of axis scale range, which are potentially important but not my focus here.

Example 1
Suppose one has a simple linear function, say Y = 0.1*X, with X ranging from 1 to 1001 and a sample size of 50 (blue dots in graphs), Y therefore ranging from 0.1 to 100.1. If time is graphed such that the reference point time = 0 is in the past, we have the basic linear graph:
plotnumber1

If we take logarithms of each axis, regardless of the base chosen (I use base 10), we still get linearity:
plotnumber2The graph retains a visual linearity, and this is a prime reason why log transformations are done, i.e. to show relationships when several orders of magnitude are spanned by the data.  However, the impression of the sampling coverage is no longer the same. The individual sampling points, which are evenly distributed (every 20 time units in the original scale), now appear to increasingly cluster as x increases.

The preceding is the best case scenario for logarithmic data transformations, in terms of their visual fidelity to the original relationship.  But very often (too often) in the science literature, only one axis (often the x), not both, is transformed. Doing so will give this:
plotnumber3Without an instinctive awareness of the effects of log transformation, the line shape here gives a strong visual distortion of the original data, one implying a strongly accelerating rate of increase in y with x, when we know that rate is in fact constant at 0.1.

Continue reading

Rock me mama like a wagon wheel

I believe somebody said Illinois River and Old Crow Medicine Show

Headed down south to the land of the pines
I’m thumbin’ my way into North Caroline
Stare on down the road and pray to God
I see headlights

Made it down the coast in 17 hours
Pickin’ me a bouquet of dogwood flowers
And I’m a hopin’ for Raleigh
So I can see my baby tonight

So rock me mama like a wagon wheel
Rock me mama anyway you feel
Hey, mama rock me
Rock me mama like the wind and the rain
Rock me mama like a south-bound train
Hey, mama rock me

Wagon Wheel, Bob Dylan

Ain’t no ash will burn

A gorgeous traditional tune that Matt Skaggs made me aware of some time back. It’s become a definite favorite; here’s my brief interp.

I have seen snow that fell in May
And I have seen rain on cloudless days
Some things are always bound to change
There ain’t no ash will burn

Love is a precious thing I’m told
It burns just like West Virginia coal
But when the fire dies down it’s cold
There ain’t no ash will burn

You say this life is not your lot
Well I can’t be something that I’m not
We can’t stoke a fire that we ain’t got
There ain’t no ash will burn

In every life there comes a time
Where there are no more tears to cry
We must leave something dear behind
There ain’t no ash will burn

There is one lesson I have learned
There ain’t no ash will burn

Hardy-Weinberg genetic equilibrium and species composition of the American pre-settlement forest landscape

This post is about how binomial probability models can, and cannot, be applied for inference in a couple of very unrelated biological contexts. The issue has (once again) made popular media headlines recently, been the focus of talk shows, etc., and so I thought it would be a good time to join in the discussions. We should, after all, always focus our attention wherever other large masses of people have focused theirs, particularly on the internet. No need to depart from the herd.

Binomial models give the coefficients/probabilities for all possible outcomes when repeated trials are performed of an event that has two possible outcomes that occur with known probabilities. The classic example is flipping a coin; each flip has two possible outcomes of h = t = 0.5, and if you flip it, say twice (two trials), you get 1:2:1 as binomial coefficients for the three possible outcomes of (1) hh = two heads, (2) ht = one head and one tail, or (3) tt = two tails, which gives corresponding probabilities of {hh, ht, tt} = {0.25, 0.50, 0.25}. These probabilities are given by the three terms of (h + t)^2, where the exponent 2 gives the number of trials. The number of possible outcomes after all trials is always one greater than the number of trials, with the order of the outcomes being irrelevant. Simple and easy to understand. The direct extension of this concept is found in multinomial models, in which more than two possible outcomes for each trial exist; the concept is identical, there are just more total probabilities to compute. Throwing a pair of dice would be a classic example.

The most well-known application of binomial probability in biology is probably Hardy-Weinberg equilibrium (HWeq) analysis in population genetics, due to the fact that chromosome segregation (in diploids) always gives a dichotomous result, each chromosome of each pair having an equal probability of occurrence in the haploid gametes. The binomial coefficients then apply to the expected gamete combinations (i.e. genotypes) in the diploid offspring, under conditions of random mating, no selection acting on the gene (and on closely linked genes), and no migration in or out of the defined population.

Continue reading