There are many things I don’t understand regarding how various folks do things, and here’s a good example. It falls, to my mind, within what Brian McGill at the Dynamic Ecology blog last year called “statistical machismo”: the use of statistical methods that are more complicated than necessary in order to appear advanced or sophisticated, when these are either unnecessary (but trendy), or worse, clearly not the best choice for the problem at hand. “Cutting edge”, their practitioners like to think of them, but I’ve got no problem in calling them sophistry that stems from a lack of real statistical understanding, combined with a willingness to do whatever will maximize the chances of getting published, of which there rarely seems to be a shortage.
I’ve had, for some time, a growing suspicion that much of the use of Bayesian statistical methods in science falls pretty squarely in this category. That of course doesn’t mean I’m right, especially since I do not fully understand everything about modern Bayesian methods, but I get the basic ideas, and the following example is a good illustration of why I think that way. It relates to my recent cogitations on the design of a general, model-based partitioning (clustering) algorithm for a common type of categorical data: data in which each sample is represented by only a small fraction of the total number of categories. In such cases, clear associations between the various categories is far from obvious.
I started thinking about the issue in relation to the estimation of forest tree community types in some widespread and very important historical tree data sets, where each sample contains individuals from, at most, only two to four tree taxa (usually, species), when there may be upwards of 10 to 30 such taxa in the population over some large landscape area (earlier post on this topic here) However, the issue has by far its greatest application in the field of population genetics, specifically w.r.t. the goal of identifying cryptic population structure–that is identifiable groups of individuals who are breeding primarily or entirely among themselves (“demes”), leading to allele and genotype frequencies that vary characteristically from deme to deme, demes which are not otherwise readily identifiable by external phenotypic characters. These are groups involved in the first step on the road to “incipient species”, to use Darwin’s phrase. The similarity with the tree data is that at each gene locus for any given diploid individual–which represents our sample–you have only two alleles, even though many such may occur in some larger, defined population.
In 2000, Pritchard et al. published what would have to be considered a landmark study, given that it’s been cited nearly 14,000 times since. This comes out to about 2.5 citations per day; I wouldn’t have guessed that so many popgen papers were even published at that kind of rate. The paper introduces a method and program (“STRUCTURE”) for the above-stated task, one based on Bayesian techniques, using Markov Chain Monte Carlo (MCMC), which is an iterative method for estimating parameters of the posterior distribution when no analytical techniques, or approximations thereof, are available. Furthermore, the paper has spawned several spin-off papers introducing various modifications, but all based on the same basic Bayesian/MCMC approach. And each of those has gotten hundreds to thousands of citations as well.
I freely admit that I am an unabashed maximum likelihood (ML) type of thinker when it comes to statistical inference and model selection. I’m far from convinced that Bayesianism offers any clear, definitive advantage over ML methods, while appearing to be saddled with some complex, time-consuming and uncertain estimation techniques (like MCMC), which is most definitely a disadvantage. To my view, Bayesianism as a whole might well fall within Brian’s machismo category, at least as employed in current practice, if not in its fundamental tenets. I very much doubt that many people who use it do so for any reason other than that a lot of others are using it, and so they just go with the flow, thinking all is kosher. Scientists do that a lot, at least until they develop their own understanding of the issues.
As I was thinking through the problem, it seemed to me pretty clear that, although a strict analytical solution was indeed not possible, one based on a ML approach, as heavily guided by expectations from binomial/multinomial probability and divisive clustering, was the way to go. Indeed, I can’t see any other logical and algorithmically efficient way to go about solving this type of problem. The underlying goal and assumptions remain the same as Pritchard et al’s, namely to find groups that approximate Hardy-Weinberg equilibrium, and which therefore represent approximately randomly mating groups. And there is also still a “Monte Carlo” procedure involved, but it’s quite different: always guided by a definite strategy, and much less intense and random than in the Bayesian/MCMC approach. As far as I can tell, nobody’s taken this approach (although I just found an Iowa State student’s dissertation from last year that might), and I don’t know why. I thought it was recognized that defaulting to a uniform (i.e. uninformative) prior probability distribution–because you really have no idea otherwise, or worse, when the idea of some “prior distribution” doesn’t even make sense to begin with–and you have quite a few parameters to estimate, that MCMC algorithms can be very slow to converge (if at all), and to do so to potentially unstable estimates at that. But that’s exactly what the authors did, and there are other limitations of the approach also, such as having to constrain the total number of possible demes to begin with–presumably because the algorithm would choke on the number of possible solutions otherwise.
These are the kinds of things I run into far more often than is desirable, and which generate a certain mix of confusion, frustration and depression. If I keep working on this topic–which I find fascinating and which, being statistical, generalizes to different fields, but which I really don’t have time for–I’ll post more details about the two approaches. The fan mail has been clamoring for posts on this topic. Or ask questions if you’re entirely desperate for something to do while waiting for the Final Four to go at it.