Bayesian Bad Boys

There are many things I don’t understand regarding how various folks do things, and here’s a good example. It falls, to my mind, within what Brian McGill at the Dynamic Ecology blog last year called “statistical machismo”: the use of statistical methods that are more complicated than necessary in order to appear advanced or sophisticated, when these are either unnecessary (but trendy), or worse, clearly not the best choice for the problem at hand. “Cutting edge”, their practitioners like to think of them, but I’ve got no problem in calling them sophistry that stems from a lack of real statistical understanding, combined with a willingness to do whatever will maximize the chances of getting published, of which there rarely seems to be a shortage.

I’ve had, for some time, a growing suspicion that much of the use of Bayesian statistical methods in science falls pretty squarely in this category. That of course doesn’t mean I’m right, especially since I do not fully understand everything about modern Bayesian methods, but I get the basic ideas, and the following example is a good illustration of why I think that way. It relates to my recent cogitations on the design of a general, model-based partitioning (clustering) algorithm for a common type of categorical data: data in which each sample is represented by only a small fraction of the total number of categories. In such cases, clear associations between the various categories is far from obvious.

I started thinking about the issue in relation to the estimation of forest tree community types in some widespread and very important historical tree data sets, where each sample contains individuals from, at most, only two to four tree taxa (usually, species), when there may be upwards of 10 to 30 such taxa in the population over some large landscape area (earlier post on this topic here) However, the issue has by far its greatest application in the field of population genetics, specifically w.r.t. the goal of identifying cryptic population structure–that is identifiable groups of individuals who are breeding primarily or entirely among themselves (“demes”), leading to allele and genotype frequencies that vary characteristically from deme to deme, demes which are not otherwise readily identifiable by external phenotypic characters. These are groups involved in the first step on the road to “incipient species”, to use Darwin’s phrase. The similarity with the tree data is that at each gene locus for any given diploid individual–which represents our sample–you have only two alleles, even though many such may occur in some larger, defined population.

In 2000, Pritchard et al. published what would have to be considered a landmark study, given that it’s been cited nearly 14,000 times since. This comes out to about 2.5 citations per day; I wouldn’t have guessed that so many popgen papers were even published at that kind of rate. The paper introduces a method and program (“STRUCTURE”) for the above-stated task, one based on Bayesian techniques, using Markov Chain Monte Carlo (MCMC), which is an iterative method for estimating parameters of the posterior distribution when no analytical techniques, or approximations thereof, are available. Furthermore, the paper has spawned several spin-off papers introducing various modifications, but all based on the same basic Bayesian/MCMC approach. And each of those has gotten hundreds to thousands of citations as well.

I freely admit that I am an unabashed maximum likelihood (ML) type of thinker when it comes to statistical inference and model selection. I’m far from convinced that Bayesianism offers any clear, definitive advantage over ML methods, while appearing to be saddled with some complex, time-consuming and uncertain estimation techniques (like MCMC), which is most definitely a disadvantage. To my view, Bayesianism as a whole might well fall within Brian’s machismo category, at least as employed in current practice, if not in its fundamental tenets. I very much doubt that many people who use it do so for any reason other than that a lot of others are using it, and so they just go with the flow, thinking all is kosher. Scientists do that a lot, at least until they develop their own understanding of the issues.

As I was thinking through the problem, it seemed to me pretty clear that, although a strict analytical solution was indeed not possible, one based on a ML approach, as heavily guided by expectations from binomial/multinomial probability and divisive clustering, was the way to go. Indeed, I can’t see any other logical and algorithmically efficient way to go about solving this type of problem. The underlying goal and assumptions remain the same as Pritchard et al’s, namely to find groups that approximate Hardy-Weinberg equilibrium, and which therefore represent approximately randomly mating groups. And there is also still a “Monte Carlo” procedure involved, but it’s quite different: always guided by a definite strategy, and much less intense and random than in the Bayesian/MCMC approach. As far as I can tell, nobody’s taken this approach (although I just found an Iowa State student’s dissertation from last year that might), and I don’t know why. I thought it was recognized that defaulting to a uniform (i.e. uninformative) prior probability distribution–because you really have no idea otherwise, or worse, when the idea of some “prior distribution” doesn’t even make sense to begin with–and you have quite a few parameters to estimate, that MCMC algorithms can be very slow to converge (if at all), and to do so to potentially unstable estimates at that. But that’s exactly what the authors did, and there are other limitations of the approach also, such as having to constrain the total number of possible demes to begin with–presumably because the algorithm would choke on the number of possible solutions otherwise.

These are the kinds of things I run into far more often than is desirable, and which generate a certain mix of confusion, frustration and depression. If I keep working on this topic–which I find fascinating and which, being statistical, generalizes to different fields, but which I really don’t have time for–I’ll post more details about the two approaches. The fan mail has been clamoring for posts on this topic. Or ask questions if you’re entirely desperate for something to do while waiting for the Final Four to go at it.


12 thoughts on “Bayesian Bad Boys

  1. I agree about the abuse of “Bayesian.” At its worst the term “Bayesian approach” can just mean “with the addition of my opinion presented as fact.” I think Bayesian is most useful when you have nothing else and you really just need some starting point, any starting point.
    As an aside, I don’t fully buy in to your statement “…demes which are not otherwise readily identifiable by external phenotypic characters. These are groups involved in the first step on the road to “incipient species”, to use Darwin’s phrase.” I tend towards a bit of quasi-Goldschmidtianism.

    • I’m adding quasi-Goldschmidtianism to the ol’ lexicon, which I’m pretty sure must differ somehow from semi-Goldschmidtianism 🙂

      Anyway, the use of “deme” and “incipient species” wasn’t meant to imply that this was the only road to speciation, only that it is one of them. The more specific goal really, is the identification of distinct breeding groups if they in fact exist, regardless of whether they’re on the road to separate taxa or not.

      I really am going to have to read some of Goldschmidt’s original stuff though, now that I’m drifting (pun intended) back into popgen. If I ever did in the past, I don’t remember it. But then again, there’s a lot of the past I don’t remember. 🙂

      I really hope somebody with some firm and definite reasons for Bayesian advantages chimes in, because my mind is still open on the issue, but I sure am skeptical.

    • Quasi- ; Semi- Goldschmidtianism look fine; but don’t forget Neo-Goldschmidtianism. This latter is surely related in some popgene fashion to Neo-Darwinism. Prior experience with ‘neo’ makes me think you’ll want to get on to a more nuanced sort of understanding. Maybe one day we’ll be talking about Boudinism (then, upon tweaking that… Neo-Bouldinism). I suppose Dawkins will squawk about that too. Oops, I said ‘prior’… I must be a Bayesian (or a Neo-Bayesian??).

      But more seriously… if you are heading into popgene territory, what sort of data are you working with? Field phenotypes? What sorts of family relationships are you able to measure?

    • No worries, pretty much everyone but Bayes himself is a neo-Bayesian as far as I can tell 🙂

      Your question is a great one. If I were to develop a new method, which is very unlikely given time, energy etc. resources, it would be most applicable to taxa which are highly continuous, spatially, and in which, even though the breeding system is very well known, the actual estimation of who is mating with whom is almost impossible without molecular markers. Think coniferous trees: huge amounts of pollen completely at the mercy of the changing winds, drifting who knows where in unknown quantities. Or perhaps plants in general really. The methods of Pritchard et al., and the spinoffs therefrom, have most commonly (I believe) been applied to populations where it’s very clear they are spatially separated, to the extent that there’s little if any gene flow between them. Well, that’s well and good, but it’s kind of obvious that you’re going to get different allele/genotypes in such a situation, at least if that structure’s been in place for any length of time. In terms of really learning something new, we know very little w.r.t. the situation I described, and with the continuing drop in DNA sequencing costs, there should be the potential for extensive genotyping of individuals in areas where their kin relations are completely unknown, or nearly so. The greater the number of genetic loci, and/or individuals, analyzed, the more the existing methods are problematic-the algorithm chokes. And no phenotypic analysis–way too problematic with plants and their notorious plasticity. Would have to stick strictly to DNA markers.

      Now, having said that, the first question I would actually like to apply a new method to, is in defining historic forest community types, based on the land surveyors data–because well, I’m a forest ecologist and I’ve done a lot of work with that kind of data.

      I really don’t want another method named after me though. I’ve got enough of them already, and constantly having to deal with people trying to throw money at me for various projects has just gotten tiresome.

  2. Jim wrote:

    “…the use of “deme” and “incipient species” wasn’t meant to imply that this was the only road to speciation, only that it is one of them”

    That makes sense, no arguments there. Speciation remains a tough nut to crack.

    A young colleague of mine used Bayesian reasoning in a reliability growth model he developed. I have to admit that I could not really follow it without considerable homework on the math, which I did not do. The problem he addressed was that the device of interest was going to be made by a new manufacturer, and the old manufacturer did not do so well. In order to model growth, he had to assume an initial reliability at the start of manufacturing, and there was no good way to estimate what that would be, since everyone felt that the numbers from the previous manufacturer should not be used. So he put together a few pages of Bayesian math and came up with actual estimates of initial reliability, almost but not quite out of thin air. Management was impressed, so we used his numbers and marched forward. Would a pure SWAG have been approximately as useful? I suspect that it would not have made much difference, but at least we avoided being criticized for being too simplistic (and pessimistic). So there is one value gained from Bayesian reasoning: it adds the imprimatur of sophistication!

    • 🙂

      I think you just made the point of my post much better than I did! We live in a very strange world.

      I’ve read people saying things to the effect of “Bayesianism is the more natural way to think about probability than frequentism/likelihood is–it makes more intuitive sense”. I about fall out of the chair when I read stuff like that. Apparently they’re Byzantine Catholics.

      As for the issue of possible speciation pathways–your point’s important actually, because I always think in terms of plant genetics and evolution, which is so much more complex and varied (and interesting!) than with animals. Polyploidy, apomixis, clonal propagation, etc etc. It’s by no means always Darwin’s gradual separation model at work–as we’ve previously discussed with the serpentine stuff.

  3. I am intrigued by the concept of a “Bayesian” MCMC. If you know of examples other than Pritchard et al. I would be interested.

    I agree that Bayesian methods are often misused. Applying Bayesian reasoning to short Monte Carlo runs, such as a truly random roulette wheel, is essentially the reasoning behind the Gambler’s Fallacy and the financial ruin of uncountable individuals.

    I am simply curious how Markov Chains are practically applied in the context of Bayesian statistics (presuming, of course, that it actually is practical).

    • Several of the papers spawned or inspired by Pritchard et al’s approach use it, in fact all of them I think. The R package “Geneland” uses it I know, in fact its “MCMC” function is the heart of that package. I believe it’s fairly commonly used across a range of fields, but I’m not really sure, since well, I’m not a Bayesian.

      There are various actual algorithms under the MCMC umbrella, including Metropolis, Metropolis-Hastings, Gibbs sampling and perhaps some others, I don’t know. As to just exactly how each works I’m not sure; I just know that all jump around through the parameter space instead of gradually traversing it, attempting to sample the posterior distribution without getting stuck in various local minima that might exist. I assume that’s where the “Markov Chain” part of it comes from.

  4. Well, my interest was that a standard Markov Chain is a frequency distribution, but it appears that it is being used in a Bayesian context, so I didn’t know whether the frequency distribution was somehow being displaced by a Bayesian probability density, or what.

    I’ll look some more but I admit I got a bit lost trying to navigate Pritchard’s symbolic representation. My statistical education is only undergrad level but I did some work with Markov Chains back in the 80s.

    • I’m not fully sure myself. My understanding is that the MCMC algorithm is simply sampling the posterior distribution, so as to be able to make parameter estimates from it, like the mean or mode or whatever.

      You wouldn’t be the first one to get lost trying to follow the description of Bayesian methods. Bayesian methods are largely opaque to most people, whereas ML is clear and understandable, and logically coherent. I think most people just nod their heads when Bayesianism is discussed, but they have little idea what’s really going on.

Have at it

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s