# On clustering, part one

In ecology and other sciences, grouping similar objects together for further analytical purposes, or just as an end in itself, is a fundamental task, one accomplished by cluster analysis, one of the most fundamental tools in statistics. In all but the smallest sample sizes, the number of possible groupings very rapidly gets enormous, and it is necessary therefore to both (1) have some way of efficiently avoiding the vast number of clearly non-optimal clusters, and (2) choosing the best solution from among those that seem at least reasonable.

First some background. There are (at least) three basic approaches to clustering. Two of these are inherently hierarchical in nature: they either aggregate individual objects into ever-larger groups (agglomerative methods), or successively divide the entire set into ever smaller ones (divisive methods). Hierarchical methods are based on a distance matrix that defines the distance (in measurement space) between every possible pair of objects, as determined by the variables of interest (typically multivariate) and the choice of distance measure, of which there are several depending on one’s definitions of “distance”. This distance matrix increases in size as a function of (n-1)(n/2), or roughly a squared function of n, and so for large datasets these methods quickly become untenable, unless one has an enormous amount of computer memory available, which typically the average scientist does not.

The k-means clustering algorithm works differently–it doesn’t use a distance matrix. Instead it chooses a number of random cluster starting points (“centers”) and then measures the distance to all objects from those points, and agglomerates stepwise according to which objects are closest to which centers. This greatly reduces the memory requirement for large data sets, but a drawback is that the output depends on the initial choice of centers; one should thus try many different starting combinations, and even then, the best solution is not guaranteed. Furthermore, one sets the number of final clusters desired beforehand, but there is no guarantee that the optimal overall solution will in fact correspond to that choice, and so one has to repeat the process for all possible cluster numbers that one deems reasonable, with “reasonable” often being less than obvious.

When I first did a k-means cluster analysis, years ago, I did it in SPSS and I remember being surprised that the output did not include a probability value, that is, the likelihood of obtaining a given clustering by chance alone. There was thus no way to determine which among the many possible solutions was in fact the best one, which seemed to be a pretty major shortcoming, possibly inexcusable. Now I’m working in R, and I find…the same thing. In R, the two workhorse clustering algorithms, both in the main stats package are kmeans and hclust, corresponding to k-means and hierarchical clustering, respectively. In neither method is the probability of the solution given as part of the output. So, it wasn’t just SPSS–if R doesn’t provide it, then it’s quite possible that no statistical software program (SAS, S-Plus, SigmaStat, etc.) does so, although I don’t know for sure.

There is one function in R that attempts to identify what it calls the “optimal clustering”, function optCluster in the package of the same name. But that function, while definitely useful, only appears to provide a set of different metrics by which to evaluate the effectiveness of any given clustering solution, as obtained from 16 possible clustering methods, but with no actual probabilities attached to any of them. What I’m after is different, more defensible and definitely more probabilistic. It requires some careful thought regarding just what clustering should be all about in the first place.

If we talk about grouping objects together, we gotta be careful. This piece at Variance Explained gives the basic story of why, using examples from a k-means clustering. A principal point is that one can create clusters from any data set, but the result doesn’t necessarily mean anything. And I’m not just referring to the issue of relating the variable being clustered to other variables of interest in the system under study. I’m talking about inherent structure in the data, even univariate data.

This point is easy to grasp with a simple example. If I have the set of 10 numbers from 0 to 9, a k-means clustering into two groups will place 0 to 4 in one group and 5 to 9 in the other, as will most hierarchical clustering trees trimmed to two groups. Even if some clustering methods were to sometimes place say, 0 to 3 in one group and 4 to 9 in the other, or similar outcome (which they conceivably might–I haven’t tested them), the main point remains: there are no “natural” groupings in those ten numbers–they are as evenly spaced as is possible to be, a perfect gradient. No matter how you group them, the number of groups and the membership of each will be an arbitrary and trivial result. If, on the other hand, you’ve got the set {0,1,2,7,8,9} it’s quite clear that 0-2 and 7-9 define two natural groupings, since the members of each group are all within 1 unit of the means thereof, and with an obvious gap between the two.

This point is critical, as it indicates that we should seek a clustering evaluation method that is based in an algorithm capable of making this discrimination between a perfect gradient and tightly clustered data. Actually it has to do better than that–it has to be able to distinguish between perfectly spaced data, randomly spaced data, and clustered data. Randomly spaced data will have a natural degree of clustering by definition, and we need to be able to distinguish that situation from truly clustered data, which might not be so easy in practice.

There are perhaps several ways to go about doing this, but the one that is most directly obvious and relevant is based on the Poisson distribution. The Poisson defines the expected values in a set of sub-samples, given a known value determined from the entire object collection, for the variable of interest. Thus, from the mean value over all objects (no clustering), we can determine the probability that the mean values for each of the n groups resulting from a given clustering algorithm (of any method), follow the expectation defined by the Poisson distribution determined by that overall mean (the Poisson being defined by just one parameter). The lower that probability is, the more likely that the clusters returned by the algorithm do in fact represent a real feature of the data set, a natural aggregation, and not just an arbitrary partitioning of random or gradient data. Now maybe somebody’s already done this, I don’t know, but I’ve not seen it in any of the statistical software I’ve used, including R’s two workhorse packages stats and cluster.

More hideous detail to come so take cover and shield your eyes.

# How not to do it

This is a long post. It analyzes a paper that recently appeared in Nature. It’s not highly technical but does get into some important analytical subtleties. I often don’t know where to start (or stop) with the critiques of science papers, or what good it will do anyway. But nobody ever really knows what good any given action will do, so here goes. The study topic involves climate change, but climate change is not the focus of either the study or this post. The issues are, rather, mainly ecological and statistical, set in a climate change situation. The study illustrates some serious, and diverse problems.

Before I get to it, a few points:

1. The job of scientists, and science publishers, is to advance knowledge in a field
2. The highest profile journals cover the widest range of topics. This gives them the largest and most varied readerships, and accordingly, the greatest responsibilities for getting things right, and for publishing things of the highest importance
3. I criticize things because of the enormous deficit of critical commentary from scientists on published material, and the failures of peer review. The degree to which the scientific enterprise as a whole just ignores this issue is a very serious indictment upon it
4. I do it here because I’ve already been down the road–twice in two high profile journals–of doing it through journals’ established procedures (i.e. the peer-reviewed “comment”); the investment of time and energy, given the returns, is just not worth it. I’m not wasting any more of my already limited time and energy playing by rules that don’t appear to me designed to actually resolve serious problems. Life, in the end, boils down to determining who you can and cannot trust and acting accordingly

For those without access to the paper, here are the basics. It’s a transplant study, in which perennial plants are transplanted into new environments to see how they’ll perform. Such studies have, at least, a 100 year history, dating to genetic studies by Bateson, the Carnegie Institute, and others. In this case, the authors focused on four forbs (broad leaved, non-woody plants), occurring in mid-elevation mountain meadows in the Swiss Alps. They wanted to explore the effects of new plant community compositions and T change, alone and together, on three fitness indicators: survival rate, biomass, and fraction flowering. They attempted to simulate having either (1) entire plant communities, or (2) just the four target species, experience sudden temperature (T) increases, by moving them downslope 600 meters. [Of course, a real T change in a montane environment would move responsive taxa up slope, not down.] More specifically, they wanted to know whether competition with new plant taxa–in a new community assemblage–would make any observed effects of T increases worse, relative to those experienced under competition with species they currently co-occur with.

Their Figure 1 illustrates the strategy:

Figure 1: Scenarios for the competition experienced by a focal alpine plant following climate warming.
If the focal plant species (green) fails to migrate, it competes either with its current community (yellow) that also fails to migrate (scenario 1) or, at the other extreme, with a novel community (orange) that has migrated upwards from lower elevation (scenario 2). If the focal species migrates upwards to track climate, it competes either with its current community that has also migrated (scenario 3) or, at the other extreme, with a novel community (blue) that has persisted (scenario 4).

There are many things I don’t understand regarding how various folks do things, and here’s a good example. It falls, to my mind, within what Brian McGill at the Dynamic Ecology blog last year called “statistical machismo”: the use of statistical methods that are more complicated than necessary in order to appear advanced or sophisticated, when these are either unnecessary (but trendy), or worse, clearly not the best choice for the problem at hand. “Cutting edge”, their practitioners like to think of them, but I’ve got no problem in calling them sophistry that stems from a lack of real statistical understanding, combined with a willingness to do whatever will maximize the chances of getting published, of which there rarely seems to be a shortage.

I’ve had, for some time, a growing suspicion that much of the use of Bayesian statistical methods in science falls pretty squarely in this category. That of course doesn’t mean I’m right, especially since I do not fully understand everything about modern Bayesian methods, but I get the basic ideas, and the following example is a good illustration of why I think that way. It relates to my recent cogitations on the design of a general, model-based partitioning (clustering) algorithm for a common type of categorical data: data in which each sample is represented by only a small fraction of the total number of categories. In such cases, clear associations between the various categories is far from obvious.

I started thinking about the issue in relation to the estimation of forest tree community types in some widespread and very important historical tree data sets, where each sample contains individuals from, at most, only two to four tree taxa (usually, species), when there may be upwards of 10 to 30 such taxa in the population over some large landscape area (earlier post on this topic here) However, the issue has by far its greatest application in the field of population genetics, specifically w.r.t. the goal of identifying cryptic population structure–that is identifiable groups of individuals who are breeding primarily or entirely among themselves (“demes”), leading to allele and genotype frequencies that vary characteristically from deme to deme, demes which are not otherwise readily identifiable by external phenotypic characters. These are groups involved in the first step on the road to “incipient species”, to use Darwin’s phrase. The similarity with the tree data is that at each gene locus for any given diploid individual–which represents our sample–you have only two alleles, even though many such may occur in some larger, defined population.

In 2000, Pritchard et al. published what would have to be considered a landmark study, given that it’s been cited nearly 14,000 times since. This comes out to about 2.5 citations per day; I wouldn’t have guessed that so many popgen papers were even published at that kind of rate. The paper introduces a method and program (“STRUCTURE”) for the above-stated task, one based on Bayesian techniques, using Markov Chain Monte Carlo (MCMC), which is an iterative method for estimating parameters of the posterior distribution when no analytical techniques, or approximations thereof, are available. Furthermore, the paper has spawned several spin-off papers introducing various modifications, but all based on the same basic Bayesian/MCMC approach. And each of those has gotten hundreds to thousands of citations as well.

I freely admit that I am an unabashed maximum likelihood (ML) type of thinker when it comes to statistical inference and model selection. I’m far from convinced that Bayesianism offers any clear, definitive advantage over ML methods, while appearing to be saddled with some complex, time-consuming and uncertain estimation techniques (like MCMC), which is most definitely a disadvantage. To my view, Bayesianism as a whole might well fall within Brian’s machismo category, at least as employed in current practice, if not in its fundamental tenets. I very much doubt that many people who use it do so for any reason other than that a lot of others are using it, and so they just go with the flow, thinking all is kosher. Scientists do that a lot, at least until they develop their own understanding of the issues.

As I was thinking through the problem, it seemed to me pretty clear that, although a strict analytical solution was indeed not possible, one based on a ML approach, as heavily guided by expectations from binomial/multinomial probability and divisive clustering, was the way to go. Indeed, I can’t see any other logical and algorithmically efficient way to go about solving this type of problem. The underlying goal and assumptions remain the same as Pritchard et al’s, namely to find groups that approximate Hardy-Weinberg equilibrium, and which therefore represent approximately randomly mating groups. And there is also still a “Monte Carlo” procedure involved, but it’s quite different: always guided by a definite strategy, and much less intense and random than in the Bayesian/MCMC approach. As far as I can tell, nobody’s taken this approach (although I just found an Iowa State student’s dissertation from last year that might), and I don’t know why. I thought it was recognized that defaulting to a uniform (i.e. uninformative) prior probability distribution–because you really have no idea otherwise, or worse, when the idea of some “prior distribution” doesn’t even make sense to begin with–and you have quite a few parameters to estimate, that MCMC algorithms can be very slow to converge (if at all), and to do so to potentially unstable estimates at that. But that’s exactly what the authors did, and there are other limitations of the approach also, such as having to constrain the total number of possible demes to begin with–presumably because the algorithm would choke on the number of possible solutions otherwise.

These are the kinds of things I run into far more often than is desirable, and which generate a certain mix of confusion, frustration and depression. If I keep working on this topic–which I find fascinating and which, being statistical, generalizes to different fields, but which I really don’t have time for–I’ll post more details about the two approaches. The fan mail has been clamoring for posts on this topic. Or ask questions if you’re entirely desperate for something to do while waiting for the Final Four to go at it.

# Transient temperature change estimation under varying radiative forcing change scenarios, part two

Continuing from part one, this post looks at a specific method for estimating TCS (transient climate sensitivity), for any desired year and/or radiative forcing scenario, as predicted by any AOGCM climate model. And some associated topics.

The basic idea was devised by Good et. al (2010, 2011; links at end), and expanded upon by Caldeira and Myhrvold (2013), who fit various equations to the data. The basic idea is fairly simple, but clever, and integrates some nice mathematical solutions/approximations, including Gregory’s linear regression ECS estimation method. The basic idea is simply that if you have an idealized RF pulse or “step” increase (i.e. sudden, one-time increase, as with the instant 4X CO2 (= ~7.4 W/M^2) increase experiment in CMIP5), and run any given AOGCM for say 150-300 years from that point, you can record the temperature course resulting from the pulse, over that time (which will rise toward an asymptote determined by the climate sensitivity). That asymptote will be twice the ECS value (because the CO2 pulse is to 4X, not 2X, CO2). From these data one can fit various curves describing the T trend as a function of time. One then simply linearly scales that response curve to any more realistic RF increase of interest, corresponding to a 1.0% or 0.5% CO2 increase, or whatever. Lastly, if each year’s RF increase is considered as one small pulse, an overlay and summation of the temperature responses from all such, at each year, gives each year’s estimated temperature response, for however long the RF is increasing. The RF increase does not have to stop at any point, although it can. It can also increase or decrease at any rate over time.

The figure below from the paper, illustrate the method and the comparison (Fig. 1 of paper, original caption):

Fig. 1 Illustrating the method. a Global mean temperature evolution in a 4xCO2 step experiment (from the HadCM3 GCM; CMIP5 GCMs give qualitatively similar results). b Reconstruction method for years
1–5 of a 1pctCO2 experiment. Red, yellow, green, blue and purple curves temperature responses estimated for the forcing changes in years 1, 2, 3, 4 and 5 respectively. Each coloured curve is identical (for the case of the 1pctCO2 scenario) and is given by scaling the step experiment temperature response. Black curve reconstructed temperature response, given by the sum of the coloured curves (Eq. 1a).

Good et al (2011), did this for nine AOGCMs, testing the method against the results of the CMIP5 1% per year CO2 increase experiment. This is interesting; they are testing whether the basic functional response to an instant, 400% CO2 increase, is similar to that from a 1% per year increase over 140 years. And lo and behold, the overall agreement was very high, both for the collection of models, and individually, for both surface T and heat content. Their Fig. 2 is shown below:

Fig. 2 Validation against 1 % experiment (all models). a,b Ensemble mean time-series (black GCM simulations, red simple model). c,d ensemble spread: mean over years 111–140 of the GCM simulations (y-axis) against simple model prediction (xaxis) (Each cross represents one GCM). a,c Temperature change/
K; b,d heat uptake/1022 J

To me, this result is rather astounding, as it says that the time decay of the temperature response to a pulsed RF increase, is highly similar, no matter the magnitude of that increase. That is absolutely not a result I would have expected, given that the thermodynamic interaction between the ocean and the atmosphere is highly important and seemingly not likely to be in phase. Of course, this result does not prove this dynamic to be a reality–only that the AOGCM models tested consider, via their encoded physics, that the two responses to be highly similar in form, just differing in magnitude.

Caldeira and Myhrvold (2013) then extended this approach by fitting four different equation forms and evaluating best fits, via Akaike AIC and RMSE criteria. To do this they first used the Gregory ECS estimation method (ref at end) to define the temperature asymptote reached. They don’t give the details of their parameter estimation procedure, which must be some type of nonlinear optimization (and hence open to possible non-ML solutions), since the equation forms they tested were three (inverted) negative exponential forms and one other non-linear form (based on heat diffusion rates in the ocean). They also don’t provide any R^2 data indicating variance accounted for, but their figures (below) demonstrate that for all but one of their model forms (a one-parameter, inverted negative exponential) the fits are extremely good (and extremely similar) across most of the AOGCMs used in CMIP5:

Figure 2. Temperature results for CMIP5 models that have performed the abrupt4xCO2 simulations (black dots). Also shown are fits to this data using the functions described in the text: θ1-exp, green; θ2-exp, blue; θ3-exp, brown; θ1D, red. The left vertical axis shows the fraction of equilibrium temperature change (i.e., ΔT/ΔT4×); the right vertical axis indicates the absolute change in global mean temperature. Fit parameters are listed in SOM tables S3–S5 (available at stacks.iop.org/ERL/8/034039/mmedia).

Figure 5. Results from CMIP5 models (black dots) running simulations of the 1pctCO2 protocol. Projections made by simulations based on curve fits to the abrupt4xCO2 simulations as described in the text: θ1-exp, green; θ2-exp, blue; θ3-exp, brown; θ1D, red. All but θ1-exp provide similar approximations to the temperature results for most of the fully coupled, three-dimensional climate model simulations. Note that the GFDL-ESM2G and GFDL-ESM2M models did not continue with increasing atmospheric CO2 content after reaching twice the preindustrial concentration.

So, both Good et al. (2011, 2012), and Caldeira et al. (2013) provide strong evidence that the physical processes involving surface temperature change, as encoded in AOGCMs, are likely very similar across extremely widely varying radiative forcing increases per unit time, from unrealistically huge, to (presumably) however small. Note that in both cases, a very large percentage (roughly, 40-60%) of the total temperature response (at equilibrium) occurs within the first decade (when normalized to the pulse magnitude). This seems to have implications for the importance of various feedbacks, an issue which is complicated by the fact that some of the models tested are Earth System Models, which include e.g. integrated carbon cycle feedbacks, while others do not. Certainly there will be major potential differences in carbon cycle feedbacks between an earth surface that has just increased 3 degrees C, instantly, versus one that has warmed only a tiny fraction of that amount.

TBC; the next post will demonstrate application to various delta RF scenarios.

Refs:

Caldeira and Myhrvold, (2013). Projections of the pace of warming following an abrupt increase in atmospheric carbon dioxide concentration. Environ. Res. Lett. 8: 034039, doi:10.1088/1748-9326/8/3/034039.
Good et al., (2011). A step‐response simple climate model to reconstruct and interpret AOGCM projections. GRL, 38, L01703, doi:10.1029/2010GL045208
Good et al., (2012). Abrupt CO2 experiments as tools for predicting and understanding CMIP5 representative concentration pathway projections. Climate Dynamics (2013) 40:1041–1053 DOI 10.1007/s00382-012-1410-4
Gregory et al., (2004) A new method for diagnosing radiative forcing and climate sensitivity. doi:10.1029/2003GL018747

See also: Hooss et al., (2001). A nonlinear impulse response model of the coupled carbon cycle-climate system (NICCS). Climate Dynamics 18.3-4: 189-202.

# Transient temperature change estimation under varying radiative forcing change scenarios, part one

The estimation of transient climate sensitivity (TCS, defined below) has been in the back of my mind since writing a couple of posts a couple of months ago (here and here) on expected future global mean temperatures over this century. This post, and the one to follow, is thus a methods oriented post resulting from that thought process investigation. This one just introduces the basics of the problem and in the next one I’ll get into methods.

I use TCS here to refer to the realized, global mean, surface temperature change, at any given time point, resulting from a change in radiative forcing (RF) up to that point, regardless of whether either the thermal, or radiation, environments have re-equlibrated in response to this forcing change or not. It is a generalization of the transient climate response (TCR), which is defined as the expected mean surface temperature change at t = 70 years, of a 1% per year CO2 increase. Such a rate gives a CO2 doubling (1.01^69.66 = 2), and since CO2 RF is well-approximated by a logarithmic function of the CO2 concentration ratio at two time points, this results in a constant annual RF change rate (= 5.35 * ln(CO2.2/CO2.1)/70 = 0.053 W/m^2/yr). So, TCS is just a generalization of TCR, in that the time span needn’t be exactly 70 years, nor the forcing rate exactly 0.053 W/m^2/yr. Linear scaling, based on other delta RF rates, is allowed, but the reference time should be within, say, a couple decades or so of 70 years. In the CMIP5 climate model experiments, which form the input to the IPCC AR5 report, the 1% increase is extended over 140 years, reaching 4X CO2 (from pre-industrial), and the transient response at that point is simply divided by 2, to estimate TCR as just defined.

Although the concept itself is straight-forward, TCS estimation from empirical data is not, because of the several important time delays and/or feedbacks, not to mention forcing agents, in the climate system, for which the available data are not sufficient to this highly important task. Generally, global mean time series output for idealized, modeled RF scenarios is thus required, in particular the 1% per year, and instantaneous 400%, CMIP5 CO2 increase scenarios. For whatever reason, the annual time series output for these, and more importantly for the four more realistic Representative Concentration Pathway (RCP) scenarios analyzed, are rarely reported. Why this is so baffles me; it’s not hard and the AR5 seemingly should have done it, but whatever, I’m not in charge. To get them, one thus has to analyze each climate model’s raw output data. Finding these data, downloading them, aggregating them from native temporal and spatial scales to obtain yearly global means, etc., is a time-consuming process, and one requiring a fair bit of programming; it’s a lot of work. But useful and important work.

The equilibrium climate sensitivity (ECS) is the temperature change expected after this same RF increase (3.7 Watts/m^2, allowing for stratospheric/tropospheric adjustment) has been imposed, but only after both radiation and temperatures have reached equilibrium. What’s missing from most reported ECS estimates however, is the time scale over which the full temperature increase is realized. In a few cases, model-estimated ECS time scales have been determined by running models having lower spatio-temporal resolutions than typical AOGCMS, for one to five thousand years, to equilibrium. But that costs a lot of supercomputer computing time, and full resolution runs cost even more, so most often it is computed from shorter AOGCM model runs (~ 150 to 300 years). An important method for so doing involves linear regressions of delta T on the planetary, top-of-atmosphere radiative flux, extrapolated to the point where that flux is estimated to be zero: the so-called “Gregory method”, after it’s originator. ECS estimates vary, with the consensus central tendency value, for 35 years or more now, being estimated at around 3.0 degrees C, with a likely range between 1.5 and 4.5. But that issue is not the point of these posts.

But ECS, which while important, is not fully realized for several hundred years after the cessation of a RF increase, and why it should receive (typically) more attention than estimates for the next few decades, is another big puzzler (albeit one that CMIP5 addressed directly with its decadal forecasts). However, the main point of this post is that the idealized CMIP5 experiments mentioned above can be used to predict the annual time series of the expected warming for any imposed, realistic RF change, even though the idealized experiments are themselves decidedly unrealistic. An instantaneous 4X CO2 increase is obviously wildly unrealistic–nobody’s ever argued such a thing could actually happen short of some planetary natural disaster scenario. Even the 1% per year increase from pre-industrial for 70, or even 140, years is clearly too high; even from a 1950 baseline the mean annual CO2 increase has been only (400/310)^(1/64) = 1.004, or 0.4% per year. Only in the last couple decades has it exceeded 0.5% per year, although it’s certainly possible to hit 1% per year in the near future, from either a continuing increase in emission rates, decrease in aerosol production rates, strong carbon cycle feedback (or forcing, via land cover changes), or some combination of these. [Edit:referring to the equivalent RF here, not necessarily via CO2 increases.]

By any account, whether purely scientific or as policy input information, the estimated TCS for any given year, i.e. a time series, is an important thing to know, (of higher practical significance than is knowing ECS and/or it’s time scale, I would argue). I haven’t seen it commonly estimated however; in the IPCC AR5 report for example, just the TCR and ECS are reported, and decadal resolution estimates for the four RCP scenarios, in which several forcing agents are changing simultaneously, including various well-mixed greenhouse gases, non well-mixed atmospheric agents (e.g. aerosols, surface ozone), land cover, surface albedo, and sometimes other things.

TBC. Fire away if you have any comments/questions. I’ll do my best to answer the easy ones and dodge or obfuscate on the hard ones.

# What’s complex and what’s simple in an exponential model?

In the post on estimating the rate of spread in the current ebola epidemic, a commenter stated that using a monthly rate of disease spread in Liberia was a “simpler” model than what I had done, which was based on a daily rate. This is not correct and I want to clarify why here.

In fact I used a very simple model–an exponential model, which has the form y = b^ax. You can’t get any simpler than a one parameter model, and that fact doesn’t change just because you alter the value of the base b. Any base can model an exponential increase; changing it just requires a corresponding change in parameter a, for a given pair of y and x variables. Base choice ought to be done in a way that carries some meaning. For example, if you’re inherently interested in the doubling time of something, then 2 is the logical choice*. But when no particular base value is obvious, it’s still best if the value used carries meaning in terms of the values of x, i.e. where a = 1.0, presuming that x is measured on some scale that has inherent interest. In my case, that’s the per-day increase in ebola cases.

However, if you fit an exponential model to some data, most programs will use a base of e (~2.781) or 10 by default; the base is fixed and the rate of change is then determined with respect to the units of ax. That’s a bit backwards frankly, but not a big deal, because the base used can easily be converted to whatever base is more meaningful relative to the data at hand. Say for example, that your model fitting procedure gives y = e^(3.2x), where b = e and a = 3.2. But if your x variable is recorded in say, days, you may well not be interested in how y changes every 3.2 days: you want to know the per-day rate of change. Well, y = e^(ax) is simply y = (e^a)^x, and so in this case b = e^(3.2) = 24.5; it takes a larger base to return a given y value if the exponent is smaller. It’s just a straight mathematical transformation (e^a), where a is whatever value is returned in the exponential model fitting. It has nothing to do with model complexity. It has instead to do with scaling, ease of interpretation and convenience.

The relevance to the ebola transmission rate modeling and the original comment is that those rates could very well change within a month’s time due to radical changes in the population’s behavior (critical), or perhaps drug availability (unlikely in this case). In a disease epidemic what happens from day to day is critical. So you want to use a time scale that allows you to detect system changes quickly, while (in this case) also acknowledging the noise generated by the data reporting process (which complicates things and was the whole point of using loess to smooth the raw data before making the estimates). Note that I’ve not gone into the issue of how to detect when an exponential growth rate has changed to some other type of growth. That’s much more difficult.

*Exponential functions are also useful for analyzing outcomes of trials with categorical variables, a where a = 1 and b defines the number of possible outcomes of some repeated process. For example y = 2^25 gives the total number of possible permutations of 25 trials of an event having two possible outcomes. But that’s a different application than modeling a change rate (unless you want to consider the increase in the number of possible permutations a rate).

# Estimating the spread rate in the current ebola epidemic

I’ve now written several articles on the West African ebola outbreak (see e.g. here, here, here, and here). This time I want to get more analytical, by describing how I estimated the ebola basic reproduction rate Ro (“R zero”), i.e. the rate of infection spread. Almost certainly, various people are making these estimates, but I’ve not seen any yet, including at the WHO and CDC websites or the few articles that have come out to date.

Some background first. Ro is a fundamental parameter in epidemiology, conceptually similar to r, the “intrinsic rate of increase”, in population biology. In epidemiology, it’s defined as the (mean) number of secondary disease cases arising from some primary case. When an individual gets infected, he or she is a secondary case relative to the primary case that infected him or her, and in turn becomes a primary case capable of spreading the disease to others. It’s a lineage in that respect, and fractal. I’ll refer to it simply as R here.

The value of R depends strongly on the biology of the virus and the behavior of the infected. It is thus more context dependent than the r parameter of population biology, which is an idealized, or optimum, rate of population growth determined by intrinsic reproductive parameters (e.g. age to reproductive maturity, mean litter size, gestation time). Diseases which are highly contagious, such as measles, smallpox and the flu, have R values in the range of 3 to 8 or even much more, whereas those that require direct exchange of body fluids, like HIV, have much lower rates.

To slow an epidemic of any disease it is necessary to lower the value of R; to stop it completely, R must be brought to zero. Any value of R > 0.0 indicates a disease with at least some activity in the target population. When R = 1.0, there is a steady increase in the cumulative number of cases, but no change in the rate of infection (new cases per unit time): each infected person infects (on average) exactly 1.0 other person before either recovering or dying. Any R > 1.0 indicates a (necessarily exponential) increase in the infection rate, that is, the rate of new cases per unit time (not just the total number of cases), is increasing.

# Step one

The hardest part about gaining any new idea is sweeping out the false idea occupying that niche. As long as that niche is occupied, evidence and proof and logical demonstration get nowhere. But once the niche is emptied of the wrong idea that has been filling it — once you can honestly say, “I don’t know,” then it becomes possible to get at the truth.

Heinlein, R. A. 1985. The Cat Who Walks Through Walls: A Comedy of Manners, p. 230. G.P. Putnam’s Sons, New York. In: Gaither, C.C. and Cavazos-Gaither, A.E., 2008, Gaither’s Dictionary of Scientific Quotations, Springer.

# Is Popper responsible for this mess?

OK, admittedly this is a bit of a weird post, but otherwise I’d have to be actually working.

It’s just a question post really, because admittedly I’ve read very little of Karl Popper’s writings, and whatever little that was, it was a long time ago. I just know what everybody in science “knows”: he’s the “falsification” guy. That is, he reportedly believes that scientific advancement comes mainly via testing hypotheses (ideas, concepts, theories, call ’em whatever you like as far as I’m concerned) and then assessing whether the hypothesis withstood the test successfully or not. If it didn’t, chuck it and come up with another one; if it did, test it some more and scale your confidence in it with the number (and/or stringency) of the tests it’s passed.

Hmm, well OK I guess, but it leaves me with this image in my mind of some authority figure standing over me saying “Your idea has been falsified by group X doing unequivocal test Y. Your idea fails. Now get out of here.”

Not to go all Bayesian Bandwagon on the issue, since I have serious questions about that viewpoint also, but if you’re addressing a complex question and you carefully and repeatedly add a little bit of good evidence at a time, over time, thereby eventually narrowing down the list of likeliest explanations for your observations, then you don’t really need to worry about “falsifying” anything really, do you? I mean, lay a solid foundation, then add floor one, then two, etc…. and there you go. I get the feeling Popper thinks science is a bunch of wanna-be sand castle architects running amok on the beach trying to outdo each other but without much of a clue really, but then WHOA, here comes the sand castle judge and he’s going to wreck all but one. But then maybe it is, at least in some fields. Jimi Hendrix could have written a song about it.

I think my main question really is this: did the obsession with hypothesis testing–and all the problems arising therefrom–come from following Popper’s ideas, or did Popper just describe what the hypothesis testing fanatics were already doing? Chicken and egg question really.

If this post has been unsatisfactory to you, I am willing to tell Rodney Dangerfield jokes or discuss baseball. Thanks for your attention either way.

# On baseball (finally!)

I’ve discussed no baseball here yet, which is kind of surprising, given that I’ve been a big fan all my life. I played a lot growing up, through high school and even a little in college and afterwards. If I had the time, I would likely start a blog just devoted strictly to baseball (and not just analysis either), because I have a lot to say on a lot of topics. But alas…

To me, the real interest in any sport comes from actually playing the game, not watching it, and I watch very little baseball (now) because the games are just too time consuming (though I still have a hard time refraining in October). When I do watch, I’m not obsessively analytical–that takes the fun out of it for me. It’s an athletic contest, not a statistics class; I want to see the center fielder go full speed and lay out for a catch, or a base thief challenge the pitcher or whatever, not sit there with numbers in my head. Analysis is for later, and I do like it, so I wade in at times, thereby joining the SABR-metric (or “sabermetric”) revolution of the last 3-4 decades (the Society for American Baseball Research (SABR), initiated much of this). And baseball offers endless analytical opportunities, for (at least) two reasons.

Billy Hamilton of the Cincinnati Reds

# Help crowd source better science communication methods

Recently, studies aimed at better quantifying and communicating the “consensus” on climate change have become more popular. To take advantage of the increasing monetary flow in this direction, and to advance the science even further, our institute, meaning me, have been designing a new research protocol. In the spirit of the “open science” movement, we/I thought it would be good to get some public feedback on potential flaws and possible improvements to this protocol. There are several advantages of this “crowd sourcing” approach to blog science, not the least of which is avoidance of placing a rather tacky “Tip Jar” icon on one’s home page.

What we want to know is what sort of message will really stick with people, make them think. New research has shown that communication methods are very important in this regard. For example, van der Linden et al. (2014) showed that “simple text” and pie charts are slightly superior to mixed metaphors involving doctors’ opinions regarding bridge failures. This is an important advancement that we want to build upon, leveraging the moment to effectively effect an optimal messaging paradigm that cannot be falsified.

One improvement that can be made involves how the experimental units are chosen. van der Linden et al. (2014) queried about 1000 volunteers, but these were chosen from among a “nationwide panel of people who are willing to participate in online surveys”. Those people want to be asked random questions by unknown people having unknown motives, which being abnormal, is not representative of the entire population. A better approach is to just target everybody, and a good way to do that is to confront them on the street while they are minding their own business, before they have any idea what you’re up to really.

Another issue is the treatments themselves; van der Linden et al. used a set of treatments involving pie charts, metaphors and numbers. This is nice but c’mon we put a man on the moon; we believe we can achieve more here. Our design applies less pedestrian treatments to these pedestrian experimental units, each chosen after careful thought. Our procedure is similar however. That is, we first ask what each unit believes that scientists believe about the climate, record the response, then apply the randomly chosen treatment, repeat the original question, and record the second response. Pretty simple really; the whole thing hinges on the treatments, which are:

Treatment 1:
The unit is shown a pie chart with the AAAS logo below it, indicating that 97 percent of scientists believe that climate change is real.

Treatment 2:
The unit is shown a pie chart with images of kittens and Jesus below, with statement as above.

Treatment 3:
The unit is shown a rerun of an old Sesame Street episode featuring the numbers 9 and 7, in sequence, over a backdrop picture of a hurricane.

Treatment 4:
The unit is informed that only Australian Aborigines and Death Row inmates are unaware that 97 of scientists believe that climate change is real.

Treatment 5:
Free dinner and beer at a nice local pub is promised to the unit for all answers over 95 regarding what percentage of scientists believe in climate change.

Treatment 6:
“97% consensus” and “Mother” are tatooed prominently on the unit’s right inner forearm.

Treatment 7:
The unit’s face is situated ~ 0.3 meters proximate to the front end of an OSHA-certified megaphone and unit is informed three times, at the “riot control” setting, that 97 percent of scientists believe in climate change.

Treatment 8:
Justin Verlander is placed approximately 60.5 feet from unit, facing, and delivers a ~97 mph fastball to unit’s upper left rib cage quadrant, while yelling “Get a real-time feeling for what 97 is all about partner”

Many more treatments than this are possible of course. For example we can certainly improve upon the “Indirectness Factor” (IF) one or more steps by asking people what they think other people think scientists believe about the climate, what they think they would think if exposed to a particular treatment, and so forth. There is a rich garden for potential studies following this path.

Thank you in advance for any contributions to the science that you may have, the world will be a better place for it. If you would like to donate \$1000 or more that would be fine as well.

# Against the grain…

Scientific thinking, which is analytic and objective, goes against the grain of traditional human thinking, which is associative and subjective.
Alan Cromer, American physicist and educator

I am going to have a sign put up all over my plant, reading “There is no expedient to which a man will not resort to avoid the real labor of thinking.
Thomas Edison

…one result of unimaginative, mechanistic thinking was that societies eventually ceased to burn people at the stake for witchcraft.
George Sudarshan, Indian physicist

Refs:
Cromer, A., 1993. Uncommon Sense: The Heretical Nature of Science. Oxford University Press, New York.
Runes, D. D., 1948. The Diary and Sundry Observations of Thomas Alva Edison, p. 167. Philosophical Library, New York.
Sudarshan, G., 1998. Doubt and Certainty: The Celebrated Academy: Debates on Science, Mysticism, Reality, in General on the Knowable and Unknowable, Fourth Debates. Perseus Books, Reading, MA.

Yep, all once again taken from: Gaither, C.C. and Cavazos-Gaither, A.E., 2008, Gaither’s Dictionary of Scientific Quotations, Springer.

I’m having a blast reading this thing so bear with me.

# Accidental truth

The scientific spirit is of more value than its products, and irrationally held truths may be more harmful than reasoned errors.
Thomas Henry Huxley.

Accidental truth of a conclusion is no compensation for erroneous deduction.
Arthur Eddington

The very truth, and the nature of things, though repudiated and ordered into exile, sneaked in again through the back door, to be received by me under an unwonted guise.
Johannes Kepler

Refs:
Huxley, T.H., 1904. The Coming of Age of “The Origin of Species” p. 229. Macmillan & Company Ltd. London.
Eddington, A.S., 1921. Space, Time and Gravitation: An Outline of the General Relativity Theory, p. 29. The University Press. Cambridge, England.
Donahue, W.H., 1992. New Astronomy Part IV, p. 575. The University Press, Cambridge, England.

All as cited in: Gaither, C.C. and Cavazos-Gaither, A.E., 2008, Gaither’s Dictionary of Scientific Quotations, Springer.

# To arrive at the simplest truth…

…as Newton knew and practiced, requires years of contemplation. Not activity. Not reasoning. Not calculating. Not busy behavior of any kind. Not reading. Not talking. Not making an effort. Not thinking. Simply bearing in mind what it is one needs to know. And yet those with the courage to tread this path to real discovery are not only offered practically no guidance on how to do so, they are actively discouraged and have to set about it in secret, pretending meanwhile to be diligently engaged in the frantic diversions and to conform with the deadening personal opinions which are continually being thrust upon them.

George Spencer-Brown, English mathematician

Spencer-Brown, G. 1969. Laws of Form, Appendix I, p. 110. George Allen & Unwin Ltd., London, England. Quoted in: Gaither, C.C. and Cavazos-Gaither, A.E., 2008, Gaither’s Dictionary of Scientific Quotations, Springer.