What’s complex and what’s simple in an exponential model?

In the post on estimating the rate of spread in the current ebola epidemic, a commenter stated that using a monthly rate of disease spread in Liberia was a “simpler” model than what I had done, which was based on a daily rate. This is not correct and I want to clarify why here.

In fact I used a very simple model–an exponential model, which has the form y = b^ax. You can’t get any simpler than a one parameter model, and that fact doesn’t change just because you alter the value of the base b. Any base can model an exponential increase; changing it just requires a corresponding change in parameter a, for a given pair of y and x variables. Base choice ought to be done in a way that carries some meaning. For example, if you’re inherently interested in the doubling time of something, then 2 is the logical choice*. But when no particular base value is obvious, it’s still best if the value used carries meaning in terms of the values of x, i.e. where a = 1.0, presuming that x is measured on some scale that has inherent interest. In my case, that’s the per-day increase in ebola cases.

However, if you fit an exponential model to some data, most programs will use a base of e (~2.781) or 10 by default; the base is fixed and the rate of change is then determined with respect to the units of ax. That’s a bit backwards frankly, but not a big deal, because the base used can easily be converted to whatever base is more meaningful relative to the data at hand. Say for example, that your model fitting procedure gives y = e^(3.2x), where b = e and a = 3.2. But if your x variable is recorded in say, days, you may well not be interested in how y changes every 3.2 days: you want to know the per-day rate of change. Well, y = e^(ax) is simply y = (e^a)^x, and so in this case b = e^(3.2) = 24.5; it takes a larger base to return a given y value if the exponent is smaller. It’s just a straight mathematical transformation (e^a), where a is whatever value is returned in the exponential model fitting. It has nothing to do with model complexity. It has instead to do with scaling, ease of interpretation and convenience.

The relevance to the ebola transmission rate modeling and the original comment is that those rates could very well change within a month’s time due to radical changes in the population’s behavior (critical), or perhaps drug availability (unlikely in this case). In a disease epidemic what happens from day to day is critical. So you want to use a time scale that allows you to detect system changes quickly, while (in this case) also acknowledging the noise generated by the data reporting process (which complicates things and was the whole point of using loess to smooth the raw data before making the estimates). Note that I’ve not gone into the issue of how to detect when an exponential growth rate has changed to some other type of growth. That’s much more difficult.

*Exponential functions are also useful for analyzing outcomes of trials with categorical variables, a where a = 1 and b defines the number of possible outcomes of some repeated process. For example y = 2^25 gives the total number of possible permutations of 25 trials of an event having two possible outcomes. But that’s a different application than modeling a change rate (unless you want to consider the increase in the number of possible permutations a rate).

Advertisements

4 thoughts on “What’s complex and what’s simple in an exponential model?

  1. I understood parts of this 😉

    What happens when a week or more goes by between updated numbers from official sources? Obviously the rate of change is harder to track… But is there any particular way(s) that it’s likely to be tracked incorrectly with less “resolution” in the input data?

    Thanks!

    • I’m glad it wasn’t a total failure. An abrupt change in how the symptomatic are cared for would be one, although it wouldn’t manifest itself for a week or two, assuming that to be the length of the asymptomatic period. So roughly weekly resolution data is probably about optimal, although we really do not know what’s going on with the data collecting and reporting schedule. So far, six days has been the longest interval.

  2. OK, Jim, I will bite. Mine was simpler because:

    1) it took me one sentence to describe and enumerate the process, calculation, & model vs your longer and involved explication & process
    2) i used the empirical smoothing process provided with one month nibbles of the actual data (for better & worse) rather than adding another step calculating a daily smoothing. After all, symptoms display in 2 – 21 days, the data rarely comes in daily, and so on.
    3) There is no need for more accuracy if the aim is a rough estimate of when the epidemic might become beyond control and my model process naturally dropped into actual and easily understood dates.

    Given the 2 x 4 times potential variance of the data, who cares about non-existent accuracy ? I will stipulate all of the various potential better micro-tracking, potential errors, better granularity, etc. You are certainly correct in that regard and there is a long history of how it is normally done. But a simpler process overall is simpler. Not better. Both have their place. Mine ain’t gonna be published, Jim. Don’t confuse the good-enuf with the better.

    BTW, great url for actual daily reports (& lots more) from the countries involved given to WHO:
    http://crofsblogs.typepad.com/h5n1/

    Best wishes, buddy.

    • I think you’ve missed some points Bob. One of the main ones thereof is that I’m trying to describe the process of how one might go about estimating R, not just making the estimate itself. On this blog, that’s the main science objective, whether it has to do with tree rings, baseball analysis or ebola. It’s mainly about the analytical process, not the result. The reason science gets itself into trouble (when it does)–even within its own community let alone with the general public–is often because it pays insufficient detail to analytical process, in my experience.

      To the issue, again, if you use monthly resolution data, you’ve got no way of detecting a change in the signal (infection rate) on any time scale less than that. Or more exactly, a change of a given magnitude is harder to discern. A main point here was to derive an estimate of R zero, and a monthly scale is definitely wrong for that; people are only infectious for 6-12 days at most. Furthermore, if you use cumulative cases as you did, instead of new cases, it becomes mathematically more and more difficult to discern a given change in the infection rate as time increases, due to the asymptotic nature of slopes of exponential functions. Cumulative totals are not as sensitive to the infection rate dynamics. That’s less of a problem when using recent rates, because the report by report numbers are freer to vary–and that’s assuming you smooth out the variance arising from reporting timeline issues. Your approach combines these two problems.

      You also keep referring to this supposed 2-4x underestimate of case and death rates as if it’s some kind of established fact or something; it isn’t. Even in the WHO “roadmap” report that you pulled that figure from, it was just given as a rough guess estimate, nothing more. I have no doubt that cases/deaths are going unreported because of the fear factor involved, but we do not know how many. And I doubt that anybody really does.

      Yes, that’s a good blog, I’ve looked at it a couple of times though not at any daily reports; thanks for the link.

Have at it

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s