What’s complex and what’s simple in an exponential model?

In the post on estimating the rate of spread in the current ebola epidemic, a commenter stated that using a monthly rate of disease spread in Liberia was a “simpler” model than what I had done, which was based on a daily rate. This is not correct and I want to clarify why here.

In fact I used a very simple model–an exponential model, which has the form y = b^ax. You can’t get any simpler than a one parameter model, and that fact doesn’t change just because you alter the value of the base b. Any base can model an exponential increase; changing it just requires a corresponding change in parameter a, for a given pair of y and x variables. Base choice ought to be done in a way that carries some meaning. For example, if you’re inherently interested in the doubling time of something, then 2 is the logical choice*. But when no particular base value is obvious, it’s still best if the value used carries meaning in terms of the values of x, i.e. where a = 1.0, presuming that x is measured on some scale that has inherent interest. In my case, that’s the per-day increase in ebola cases.

However, if you fit an exponential model to some data, most programs will use a base of e (~2.781) or 10 by default; the base is fixed and the rate of change is then determined with respect to the units of ax. That’s a bit backwards frankly, but not a big deal, because the base used can easily be converted to whatever base is more meaningful relative to the data at hand. Say for example, that your model fitting procedure gives y = e^(3.2x), where b = e and a = 3.2. But if your x variable is recorded in say, days, you may well not be interested in how y changes every 3.2 days: you want to know the per-day rate of change. Well, y = e^(ax) is simply y = (e^a)^x, and so in this case b = e^(3.2) = 24.5; it takes a larger base to return a given y value if the exponent is smaller. It’s just a straight mathematical transformation (e^a), where a is whatever value is returned in the exponential model fitting. It has nothing to do with model complexity. It has instead to do with scaling, ease of interpretation and convenience.

The relevance to the ebola transmission rate modeling and the original comment is that those rates could very well change within a month’s time due to radical changes in the population’s behavior (critical), or perhaps drug availability (unlikely in this case). In a disease epidemic what happens from day to day is critical. So you want to use a time scale that allows you to detect system changes quickly, while (in this case) also acknowledging the noise generated by the data reporting process (which complicates things and was the whole point of using loess to smooth the raw data before making the estimates). Note that I’ve not gone into the issue of how to detect when an exponential growth rate has changed to some other type of growth. That’s much more difficult.

*Exponential functions are also useful for analyzing outcomes of trials with categorical variables, a where a = 1 and b defines the number of possible outcomes of some repeated process. For example y = 2^25 gives the total number of possible permutations of 25 trials of an event having two possible outcomes. But that’s a different application than modeling a change rate (unless you want to consider the increase in the number of possible permutations a rate).

It just won’t get you there

I’ve got two little feet to get me across the mountain
Two little feet to carry me away into the woods
Two little feet
A big mountain
And a cloud coming down, a cloud a comin’ down

I hear the voice of the ancient ones
Chanting magic words from a different time
Well there is no time, there is only this rain
There is no time
That’s why I missed my plane

John Muir walked away into the mountains
With his old overcoat and a crust of bread in his pocket
We have no knowledge and so we have stuff
But stuff with no knowledge is never enough to get you there…
It just won’t get you there

Greg Brown, Two Little Feet

Estimating the spread rate in the current ebola epidemic

I’ve now written several articles on the West African ebola outbreak (see e.g. here, here, here, and here). This time I want to get more analytical, by describing how I estimated the ebola basic reproduction rate Ro (“R zero”), i.e. the rate of infection spread. Almost certainly various people are making these estimates, but I’ve not seen any yet, including at the WHO and CDC websites or the few articles that have come out to date.

Some background first. Ro is a fundamental parameter in epidemiology, conceptually similar to r, the “intrinsic rate of increase”, in population biology (I’ll refer to it as just R here). It’s defined as the mean number of secondary disease cases arising from a primary case. When an individual is infected, he or she is a secondary case relative to whoever infected him or her, and in turn becomes a primary case capable of spreading the disease to others. Estimates of R depend strongly on the both the biology of the virus, and the behavior of the infected. It is thus more context dependent than the population biology’s r parameter, which assumes idealized conditions and depends more strictly on biologically limiting parameters (lifespan, age to first reproduction, gestation time etc.). Diseases which are highly contagious, like measles, smallpox and the flu, have relatively high R values, whereas those requiring direct contact or exchange of body fluids, like HIV, have rates which are at least potentially much lower, depending on the behavior of the infected.

To stop an epidemic of any disease, it is necessary to first lower R, and eventually bring it near zero. Any value of R > 0.0 indicates a disease with at least some activity in the population of concern. When R = 1.0, there is a steady increase in the total number of cases, but no change in the rate of infection (new cases per unit time): each infected person infects (on average) exactly 1.0 other person. Any R > 1.0 indicates a (necessarily exponential) increase in the infection rate, that is, the rate of new cases per unit time (not just the total number of cases), is increasing. It’s also possible to get a constant, rather than accelerating, increase in the number of new cases, but that’s an unstable equilibrium requiring a steady decrease of R from values > 1.0, and is thus uncommon.

Continue reading

Liberian ebola rate jumps

Many reports from on-the-ground workers with the WHO, Doctors Without Borders, state health and aid agencies, etc. have commented that the case and death rates in at least some locations have almost certainly been too low, because of a substantial number of people avoiding going to clinics and hospitals, out of fear primarily. This situation seems to be the worst in Liberia. See this article for example. Today’s WHO-released data from Liberia may be confirmation of this, many new cases and deaths being reported there from August 16-18. Such an explanation could be due to more intensive case tracking/finding. However, it is also possible that the epidemic is simply exploding there now, especially given that it is well established in the capitol of Monrovia. Or it could be due to some combination of the two.

In the graphs below I used a pretty stiff “span” parameter (span = 1.0) in the loess smoothings (dark black lines) of the WHO-reported raw data (thin line). This choice gives about 35 deaths/day in Liberia. If I use something more flexible, span = 0.5 for example, the estimated rates are higher, about 47/day. However, it’s best to go stiff (i.e. conservative) here, because clearly there are major variations due to data gathering and reporting timelines that have been causing large fluctuations in the numbers (discussed more here).  But there’s also clearly more than just that going on with this latest surge in numbers.
This situation is now extremely serious, if it wasn’t already. Note also that negative rates early on in the outbreak are presumably due to case retractions or re-classifications. Code generating data and graphs is here and data table itself is here.

Continue reading

Step one

The hardest part about gaining any new idea is sweeping out the false idea occupying that niche. As long as that niche is occupied, evidence and proof and logical demonstration get nowhere. But once the niche is emptied of the wrong idea that has been filling it — once you can honestly say, “I don’t know,” then it becomes possible to get at the truth.

Heinlein, R. A. 1985. The Cat Who Walks Through Walls: A Comedy of Manners, p. 230. G.P. Putnam’s Sons, New York. In: Gaither, C.C. and Cavazos-Gaither, A.E., 2008, Gaither’s Dictionary of Scientific Quotations, Springer.

“…a garden opposite the Half Dome”

The good old pioneer, Lamon, was the first of all the early Yosemite settlers who cordially and unreservedly adopted the Valley as his home.

He was born in the Shenandoah Valley…emigrated to Illinois…afterwards went to Texas and settled on the Brazos, where he raised melons and hunted alligators for a living. “Right interestin’ business,” he said; “especially the alligator part of it.” From the Brazos he went to the Comanche Indian country between Gonzales and Austin, twenty miles from his nearest neighbor..When the formidable Comanche Indians were on the war-path he left his cabin after dark and slept in the woods. From Texas he crossed the plains to California and worked In the Calaveras and Mariposa gold-fields.

He first heard Yosemite spoken of as a very beautiful mountain valley and after making two excursions in the summers of 1857 and 1858 to see the wonderful place, he made up his mind to quit roving and make a permanent home in it. In April, 1859, he moved into it, located a garden opposite the Half Dome, set out a lot of apple, pear and peach trees, planted potatoes, etc., that he had packed in on a “contrary old mule,”…For the first year or two lack of provisions compelled him to move out on the approach of winter, but in 1862 after he had succeeded in raising some fruit and vegetables he began to winter in the Valley…When the avalanches began to slip, he wondered where all the wild roaring and booming came from, the flying snow preventing them from being seen. But, upon the whole, he wondered most at the brightness, gentleness, and sunniness of the weather, and hopefully employed the calm days in tearing ground for an orchard and vegetable garden.

He was a fine, erect, whole-souled man, between six and seven feet high, with a broad, open face, bland and guileless as his pet oxen. No stranger to hunger and weariness, he knew well how to appreciate suffering of a like kind in others, and many there be, myself among the number, who can testify to his simple, unostentatious kindness that found expression in a thousand small deeds. After gaining sufficient means to enjoy a long afternoon of life in comparative affluence and ease, he died in the autumn of 1876. He sleeps in a beautiful spot near Galen Clark and a monument hewn from a block of Yosemite granite marks his grave.

John Muir, The Yosemite ch.14

JC Lamon
Lamon cabin 1861
Yos Falls 1900
Lamon monument
Images all courtesy of the NPS

See also Order of The Good Earth, The Yosemite Cemetary

Is Popper responsible for this mess?

OK, admittedly this is a bit of a weird post, but otherwise I’d have to be actually working.

It’s just a question post really, because admittedly I’ve read very little of Karl Popper’s writings, and whatever little that was, it was a long time ago. I just know what everybody in science “knows”: he’s the “falsification” guy. That is, he reportedly believes that scientific advancement comes mainly via testing hypotheses (ideas, concepts, theories, call ‘em whatever you like as far as I’m concerned) and then assessing whether the hypothesis withstood the test successfully or not. If it didn’t, chuck it and come up with another one; if it did, test it some more and scale your confidence in it with the number (and/or stringency) of the tests it’s passed.

Hmm, well OK I guess, but it leaves me with this image in my mind of some authority figure standing over me saying “Your idea has been falsified by group X doing unequivocal test Y. Your idea fails. Now get out of here.”

Not to go all Bayesian Bandwagon on the issue, since I have serious questions about that viewpoint also, but if you’re addressing a complex question and you carefully and repeatedly add a little bit of good evidence at a time, over time, thereby eventually narrowing down the list of likeliest explanations for your observations, then you don’t really need to worry about “falsifying” anything really, do you? I mean, lay a solid foundation, then add floor one, then two, etc…. and there you go. I get the feeling Popper thinks science is a bunch of wanna-be sand castle architects running amok on the beach trying to outdo each other but without much of a clue really, but then WHOA, here comes the sand castle judge and he’s going to wreck all but one. But then maybe it is, at least in some fields. Jimi Hendrix could have written a song about it.

I think my main question really is this: did the obsession with hypothesis testing–and all the problems arising therefrom–come from following Popper’s ideas, or did Popper just describe what the hypothesis testing fanatics were already doing? Chicken and egg question really.

If this post has been unsatisfactory to you, I am willing to tell Rodney Dangerfield jokes or discuss baseball. Thanks for your attention either way.

Didn’t know where to find you

I didn’t know where to look for you last night
Didn’t know where to find you
I didn’t know how I could touch that light
That’s always gathering behind you

I didn’t know that I would find a way
To find you in the morning
But love can pull you out of yesterday
As it takes you without warning

I want to be a long time friend to you
Want to be a long time known
Not one of your memory’s used-to-be’s
Not a summer’s fading song

John Gorka, Love is our cross to bear

Ebola rates, updated

Latest data from the WHO on the W. Africa Ebola outbreak (report of 8-22-14; data therein as of 8-20-14). For data table go here, and for R code generating data and graphs go here.

Reporting issues are likely responsible for the large fluctuations in the raw data, hence the loess smoothing (dark line) for a better approximation of the true rates. See here for a more in depth discussion of this issue.
Ebola case recent rates 2n
Ebola death recent rates 2n

On baseball (finally!)

I’ve discussed no baseball here yet, which is kind of surprising, given that I’ve been a big fan all my life. I played a lot growing up, through high school and even a little in college and afterwards. If I had the time, I would likely start a blog just devoted strictly to baseball (and not just analysis either), because I have a lot to say on a lot of topics. But alas…

To me, the real interest in any sport comes from actually playing the game, not watching it, and I watch very little baseball (now) because the games are just too time consuming (though I still have a hard time refraining in October). When I do watch, I’m not obsessively analytical–that takes the fun out of it for me. It’s an athletic contest, not a statistics class; I want to see the center fielder go full speed and lay out for a catch, or a base thief challenge the pitcher or whatever, not sit there with numbers in my head. Analysis is for later, and I do like it, so I wade in at times, thereby joining the SABR-metric (or “sabermetric”) revolution of the last 3-4 decades (the Society for American Baseball Research (SABR), initiated much of this). And baseball offers endless analytical opportunities, for (at least) two reasons.

Billy Hamilton of the Cincinnati Reds

Billy Hamilton of the Cincinnati Reds

Continue reading