Ebola epidemic update

Today a new and more extensive WHO W. Africa ebola update was released, including data current as of Sept. 14, four days ago. I’ve therefore compiled new tables, and case and death rates. The new code and graphs are here, and the new data table is here. Liberia-specific graphs are here.

There’s been a slight drop in the transmission rate, based on these data. The daily rate is now estimated at about 1.038 (down from 1.043 a month ago). The 6-12 day rates, which correspond roughly with the estimate of R_zero, the per person rate of infection (depending on the mean infectiousness period, in days), range from 1.25 to 1.57. The midpoint value is 1.41. See here and here for my methodology.

It is almost certain that cases are going unreported however, and it could be many, I don’t know. These estimates are therefore underestimates of the true rate, and hence the severity of the outbreak. And this kind of thing is certainly tragic and not helping the situation.

This week’s puzzler

This week’s puzzler comes to us from John Storthwaite in Stonyfield, Minnesota, who has been wondering why there are so many trees blocking his view of the rocks up there.

Suppose you have been given the following problem. A number of objects are located in some given area, say trees in a forest for example, and one wishes to estimate their density D (number per unit area). Distance-based sampling involves estimating D by averaging a sample of squared, point-to-object distances (d), for objects of known integer rank distance (r) from the point. The distances are squared because one is converting from one dimensional measurements (distance) to a two dimensional variable (objects per unit area).

So here’s the puzzler. If you run a line through this arbitrary point, and choose the closest objects (r = 1) on each side of it, what will be the ratio of the squared distances of the two objects and how would you solve this, analytically? Would they be about the same distance away? If not, would there be a predictable relationship between them? The problem can be extended to any number of lines passing through said point, just with correspondingly more pairs of distances to evaluate.

The first questions one should ask here are clear: (1) “Why on earth would anybody want to do that?” and (2) “Is that the type of thing you clowns spend your time on?“. We have answers for those questions. Not necessarily satisfactory answers, but answers nonetheless. Giving an answer, that’s the important thing in life. So, if you know the answer, write it on the back of a $100 bill and send it to…

Anyway, there are two possible solutions here. The first one comes readily if one realizes that the densities within sectors must each be about the same as the overall density, since we assume a homogeneous overall density. But, for a given value of r, the squared distances in each of the two sectors must be, on average, about twice those for the collection of trees overall, because there are only half as many trees in each sector as there are overall. So, e.g. the r = 5th closest trees within each half are on average, 2X the squared distance of the r = 5th closest tree overall.

Knowing this, the relationship between the two r = 1 trees (label them r1.1 and r1.2 having squared distances d1.1 and d1.2) in the two sectors becomes clear. Since one of the two trees (r1.1) must necessarily be the r = 1 tree overall, and the mean squared distance of the two trees must be 2X that of the r = 1 tree, this translates to:

2*d1.1 = (d1.1 + d1.2)/2 and thus,
d1.2 = 3(d1.1),

i.e., one member of the pair will, on average, be exactly three times the squared distance of the other. This result can be confirmed by an entirely independent method involving asymptotic binomial/multinomial probability. That exercise is left, as they say in the ultimate cop-out, to the reader.

This work has highly important implications with respect to a cancer research, and for solutions to poverty, malnutrition, and climate change. It can also help one discern if tree samplers 150-200 years ago were often sampling the closest trees or not.

Funding for this work was provided by the Doris Duke Foundation, the Society for American Baseball Research, the American Bean and Tree Counters Society, the Society for Measuring Things Across From Other Things, and the Philosophy Department at the University of Hullaballo. All rights reserved, all obligations denied. Any re-use, re-broadcast, retransmission, regurgitation or other use of the accounts and descriptions herein, without the express written consent of the closest random stranger on the street, or the closest random stranger on the other side of said street, is strictly prohibited.

Golf course succession

A friend’s property, in the county my parents live in, is surrounded by a nine hole golf course that went out of business several years ago, and is about to be acquired by the US Fish and Wildlife Service. It is undergoing rapid ecological succession to a less managed state since they stopped mowing a few years back. This process is very common with abandoned farm land, but this is the first I’ve looked at a golf course. The place is interesting because the area is naturally wet, being originally part of a very large swamp/wetland complex (the “Great Black Swamp”) that stretched over many counties and caused this area to be the last settled in Ohio. The original vegetation, documented in 1820, was dominated by intermixed treeless wet prairie, and swamp or other northern wetland hardwoods, with standing water over the entire year common. The inherently wet soils might well have affected the course’s success, I don’t know.

Several tree species mentioned in the 1820 GLO land survey notes (see bottom image) are still present, including swamp white oak (Quercus bicolor), american elm (Ulmus americana), pin oak (Q. palustris), green ash (Fraxinus pennsylvanica), hickory (Carya cordiformis), eastern cottonwood (Populus deltoides), and unspecified willows (Salix spp.). Others have clearly come in post-settlement, including black walnut (Juglans nigra), northern catalpa (Catalpa speciosa), weeping willow (Salix babylonica), possibly silver maple (Acer saccharinum), and the completely misplaced jack pine (Pinus banksiana) and eastern redcedar (Juniperus virginiana) (most likely both as yard markers and fairway dividers). How the USFWS will manage the property will be interesting; it may be difficult to recreate the wet prairie habitat given that the natural drainage pattern is now highly altered by ditching and drain tiling.

Wet prairie and hardwood swamp, to farm, to golf course, to...

Wet prairie and hardwood swamp, to farm, to golf course, to…

Goldenrod (Solidago spp), a notorious and obvious late bloomer.

Goldenrod (Solidago spp), a notorious and obvious late bloomer.

Continue reading

Camping among the tombs

I found a road which led me to the Bonaventure graveyard. If that burying-ground across the Sea of Galilee, mentioned in Scripture, was half as beautiful as Bonaventure, I do not wonder that a man should dwell among the tombs. It is only three or four miles from Savannah…Part of the grounds was cultivated and planted with live-oak, about a hundred years ago, by a wealthy gentleman who had his country residence here. But much the greater part is undisturbed. Even those spots which are disordered by art, Nature is ever at work to reclaim, and to make them look as if the foot of man had never known them. Only a small plot of ground is occupied with graves and the old mansion is in ruins.

Bonaventure Cemetery

Continue reading

“I ventured out…”

A wild scene, but not a safe one, is made by the moon as it appears through the edge of the Yosemite Fall when one is behind it. Once…I ventured out on the narrow ledge that extends back of the fall…and wishing to look at the moon through some of the denser portions of the fall, I ventured to creep further behind it. The effect was enchanting: fine, savage music sounding above, beneath, around me, while the moon, apparently in the very midst of the rushing waters, seemed to be struggling to keep her place, on account of the ever-varying form and density of the water masses through which she was seen…I was in fairy land between the dark wall and the wild throng of illumined waters, but suffered sudden disenchantment; for like the witch scene in Alloway Kirk, “in an instant all was dark”. Down came a dash of spent comets, thin and harmless looking in the distance, but they felt desperately solid and stony when they struck my shoulders, like a mixture of choking spray and gravel and big hailstones. Instinctively dropping on my knees, I gripped an angle of the rock, curled up like a fern frond with my face pressed against my breast, and in this way submitted as best I could to my thundering bath…How fast one’s thoughts burn in such times of stress. I was weighing chances of escape. Would the column be swayed a few inches from the wall, or would it come yet closer? The fall was in flood and not so lightly would its ponderous mass be swayed. My fate seemed to depend on the fate of the “idle wind”…

John Muir, The Yosemite, p.30

The Yosemite
Yos Falls 1900

Who’s “best” and how do you know it?

So suppose you have your basic Major League Baseball (MLB) structure, consisting of two leagues having three divisions of five teams each, each of which plays a 162 game, strongly unbalanced*, schedule. There are, of course, inherent quality differences in those teams; some are better than others, when assessed over some very large number of games, i.e. “asymptotically” **. The question thus arises in your mind as you ponder why the batter feels the need to step out of the batter’s box after each pitch ***: “how often will the truly best team(s) win their league championships and thus play each other in the World Series”. The current playoff structure involves having the two wild card teams play each other in a one game elimination, which gives four remaining playoff teams in each league. Two pairings are made and whoever wins three games advances to the league championship series, which in turn requires winning four games.

I simulated 1000 seasons of 162 games with leagues having this structure. Inherent team quality was set by a normal distribution with a mean of 81 wins and a standard deviation of ~7, such that the very best teams would occasionally win about 2/3 (108) of their games, and the worst would lose about that same fraction. Win percentages like those are pretty realistic, and the best record in each league frequently falls between 95 and 100 wins.

1) The truly best team in each league makes the playoffs about 80 percent of the time under the current system, less when only four teams make it.
2) That team wins its league championship roughly 20 to 30 percent of the time, getting knocked out in the playoffs over half the time. It wins the whole shebang about 10 to 15 percent of the time.
3) Whenever MLB expands to 32 teams, in which the playoff structure will very likely consist of the four division winners in each league and no wild card teams, the truly best (and second and third best) teams in each league will both make the playoffs, and advance to the World Series, less frequently than they do now.

This type of analysis is generalizable to other types of competitions under structured systems, at least for those in which the losers of individual contests live to fight another day, or if they don’t, are replaced by others of the same basic quality. The inherent spread in team quality makes a very big difference in the results obtained however. It’ll apply very well to baseball and hockey, but not so well to the NBA, for example.

So the next time an MLB team wins it’s league, or the World Series, and you’re tempted to think this means they must be the best team in the league (or MLB overall), think about that again. Same for the NHL.

* Currently, each team plays around 3 times as many games against each intra-division opponent as inter-division opponents, not even including the 20 inter-league games (which I’ve ignored in these analyses, assuming all games are within-league).
** These records are conceived of as being amassed against some hypothetical, perfectly average team. This team is from Lake Wobegon Minnesota.
*** It is perfectly OK to think other things of course, and we need not worry about the particulars of the language embodied therein.

What’s complex and what’s simple in an exponential model?

In the post on estimating the rate of spread in the current ebola epidemic, a commenter stated that using a monthly rate of disease spread in Liberia was a “simpler” model than what I had done, which was based on a daily rate. This is not correct and I want to clarify why here.

In fact I used a very simple model–an exponential model, which has the form y = b^ax. You can’t get any simpler than a one parameter model, and that fact doesn’t change just because you alter the value of the base b. Any base can model an exponential increase; changing it just requires a corresponding change in parameter a, for a given pair of y and x variables. Base choice ought to be done in a way that carries some meaning. For example, if you’re inherently interested in the doubling time of something, then 2 is the logical choice*. But when no particular base value is obvious, it’s still best if the value used carries meaning in terms of the values of x, i.e. where a = 1.0, presuming that x is measured on some scale that has inherent interest. In my case, that’s the per-day increase in ebola cases.

However, if you fit an exponential model to some data, most programs will use a base of e (~2.781) or 10 by default; the base is fixed and the rate of change is then determined with respect to the units of ax. That’s a bit backwards frankly, but not a big deal, because the base used can easily be converted to whatever base is more meaningful relative to the data at hand. Say for example, that your model fitting procedure gives y = e^(3.2x), where b = e and a = 3.2. But if your x variable is recorded in say, days, you may well not be interested in how y changes every 3.2 days: you want to know the per-day rate of change. Well, y = e^(ax) is simply y = (e^a)^x, and so in this case b = e^(3.2) = 24.5; it takes a larger base to return a given y value if the exponent is smaller. It’s just a straight mathematical transformation (e^a), where a is whatever value is returned in the exponential model fitting. It has nothing to do with model complexity. It has instead to do with scaling, ease of interpretation and convenience.

The relevance to the ebola transmission rate modeling and the original comment is that those rates could very well change within a month’s time due to radical changes in the population’s behavior (critical), or perhaps drug availability (unlikely in this case). In a disease epidemic what happens from day to day is critical. So you want to use a time scale that allows you to detect system changes quickly, while (in this case) also acknowledging the noise generated by the data reporting process (which complicates things and was the whole point of using loess to smooth the raw data before making the estimates). Note that I’ve not gone into the issue of how to detect when an exponential growth rate has changed to some other type of growth. That’s much more difficult.

*Exponential functions are also useful for analyzing outcomes of trials with categorical variables, a where a = 1 and b defines the number of possible outcomes of some repeated process. For example y = 2^25 gives the total number of possible permutations of 25 trials of an event having two possible outcomes. But that’s a different application than modeling a change rate (unless you want to consider the increase in the number of possible permutations a rate).

It just won’t get you there

I’ve got two little feet to get me across the mountain
Two little feet to carry me away into the woods
Two little feet
A big mountain
And a cloud coming down, a cloud a comin’ down

I hear the voices of the ancient ones
Chanting magic words from a different time
Well there is no time, there is only this rain
There is no time
That’s why I missed my plane

John Muir walked away into the mountains
With his old overcoat and a crust of bread in his pocket
We have no knowledge and so we have stuff
But stuff with no knowledge is never enough to get you there…
It just won’t get you there

Greg Brown, Two Little Feet

Estimating the spread rate in the current ebola epidemic

I’ve now written several articles on the West African ebola outbreak (see e.g. here, here, here, and here). This time I want to get more analytical, by describing how I estimated the ebola basic reproduction rate Ro (“R zero”), i.e. the rate of infection spread. Almost certainly various people are making these estimates, but I’ve not seen any yet, including at the WHO and CDC websites or the few articles that have come out to date.

Some background first. Ro is a fundamental parameter in epidemiology, conceptually similar to r, the “intrinsic rate of increase”, in population biology (I’ll refer to it as just R here). It’s defined as the mean number of secondary disease cases arising from a primary case. When an individual is infected, he or she is a secondary case relative to whoever infected him or her, and in turn becomes a primary case capable of spreading the disease to others. Estimates of R depend strongly on the both the biology of the virus, and the behavior of the infected. It is thus more context dependent than population biology’s r parameter, which assumes idealized conditions and depends more strictly on biologically limiting parameters (lifespan, age to first reproduction, gestation time etc.). Diseases which are highly contagious, like measles, smallpox and the flu, have relatively high R values, whereas those requiring direct contact or exchange of body fluids, like HIV, have rates which are at least potentially much lower, depending on the behavior of the infected.

To stop an epidemic of any disease, it is necessary to first lower R, and eventually bring it near zero. Any value of R > 0.0 indicates a disease with at least some activity in the population of concern. When R = 1.0, there is a steady increase in the total number of cases, but no change in the rate of infection (new cases per unit time): each infected person infects (on average) exactly 1.0 other person. Any R > 1.0 indicates a (necessarily exponential) increase in the infection rate, that is, the rate of new cases per unit time (not just the total number of cases), is increasing. It’s also possible to get a constant, rather than accelerating, increase in the number of new cases, but that’s an unstable equilibrium requiring a steady decrease of R from values > 1.0, and is thus uncommon.

Continue reading

Liberian ebola rate jumps

Updated as of 09-18-2014 WHO report.

Many reports from on-the-ground workers with the WHO, Doctors Without Borders, state health and aid agencies, etc. have commented that the case and death rates in at least some locations have almost certainly been too low, because of a substantial number of people avoiding going to clinics and hospitals, out of fear primarily. This situation seems to be the worst in Liberia. See this article for example. Today’s WHO-released data from Liberia may be confirmation of this, many new cases and deaths being reported there from August 16-18. Such an explanation could be due to more intensive case tracking/finding. However, it is also possible that the epidemic is simply exploding there now, especially given that it is well established in the capitol of Monrovia. Or it could be due to some combination of the two.

In the graphs below I used a pretty stiff “span” parameter (span = 1.0) in the loess smoothings (dark black lines) of the WHO-reported raw data (thin line). This choice gives about 35 deaths/day in Liberia. If I use something more flexible, span = 0.5 for example, the estimated rates are higher, about 47/day. However, it’s best to go stiff (i.e. conservative) here, because clearly there are major variations due to data gathering and reporting timelines that have been causing large fluctuations in the numbers (discussed more here).  But there’s also clearly more than just that going on with this latest surge in numbers.

This situation is now extremely serious, if it wasn’t already. Note also that negative rates early on in the outbreak are presumably due to case retractions or re-classifications. Code generating data and graphs is here and data table itself is here.

Continue reading

Step one

The hardest part about gaining any new idea is sweeping out the false idea occupying that niche. As long as that niche is occupied, evidence and proof and logical demonstration get nowhere. But once the niche is emptied of the wrong idea that has been filling it — once you can honestly say, “I don’t know,” then it becomes possible to get at the truth.

Heinlein, R. A. 1985. The Cat Who Walks Through Walls: A Comedy of Manners, p. 230. G.P. Putnam’s Sons, New York. In: Gaither, C.C. and Cavazos-Gaither, A.E., 2008, Gaither’s Dictionary of Scientific Quotations, Springer.

“…a garden opposite the Half Dome”

The good old pioneer, Lamon, was the first of all the early Yosemite settlers who cordially and unreservedly adopted the Valley as his home.

He was born in the Shenandoah Valley…emigrated to Illinois…afterwards went to Texas and settled on the Brazos, where he raised melons and hunted alligators for a living. “Right interestin’ business,” he said; “especially the alligator part of it.” From the Brazos he went to the Comanche Indian country between Gonzales and Austin, twenty miles from his nearest neighbor..When the formidable Comanche Indians were on the war-path he left his cabin after dark and slept in the woods. From Texas he crossed the plains to California and worked In the Calaveras and Mariposa gold-fields.

He first heard Yosemite spoken of as a very beautiful mountain valley and after making two excursions in the summers of 1857 and 1858 to see the wonderful place, he made up his mind to quit roving and make a permanent home in it. In April, 1859, he moved into it, located a garden opposite the Half Dome, set out a lot of apple, pear and peach trees, planted potatoes, etc., that he had packed in on a “contrary old mule,”…For the first year or two lack of provisions compelled him to move out on the approach of winter, but in 1862 after he had succeeded in raising some fruit and vegetables he began to winter in the Valley…When the avalanches began to slip, he wondered where all the wild roaring and booming came from, the flying snow preventing them from being seen. But, upon the whole, he wondered most at the brightness, gentleness, and sunniness of the weather, and hopefully employed the calm days in tearing ground for an orchard and vegetable garden.

He was a fine, erect, whole-souled man, between six and seven feet high, with a broad, open face, bland and guileless as his pet oxen. No stranger to hunger and weariness, he knew well how to appreciate suffering of a like kind in others, and many there be, myself among the number, who can testify to his simple, unostentatious kindness that found expression in a thousand small deeds. After gaining sufficient means to enjoy a long afternoon of life in comparative affluence and ease, he died in the autumn of 1876. He sleeps in a beautiful spot near Galen Clark and a monument hewn from a block of Yosemite granite marks his grave.

John Muir, The Yosemite ch.14

JC Lamon
Lamon cabin 1861
Yos Falls 1900
Lamon monument
Images all courtesy of the NPS

See also Order of The Good Earth, The Yosemite Cemetary

Is Popper responsible for this mess?

OK, admittedly this is a bit of a weird post, but otherwise I’d have to be actually working.

It’s just a question post really, because admittedly I’ve read very little of Karl Popper’s writings, and whatever little that was, it was a long time ago. I just know what everybody in science “knows”: he’s the “falsification” guy. That is, he reportedly believes that scientific advancement comes mainly via testing hypotheses (ideas, concepts, theories, call ‘em whatever you like as far as I’m concerned) and then assessing whether the hypothesis withstood the test successfully or not. If it didn’t, chuck it and come up with another one; if it did, test it some more and scale your confidence in it with the number (and/or stringency) of the tests it’s passed.

Hmm, well OK I guess, but it leaves me with this image in my mind of some authority figure standing over me saying “Your idea has been falsified by group X doing unequivocal test Y. Your idea fails. Now get out of here.”

Not to go all Bayesian Bandwagon on the issue, since I have serious questions about that viewpoint also, but if you’re addressing a complex question and you carefully and repeatedly add a little bit of good evidence at a time, over time, thereby eventually narrowing down the list of likeliest explanations for your observations, then you don’t really need to worry about “falsifying” anything really, do you? I mean, lay a solid foundation, then add floor one, then two, etc…. and there you go. I get the feeling Popper thinks science is a bunch of wanna-be sand castle architects running amok on the beach trying to outdo each other but without much of a clue really, but then WHOA, here comes the sand castle judge and he’s going to wreck all but one. But then maybe it is, at least in some fields. Jimi Hendrix could have written a song about it.

I think my main question really is this: did the obsession with hypothesis testing–and all the problems arising therefrom–come from following Popper’s ideas, or did Popper just describe what the hypothesis testing fanatics were already doing? Chicken and egg question really.

If this post has been unsatisfactory to you, I am willing to tell Rodney Dangerfield jokes or discuss baseball. Thanks for your attention either way.