I’ve now written several articles on the West African ebola outbreak (see e.g. here, here, here, and here). This time I want to get more analytical, by describing how I estimated the ebola basic reproduction rate Ro (“R zero”), i.e. the rate of infection spread. Almost certainly various people are making these estimates, but I’ve not seen any yet, including at the WHO and CDC websites or the few articles that have come out to date.
Some background first. Ro is a fundamental parameter in epidemiology, conceptually similar to r, the “intrinsic rate of increase”, in population biology (I’ll refer to it as just R here). It’s defined as the mean number of secondary disease cases arising from a primary case. When an individual is infected, he or she is a secondary case relative to whoever infected him or her, and in turn becomes a primary case capable of spreading the disease to others. Estimates of R depend strongly on the both the biology of the virus, and the behavior of the infected. It is thus more context dependent than population biology’s r parameter, which assumes idealized conditions and depends more strictly on biologically limiting parameters (lifespan, age to first reproduction, gestation time etc.). Diseases which are highly contagious, like measles, smallpox and the flu, have relatively high R values, whereas those requiring direct contact or exchange of body fluids, like HIV, have rates which are at least potentially much lower, depending on the behavior of the infected.
To stop an epidemic of any disease, it is necessary to first lower R, and eventually bring it near zero. Any value of R > 0.0 indicates a disease with at least some activity in the population of concern. When R = 1.0, there is a steady increase in the total number of cases, but no change in the rate of infection (new cases per unit time): each infected person infects (on average) exactly 1.0 other person. Any R > 1.0 indicates a (necessarily exponential) increase in the infection rate, that is, the rate of new cases per unit time (not just the total number of cases), is increasing. It’s also possible to get a constant, rather than accelerating, increase in the number of new cases, but that’s an unstable equilibrium requiring a steady decrease of R from values > 1.0, and is thus uncommon.
Ebola virus can remain infective outside a warm body for a while, and also spread via aerosol mists, but the strongly dominant transmission mode (thankfully) is via direct contact. This fact will tend to favor relatively low R values, all else equal. However, it’s also a very new disease, known only since 1976, and as far as West Africa is concerned, entirely new (at least in the modern record), unexpected, unfamiliar and likely without much resistance in the human population. The same held when it showed up in Zaire and Uganda previously. Therefore, one might expect significantly higher spread rates than for a viral disease having the same basic biology but with which the population has historical familiarity, i.e. some degree of immunity or resistance.
So, on to making the estimates and the considerations thereto. There are four main issues here: (1) smoothing the data, (2) choosing an optimal mathematical model, (3) estimating that model’s parameter(s), and (4) converting these to an estimate of R, if necessary. I’ll take them in turn.
One always needs good data of course, and a major complication in this situation involves the data gathering and reporting timelines. It’s very clear that the reported new case and death rates jump fairly greatly between consecutive reports. There is no biological reason whatsoever to expect this; it is almost certainly due to how the data are being collected and reported. For estimating the total number of cases this is no big issue, but for estimating infection rates it certainly is. One has to smooth this noise out, and I’ve done this using loess, or locally weighted regression. Loess smoothing is subjective however, because you have to choose how “locally” you want the regression to be weighted, which is very ad-hoc.
As for issue (2), we know two very important facts here: (1) R > 1.0 (the number of new cases and deaths per unit time is increasing), and (2) R > 1.0 very strongly implies an exponential function (as mentioned above, a steadily but non-exponentially increasing case rate is unstable and thus unlikely, except for brief periods, such as at the inflection point of a logistic curve). Exponential functions where y = f(x) are of the structural form y = b^ax, where b is some chosen base, usually e (~2.78) or 10, but anything will work, with a being the single estimated parameter. In this case, the base chosen has direct relevance to estimates of R, as discussed below, and it’s actually better not to fit a model having a base of either e or 10.
Issue (3) necessary involves fitting the model to the data, and thus potentially involves issue (1). Should we fit the model to the raw data, or to the smoothed data (the latter being much more likely to reflect the actual biology rather than artifacts)?. If to the raw data, and those are greatly affected by a spike in case reports, that’s going to induce some error. So, should we wait for the next report then, or just smooth with a very stiff loess function or?… etc. Issue (4) involves very simple math in this case, simply converting one base to another; easy (but important).
In a nutshell then, this is what I did. First of all, I waited for the next WHO report after that of August 20 (with it’s enormous spike) to come out, because the typical pattern to date has been a spike followed by much lower numbers in the next report. Sure enough, this pattern was again followed, and fortunately I only had to wait two days. I then computed the recent new case and death rates from the raw data, and smoothed these using a relatively stiff loess function, exactly as in the many graphs I’ve now shown. In the R language, stiffness is controlled by the “span” argument of the loess function. I did a little experimenting with different span values to get the sense that span = 1.0 was relatively stiff, and hence minimally influenced by spikes in the numbers due to reporting issues.
Eyeballing the loess smooth confirmed that it was concave over the last 3 months or so, and thus that an exponential model was indeed reasonable. The standard way to proceed at this point would be to simply fit an exponential model to the raw data. This requires non-linear optimization, since there is no analytical solution. In R, this is done using either the “nlm” or “optim” functions, and for whatever reason both of these always give me headaches, which I was in no mood for, not that I ever am. So I worked around it using a “brute force” approach, but one which also has some very definite advantages, one of which in this case is information on the ability of a stiff loess function to capture the underlying epidemiology. Which, since loess is a really nice, easy and useful tool, is a good thing to know.
What I did was to (1) take the loess-estimated new case rate from the latest WHO report (Aug. 22), (2) compute the per-day rate that would be required to achieve that value over the last ~ three months, and (3) used that value as a starting point estimate for systematic searches for values giving better estimates (as evaluated using standard goodness of fit measures, i.e. the smallest sum of squared deviations from the model-estimated values). A key point in step (2) is that the parameter estimated (in model y = b^ax), is not in fact a, but is rather the base b; I do not assume a given base and then estimate a. I instead set a = 1.0, and since x is the number of days from the defined starting date, I estimate b, which gives the per-day transmission rate of ebola in the population. This is given simply by b = y^(1/x), with y as given by step (1). This does not give us R directly, but that conversion is simple if we know a little about ebola disease progression, which we do.
From an initial estimate of b, which from eyeballing I have confidence is entirely reasonable, I then compute the sums of squares for all values between 0.5b and 1.5b, in 200 increments (I had to experiment a little with the number of increments to use), which is where the so-called brute force comes in. Choosing the value giving the best fit, I then convert that to an estimate of R under the following logic. Since R as commonly defined is time-independent, being the mean of the total number of secondary cases initiated by each primary case, I simply need to know how many days ebola patients are infectious on average. They are presumed to be uninfectious before presenting symptoms. From what I have read, infectiousness ranges from 6-12 days; patients either die or recover fairly quickly after that time (although there is reportedly a long tail for infectiousness of some body fluids after recovery, up to 2-3 months). In the absence of any more precised data, R is therefore estimated simply by R = b^d, where d ranges from 6-12 days.
So, here are the results. These are for all three countries combined, with time zero of mid-May. That’s when the initial outbreak of March/April appeared to have subsided, before the increase from June until now. I’ve assumed that case identification accuracy is high, i.e. that “probable” and “suspected” cases are correct a high percentage of the time, which I think is entirely reasonable. I’ve also assumed that spread from any animal vectors to humans is not an important component of the epidemic, other than having started it of course.
The initial estimate of b, the per-day infection rate, was 1.042. The (maximum likelihood) rate obtained from step (3) above was very close to this: 1.043, meaning that for these data, the relatively stiff loess data smoothing (span = 1.0) of the raw data provides a very reasonable representation of the actual epidemiology. Lastly, the estimate of R, the time-independent, per-individual spread rate, thus ranges from 1.29 to 1.66 for d = 6 and 12 respectively, over the last ~ 3 months. It will be interesting to see how these rates change with time, and how they compare to estimates from past epidemics, if they’ve been made, and also between the three countries in this outbreak. What everyone involved desperately wants to see, is a transition to a logistic form, indicating a slowdown in spread.