Abstract
Many epidemics of plant diseases are characterized by large variability among individual outbreaks. However, individual epidemics often follow a well-defined trajectory which is much more predictable in the short term than the ensemble (collection) of potential epidemics. In this paper, we introduce a modelling framework that allows us to deal with individual replicated outbreaks, based upon a Bayesian hierarchical analysis. Information about ‘similar’ replicate epidemics can be incorporated into a hierarchical model, allowing both ensemble and individual parameters to be estimated. The model is used to analyse the data from a replicated experiment involving spread of Rhizoctonia solani on radish in the presence or absence of a biocontrol agent, Trichoderma viride. The rate of primary (soil-to-plant) infection is found to be the most variable factor determining the final size of epidemics. Breakdown of biological control in some replicates results in high levels of primary infection and increased variability. The model can be used to predict new outbreaks of disease based upon knowledge from a ‘library’ of previous epidemics and partial information about the current outbreak. We show that forecasting improves significantly with knowledge about the history of a particular epidemic, whereas the precision of hindcasting to identify the past course of the epidemic is largely independent of detailed knowledge of the epidemic trajectory. The results have important consequences for parameter estimation, inference and prediction for emerging epidemic outbreaks.
Keywords: epidemiology, compartmental modelling, Bayesian inference, Markov chain methods, plant–pathogen systems, biological control
1. Introduction
One of the main goals of epidemiological modelling is to provide guidelines for predicting and controlling disease outbreaks.This task can be achieved only if a reliable model is available and successfully parametrized. The model can subsequently be used to assess the impact of treatments on potential outbreaks and to predict future epidemics. Parameter estimation and subsequent testing of models against data form key steps in modelling ecological and epidemiological processes. However, these steps are not often given enough emphasis (Pascual & Kareiva 1996), owing to the lack of studies combining highly controllable experimental systems and involving biologically plausible yet tractable models.
Conventionally, epidemiologists consider single outbreaks for which some parameters are estimated from independent studies, while others are found based on the whole or part of the current outbreak (Anderson & May 1991; Britton 2001; Ferguson et al. 2001, 2003; Keeling et al. 2001; Bjornstad et al. 2002; Grenfell et al. 2002; Haydon et al. 2006). In practice, however, we might have additional information about other outbreaks of the disease, often occurring under similar conditions. Given that other epidemics provide additional information, we expect that if such information is incorporated into a modelling framework, parameter estimation and prediction of emerging epidemics could be significantly improved. There is a tendency, however, in conventional approaches to the analysis of replicated epidemics to average over replicates, whereby the results are represented by their mean with information about variability either completely missing or at best described in terms of a standard deviation (Grenfell et al. 1994; Kleczkowski et al. 1996). Thus, there is a need for a modelling protocol that enables us to study the properties of a single replicated outbreak and at the same time to describe properties of an ensemble of such outbreaks. Plant epidemiology, with its relative ease in generating replicate and highly controllable data, gives a unique opportunity to develop and test such a protocol (Kleczkowski et al. 1996; Gibson et al. 1999). The concept of an ensemble used here represents a set of actual or potential replicates of a biological system, created by repeating the experiment under the same macroscopic (or environmental) conditions without being able to control the microscopic detail.
Despite considerable insights in fitting models in ecology and epidemiology (Dennis et al. 1995; Pascual & Kareiva 1996; Finkenstadt & Grenfell 1998; Finkenstadt et al. 1998; Grenfell et al. 1998; Bjornstad et al. 2002; Ionides et al. 2006), parameter estimation in plant epidemiology is still often done under the (implicit or explicit) assumption that subsequent measurements are uncorrelated and that we only know the average behaviour of an ensemble of replicates. Very often this is not true and epidemic data are characterized by substantial levels of variability among the replicates and high correlation along each realization. Thus, there is an inherent conflict between the statistical methodology that ignores information contained in replicate data and the need to describe single replicates. For systems that possess a well-defined long-term behaviour, we can often exchange temporal and replicate averaging (Kleczkowski 2005). However, for epidemic outbreaks these two approaches are often not interchangeable. The number of infected individuals at any given time strongly depends upon the number at previous times, generating high degrees of temporal correlation along each disease progress curve. For such systems, predictability for a single replicate along its trajectory can be much higher than predictability for the ensemble of replicates (Kleczkowski 2005).
In this paper, we develop a modelling protocol that allows us to study the properties of an ensemble of epidemic outbreaks as a whole and at the same time describe individual outbreaks. We use a deterministic model to describe the main features of a single replicate behaviour, but augment this with probabilistic assumptions about parameter values (Kleczkowski et al. 1996; Kleczkowski 1998). The deterministic approach is chosen here over the fully stochastic one (Gibson et al. 1999, 2004) since differential equations are still the model form of choice for many studies describing plant, animal and human epidemics (Gilligan 2002). Information about replicate and population behaviour is combined within a hierarchical Bayesian approach. Such approaches have been widely used in social sciences (Goldstein 2003), ecology (Koop 2003; Clark & Gelfand 2006) and epidemiology (Zeger & Karim 1991; Taylor et al. 1994; Twisk et al. 1994), but have so far received little attention in the analysis of plant disease epidemics.
In addition to parameter estimation based upon single replicates, we study the predictability of the future of an epidemic outbreak based upon its history up to a certain point. This forecasting method is complemented by hindcasting, whereby we seek to reconstruct the past epidemic progress from later observations. This reconstruction allows us to estimate the initial levels of disease following introduction of a pathogen into the plant population. Such information about the early stages of epidemics or the levels and the potential of the initial inoculum is notoriously difficult to collect. Knowledge of the early epidemic stages is important, however, in estimating the risk of recurrent plant epidemics (Bailey & Gilligan 1999; Bailey et al. 2004).
We test the protocol using a series of well-controlled microcosm experiments involving replicated epidemics for the spread of a fungal pathogen, Rhizoctonia solani (Kühn) among radish (Raphanus sativus L.) plants in the presence or absence of a biocontrol agent, Trichoderma viride (Pers ex Grey). In previous papers (Kleczkowski et al. 1996; Gibson et al. 1999, 2004), we have presented experimental evidence for unexpectedly high levels of variability in the population dynamics of R. solani in the presence of the biological control agent. We have shown that small differences in the initiation of epidemics become magnified later in the season by the nonlinear character of the infection process. The variability in the progress of the disease is ‘quenched’ by changes in the host and pathogen behaviour—as plants become resistant, the rate of primary (soil-to-plant) and secondary (plant-to-plant) spread decreases. This results in substantial variability in the final size of the epidemic, affecting our ability to predict the severity of the outbreak and potential success of biological control. However, we show here that it is still possible to predict the level of disease in a single outbreak, provided we know its behaviour in the first few days.
Thus, the paper addresses five questions, all associated with the relationship between the between- and within-replicate variability. How can we fit a model to a single replicate and relate its parameters to other replicates? How can we combine information from an ensemble of replicated epidemics to study general features of this ensemble? How can we compare two sets of replicated data and assess the effects of treatments, exemplified here by the presence and absence of a biocontrol agent, T. viride? How can we use information about past outbreaks to predict an emerging epidemic? How long do we need to observe a new outbreak to be able to predict its future (or past) course with confidence? In addition, we also ask a more general question: can a deterministic model be augmented in such a way that it can be used to describe a replicated system characterized by a high degree of variability?
2. Material and methods
2.1 Experimental data
The experimental system was described by Kleczkowski et al. (1996) and Gibson et al. (1999, 2004). The data represent spread of a pathogen in a population of N=50 plant hosts. Seeds of radish were planted on a 5 by 10 regular (square) grid to a depth of 10 mm in clear plastic boxes (100 mm wide, 200 mm long and 100 mm deep), filled with 1.0 kg of white sand with a moisture content of 10% by weight. Ten inoculum units of mycelial discs, 1.0 mm in diameter, removed from the edge of a 5-day-old colony of R. solani growing in a Millipore filter over potato dextrose agar, and placed at a depth of 5 mm, were introduced at random locations into the population. Each treatment was replicated five times, with all conditions kept strictly identical in each treatment and replicate. The boxes were sealed and incubated in a growth chamber at 23°C and 16 hours light and 8 hours darkness cycle. The boxes were opened and scored daily for three weeks to record the number of infected (damped-off) plants in each replicate i=1, …, 5, Yij at each time tj, j=1, …, 14.
2.2 Model
Two phases can be identified in the spread of R. solani in the plant population (Kleczkowski et al. 1996, 1997). There is an initial lag following the emergence of plants after which plants targeted by initial inoculum in the soil become infected (primary infections). Disease foci form and subsequently spread, with the dynamics becoming progressively slower as the plants become more resistant to infection (Kleczkowski et al. 1996). We summarize these processes by a deterministic model in the form of an ordinary differential equation describing the epidemics spread in a plant population subject to primary (soil-to-plant) and secondary (plant-to-plant) infections and decay in infectivity and susceptibility (adapted from Kleczkowski et al. (1996))
| (2.1) |
where N is the total number of plants; Y is the number of infected plants; t is time; rp represents the rates of primary infection and incorporates information about the density of initial inoculum; rs describes the rate of secondary infections associated with focal expansion; and a corresponds to the quenching caused by the decay in susceptibility and is common to both primary and secondary infections. This equation belongs to a general class of S(E)IR models commonly used in plant, animal and human epidemiology (Anderson & May 1991; Gilligan & Kleczkowski 1997; Gilligan 2002). Here, we assume that all infected plants become diseased and there is no lag between infection and the expression of disease symptoms. The resulting disease progress curve has an initial rise towards a temporary plateau, followed by sigmoidal growth with the rate and asymptote varying among replicates,
| (2.2) |
where θ represents a vector of parameters.
2.3 Variability
In Kleczkowski et al. (1996), model (2.2) was used to describe the average behaviour of replicate epidemics. Thus, the ensemble average at each time point tj, , is computed from individual measurements and assumed to follow model (2.2). Gibson et al. (1999) used a stochastic ‘version’ of equation (2.2) to describe the dynamics of each replicate epidemic and then combined the individual likelihoods in order to estimate the parameters common to the whole ensemble of epidemics. The infection levels in each replicate epidemic, Yij, represent the sum of individual processes involved in epidemic progress in a population of N=50 plants and up to 10 infection foci, each initiated by a unit of inoculum. There is a fundamental difference between the approach where model (2.2) is used to describe each individual outbreak (averaging over N individuals) and average behaviour of an ensemble of outbreaks (additional averaging over K replicates). Here, we use a similar approach to Gibson et al. (1999, 2004), but use the deterministic model (2.2) to describe the behaviour of each replicated epidemic Yij over time that distinguishes two sources of variability, within- and between-replicate variability.
We first consider the estimation of parameters for a collection of replicate epidemics. The link between the data and the model is provided by a likelihood function
| (2.3) |
to describe the probability that in the ith replicate we obtain a particular (vector) result y(i)=(Yij)T, conditioned on a particular value of the (vector) parameter Θ. In a general formulation, the vector parameter Θ encompasses all possible choices, including parameters common to all replicates, θ, and parameters specific for each replicate, θ(i). We assume that individual realizations are statistically independent, resulting in further expansion in (2.3).
A traditional approach (‘fitting to the mean’) assumes that the vector of the parameters Θ is common to all replicates. Thus,
| (2.4) |
where . An alternative approach (‘fitting to the replicates’) is to consider a hierarchical model in which each replicate is described by a different (vector) parameter, . The individual parameters are then assumed to be drawn from a certain distribution, characterized by another set of parameters (ω) called hyperparameters. Consider the following expression for the likelihood
| (2.5) |
in which the (hyper)parameters, ω, describe the distribution of θ.
We use two models to describe the individual probability for a given (ith) replicate , representing the within-replicate variation. Since the data are given as low-integer values, it is natural to represent them by a binomial distribution, with an average given by and the population size of N. However, the binomial model fails to capture the essential features of the infection process and therefore significantly overestimates the within-replicate variability. We therefore also consider a normal distribution for , which has the added flexibility of a free parameter, a variance . The variance is assumed to be constant along each individual trajectory. The within-replicate variability represents the temporal uncertainty within the particular outbreak and can be associated with both demographic stochasticity and measurement error (Lande et al. 2003).
The distribution of individual parameters, , is assumed to be lognormal, so that is normally distributed and characterized by two hyperparameters each, mean μ and variance σ2 (these are vectors corresponding to three parameters ), so that . In the simple approach adopted here we ignore correlation among parameters, but the model can be easily extended to incorporate it. The choice of the lognormal distribution allows us to keep the parameters positive while benefiting from the flexibility of the normal distribution. The between-replicate variability represents the uncertainty about which particular replicate is realized, and this is associated with demographic stochasticity as well as with environmental factors that affect replicates differentially.
Within a Bayesian interpretation, μ reflects our belief about the values of the parameters characterizing the ensemble as a whole, whereas σ2 describes our uncertainty about these estimates. may be interpreted as our belief in the values of parameters characterizing the particular ith outbreak conditioned on the hyperparameters ω, whereas collectively reflect our belief in the size of the outbreak given the individual parameters θ(i). Thus, we divide the total variability in the experiment into two components, the between-replicate variability () and the within-replicate variability (fixed for the binomial model, fitted for the normal model). Model predictions for the ith replicate at time tj are denoted as yi(tj).
2.4 Priors
In specifying the priors, we assume that we possess minimal knowledge of past epidemic outbreaks. This leads to flat priors for the hyperparameters μ, represented in the simulations by a normal distribution with a large variance. We also consider a range of more informative priors by using past studies to guide us in the choice of the particular prior. Kleczkowski et al. (1996) and Gibson et al. (2004) have found that the primary and secondary infection rates are of an order of magnitude 0.01 d−1, whereas the decay rate is of order 0.1 d−1. We represent this knowledge as a normal prior for μ (centred at −4 for and −2 for a; note that μ is related to and not to θ(i)), with a large variance leading to a relatively broad prior distribution. For the replicate standard deviation (σc), we not only used a flat prior over the interval , but also compared the results to an approximate 1/x prior for (Barnett 1999). For the hyperparameter, σ2, we assume that we have a strong prior belief that all replicate epidemics occur in similar environmental conditions and, therefore, we expect small between-replicate variability. A range of contrasting priors is considered for , represented by an exponential distribution , m=1, …, 3, corresponding to . Small values of λ represent a relatively shallow prior (e.g. λ=5, reflecting relatively weak prior belief in uniformity of the replicates) to a steep prior (λ=15, reflecting strong prior belief in uniformity of the replicates). We use λ=15 in the paper, but sensitivity to the prior form is addressed later.
Thus, the full model is given by
| (2.6) |
where we listed the priors shown in the figures. The sensitivity of results to particular assumptions about the model structure and priors is discussed below.
2.5 Forecasting and hindcasting
Here, we focus on a single replicate epidemic for which we wish to estimate either the unknown future (forecasting) or unknown past (hindcasting) trajectories. Prediction is therefore performed on one replicate (i=K1, where K1 is selected from 1 to 5), assuming that we only possess limited information about the trajectory of the selected epidemic. The remaining K−1=4 replicates act as a ‘library’ of past outbreaks used to calibrate the model. Splitting the replicate set into the library and the emerging outbreak allows us to assess the ability of our model to predict out-of-sample. For forecasting, we assume that for the K1-th replicate all data points up to a given point are known and the remaining course of the epidemic is to be estimated. To quantify the predictability, we assign the last point (j=14) to be of the particular interest for forecasting, since it represents the final size of the replicate epidemic.
For hindcasting, knowledge of the final points of the epidemic in replicate K1 is assumed. Quantification of hindcasting is made at t=8 d. This time is known to be associated with a switch from primary to secondary infections for the R. solani–radish system (Kleczkowski et al. 1996, 1997) and is therefore of biological interest. Predictability (either forecasting or hindcasting) is quantified by computing a 95% highest predictive density region (HPDR) and its width (Barnett 1999). We use symmetric HPDRs and therefore compute values of Y corresponding to 2.5 and 97.5% quantiles of the marginal posterior distribution and then calculate the difference between the two points.
2.6 Parameter estimation
Parameter estimation and prediction were performed using Markov chain Monte Carlo methods with Gibbs sampling, implemented in WinBUGS (WinBUGS v. 1.4, MRC Biostatistics Unit, Institute of Public Health, Cambridge, UK). Typically, 50 000 iterations were performed to burn in and then the model was run for 50 000 iterations to collect the statistics. The Markov chains for lower tiers of the hierarchy (individual parameters) mixed well and the distributions were stable even after a very short burn-in period, with small correlations between subsequent Markov chain steps. In contrast, the chains for hyperparameters took longer to mix and hence we chose a long burn-in period. The results were also tested for consistency of estimates of posterior densities of parameters over replicate simulation runs. Convergence was assessed by running quantiles and the Gelman–Rubin convergence statistic, as modified by Brooks & Gelman (1998).
3. Results
3.1 Disease progress
Average and replicate disease curves have a general sigmoidal shape, exhibiting strong correlation between successive observations (figure 1a–d) and large variability between replicate curves (figure 1c,d). The model (2.2) fits each replicate epidemic remarkably well (figure 1c,d), with small and uniform residuals (figure 1e,f), despite marked differences in shapes of individual curves and sizes of individual outbreaks. For some replicates, there is not enough information to estimate all three parameters successfully, for example if no or very little infection occurs or if it is present at low levels (replicates 2, 4 and 5 in the presence of T. viride). If we had no other information about replicate epidemics, it would not be possible to describe or predict such outbreaks. However, we can use estimated parameters for other replicates (replicates 1 and 3 in figure 1d) to enable the fit for the replicates that would have caused a problem if hierarchical Bayesian model had not been used (replicates 2, 4 and 5 in figure 1d). Despite large ensemble uncertainty, individual curves are smooth and their short-term behaviour can be predicted with considerable precision. This is reflected in a small value of the within-replicate uncertainty, represented by (point estimates are and for the case without T. viride and for the case with the biocontrol, respectively; see the 95% credible intervals in figure 1c,d and the spread of residuals in figure 1e,f). The residual variance does not change through time (figure 1e,f). Neither of these two features is captured by the model (2.6) in which ϵij are distributed according to the binomial distribution and we therefore show the results for the normal distribution only, in figure 1 and subsequent figures. Analysis of the autocorrelation (not shown here) suggests that there is no significant temporal correlation between residuals, except for replicate 5 where T. viride completely stops the spread of Rhizoctonia. This shows that most of the temporal autocorrelation seen in the disease progress data is removed by the deterministic model (2.2).
Figure 1.
Disease progress curves and residuals for damping-off of radish plants caused by R. solani in the absence (a,c,e) and in the presence (b,d,f) of the biocontrol agent, T. viride (Kleczkowski et al. 1996). (a,b) The average behaviour; circles represent observed data and lines the fit based upon the model of Kleczkowski et al. (1996), fitted to averaged replicate data. (c,d) The results of fitting model (2.2) to individual replicates, with solid lines representing the median of the simulated marginal posterior distribution of yi(tj). Broken lines show boundaries of the 95% high-density region based upon the normal approximation and for clarity are shown for replicate 1 only. Error bars show the maximum extent of the within-replicate variability assumed by the binomial model, for replicate 1 only. (e,f) Residuals for the fit in (c,d), respectively. In (c)–(f), data points are labelled 1–5 to indicate which replicate epidemic they represent. (This labelling is consistent across all subsequent figures.)
3.2 Parameter estimation
Hierarchical modelling allows us to estimate the parameters of individual epidemic outbreaks and at the same time to study features common to all replicated outbreaks. Information about individual outbreaks is given by a posterior distribution of individual parameters, whereas common characteristics are characterized by posterior distributions of hyperparameters in the second tier of the hierarchy, Pr(μ) and . Figure 2 shows marginal posterior distributions of and Pr(μ) for each of three parameters (lines). Figure 2b,d,f show how Pr(μ) compares with point estimates corresponding to individual replicates, (points labelled as in figure 1). Point estimates (medians and 95% symmetrical credible intervals) are given in table 1 and compared with values obtained previously by a maximum-likelihood method (Gibson et al. 1999), which are similar to those obtained by Bayesian methods in Gibson et al. (2004). The parameter estimates are consistent with the earlier results, but our method allows simultaneous estimation of parameters for individual replicates.
Figure 2.
Marginal posterior densities for individual replicate parameters θ(i) (a,c,e) and for population parameters μ (b,d,f), in the absence (thick line) and in the presence (thin line) of T. viride. In (b,d,f), labelled points represent modes of marginal posterior densities for individual replicate parameters θ(i), while labels 1–5 indicate which replicate epidemic they represent. For ease of viewing, the labels are arbitrarily distributed vertically and separated by a dashed line. (a,b) Primary infection rate, rp; (c,d) secondary infection rate, rs; (e,f) decay rate, a.
Table 1.
Values of parameters obtained by fitting the model (2.2) to the replicate data. (Point estimates (medians) and symmetric 95% credible intervals are given here for the hyperparameters and for the within-replicate standard deviation σc. The binomial model assumes that within-replicate variability is described by a binomial distribution, whereas the normal model assumes a normal distribution. Two prior sets were used for the normal distribution to illustrate the relative insensitivity to the model assumptions. The flat prior corresponds to a large standard deviation in the normal distribution for μ(103) and an approximately 1/x prior for , whereas the informative prior assumes standard deviation of 1 for the μ priors and uniform prior for . The values are compared with those of Gibson et al. (1999, 2004).)
| model | rate of primary infection (rp) | rate of secondary infection (rs) | rate of decay (a) | s.d. (σc) |
|---|---|---|---|---|
| without biocontrol | ||||
| binomial | 0.018 (0.009–0.0420) | 0.013 (0.004–0.025) | 0.17 (0.07–0.25) | |
| normal (flat prior) | 0.014 (0.006–0.032) | 0.016 (0.013–0.022) | 0.19 (0.14–0.25) | 1.13 (0.95–1.38) |
| normal (informative prior) | 0.013 (0.005–0.029) | 0.018 (0.012–0.024) | 0.20 (0.15–0.27) | 1.15 (0.96–1.40) |
| Gibson et al. (1999, 2004) | 0.02265 | 0.0118 | 0.167 | |
| with biocontrol | ||||
| binomial | 0.006 (0.001–0.045) | 0.006 (0.002–0.02) | 0.09 (0.02–0.29) | |
| normal (flat prior) | 0.004 (0.000–0.024) | 0.011 (0.008–0.015) | 0.14 (0.08–0.25) | 0.67 (0.56–0.81) |
| normal (informative prior) | 0.005 (0.001–0.031) | 0.011 (0.007–0.015) | 0.14 (0.07–0.29) | 0.67 (0.56–0.83) |
| Gibson et al. (1999, 2004) | 0.0074 | 0.0102 | 0.127 | |
The rates of primary infection are the most variable factor affecting disease progress (figure 2a,b). The biocontrol agent mainly affects the primary infections by reducing the overall level of infection, although there is some reduction in the rate of secondary infections (figure 2c,d) and the rate of decay (figure 2e,f). These results are consistent with Gibson et al. (1999, 2004), rather than with Kleczkowski et al. (1996), showing that a deterministic model with a ‘proper’ treatment of replicate behaviour exhibits behaviour similar to a fully stochastic model.
Marginal posterior distributions for individual replicate estimates (figure 2a,c,e) are much narrower than Pr(μ) (figure 2b,d,f), reflecting the relative smoothness of each replicate disease progress curve. Modes of the distribution are shown in figure 2 as point estimates. Two out of five replicates with T. viride are characterized by levels of primary infections very similar to the case without the control agent (figure 2a,b), whereas the remaining three replicates have much smaller values. Other parameters are much more consistent across the individual replicate estimates. Individual replicate estimates for the case without T. viride also show clustering in the values of the primary infection rate (replicates 1, 2, 3 versus 4, 5). The clustering is reflected in the differences among the individual disease progress curves (cf. figure 2a,b with figure 1c,d).
3.3 Predicting ensemble variability
If no information is available about which individual replicate we are likely to follow, prediction is linked to the between-replicate variability (figure 3). This figure shows marginal posterior density distribution , representing our belief in the size of an epidemic based on the population of outbreaks rather than on an individual replicate. These predictions are necessarily very broad (as we do not know a priori which replicate will be realized) and strongly affected by the choice of the prior information about the hyperparameters σ2 (see below for further discussion). There is a very clear distinction between the posterior distribution for the between-replicate variability with and without T. viride. Introduction of the biocontrol agent results in a highly skewed distribution. Thus, although most of the epidemics are controlled by the biocontrol agent, there is a substantial chance of an uncontrolled outbreak. This result reflects the difference between replicates 1–3 and 4 and 5 in the data (figures 1d and 2a; cf. data points in figure 3). The distributions are qualitatively similar to the posterior distributions reported by Gibson et al. (2004) for the fully stochastic model.
Figure 3.
Marginal posterior distributions for disease levels in the absence (a,c,e) and in the presence (b,d,f) of the biocontrol agent. Solid lines represent predictions for the population behaviour (between-replicate variability), whereas broken lines show predictions for one of the replicates (K1=1; within-replicate variability) based upon the normal distribution model for . Points show the extent of the within-replicate variability for the binomial model. Labels 1–5 represent replicated data to indicate which replicate epidemic they represent.
The within-replicate variability is very small, as shown in figure 3 (broken line) by the posterior distribution of under the assumption of the normal distribution for . This probability represents our belief in the size of the current ith outbreak and can be interpreted as model error, i.e. inability of the model to describe the existing replicate data. The narrow distribution in figure 3 is yet another manifestation of the smoothness of individual replicate data as contrasted with the uncertainty of the ensemble behaviour.
3.4 Forecasting
If only partial information about an individual replicate is given, forecasting may improve significantly with the amount of information we can supply (figure 4). As an example, we show the results for replicate K1=4 without T. viride, but similar results are obtained for other replicates (cf. figure 5). When only the first three data points are known (including the trivial point Yij=0 for t=4 d), the HPDR is broad, reflecting relatively large uncertainty in forecasting. However, observing the outbreak for three more days (up to day 9) results in significant reduction of uncertainty associated with our predictions (quantified here by the 95% HPDR). It is worth noting that if we simply concentrate on point estimates of the disease progress instead of on the whole distribution (represented by the median of the predictive posterior distribution of yK1(tj) at each time point; broken line in figure 4), the prediction is remarkably good, even when based on the first 3 days only (figure 4a). In particular, the distinction between clusters of replicates (replicates 1, 2 and 3 versus 4 and 5) is clearly made, even if based on the first three points only (figures 4a and 1).
Figure 4.
Forecasting and hindcasting individual outbreaks. (a,b) show 95% highest predictive density regions (HPDRs) for forecasting one of the replicate epidemics from figure 1 (K1=4). The thick line represents the ‘known’ part of the trajectory, whereas filled circles show the actual ‘future’ (or ‘past’) outbreak. Forecasting is based on (a) three (4–6 d) or (b) six (4–9 d) initial days of data. Hindcasting in (c,d) estimates the outbreak levels in the past based upon the last (c) three (17–19 d) or (d) six (13–19 d) data points. Open circles show replicates used as a library. The width of the HPDR at t=19 d (for forecasting) and 8 d (for hindcasting) is subsequently used to characterize predictability for individual replicates (figure 5). Broken lines show the median of the predictive posterior density.
Figure 5.
Forecasting and hindcasting individual outbreaks. (a,b) The width of the 95% highest predictive density regions (HPDRs) for forecasting final levels (t=19 d) of replicate epidemics. (c,d) The width of the 95% high-density regions for hindcasting the disease levels at t=8 d. Cut-off time for forecasting is the date for the last measurement, whereas for hindcasting it is the earliest date for which data are assumed to exist. Plots (a,c) correspond to the case without and (b,d) with the biocontrol agent.
The results can be quantified by repeating the forecasting procedure for all lengths of periods of known data and for all replicates (figure 5a,b). Uncertainty associated with forecasting is very similar for all replicates, despite very different temporal progress and very different final sizes of the epidemics (compare figure 5 to figure 1c,d). When no information about the current outbreak is known (t=4 d in figure 5a,b), the uncertainty of our predictions for the final size of the epidemic is very high. However, once we observe the outbreak for several days, prediction uncertainty is significantly reduced. There appears to be a critical period (t=8–10 d) beyond which there is no significant improvement in the uncertainty of prediction. Predictability of epidemics with T. viride is initially lower than in the absence of biocontrol agent, but improves significantly with time. This is, paradoxically, due to the large separation of the disease progress curves seen in figure 1d and the large variability in the rate of primary infections (figure 2a,b). Thus, once the rate of primary infections can be estimated with sufficient precision, the future course of the epidemic can be predicted very well.
3.5 Hindcasting
In contrast to forecasting, hindcasting is characterized by low levels of variability (figures 4c,d and 5c,d). The uncertainty is also independent of the replicate and of the lengths of the period for which we know the data. This paradoxical behaviour (the knowledge of more data does not improve the predictability) is associated with a very strong link between the final level of the epidemic and the rate of primary infections. Thus, once we know the final level of the epidemic, we can estimate primary infections very well. Since in addition we also know the starting point (as all epidemics are initiated at t1=0 at the level of Yij=0), the whole trajectory is then well defined. However, even though the uncertainty is relatively low, we still might be making a significant error in our estimation of infection levels in the middle of the epidemic, see figure 5c for ti=9–14 d.
3.6 Sensitivity to model choice
The model is based upon three key assumptions: (i) normal distribution for the within-replicate variability, (ii) normal but very broad priors for the mean of the hyperparameters, and (iii) tight exponential priors for the variance of the hyperparameters. The sensitivity of the results to each of these assumptions is summarized below.
The within-replicate variability was modelled by two distributions: binomial and normal. There was no significant difference in the parameter estimates for either of the distributions, but the normal distribution modelled the residuals better (figure 1e,f), particularly the small variance of the residuals (figure 1c,d). The predictability (figure 5) was also qualitatively similar for models with normal and binomial distributions, although the uncertainty levels were higher in the binomial case.
We found that the prior assumption for σ2 is the only model detail which affected the results in a substantial way. If we have a strong prior belief that all outbreaks come from the same population of epidemic outbreaks (i.e. we expect no significant between-replicate variability), we can choose λ to be large (steep prior; λ=15 as used here). This results in a relatively tight prediction for (as shown in figure 3), but creates problems for chain convergence. However, if we have no prior reason to believe that future outbreaks come from the same population (small λ; shallow prior), a new outbreak might be very different from the ones we have observed so far and the prediction is very uncertain. In contrast, the individual fits (figures 1–3) were not affected. For a shallow prior (small λ), the quality of the forecast (figure 5) was significantly lower when predicting early in the outbreak. However, once enough information about the particular replicate was collected, the predictability was independent of λ.
Unless prior knowledge was strongly concentrated around a specific value of μ, the posterior distribution for parameters was not affected (table 1). Similarly, the results were not significantly affected when the prior for was changed. We conclude that the method presented here is not sensitive to a wide choice of priors or detailed model assumptions, although prior information can be used to narrow down the estimates in cases when data are variable.
The replication in this dataset is relatively low (five populations per treatment). However, parameter estimation worked very well in our case, despite large variability among the replicates, even if the priors were chosen to be uninformative.
4. Discussion
Analysis and prediction of disease outbreaks is central to our ability to prevent and control epidemics. Two strategies have been conventionally used to study disease spread. One approach is focused upon a single outbreak whereby information collected for other similar outbreaks is ignored. Alternatively, data are collected for replicated epidemics, but analysis is concentrated on average behaviour. This duality leads to a paradoxical situation whereby in some cases we do not even attempt to collect or search for replicate data (for large epidemic outbreaks) while the information about individual replicates is ignored when it is available (for controlled experiments, for example in plant epidemiology or livestock diseases). There are several problems associated with each of these approaches.
If we are dealing with a single realization, we need to make strong and often arbitrary assumptions about the underlying probabilistic model that has generated this realization. Without knowledge of other realizations it is impossible to estimate the parameters describing the error structure and magnitude and to validate the assumptions. Thus, while we can predict the future behaviour of the emerging epidemic based upon limited information available for the initial stages of the epidemic, we often cannot give realistic bounds on the risk associated with the particular outbreak. Incorporation of replicate information enables us to carry out the risk assessment.
However, when replicate data are available but the underlying stochastic model is oversimplified (like in the fitting-to-mean procedure), we encounter a problem of averaging nonlinear disease progress curves. In this case, averaging of individual disease curves produces an outcome that is not equivalent to the dynamics with average parameters (Kleczkowski 1998; Gibson et al. 1999). In many cases, the average behaviour might even belong to a completely different class of functions than the individual replicates (Kleczkowski 1998). As a result, parameter estimations based upon average disease progress curves cannot be used reliably for making inferences about the underlying biological processes.
Finally, data for replicate epidemics often contain much more information about the actual biological mechanisms than the average data or single realizations. When rp is estimated for the case with T. viride, we see clustering of disease outbreaks into two subsets, with replicates 1 and 3 characterized by high levels of primary infection, whereas replicates 2, 4 and 5 exhibit suppressed levels of primary infection (cf. figure 2), with similar values for rs and a. This result suggests that T. viride failed to protect against primary infections in outbreaks i=1 and 3 leading to particularly high levels of infection at the end of the respective epidemics (cf. figure 1d). The clustering of replicate estimates of rp is also responsible for a long tail of the distribution Pr(μ) (cf. figure 2b) and for the increased variability in the presence of the biocontrol agent (cf. figure 3).
In this paper, we have also shown how to combine information about some past outbreaks with partial information about the current emerging epidemic. The hierarchical Bayesian framework can be extended to incorporate more informative prior knowledge about the epidemiological parameters than can subsequently be built into the model via priors for μ and σ2. This will narrow down predictability for future outbreaks and lower the requirements for expensive collection of data.
The method we propose in this paper distinguishes between two types of variability. The within-replicate variability is associated with a single replicate and is characterized by the posterior distribution . The between-replicate variability describes uncertainty associated with repeated outbreaks and is characterized by the posterior distribution . We have used a deterministic model to describe the epidemiological dynamics, an approach still widely used in plant epidemiology. In this approach, variability is described phenomenologically, instead of emerging from the dynamics themselves. Thus, in the model formulation we do not provide any details about the mechanism generating the variability. It could be argued that three sources of variation can been identified in many ecological and epidemiological systems (Nisbet & Gurney 1982; Lande et al. 2003). Measurement error is linked to the sampling procedures, which is particularly imperfect in scoring plants for disease symptoms. We assume perfect scoring here and so ignore this element, but it can be incorporated either in the model (for example, as a cryptic phase; Gibson et al. 2004) or into the within-replicate variability. Demographic variability is conventionally associated with chance events leading to unpredictability even when run under identical conditions and can be potentially associated with the within-replicate variability in our model. Environmental variability is conventionally associated with aperiodic temporal fluctuations affecting all individuals in the population (Nisbet & Gurney 1982; Lande et al. 2003). The latter concept can be generalized to include spatial variability in which subpopulations are affected differentially by local variations in the environment. Gibson et al. (1999, 2004) have shown that between-replicate variability can be described by a stochastic model with no additional systematic variation among replicated epidemics. This result shows that a careful definition of environmental and demographic variability is needed for systems in which systematic variation may be present. We suggest that hierarchical Bayesian modelling provides a flexible framework in which to assign a part of the between-replicate variability to stochastic and another part to systematic (environmental) variation. To do this, however, we need to extend our approach to incorporate a stochastic version of model (2.2) (Gibson et al. 1999, 2004; Kleczkowski 2005).
In the Bayesian framework, posterior distributions can be interpreted in terms of probabilities of an outbreak reaching a certain size. Thus, the framework developed here can be extended to calculate risks associated with disease outbreaks (Keeling et al. 2001; Claessen et al. 2005a,b). It can also be built into expert systems that allow real-time monitoring and intervention based on partial information about disease outbreaks.
The modelling approach developed here for plant epidemics can also be applied for human and animal diseases. The model is based upon a standard SI model and extensions to more complicated systems can be considered, with primary infections corresponding to external imports of infective individuals (Keeling & Grenfell 2002). The concept of a replicate, however, needs careful consideration as it might be difficult to obtain data for outbreaks under carefully controlled and replicable conditions. Herds of animals, farms, wards, hospitals and regions can be approximately treated as replicated epidemics, even if small levels of coupling between them are present. It is also possible to study replicates over time in recurrent epidemics (e.g. Ionides et al. (2006)), provided the cohorts of individuals are sufficiently distinct so that the carry-over effects from one season to another can be neglected. Our method can be generalized to include the time when the epidemic starts. Hindcasting would then allow us to trace the origin and history of the current outbreak, for example AIDS and HIV infections (Lemey et al. 2003).
Central to our paper is the distinction between within- and between-replicate variability. In the Bayesian interpretation, the between-replicate variability corresponds to our degree of belief in the outbreak size without any reference to the particular replicate: what would the behaviour be, were the disease to invade again many times? The within-replicate variability reflects, in the Bayesian sense, our degree of belief in properties of each individual replicate: what is the future behaviour of a particular outbreak? Kleczkowski (2005) has shown that contrast between within- (or replicate) and between-replicate (or ensemble) variability has its origins in properties of a simple birth process. In the exponential phase of disease spread, small differences in the initial conditions are magnified by the subsequent dynamics. This generates large differences among realizations while keeping the individual growth curves relatively smooth. Since many biological processes are adequately approximated by the exponential function, we suggest that the method developed here can be applied to many epidemiological, ecological and even metabolic processes (Colman-Lerner et al. 2005).
Acknowledgments
We gratefully acknowledge funding from the Department of Environment, Food and Rural Affairs and the Biotechnology and Biological Sciences Research Council. We also thank Dr Douglas Bailey for data used in this paper.
Footnotes
One contribution of 20 to a Theme Issue ‘Cross-scale influences on epidemiological dynamics: from genes to ecosystems’.
References
- Anderson R.M, May R.M. Oxford University Press; Oxford, UK: 1991. Infectious diseases of humans: dynamics and control. [Google Scholar]
- Bailey D.J, Gilligan C.A. Dynamics of primary and secondary infection in take-all epidemics. Phytopathology. 1999;89:84–91. doi: 10.1094/PHYTO.1999.89.1.84. [DOI] [PubMed] [Google Scholar]
- Bailey D.J, Kleczkowski A, Gilligan C.A. Epidemiological dynamics and the efficiency of biological control of soil-borne disease in consecutive crops. New Phytol. 2004;161:569–575. doi: 10.1111/j.1469-8137.2004.00973.x. [DOI] [PubMed] [Google Scholar]
- Barnett V. Wiley; Chichester, UK: 1999. Comparative statistical inference. [Google Scholar]
- Bjornstad O.N, Finkenstadt B.F, Grenfell B.T. Dynamics of measles epidemics: estimating scaling of transmission rates using a time series SIR model. Ecol. Monogr. 2002;72:169–184. [Google Scholar]
- Britton T. Epidemics in heterogeneous communities: estimation of Ro and secure vaccination coverage. J. R. Stat. Soc. Ser. B-Stat. Methodol. 2001;63:705–715. doi: 10.1111/1467-9868.00307. [DOI] [Google Scholar]
- Brooks S.P, Gelman A. General methods for monitoring convergence of iterative simulations. J. Computat. Graph. Stat. 1998;7:434–455. [Google Scholar]
- Claessen D, Gilligan C.A, Lutman P.J.W, van den Bosch F. Which traits promote persistence of feral GM crops? Part 1: implications of environmental stochasticity. Oikos. 2005a;110:20–29. doi: 10.1111/j.0030-1299.2005.13667.x. [DOI] [Google Scholar]
- Claessen D, Gilligan C.A, Van den Bosch F. Which traits promote persistence of feral GM crops? Part 2: implications of metapopulation structure. Oikos. 2005b;110:30–42. doi: 10.1111/j.0030-1299.2005.13668.x. [DOI] [Google Scholar]
- Clark J.S, Gelfand A.E. Oxford University Press; Oxford, UK: 2006. Hierarchical modelling for the environmental sciences. [Google Scholar]
- Colman-Lerner A, Gordon A, Serra E, Chin T, Resnekov O, Endy D, Pesce C.G, Brent R. Regulated cell-to-cell variation in a cell-fate decision system. Nature. 2005;437:699–706. doi: 10.1038/nature03998. [DOI] [PubMed] [Google Scholar]
- Dennis B, Desharnais R.A, Cushing J.M, Costantino R.F. Nonlinear demographic dynamics—mathematical-models, statistical-methods, and biological experiments. Ecol. Monogr. 1995;65:261–281. doi: 10.2307/2937060. [DOI] [Google Scholar]
- Ferguson N.M, Donnelly C.A, Anderson R.M. The foot-and-mouth epidemic in Great Britain: pattern of spread and impact of interventions. Science. 2001;292:1155–1160. doi: 10.1126/science.1061020. [DOI] [PubMed] [Google Scholar]
- Ferguson N.M, Keeling M.J, Edmunds W.J, Gant R, Grenfell B.T, Amderson R.M, Leach S. Planning for smallpox outbreaks. Nature. 2003;425:681–685. doi: 10.1038/nature02007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Finkenstadt B, Grenfell B. Empirical determinants of measles metapopulation dynamics in England and Wales. Proc. R. Soc. B. 1998;265:211–220. doi: 10.1098/rspb.1998.0284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Finkenstadt B, Keeling M, Grenfell B. Patterns of density dependence in measles dynamics. Proc. R. Soc. B. 1998;265:753–762. doi: 10.1098/rspb.1998.0357. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gibson G.J, Gilligan C.A, Kleczkowski A. Predicting variability in biological control of a plant–pathogen system using stochastic models. Proc. R. Soc. B. 1999;266:1743–1753. doi: 10.1098/rspb.1999.0841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gibson G.J, Kleczkowski A, Gilligan C.A. A Bayesian analysis of botanical epidemics using stochastic compartmental models. Proc. Natl Acad. Sci. USA. 2004;101:12 120–12 124. doi: 10.1073/pnas.0400829101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilligan C.A. An epidemiological framework for disease management. Adv. Bot. Res. 2002;38:1–64. [Google Scholar]
- Gilligan C.A, Kleczkowski A. Population dynamics of botanical epidemics involving primary and secondary infection. Phil. Trans. R. Soc. B. 1997;352:591–608. doi: 10.1098/rstb.1997.0180. [DOI] [Google Scholar]
- Goldstein H. Hodder Arnold; London, UK: 2003. Multilevel statistical models. [Google Scholar]
- Grenfell B.T, Kleczkowski A, Ellner S.P, Bolker B.M. Measles as a case-study in nonlinear forecasting and chaos. Phil. Trans. R. Soc. A. 1994;348:515–530. doi: 10.1098/rsta.1994.0108. [DOI] [Google Scholar]
- Grenfell B.T, Wilson K, Finkenstadt B.F, Coulson T.N, Murray S, Albon S.D, Pemberton J.M, Clutton-Brock T.H, Crawley M.J. Noise and determinism in synchronized sheep dynamics. Nature. 1998;394:674–677. doi: 10.1038/29291. [DOI] [Google Scholar]
- Grenfell B.T, Bjornstad O.N, Finkenstadt B.F. Dynamics of measles epidemics: scaling noise, determinism, and predictability with the TSIR model. Ecol. Monogr. 2002;72:185–202. [Google Scholar]
- Haydon D.T, et al. Low-coverage vaccination strategies for the conservation of endangered species. Nature. 2006;443:692–695. doi: 10.1038/nature05177. [DOI] [PubMed] [Google Scholar]
- Ionides E.L, Breto C, King A.A. Inference for nonlinear dynamical systems. Proc. Natl Acad. Sci. USA. 2006;103:18 438–18 443. doi: 10.1073/pnas.0603181103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Keeling M.J, Grenfell B.T. Understanding the persistence of measles: reconciling theory, simulation and observation. Proc. R. Soc. B. 2002;269:335–343. doi: 10.1098/rspb.2001.1898. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Keeling M.J, et al. Dynamics of the 2001 UK foot and mouth epidemic: stochastic dispersal in a heterogeneous landscape. Science. 2001;294:813–817. doi: 10.1126/science.1065973. [DOI] [PubMed] [Google Scholar]
- Kleczkowski A. Statistical properties of dynamical systems with disturbances: variation in parameters. Acta Physica Polonica B. 1998;29:1717–1735. [Google Scholar]
- Kleczkowski A. Population and replicate variability in an exponential growth model. Acta Physica Polonica B. 2005;36:1623–1634. [Google Scholar]
- Kleczkowski A, Bailey D.J, Gilligan C.A. Dynamically generated variability in plant–pathogen systems with biological control. Proc. R. Soc. B. 1996;263:777–783. doi: 10.1098/rspb.1996.0116. [DOI] [Google Scholar]
- Kleczkowski A, Gilligan C.A, Bailey D.J. Scaling and spatial dynamics in plant pathogen systems: from individual to populations. Proc. R. Soc. B. 1997;264:979–984. doi: 10.1098/rspb.1997.0135. [DOI] [Google Scholar]
- Koop G. Wiley; New York, NY: 2003. Bayesian econometrics. [Google Scholar]
- Lande R, Engen S, Saether B.E. Oxford series in ecology and evolution. Oxford University Press; Oxford, UK: 2003. Stochastic population dynamics in ecology and conservation. [Google Scholar]
- Lemey P, Pybus O.G, Wang B, Saksena N.K, Salemi M, Vandamme A.M. Tracing the origin and history of the HIV-2 epidemic. Proc. Natl Acad. Sci. USA. 2003;100:6588–6592. doi: 10.1073/pnas.0936469100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nisbet R.M, Gurney W.C.S. Wiley; Chichester, UK: 1982. Modelling fluctuating populations. [Google Scholar]
- Pascual M.A, Kareiva P. Predicting the outcome of competition using experimental data: maximum likelihood and Bayesian approaches. Ecology. 1996;77:337–349. doi: 10.2307/2265613. [DOI] [Google Scholar]
- Taylor J.M.G, Cumberland W.G, Sy J.P. A stochastic-model for analysis of longitudinal AIDS data. J. Am. Stat. Assoc. 1994;89:727–736. doi: 10.2307/2290898. [DOI] [Google Scholar]
- Twisk J.W.R, Kemper H.C.G, Mellenbergh G.J. Mathematical and analytical aspects of tracking. Epidemiol. Rev. 1994;16:165–183. doi: 10.1093/oxfordjournals.epirev.a036149. [DOI] [PubMed] [Google Scholar]
- Zeger S.L, Karim M.R. Generalized linear-models with random effects—a Gibbs sampling approach. J. Am. Stat. Assoc. 1991;86:79–86. doi: 10.2307/2289717. [DOI] [Google Scholar]





