Abstract
During infectious disease outbreaks, delays in case reporting mean that the time series of cases is unreliable, particularly for those cases occurring most recently. This means that real-time estimates of the time-varying reproduction number, , are often made using a time series of cases only up until a time period sufficiently far in the past that there is some confidence in the case counts. This means that the most recent estimates are usually out of date, inducing lags in the response of public health authorities. Here, we introduce an estimation method, which makes use of the retrospective updates to case time series which happen as more cases that occurred historically enter the health system; these data encode within them information about the reporting delays, which our method also estimates. These estimates, in turn, allow us to estimate the true count of cases occurring most recently allowing up-to-date estimates of . Our method simultaneously estimates the reporting delays, true historical case counts and in a single Bayesian framework, allowing the uncertainty in each of these quantities to be accounted for. We apply our method to both simulated and real outbreak data, which shows that the method substantially improves upon naive estimates of which do not account for reporting delays. Our method is available in an open-source fully tested R package, incidenceinflation. Our research highlights the value of keeping historical time series of cases since changes to these data can help to characterize nuisance processes, such as reporting delays, which allow these to be accounted for when estimating key epidemic quantities.
This article is part of the theme issue ‘Uncertainty quantification for healthcare and biological systems (Part 1)’.
Keywords: Bayesian inference, outbreaks, infectious diseases, reproduction number
1. Introduction
A range of delays obstructs real-time surveillance efforts during infectious disease outbreaks [1], including the time period from infection to symptom onset, delays in seeking care after symptom onset and reporting mechanisms resulting in variation in case data more reflective of imperfections in the health services than the outbreak signal [2]. Here, we focus on the estimation of the time-varying reproduction number, , when substantial reporting delays mean the case data are unlikely to be completely recorded until potentially days or weeks after the infections occurred.
Reporting delays typically artificially deflate recent case counts, as those most recently occurring infections are yet to enter health records. This means that naive attempts to estimate the rate of epidemic growth or probably understate the seriousness of the current outbreak situation. This issue is well-recognized in outbreak modelling and one approach is to delineate reporting of epidemic quantities according to whether they belong to a so-called trusted period, when case data are probably near to complete, and a more recent period when additional extrapolations are necessary to estimate epidemic quantities [3]. Using data on the delay between illness onset and subsequent notification into the health systems, the reporting delay period can be estimated and projections made about recent cases. However, these projections typically require further assumptions to be made, for example, that transmission strength remains the same [3]. Because of these assumptions, such approaches do not allow near real-time estimation of . An alternative approach is to make full use of the history of incomplete cases stratified by symptom onset data to determine reporting delays and, in turn, estimate recent cases—these approaches have thus far used heuristic statistical models to impute cases [4–9], which do not natively allow the inclusion of knowledge of epidemic spreading rates (e.g. the serial interval distribution) into predictions; it is also not clear that estimation is appropriate since the statistical model encodes a different delay distribution from that of the serial interval (which is used to estimate from cases [7]). Similar heuristic approaches have also been used to ‘nowcast’ deaths owing to infection in the presence of reporting delays [10]. The trajectories of cases reported with onset on a particular date have also been used to estimate true cases and subsequently to estimate [11], although this approach neglects the feedback between the epidemic process and case reporting—i.e. that if is higher, we should expect more cases to come to light; it also neglects uncertainty in the delay period.
Our key contribution is to use such data to develop a more mechanistic version of case and estimation in the presence of unknown reporting delays. Our method is Bayesian and naturally handles and outputs the joint uncertainty in infections, and the reporting delays, facilitating robust real-time decision-making in unfolding outbreaks. We make our method available through a fully tested open-source R package called incidenceinflation.1
2. Method
(a). Epidemic model
We assume that the count of cases with symptom onset on day , , follows a renewal process of the form
| (2.1) |
where denotes either a Poisson or negative binomial probability distribution. Here, represents an element of a discrete serial interval distribution, which is a probability distribution representing the typical variation in the period from symptom onset of a primary case to symptom onset of cases caused by it. Using a discretized serial interval distribution, equation (2.1), is standard in infectious disease epidemiology, and this implicitly assumes that the serial interval is never negative and that symptomatic cases are caused only by those with symptom onset on previous days; violations to this latter assumption can be handled by using a renewal equation model with time-steps shorter than a day [12].
In this work, we assume that is piecewise-constant with the number and length of pieces decided a priori; future work could allow these to be estimated from the data [13]. The widths of the pieces used in our analyses are shown in electronic supplementary material, table S1.
(b). Reporting delays
We suppose that the number of cases thought to have symptom onset on day , , can be updated as time passes because of a delay in cases entering health systems. Here, we assume that owing to these delays, the number of cases with onset on day will be less than or equal to the true number of cases onset at that time. This means that, for cases occurring most recently, we tend to underestimate the case counts (figure 1a). In reality, retrospective case counts may be revised both upwards and downwards owing to imperfections in reporting which come to light later on; here, we consider only underestimated case counts that arise owing to reporting delays since this case reporting mechanism plagues attempts to obtain contemporary estimates.
Figure 1.
A schematic which shows: (A) how reporting delays affect estimates and (B) our Bayesian approach. In (A), we show how different delay durations cause the cases reported up until the present (coloured lines) to deviate from the true case counts (dashed line); relying on these case counts then results in the most recent estimates underestimating the true values (dashed lines). In (B), we illustrate our approach, which relies on having access to historical reported case counts for each onset day that can be thought of as reported case trajectories; these trajectories contain information both about the delay periods (ii) and the true case counts (iii) and inform our estimates of these; our estimates of case counts (which incorporate uncertainty) are then used to (iv) estimate , and we assume is piecewise-constant.
Our method requires that we have access to historical estimates of case time series (figure 1b). We consider the number of cases with symptom onset on day when observed from some later day, , which we represent by . Since case reporting imperfections are thought to only delay their reception into health systems, we implicitly assume that , the true number of cases having onset on day , and that, as more time passes, the number of cases arising at that time can only increase: if then . The true number of cases with symptom onset on day , , is then given by the case count at that time when viewed from a time infinitely far into the future: . This implicitly means that we assume that, ultimately, all cases are reported—this is unlikely to be true and future work may consider this additional assumption. Although without information regarding the degree of underreporting, it is unclear whether this could be estimated from the case data. If the fraction of cases unobserved remains relatively fixed over time, our method should produce low-bias estimates of . In reality, this fraction probably varies, particularly at the start of outbreaks, although all predominant methods for estimation rely on consistent surveillance.
We assume that the cases detected between day and later day follow a binomial distribution:
| (2.2) |
where the first term on the right-hand side of equation (2.2), , is the number of cases yet to be detected; is the probability that a case arising on day is observed in the interval given that it has not been observed until day ; this probability can be calculated as follows:
| (2.3) |
where is the survival function representing the probability that a case occurring on day has yet to be detected by day . This survival probability depends on the parameters, , of the distribution characterizing the typical reporting delays: , where CDF is the cumulative distribution function. In our current software implementation and for generating all results in this article, we assume that the reporting delay distribution is characterized by a gamma distribution with a mean, standard deviation parameterization.
Of course, it may be possible that the true number of cases arising on day may have already been detected at some later day meaning that there are no cases remaining at large. We effectively modify equation (2.2) so that detection of cases such that the total exceeds the true number is assigned probability zero.
In real epidemics, the reporting delay probably varies through time and, in our package, we allow the parameters, , to be piecewise-constant, with the user specifying the number and width of pieces a priori. We assume that the day of symptom onset determines the reporting delay distribution.
(c). Two inference algorithms
We develop a Bayesian method for simultaneously estimating , the true case counts and the probability distribution representing the reporting delays, which is characterized by parameters, . We have developed two distinct estimation approaches, each of which uses Markov chain Monte Carlo (MCMC) sampling [14,15] to estimate these quantities.
The first of these we term the Poisson model since it requires that the renewal distribution ( in equation (2.1)) is a Poisson distribution and makes other restrictive assumptions. The second approach, which we call the unrestricted model, allows either a Poisson or negative binomial renewal model, and a wider variety of prior choices. This approach is generally more expensive, though, and requires a more refined initialization to ensure efficient sampling, so we find utility in using both methods.
Both approaches are from the same family of Gibbs-type algorithms. In §2d, we describe the Poisson model; in §2e, we describe the unrestricted model.
(d). Poisson model
This approach follows a Gibbs-type sampling algorithm comprising the following steps:
Generate an estimate conditional on estimates where T is the last day on which observations occur;
Generate estimates, , conditional on from the previous step and ;
Generate an estimate conditional on estimates of from the previous step.
Like all MCMC methods, the algorithm requires initialization for these three types of quantity and, like for all MCMC methods, this initialization is usually from an arbitrary distribution, ideally one that emulates the posterior distribution. We found inference for the Poisson model insensitive to these starting points and, in all examples we present here, we assumed that for all time periods; that the initial case histories were given by those reported in the most recent time period; and that (all) reporting periods were characterized by a gamma distribution with a mean of 10 days and standard deviation of 3 days.
(i). Sampling
We assume that is piecewise-constant with pieces of time-width . Within a given piece, , the estimate is then informed by all the cases which were generated using it: ; since the cases within each piece also depend on the history of cases; however, it also depends on all cases before the piece.
If the renewal model given by equation (2.1) follows a Poisson distribution, we can independently sample an value within a piece if we know the case history, , and assume priors of the form: , representing a prior mean . In this situation, we can analytically calculate the posterior distribution of , which is given by
| (2.4) |
where . In each step of the algorithm, we draw all estimates from their conditional posterior distributions of the form given by equation (2.4).
(ii). Sampling the true case counts
We now derive the conditional posterior for the true cases , which we then independently draw a value from in a Gibbs-type update step. To do so, we write out the numerator of Bayes’ rule considering a single unknown case count, , dropping any terms that do not include . Here, we assume a discrete uniform prior for up until a maximum of which we describe later:
| (2.5) |
where represents the number of observations made of the case counts arising on a particular day, and
| (2.6) |
where indicates the probability mass function for the distribution dist evaluated at for parameters ; indicates all true case counts except that for cases arising at time . In equation (2.6), we show the three distinct contributions to the probability: that from the retrospective observations of the cases arising at time when viewed at later times ; the next two terms are associated with the evolution of the underlying state: the first of these is the probability of obtaining cases at time given a past history of cases; the second of these concerns case counts after , which depend on since it forms part of the case history.
The posterior defined by equation (2.6) is non-standard. But because is a discrete random variable, we can calculate the right-hand side of equation (2.6) across a finite range of integer values of , where is the latest observation time for retrospective cases; this implicitly means our prior for given by the middle term on the right-hand side of equation (2.6) is truncated. Then, because we have only a finite set of unnormalized probabilities, we can determine their sum and form a (normalized) discrete probability distribution from which we can independently draw ([15], chapter 14). As long as the uppermost value of selected is high enough that the probability , where is tiny, then this should not introduce substantial bias (i.e. underestimation) into estimates. In our current implementation, the user selects a maximum possible count of true cases, , although future work could refine this to allow a dynamic maximum that ensures , improving the efficiency of the sampling.
(iii). Sampling the parameters characterizing the reporting delays
We can again calculate the conditional posterior distribution up to a normalization constant:
| (2.7) |
where is the prior on . In our current implementation of the method, we model the delays as being generated from a gamma distribution with parameters, , representing the mean delay () and standard deviation in delay (). In our current software, each of these parameters are assigned independent gamma priors.
In each iteration, we use the random walk Metropolis algorithm ([15], chapter 13) to draw values of from the posterior defined by equation (2.7).
(e). Unrestricted model
The unrestricted model is also a Gibbs-type algorithm but comprises two steps:
Generate estimates: , (and if a negative binomial model is used, an estimate of the overdispersion parameter, ) conditional on estimates of from the previous step;
Generate estimates, , conditional on from the previous step and .
Step 2 of this algorithm is the same as step 2 described in §2d(ii), except we allow users to optionally replace the Poisson distribution in equation (2.6) by a negative binomial distribution; this introduces an overdispersion parameter, , where, as , the distribution approaches a Poisson.
To allow for more general models of the process, we use Stan’s NUTS sampler in step 1 [16]. In all examples, we used 10 Stan iterations each time step 1 was called, and our updated parameter estimates were those output in the final of these iterations. Stan’s NUTS algorithm uses various adaptive phases where reasonable hyperparameters of this algorithm are first determined. To ensure that the algorithm ran efficiently and without biases owing to divergent iterations ([15], chapter 15), we determined appropriate hyperparameters through initial runs of the algorithm. Our approach requires that we repeatedly call Stan to perform sampling (i.e. each time step 1 is called), and we assumed that the hyperparameters were initially the same across each run, but we allowed Stan to automatically tune these.
Unlike for the Poisson model, we found efficient inference for the unrestrictive model to depend more acutely on the starting points of the Markov chains. To minimize the length of the warm-up period, we initialized all Markov chains at the maximum a posteriori (MAP) estimates returned by an initial optimization step (see §2f). The risk of this is that we fail to explore the full posterior space resulting in biased estimates ([15], chapter 12), but we found this risk more than compensated for by the gains it provided for sampling efficiency.
(f). Optimization
Both models permit rapid estimates of the parameters to be obtained using optimization. These estimates do not have uncertainties associated with them but provide a guide to the consequences of making different assumptions on the resultant estimates. For both models, we replace each of the algorithm’s Gibbs-type sampling steps with algorithms that maximize the conditional log-posterior; our aim is then to produce MAP estimates. This iterative approach may reach local maxima, but, in practice, we have not found this to occur.
For the Poisson model: in step 1, we use an analytical result for the MAP estimate for ; in step 2, our search space is discrete and we just take case count for each onset time which maximizes the posterior; in step 3, we use the default optimization algorithm provided by base R’s optim function.
For the unrestricted model: in step 1, we use the same approach to determine estimate cases as for the Poisson model (i.e. its step 2); in step 2, we use R’s optim function using the L-BFGS-B optimization algorithm [17].
(g). Estimation of when assuming case data are perfect
In some results, we showcase the difference in estimates between our approach, which simultaneously estimates case counts, and a naive approach which treats the case data as being perfect. To do this, we used Stan’s NUTS algorithm [16] to perform estimation since it allowed us to make diverse sets of assumptions about the renewal model and reporting delay processes.
(h). Prior choice
Throughout this work, we used relatively wide and generally uninformative priors (see table 1).
Table 1.
Priors used in all analyses. Note that all parameters were assigned independent prior distributions; , refer to the mean and standard deviation of the reporting delay distributions which are assumed to be parameterised by gamma distributions throughout this work.
|
parameter |
prior |
justification |
|---|---|---|
|
|
gamma (mean = 5, s.d. = 5) |
conservative prior giving 82% weight for |
|
|
gamma (mean = 10, s.d. = 15) |
prior 2.5–97.5% interval: <0.1–53 days |
|
|
gamma (mean = 5, s.d. = 15) |
prior 2.5–97.5% interval: <0.1–47 days |
|
|
discrete-uniform ( , ) |
empirical prior across wide range |
(i). MCMC convergence monitoring
Throughout this study, we performed diagnostic checks that the MCMC had converged. We diagnosed convergence if (in the majority of cases, ) for all parameters, and we used the posterior R package [18] to calculate this statistic.
(j). Reproducibility and reliability
The methodological work underpinning this article is technical, and we used multiple approaches to ensure that our results are reproducible. We used the targets R package to set up a reproducible data analysis pipeline for producing all our figures [19]. We used the renv package to manage our R environment [20], which makes it easier for others to duplicate our software environment when rerunning our results.
To allow others to use our methods, all code underlying our inference algorithms was developed into two separate R packages: incidenceinflation [21], corresponding to the Poisson model; and incidenceinflationstan [22], corresponding to the unrestricted model. Each of these is straightforwardly installable using only a single line of R code.
To mitigate against the risk of coding errors, we wrote unit tests resulting in a high testing coverage for both R packages. We also used continuous integration testing to minimize the risk that changes introduced into the code at later times did not break earlier tests.
The materials to return the analyses in this paper are freely available in a GitHub repository [23].
3. Results
We illustrate how our method performs using a combination of simulated and real outbreak data. In §3a–c, we use simulated data. In §3d,e, we use real outbreak data for dengue fever and measles, respectively.
(a). Our approach can simultaneously estimate infections, reporting delays and given case histories
We begin by simulating an outbreak using equations (2.1) and (2.2) assuming a COVID-19-like serial interval distribution with a mean of 4.6 days and a standard deviation of 4.8 days [24]. We assume a time-invariant reporting delay distribution characterized by a gamma distribution with a mean of 10 days and a standard deviation of 3 days.
In figure 2a, we show a snapshot of the observed cases at (orange line) and the true case count (black line). The stark divergence between these two series illustrates the unreliability of recent case observations, which, in our model, underestimate the true case counts. On the same plot, we show our estimates of the true case count (green line shows posterior median estimates; 95% central credible interval estimates shown by shading), which are reasonably close to the ground truth. This also illustrates that the uncertainty in our estimates of cases grows the closer to the present day we consider. This is because, intuitively, we have the least information about those cases occurring most recently.
Figure 2.
Simultaneously estimating reporting delays, case counts and for a simulated dataset. For this analysis, we consider ourselves at an observation date of 100 days post the start of the outbreak. In (A), we show the observed cases at this observation date (orange line), and the true cases which are eventually reported (black line); our estimates are shown in green (the line represents posterior median). In (B), we show the true values (dashed black line); we also show three sets of estimates: those from a naive approach which assumes the case data are perfect; another representing the gold standard where we know the true case counts; and a final set using our method. In (C), we show the true reporting delay distribution (black line) and are recovered estimates (green shading). In all panels, the uncertainties shown represent the 2.5–97.5% posterior intervals.
In figure 2b, we show two sets of estimates produced using only information available 100 days after the outbreak onset: naive estimates assuming a standard Poisson renewal model where the reported cases are assumed perfect; and the estimates of from our incidenceinflation approach. We compare these with the ground truth values (dashed line). We also compare these with estimates under the gold standard scenario, where the case counts are perfectly known; this effectively represents the situation where we are looking back from a much later vantage point.
Unsurprisingly, this shows that blind reliance on the observed cases results in downwardly biased estimates. It also shows that our approach can generate reasonable estimates by leveraging the information inherent in the case histories. The estimates we produce have similar point estimates to the gold standard scenarios, although ours have wider uncertainty intervals for the most recent period owing to the uncertainty about the true case count.
In figure 2c, we display the reporting delay distribution (black line) and that recovered by our method (green line and shading), which shows them in good agreement.
Since we have access to both true and observed cases over time, we repeated the same exercise as above but considering different observation times. In figure 3, we show estimates of the cases (in (a)) and (in (b)) across a range of times throughout the outbreak. Figure 3a shows that for each vantage point considered, our uncertainty bounds for case counts contained the actual values. It also illustrates that our method is able to handle epidemics that are growing or shrinking in size.
Figure 3.
Reliable estimation of case counts and across a range of vantage points. In both (A) and (B), each plot shows estimates assuming an observer at the labelled number of days post the start of the outbreak; e.g. ‘50’ denotes an observer producing estimates at 50 days post the start of the outbreak.
Figure 3b shows that the naive approach which uses only the observed case counts leads to underestimates of at each vantage point. It also shows that our approach and the gold standard approach, given perfect information, produce broadly similar estimates of ; albeit our estimates typically have greater uncertainty associated with them, reflecting the additional uncertainty over the case counts.
(b). Our method can account for reporting delays which vary over the epidemic but estimates of case counts and are sensitive to assumptions made about the time periods when the delays are constant
Reporting delays often shorten during outbreaks as health systems improve. We now simulate an outbreak with shortening reporting delays with an Ebola-like serial interval distribution characterized by a mean of 15.3 days and a standard deviation of 9.3 days (the all-countries serial interval reported in [25], table 6). We allow for superspreading by generating the outbreak assuming a negative binomial renewal model with . We also use a negative binomial model to fit the data.
In figure 4a, we show how the trajectories of retrospectively reported cases change throughout the outbreak: the reporting delays are initially long and shorten after the first 50 days of the outbreak, and this is characterized by more slowly rising trajectories and then more abruptly increasing trajectories.
Figure 4.
An Ebola-like outbreak with shortening reporting delays. (A) Shows the trajectories of retrospectively reported cases with onset on a particular day (where each line first stems from the horizontal axis). Here, we colour the lines according to whether they occur before or after a change in the delay period at an onset time of 50 days; before this point, the reporting delay distribution has a mean of 12 days and a standard deviation of 5 days; after it, the mean is 5 days and the standard deviation is 3 days. In both time periods, the reporting delay distribution is characterized by a gamma distribution. The inset figure in (A) shows the true reporting delay distributions (dashed black lines) and posterior draws of the densities (coloured lines). In (B), we show the true cases with onset on each day (black dashed line); we consider a vantage point of 90 days after the outbreak began and show the currently recorded time series case counts (orange line). The green line shows our method's posterior median estimate of the case count and the ribbon shows the 2.5–97.5% posterior distribution estimates. In (C), we show the true values used to generate the outbreak (dashed lines) and the values recovered by our method (green lines show posterior medians and the uncertainty ribbon shows the 2.5–97.5% posterior distribution estimates).
We allow our model to estimate different reporting delay distributions in the first 50 and remaining days of the outbreak; in the inset panel in figure 4a, we show that our model can accurately estimate each of these distributions.
In figure 4b, we show the true case series (dashed black line) and the reported cases (orange line) from an observer’s point of view at 90 days since the outbreak began. This shows that the case counts nearest the current time are most underestimated. Our method explicitly accounts for these reporting delays and so can accurately estimate the true case counts (green line with uncertainty shown by shading). The result of this is that the estimates for the most recent time period remain reasonable (figure 4c).
In a real outbreak, of course, it is unknown when and if changes to the reporting delay distribution occur; this means that the results we show in figure 4 represent a best-case scenario. We now consider a more realistic situation when we do not know if and when the reporting delays change. We produce estimates of the various epidemiological quantities for the same Ebola-like outbreak under three scenarios: one matching the best-case scenario already described; another where we assume there are no changes in the reporting delays; and a final situation, where we assume that the reporting delay distributions remain the same within bins of width 30 days (shorter bins caused issues with estimation owing to insufficient data to produce reliable estimates). We also use this as an opportunity to show that our method allows for rapid estimates using optimization rather than sampling methods; these estimates do not have uncertainty associated with them but provide a quick way to get an idea of the implications of different assumptions on the resultant estimates. In our experience, these first optimization steps can be very useful as part of a full Bayesian workflow.
In electronic supplementary material, fig. S1, we show the estimates of the reporting delay distribution in each of these three circumstances. This shows that in all circumstances, fewer than 10 iterations (taking only a few minutes for each situation) of the optimization algorithm were sufficient to converge to reliable estimates. In those situations where the reporting bins overlapped only a regime where only a single reporting delay was present, our estimates of the reporting delay distribution coverage were close to the actual values. In those situations when the assumed bins overlapped multiple reporting delay regimes, it was a mix of the corresponding reporting delay distributions.
A consequence of mistakenly assuming that reporting delays are constant when, in fact, they change, is biased estimates of the history of cases. In electronic supplementary material, fig. S2A–C, we show how our estimates of the true case counts progress as our optimization algorithm proceeds for each of the three differing assumptions being made about the periods when the reporting delays are constant. In panels b and c, the reporting delay periods are correctly assumed constant in the last of the subperiods, and the estimated case counts are close to the actual values. In panel a, corresponding to the situation when we fail to account for any change in reporting period over the outbreak, we underestimate the most recent reporting delays (see electronic supplementary material, fig. S1C) meaning that we generally overestimate the most recent case counts.
A consequence of overestimating cases is that we overestimate , although the degree of overestimation here is not substantial (electronic supplementary material, fig. S2D). Whereas the estimates of when we correctly characterize the reporting delay distributions are slightly below the true values (electronic supplementary material, figs. S2E and F). This figure also shows how information between estimates of case counts and is shared throughout the course of the algorithm: as case counts fall, so does and vice versa.
(c). Uncertainty in recent case data results in conservative estimates of the probability that an outbreak goes extinct
We now simulate a waning Ebola epidemic assuming reporting delays are described by a gamma distribution with a mean of 10 days and a standard deviation of 3 days. We simulate it such that no cases are reported to have occurred in the last week and consider determining the probability that the epidemic has finished. We compare three approaches; all of these scenarios are based on the same underlying epidemic but suppose differing access to information or use different approaches to calculate this probability: one which has perfect case and information as a gold standard; another which naively uses the reported case counts but has perfect information; and our approach which uses information on reporting delays to inflate the case counts and accounts for uncertainty in both case counts and .
In figure 5a, we show the projections for the gold standard scenario. In this scenario, the actual case count before an onset time of 100 days is perfectly known, and this approach produces daily case projections that are all lower than 10 for the next 50 days. In figure 5b, we show the naive scenario projections: these are similar but the upper bound of the projections are lower than the gold standard scenario. In figure 5c, we show the projections from our method which reconstructs past cases and includes this uncertainty (and uncertainty in ) when making projections. Owing to these additional uncertainties, there is considerably more variability in the projections.
Figure 5.
Outbreak projections for a waning epidemic for three different scenarios. In each panel, we consider an observation time of 100 days after the onset of the outbreak; the black series represent the past case series: for each of panels (A) and (B), these are a single series; for panel (C), it represents a family of series since our method nowcasts cases; the blue lines show projections assuming transmission remains the same in the future. For the projections, 100 iterations were used.
We quantify the probability that an outbreak is over by performing 1000 projections for each scenario and counting the fraction of them with zero cases 50 days in the future. For the gold standard approach, 71% of projections resulted in the epidemic dying out. The naive approach results in overly confident determinations that the epidemic has died out, with 86% of projections resulting in it dying out. The corresponding probability for our method was 67% owing to the added uncertainties incorporated into our approach. Our method’s estimates are therefore conservative and risk-averse, which is appropriate for policymaking.
(d). Dengue fever in Puerto Rico
We now consider a time series of laboratory-confirmed dengue fever cases from Puerto Rico collected by the Puerto Rico Department of Health and Centers for Disease Control and Prevention; this dataset was originally analysed by [26] who developed a statistical model for predicting dengue cases accounting for the delay between the time of case onset according to a clinician and the time when the laboratory results were returned.
The data are weekly, and with such data comes the risk that cases occurring earlier in each week could spawn cases that were reported later in the week. We consider this a minor concern since the serial interval for dengue is thought to be well in excess of a week even at high temperatures where it is shortest [27,28], accounting for both the human and mosquito stages of the transmission cycle. Here, for simplicity, we assume a temperature-independent serial interval of four weeks which is the median estimate reported by [28] at 25°C; to allow a wide range of serial intervals potentially encompassing some fluctuation owing to temperature variation, we assume a standard deviation of two weeks.
We first assessed whether the reporting delay period was consistent throughout the analysis period by considering the empirical CDFs for each onset date within each subperiod. This showed there was minimal variation in the reporting day distribution over time (electronic supplementary material, fig. S3), and we assumed a fixed delay distribution throughout our period of analysis in our modelling.
We fitted our model to the dengue case data and used it to estimate the true case counts at a range of observation points. Since we had access to the numbers of cases that were eventually reported for each onset date, we were able to assess the performance of our model at recovering the true case counts. In figure 6, we show the true case counts (black line), the reported cases corresponding to eight different observation dates (dashed lines) and the case counts recovered by our method (coloured solid lines and uncertainty intervals). Our method was generally good at recovering the true cases the further they were back in time from the observation date. The estimates of case counts were more variable the closer these were to the observation date, although these always represented a substantial improvement over using the raw reported case counts. Generally, our case estimates were best for those periods of relative stasis in the case counts; in those periods where there was epidemic growth, our estimates overshot their true counts, probably owing to changes in the reproduction number over short periods.
Figure 6.
Dengue case estimation across a range of observation dates. The black line shows the true case count (i.e. the reported case counts at a time long after the corresponding onset date). We consider eight observation dates, where we assume there is access only to the cases reported up until that date (dashed coloured lines). For each of these observation dates, we used our method to infer the true case counts, and the posterior median estimates are shown by coloured lines with 2.5–97.5% posterior quantiles shown by uncertainty ribbons.
We next focused on estimation from these case data, and we considered the same eight observation dates. In electronic supplementary material, fig. S4, we show three sets of estimates: naive estimates which assume the true case counts were given by the cases reported up until each observation date (i.e. the dashed orange lines in figure 6); the gold standard estimates, which had access to the true case counts for each date—these assume perfect foresight over the case counts; and a set of estimates from our method. For each observation date, the naive approach substantially underestimated for the most recent period. Our set of estimates were always higher than those produced by the naive approach, and the uncertainty from our method was generally higher than from the other two methods, owing to the additional uncertainty from not knowing the true case counts. Generally, the estimates output by our method were closer to the gold standard estimates when the estimated case counts were closer to the true counts: for example, for an observation date of 10 June 1991, the estimated case counts were similar to the true values (figure 6; red), and the estimates between our method and the gold standard were similar; on the 19 August 1991, our case estimates overshoot the true case counts (figure 6; blue), and our estimates also overshoot.
(e). 2013–2014 measles outbreak in the Netherlands
We now explore the performance of our approach using daily data from a measles outbreak in the Netherlands in 2013−2014 [4]. This dataset, unlike most case count series, contains information on both the onset time of the case and the time the case was recorded. We assume a serial interval distribution parameterized by a gamma distribution with a mean of 11.7 days and standard deviation of 2.15 days (the mean is the overall estimate reported by [29]; the standard deviation is the mean across all standard deviations reported in their table 3).
We tried fitting the data with both Poisson and the negative binomial renewal models; the fits were similar, while the Poisson model estimates were much faster to generate. So we used the Poisson model to generate the raft of results we present.
We constructed empirical CDFs for the reported cases, assuming the last reported case counts represented the true counts for each onset date for three distinct periods during the outbreak (figure 7). This showed that the delay periods generally increased throughout the outbreak; we allowed for this during inference by allowing separate estimates of the reporting delay distribution in three distinct periods (those shown in figure 7a).
Figure 7.
Netherlands measles outbreak and reporting delays. (A) Shows the daily measles case counts and labels three subperiods. (B) Visualises the reporting delay distributions in each of the corresponding subperiods; in each panel, the dashed orange line shows the mean empirical cumulative distribution function (eCDF) and the solid orange line indicates the median reporting delays. Each (non-orange) solid line indicates an eCDF corresponding to a particular onset date within that period. The (non-orange) lines are coloured according to the maximum cases observed for each of the onset dates.
In figure 8, we show the true and recovered case counts using our method across six different observation dates; we also show the cases reported up until that observation date. Our estimates were generally comparable with the true case counts with the exception of 25 August 2013, when our estimates undershot the true counts.
Figure 8.
Measles case estimation across a range of observation times. Each panel shows the cases reported up until a specific observation time (in orange); the black lines show the true case counts; the green lines show our estimates of the case counts (posterior medians) and the uncertainty ribbons show the 2.5–97.5% posterior intervals.
In figure 9, we show the corresponding estimates for each of the six observation dates. We also show naive estimates corresponding to assuming the cases reported up until the observation date are correct, and gold standard estimates, which assume access to the true case counts. Across the six scenarios, our estimates always improved upon the naive estimates. The uncertainty bounds from our methods, however, did not always contain the true estimates.
Figure 9.
Measles estimation across a range of observation times. Each panel corresponds to a particular observation date. Within each panel, we present three sets of estimates: naive estimates where we assume the reported cases are perfect; gold standard estimates, where we have access to the true case counts; and a set produced by our method, incidence inflation. The uncertainty ribbons show the 2.5–97.5% posterior estimates from each method; the lines show the posterior median estimates.
Overall, these suggest that our method does not allow sufficient uncertainty in either the estimated case counts or values. This suggests model misspecification. The cases estimated by changing a Poisson model to a negative binomial model were not greatly changed (see electronic supplementary material, fig. S5 for one example). This suggests that the problem may lie with the reporting delay model, and electronic supplementary material, fig. S6 suggests that this may be owing to bunching together of reported cases. A beta-binomial reporting model implicitly allows the probability of reporting to fluctuate but the fits of the model to the data remained relatively poor (electronic supplementary material, fig. S7), and we discuss in §4 potential future research to handle such data.
4. Discussion
Outbreak data are imperfect and one of the key corruptions of these is from reporting delays, which means that reliance on observed cases alone produces downwardly biased estimates of . In real outbreak scenarios, analysts know that recent case data are not reliable and analyses are then restricted to a so-called trusted region which fails to produce contemporaneous estimates. Here, we present a framework that explicitly acknowledges reporting delays and makes use of data representing a shifting history of case series. Our method is Bayesian and jointly estimates the true cases, the reporting period and , accounting for uncertainty in each of these sources. Our method allows up-to-current estimates of providing policymakers with timely information about which to act.
Our method is only as good as the assumptions we make about the underlying disease transmission process and the reporting delays. Here, we assumed that was piecewise-constant, and these pieces must be decided upon a priori. For the dengue data we analysed, it looks as if there were often shifts in within the pieces assumed, which sometimes caused our estimates of cases to overshoot the mark (figure 6). Of course, we could choose to make these windows narrower, but the process of selecting appropriate piece widths could be automated using Bayesian non-parametric processes [13]; alternatively, adapting our method to estimate in a more continuous fashion, for example, by adapting EpiEstim [30] and EpiFilter [31] could be worthwhile. Similarly, we assume that the distributions characterizing reporting delays remain constant within given time intervals, and this must be decided upon before running our inference approach—this process can, however, be informed by visual inspections of the data such as those we show here (electronic supplementary material, fig. S3).
Throughout, we assumed that there was a binomial observation process which governed the numbers of cases observed in a fixed interval equation (2.2). We demonstrated that, for the measles data in particular, this was not a good approximation to the process governing how cases are reported, where they often appear to occur in groups. This is possibly because finding one additional case may lead to a whole group of their infected contacts also being located, violating an implicit assumption of the binomial distribution, and our estimates lacked uncertainty owing to this. We tried a beta-binomial distribution here but this did not greatly improve the fit while adding substantially to the runtime. Further work building a more mechanistic model of how reporting delays arise should lead to more appropriate estimates of uncertainty. In addition, incorporating the flexibility of more heuristic approaches to case count estimation (for examples, using the methods of [5]) could allow a more dynamic and accurate depiction of reporting delays.
The predominant approaches for estimation, like EpiEstim [30], are inherently backward looking. This means that estimates of are probably inefficient since they use case data only up until that time point. A better approach would be to use all the information in the case data; methods such as EpiFilter do this [31]. Assuming piecewise-constant values is a sort of halfway house between each of these two extremes since assuming that is fixed forces case data later than a time point to be used to determine estimates. But our estimates ignore all information after the end of a piece, and future work could consider how to incorporate this information. In addition, there is merit in trying to extend our method to be more natively able to handle more continuous prior distributions for . Our unrestricted model can handle this but at quite a cost to computational runtime. Efficiency savings by using different sampling strategies may be possible, and particle filtering approaches may be fruitful to consider.
Computational methods for estimating key epidemiological quantities such as either implicitly or explicitly make assumptions about the quality of the underlying data on which these are based. Predominant approaches to estimation implicitly assume the case series are perfect, which ignores the manifold sources of errors in these. Here, we considered one predominant source of measurement imperfections for disease cases—that owing to delays in the time taken for cases to enter health systems—but there are many other types of measurement errors. One such measurement error comes from underreporting, and cure models may prove useful in extending our framework to account for this [32]; another form is idiosyncratic errors where cases may be either above or below their true counts. Indeed, the measles data we considered here probably had a more complex form of reporting delay than that embedded in our model, and further work is needed to develop better models for these. More generally, the predominant statistical approaches to estimation make arbitrary assumptions about the nature of the noise within the renewal approaches. Future work to produce models that more mechanistically account for how this noise arises should lead to estimates with more appropriate uncertainties.
Acknowledgements
The authors wish to thank Anne Cori and Natsuko Imai for the initial conversations that inspired this work. RNT would like to thank members of the Infectious Disease Modelling group in the Mathematical Institute at the University of Oxford for useful discussions about this work.
Footnotes
Contributor Information
Sumali Bajaj, Email: sumali.bajaj@biology.ox.ac.uk.
Robin Thompson, Email: robin.thompson@st-hildas.ox.ac.uk.
Ben Lambert, Email: ben.lambert@stats.ox.ac.uk.
Data accessibility
All data underlying this study are available in our public GitHub repository [23].
Supplementary material is available online [33].
Declaration of AI use
We have not used AI-assisted technologies in creating this article.
Authors’ contributions
S.B.: formal analysis, investigation, methodology, software, visualization, writing— original draft, writing—review and editing; R.T.: conceptualization, data curation, formal analysis, investigation, methodology, writing—original draft, writing—review and editing; B.L.: conceptualization, data curation, formal analysis, investigation, methodology, project administration, resources, software, supervision, validation, visualization, writing—original draft, writing—review and editing.
All authors gave final approval for publication and agreed to be held accountable for the work performed therein.
Conflict of interest declaration
We declare we have no competing interests.
Funding
S.B. is supported by the Clarendon Scholarship and St Edmund Hall, University of Oxford, and the Natural Environment Research Council Doctoral Training Partnership (grant number NE/S007474/1).
References
- 1. Gostic KM, et al. 2020. Practical considerations for measuring the effective reproductive number, Rt. PLoS Comput. Biol. 16, e1008409. ( 10.1371/journal.pcbi.1008409) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Gallagher K, Creswell R, Gavaghan D, Lambert B. 2023. Identification and attribution of weekly periodic biases in epidemiological time series data. medRxiv. [DOI] [PMC free article] [PubMed]
- 3. Barry A, et al. 2018. Outbreak of Ebola virus disease in the Democratic Republic of the Congo, April–May, 2018: an epidemiological study. Lancet 392, 213–221. ( 10.1016/s0140-6736(18)31387-4) [DOI] [PubMed] [Google Scholar]
- 4. van de Kassteele J, Eilers PHC, Wallinga J. 2019. Nowcasting the number of new symptomatic cases during infectious disease outbreaks using constrained p-spline smoothing. Epidemiology 30, 737–745. ( 10.1097/ede.0000000000001050) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Bastos LS, Economou T, Gomes MFC, Villela DAM, Coelho FC, Cruz OG, Stoner O, Bailey T, Codeço CT. 2019. A modelling approach for correcting reporting delays in disease surveillance data. Stat. Med. 38, 4363–4377. ( 10.1002/sim.8303) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Stoner O, Economou T. 2020. Multivariate hierarchical frameworks for modeling delayed reporting in count data. Biometrics 76, 789–798. ( 10.1111/biom.13188) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Günther F, Bender A, Katz K, Küchenhoff H, Höhle M. 2021. Nowcasting the COVID‐19 pandemic in Bavaria. Biom. J. 63, 490–502. ( 10.1002/bimj.202000112) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Kline D, Hyder A, Liu E, Rayo M, Malloy S, Root E. 2022. A bayesian spatiotemporal nowcasting model for public health decision-making and surveillance. Am. J. Epidemiol. 191, 1107–1115. ( 10.1093/aje/kwac034) [DOI] [PubMed] [Google Scholar]
- 9. Sahai SY, Gurukar S, KhudaBukhsh WR, Parthasarathy S, Rempała GA. 2022. A machine learning model for nowcasting epidemic incidence. Math. Biosci. 343, 108677. ( 10.1016/j.mbs.2021.108677) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Bergström F, Günther F, Höhle M, Britton T. 2022. Bayesian nowcasting with leading indicators applied to COVID-19 fatalities in Sweden. PLoS Comput. Biol. 18, e1010767. ( 10.1371/journal.pcbi.1010767) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Li T, White LF. 2021. Bayesian back-calculation and nowcasting for line list data during the COVID-19 pandemic. PLoS Comput. Biol. 17, e1009210. ( 10.1371/journal.pcbi.1009210) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Ogi-Gittins I, Hart WS, Song J, Nash RK, Polonsky J, Cori A, Hill EM, Thompson RN. 2024. A simulation-based approach for estimating the time-dependent reproduction number from temporally aggregated disease incidence time series data. Epidemics 47, 100773. ( 10.1016/j.epidem.2024.100773) [DOI] [PubMed] [Google Scholar]
- 13. Creswell R, Robinson M, Gavaghan D, Parag KV, Lei CL, Lambert B. 2023. A Bayesian nonparametric method for detecting rapid changes in disease transmission. J. Theor. Biol. 558, 111351. ( 10.1016/j.jtbi.2022.111351) [DOI] [PubMed] [Google Scholar]
- 14. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. 2013. Bayesian data analysis. New York, NY: CRC press. [Google Scholar]
- 15. Lambert B. 2018. A student’s guide to bayesian statistics. London, UK: Sage Publications Ltd. [Google Scholar]
- 16. Carpenter B, et al. 2017. Stan : A Probabilistic Programming Language. J. Stat. Softw. 76, 01. ( 10.18637/jss.v076.i01) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Zhu C, Byrd RH, Lu P, Nocedal J. 1997. Algorithm 778: L-BFGS-B: Fortran Subroutines for Large-Scale Bound-Constrained Optimization. ACM Trans. Math. Softw. 23, 550–560. [Google Scholar]
- 18. Bürkner PC, Gabry J, Kay M, Vehtari A. 2023. posterior: tools for working with posterior distributions. R package version 1.4.1.. See https://mc-stan.org/posterior/.
- 19. Landau W. 2021. The targets R package: a dynamic Make-like function-oriented pipeline toolkit for reproducibility and high-performance computing. J. Open Source Softw. 6, 2959. ( 10.21105/joss.02959) [DOI] [Google Scholar]
- 20. Ushey K, Wickham H. 2023. renv: Project Environments. R package version 1.0.3.
- 21. Bajaj S, Lambert B. 2024. incidenceinflation. See https://github.com/ben18785/incidenceinflation.
- 22. Bajaj S, Lambert B. 2024. incidenceinflationstan. See https://github.com/ben18785/incidenceinflationstan.
- 23. Bajaj S, Thompson R, Lambert B. 2024. Incidence inflation paper repository. See https://github.com/ben18785/incidenceinflation_paper_released.
- 24. Nishiura H, Linton NM, Akhmetzhanov AR. 2020. Serial interval of novel coronavirus (COVID-19) infections. Int. J. Infect. Dis. 93, 284–286. ( 10.1016/j.ijid.2020.02.060) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Van Kerkhove MD, Bento AI, Mills HL, Ferguson NM, Donnelly CA. 2015. A review of epidemiological parameters from Ebola outbreaks to inform early public health decision-making. Sci. Data 2, 150019. ( 10.1038/sdata.2015.19) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. McGough SF, Johansson MA, Lipsitch M, Menzies NA. 2020. Nowcasting by Bayesian smoothing: A flexible, generalizable model for real-time epidemic tracking. PLos Comput. Biol. 16, e1007735. ( 10.1371/journal.pcbi.1007735) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Siraj AS, Oidtman RJ, Huber JH, Kraemer MUG, Brady OJ, Johansson MA, Perkins TA. 2017. Temperature modulates dengue virus epidemic growth rates through its effects on reproduction numbers and generation intervals. PLoS Neglected Trop. Dis. 11, e0005797. ( 10.1371/journal.pntd.0005797) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Codeço CT, Villela DAM, Coelho FC. 2018. Estimating the effective reproduction number of dengue considering temperature-dependent generation intervals. Epidemics 25, 101–111. ( 10.1016/j.epidem.2018.05.011) [DOI] [PubMed] [Google Scholar]
- 29. Vink MA, Bootsma MCJ, Wallinga J. 2014. Serial intervals of respiratory infectious diseases: a systematic review and analysis. Am. J. Epidemiol. 180, 865–875. ( 10.1093/aje/kwu209) [DOI] [PubMed] [Google Scholar]
- 30. Thompson RN, et al. 2019. Improved inference of time-varying reproduction numbers during infectious disease outbreaks. Epidemics 29, 100356. ( 10.1016/j.epidem.2019.100356) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Parag KV. 2021. Improved estimation of time-varying reproduction numbers at low case incidence and between epidemic waves. PLoS Comput. Biol. 17, e1009347. ( 10.1371/journal.pcbi.1009347) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Othus M, Barlogie B, LeBlanc ML, Crowley JJ. 2012. Cure models as a useful statistical tool for analyzing survival. Clin. Cancer Res. 18, 3731–3736. ( 10.1158/1078-0432.ccr-11-2859) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Bajaj S, Thompson RN, Lambert B. 2025. Supplementary material from: A renewal-equation approach to estimating Rt and infectious disease case counts in the presence of reporting delays. Figshare ( 10.6084/m9.figshare.c.7657351) [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All data underlying this study are available in our public GitHub repository [23].
Supplementary material is available online [33].









