Skip to main content
Philosophical transactions. Series A, Mathematical, physical, and engineering sciences logoLink to Philosophical transactions. Series A, Mathematical, physical, and engineering sciences
. 2022 Aug 15;380(2233):20210303. doi: 10.1098/rsta.2021.0303

Estimation of local time-varying reproduction numbers in noisy surveillance data

Wenrui Li 1,, Katia Bulekova 2, Brian Gregor 2, Laura F White 3, Eric D Kolaczyk 1,4
PMCID: PMC9376722  PMID: 35965456

Abstract

A valuable metric in understanding local infectious disease dynamics is the local time-varying reproduction number, i.e. the expected number of secondary local cases caused by each infected individual. Accurate estimation of this quantity requires distinguishing cases arising from local transmission from those imported from elsewhere. Realistically, we can expect identification of cases as local or imported to be imperfect. We study the propagation of such errors in estimation of the local time-varying reproduction number. In addition, we propose a Bayesian framework for estimation of the true local time-varying reproduction number when identification errors exist. And we illustrate the practical performance of our estimator through simulation studies and with outbreaks of COVID-19 in Hong Kong and Victoria, Australia.

This article is part of the theme issue ‘Technical challenges of modelling real-life epidemics and examples of overcoming these’.

Keywords: Bayesian modelling, local reproduction number, identification error, infectious disease epidemiology

1. Introduction

Epidemic modelling, while not at all new, has taken on renewed importance due to the COVID-19 pandemic. The local time-varying reproduction number is an important quantity to monitor the infectiousness and transmissibility of diseases and, therefore, to design and adjust public health responses during an outbreak. Recent examples include monitoring transmission of the COVID-19 pandemic and demonstrating the efficacy of non-pharmaceutical interventions in more than 100 countries [16]. The value of the local time-varying reproduction number, Rlocal(t), represents the expected number of secondary local cases arising from a primary case infected at time t. Different formal definitions of Rlocal(t) have been proposed, and a number of methods are available to estimate this quantity. The widely used EpiEstim estimator is an estimator of the instantaneous reproductive number that is defined as the ratio of the expected number of incident locally infected cases at time t to the expected total infectiousness of infected individuals at time t [7,8]. In implementing this estimator, we typically smooth cases over a sliding window. This can have the result of making the estimator less timely but with the benefit of smoothing out much of the noise due to day of week effects in reporting and other random fluctuations to get a clearer trend.

Distinguishing local cases from imported cases is essential to estimation of the local time-varying reproduction number [7,9]. However, surveillance data generally are available only up to some level of error. For example, if we are unable to identify the correct source of infection from contact tracing or genetic information, imported cases might be misclassified as local cases, and vice versa. Such misclassification error is recognized as one limitation of estimating Rlocal(t) in the COVID-19 outbreak [10,11]. We investigate how identification error impacts on the estimation of the instantaneous reproduction number and, thus, on our understanding of diseases transmission dynamics.

Extensive work regarding improving the inference of time-varying reproduction numbers has been done. For instance, there have been efforts to estimate the serial interval that is used to compute the total infectiousness for Rlocal(t) estimation, including Bayesian parametric estimation using data augmentation Markov chain Monte Carlo (MCMC) [7,12], and a cure model for limited follow-up data [13]. Many studies have explored the effects of imperfect detection and estimated the true infection prevalence [11,1416]. But, to our best knowledge, there has been little attention to date given towards accounting for identification errors of local and imported cases.

Our contribution in this paper is to quantify how such errors propagate to the local time-varying reproduction number, and to provide estimators for Rlocal(t) when contact tracing survey information is available. Adopting the definition of Rlocal(t) proposed by Thompson et al. [7], we characterize the impact of identification errors on the bias of noisy local time-varying reproduction numbers. Our work shows that, in general, the bias can be expected to be non-trivial. Accordingly, we propose a Bayesian framework to estimate the true local time-varying reproduction number. Numerical simulation suggests that high accuracy is possible for estimating local time-varying reproduction numbers in outbreaks of even modest size. We illustrate the practical use of our estimators in the context of the COVID-19 pandemic in Hong Kong and Victoria, Australia.

The organization of this paper is as follows. In §2, we show the bias of the noisy local time-varying reproduction number, and propose a Bayesian hierarchical framework to estimate the true local time-varying reproduction number with imperfect knowledge. Section 3 reports the practical performance of our estimators through simulation studies and with SARS-CoV-2 infections in Hong Kong and Australia. Finally, we conclude in §4 with a discussion of future directions for this work.

2. Methods

In this section, we first quantify the bias of the noisy local time-varying reproduction number when misidentification occurs in the surveillance data. We then build a Bayesian hierarchical framework to estimate true local time-varying reproduction numbers. We also propose a method to estimate misidentification rates based on contact tracing survey data, which informs the prior distribution in the model.

(a) . Notation

Both the seminal Fraser [17] article and the Thompson et al. [7] article we are working from use what seems a tendency in the epidemiology literature of conflating empirical processes and their means. From the perspective of designing the simulation study and including other Bayesian aspects, we necessarily distinguish between processes and means more precisely in this paper. Specifically, we use letter I to denote the empirical processes and letter μ to denote their means. The (local) time-varying reproduction number involves μ only. The plug-in estimator of the time-varying reproduction number in [17] involves I only. The estimator of the local time-varying reproduction number proposed by Thompson et al. [7] involves both μ and I. One of the reasons that [7] used both empirical values and population values might be this estimator is easier to work with. We note that we use the sum notation for empirical processes and the integral notation for their means.

To clarify the terminology, we provide the technical differences among the terms error, bias and accuracy we used in the paper. If the surveillance data we have are not the same as the underlying truth, we say that the surveillance data are with some error. Here error implies the differences between the surveillance data and the truth. The bias of an estimator is the difference between this estimator’s expected value and the true value of the parameter being estimated. We say an estimator is of high accuracy if the bias and variance of the estimator are relatively small.

The number of newly infected cases at time t, I(t), is the sum of the numbers of local (Ilocal(t)) and imported (Iimported(t)) cases. If one assumes independence between calendar time and the generation interval, g(s), then the local time-varying reproduction number is defined as [7]

Rlocal(t)=μlocal(t)0g(s)μ(ts)ds, 2.1

where μlocal(t)=E[Ilocal(t)] and μ(t)=E[I(t)]. Note that from the perspective of simulation, the distinction between empirical values and population values seems potentially important, for the reason that ‘the expectation of a ratio is not the ratio of expectations’. Specifically, to calculate a true local time-varying reproduction number from simulation, we have expectations in the numerator and denominator, each of which can be approximated over a large number of trials through sample averages.

In reality, we only know the serial interval and the number of diagnosed cases. Let I(t), Ilocal(t) and Iimported(t) be the numbers of total diagnosed cases, local diagnosed cases and imported diagnosed cases at time t, respectively. Then we define a realistic local time-varying reproduction number as

Rlocal(t)=μlocal(t)0w(s)μ(ts)ds, 2.2

where w(s) is the serial interval, μlocal(t)=E[Ilocal(t)] and μ(t)=E[I(t)]. Note that the serial interval corresponds to date of symptom onset. One can estimate symptom onset dates by back calculation of report dates [18].

Realistically, we can expect identification of cases as local or imported to be imperfect. Let I~local(t) and I~imported(t) be the number of new local and imported cases reported at time t, with identification error. Thus, we define a noisy local time-varying reproduction number as

R~local(t)=μ~local(t)0w(s)μ(ts)ds, 2.3

where μ~local(t)=E[I~local(t)]. The definition of R~local(t) in (2.3) comes from an argument that mimics the original argument using Poisson arrivals in [17]. Specifically, we suppose that we observe a Poisson stream (also known as a Poisson process, i.e. a sequence of statistically independent and memoryless arrival times, the counts of which are Poisson distributed random variables) I~local(t) that is a function of calendar time t in terms of the transmissibility, denoted β~local(t,s), an arbitrary function of calendar time t and time since infection s. Then, μ~local(t) follows the so-called renewal equation

μ~local(t)=0β~local(t,s)μ(ts)ds. 2.4

Following [17], we have

β~local(t,s)=R~local(t)w(s). 2.5

Inserting (2.5) into (2.4) yields the definition of R~local(t) in (2.3).

Our interest is in characterizing the manner in which the uncertainty in I~local(t) and I~imported(t) propagates to the local time-varying reproduction number, and providing estimators of Rlocal(t) to account for identification errors.

(b) . Bias of the noisy local time-varying reproduction number

We quantify the bias of the noisy local time-varying reproduction number in (2.3) when misidentification occurs. We begin by defining a model for I~local(t) and I~imported(t). Let α0 denote the probability that an imported case is misidentified as local, and α1 the probability that a local case is misidentified as imported. Then, a simple model is

I~local(t)|Ilocal(t),Iimported(t),α0,α1Bin(Ilocal(t),1α1)+Bin(Iimported(t),α0)andI~imported(t)=Ilocal(t)+Iimported(t)I~local(t).} 2.6

Under independence, the first relationship in (2.6) is directly obtained by the definition of α0 and α1. And the second equation in (2.6) is due to the fact that the total number of cases reported at time t is not affected by the misidentification.

By (2.6), the relationship between μ~local(t) and μlocal(t) is

μ~local(t)=(1α1)μlocal(t)+α0μimported(t), 2.7

where μimported(t)=E(Iimported(t)). Direct computation yields

R~local(t)=(1α1+α0μimported(t)μlocal(t))Rlocal(t) 2.8

when μlocal(t)0. From (2.8), we can see that the bias of R~local(t) depends on α0, α1 and the ratio of μimported(t) and μlocal(t). We will overestimate Rlocal(t) if α1/α0<μimported(t)/μlocal(t) and underestimate Rlocal(t) if α1/α0>μimported(t)/μlocal(t). The ratio of R~local(t) to Rlocal(t) is shown below.

R~local(t)Rlocal(t)=1α1+α0μimported(t)μlocal(t). 2.9

We can see that the ratio increases when α0 and μimported(t)/μlocal(t) increase, and decreases when α1 increases. The absolute difference of R~local(t) and Rlocal(t) is as follows:

|R~local(t)Rlocal(t)|=|α1+α0μimported(t)μlocal(t)|Rlocal(t). 2.10

This absolute difference is proportional to Rlocal(t) and the absolute difference of α1 and α0μimported(t)/μlocal(t).

(c) . Bayesian hierarchical modelling to account for misidentification

We propose a Bayesian framework to estimate Rlocal(t) using noisy surveillance data. Figure 1 summarizes the general idea.

Figure 1.

Figure 1.

Schematic of our method to account for misidentification. Note that we do not back-calculate Ilocal(t) and Iimported(t) from estimated Ilocal(t) and Iimported(t) in this paper.

The model for the data I~local(t) and I~imported(t) is defined in (2.6). Following [7,8,17], we specify

Ilocal(t)|Rlocal(t),n(t1),w(s)Pois(Rlocal(t)Λ(t)),for t>0, 2.11

where Λ(t)=s=1tw(s)I(ts) is the total infectiousness of infected individuals at time t, and n(t1) represent the historical data up to time t1 (i.e. Ilocal(0),Iimported(0),,Ilocal(t1),Iimported(t1)). Note that Λ(t) is undefined for t=0. So, we assume that

Ilocal(0)|μlocal(0)Pois(μlocal(0)). 2.12

And we assume the imported case counts follow a Poisson distribution:

Iimported(t)|μimported(t)Pois(μimported(t)). 2.13

Next, we define relevant prior distributions. We assume a distribution for Rlocal(t) of the form

Rlocal(t)|n(t1),w(s)Gamma(at|t1local,bt|t1local),for t>0. 2.14

This choice is similar to that in [7], but differs in that we specify gamma conditioned on the history, rather than marginally. The conditioning reflects the expectation that the evolution of Rlocal(t) is likely to depend on the course of infection in the population and intervention measures that may result. One can set at|t1local and bt|t1local based on the historical surveillance data, e.g. at|t1local=I~local(t1) and bt|t1local=Λ(t1). Analogously, we also assume gamma distributed priors for μimported(t) and μlocal(0), that is,

μimported(t)Gamma(atimported,btimported)andμlocal(0)Gamma(a0local,b0local).} 2.15

In addition, we assign the beta distributed priors to the misidentification rates:

α0Beta(ζα0,ξα0)andα1Beta(ζα1,ξα1).} 2.16

By using MCMC simulation, we can get both estimates of Rlocal(t) and its uncertainty. We implement MCMC using the R package, NIMBLE [1921] with the default assignment of sampler algorithms. The samplers assigned to the variables are as follows: Gibbs samplers are assigned to μlocal(0) and μimported(t), t0, which have conjugate relationships between their prior distribution and the distributions of their stochastic dependents; slice samplers [22] are used for Ilocal(t) and Iimported(t), t0; Metropolis–Hastings adaptive random-walk samplers are set to α0, α1 and Rlocal(t), t>0.

(d) . Setting hyperparameters and initial values in Markov chain Monte Carlo

Without any information on the misidentification rates, it is difficult to get an accurate estimator of Rlocal(t). However, contact tracing data could provide adequate information to estimate the misidentification rates. Here we use contact tracing data to set informative priors on α0 and α1, and initial values of Ilocal(t) and Iimported(t).

Let pi be the probability that we think individual i is a local case based on the survey. Then, pi can be modelled as a mixture of α0 and 1α1. Note that α1Beta(ζα1,ξα1) implies 1α1Beta(ξα1,ζα1). See the appendix for the proof of this property of the beta distribution. We thus model the distribution of pi as a mixture of two beta distributions:

piπ0Beta(ζα0,ξα0)+(1π0)Beta(ξα1,ζα1), 2.17

where π0 can be interpreted as the fraction of the diagnosed cases that are imported. By using an expectation–maximization algorithm, we can obtain estimators ζ^α0,ξ^α0,ζ^α1 and ξ^α1. We set α0Beta(ζ^α0,ξ^α0) and α1Beta(ζ^α1,ξ^α1) in the MCMC simulation.

Note that, if 1ζα0/(ζα0+ξα0)ζα1/(ζα1+ξα1)0, we obtain unbiased estimators of Ilocal(t) and Iimported(t)

I^local(t)=(1μα0)I~local(t)μα0I~imported(t)1μα0μα1andI^imported(t)=(1μα1)I~imported(t)μα1I~local(t)1μα0μα1,} 2.18

where μα0=ζα0/(ζα0+ξα0) and μα1=ζα1/(ζα1+ξα1). Thus, we set initial values of Ilocal(t) and Iimported(t) in the MCMC based on (2.18) and estimators ζ^α0,ξ^α0,ζ^α1 and ξ^α1. To be specific, the initial values of Ilocal(t) and Iimported(t) are given by

Iinitiallocal(t)=max(0,min(I(t),[(1μ^α0)I~local(t)μ^α0I~imported(t)1μ^α0μ^α1]))andIinitialimported(t)=I(t)Iinitiallocal(t),} 2.19

where μ^α0=ζ^α0/(ζ^α0+ξ^α0), μ^α1=ζ^α1/(ζ^α1+ξ^α1), and [] denotes the nearest integer.

And we choose priors (at|t1local,bt|t1local)=(1,1) for Rlocal(t), μimported(t)Gamma(1,1) and μlocal(0)Gamma(1,1), which are fairly uninformative.

3. Results

In this section, we conducted some simulations to illustrate the performance of the proposed estimation methods. And we applied our method to two real datasets. One is surveillance data of COVID-19 cases in Hong Kong that includes contact tracing information, including travel history data [23]. They collected information on 1038 SARS-CoV-2 cases confirmed between 23 January and 28 April 2020. And they identified 355 local cases and 683 imported cases. The other dataset is from the COVID-19 pandemic in Victoria, Australia, studied in [24]. There they had 1333 laboratory-confirmed cases of COVID-19 between 6 January and 14 April 2020. After excluding duplicate patients from cases, they identified 345 local cases and 558 imported cases.

We considered two settings, a simulation setting and an application setting. In the simulation setting, we first used surveillance data from Hong Kong and Victoria to create realistic simulated data. Then we added identification errors to the ‘true’ local and imported cases derived from the simulated epidemics. Finally, we estimated the local time-varying reproduction number using the noisy local and imported cases counts. In the application setting, we assumed that identified local and imported cases in the real datasets were with some error. The former results allow us to understand what properties can be expected of our estimators, while the latter are reflective of what would be observed in practice with such data.

(a) . Simulation study

In this simulation study, we used Covasim [25], a stochastic individual-based model for transmission of SARS-CoV-2, calibrated to the epidemics in Hong Kong and Victoria. In Covasim, a susceptible–exposed–infectious–removed model dictates the progression of disease for individuals, and contact networks determine interactions between individuals that can cause infection. Covasim supports an extensive set of interventions, including both non-pharmaceutical interventions and pharmaceutical interventions. In the calibration, we set network connectivity and intervention strategies such that the simulated data are close to the epidemics in Hong Kong and Victoria. The details of parameter values we used are available at https://github.com/KolaczykResearch/EstimLocalRt.

Figure 2 shows the average daily local and imported diagnosed counts over 1000 trials. The noisy I~local(t) and I~imported(t) were generated according to (2.6). We set α00.1 (beta distributed with mean 0.1), and α10.3,0.4or0.5 to see the effect of small α0 and large α1. This might happen if the definition of imported cases relies on travel history collected in the case investigation and some people are infected locally, even although they have a travel history within 14 days prior to symptom onset. We also considered α10.1, and α00.3,0.4or0.5 (corresponding to small α1 and large α0, which might occur if cases are defined as local when we are not sure about their source of infection). We assumed that both α0 and α1 are unknown.

Figure 2.

Figure 2.

The means of daily local and imported diagnosed counts in 1000 simulation trials for epidemics in Hong Kong and Victoria.

We evaluated the estimate for Rlocal(t) in terms of a corresponding posterior, and 95% credible intervals. Figures 3 and 4 show the simulation results, in which we ran MCMC chains of 10 000 samples for each of 1000 simulated epidemic trials. The number of burn-in samples is 1000. And we used the trace and autocorrelation plots to evaluate the samples. In each trial, we compute the posterior mean and 95% credible intervals of estimated local time-varying reproduction numbers at each time point. Then we take the average over 1000 trials and obtain the curves and error bands in Figures 3 and 4. Figure 3 assumes that we are more likely to misclassify local cases as imported cases and figure 4 assumes that we are more likely to misclassify imported cases as local cases. The reason for not showing estimates for Rlocal(t) in the left part of the right panel is that there are few diagnosed counts and the data are not sufficiently informative. The red curve represents the results obtained from our Bayesian model. For comparison purposes, we computed Rlocal(t) (corresponds to the blue curve) and Rlocal(t) (corresponds to the purple curve) defined in (2.1) and (2.2) by approximating μlocal(t), μ(t), g(s), μlocal(t), μ(t), w(s) using 1000 simulation trials. And we calculated the widely used estimator of R~local(t) (corresponds to the green curve) defined in (2.3), which is implemented in the R package, EpiEstim [26]. We chose the weekly sliding window (default setting in EpiEstim) so the green curve has a thinner credible interval compared with the red curve. We view it as a representative estimator that does not account for misidentification, i.e. it treats the noisy local and imported cases as true. Note that the blue curve (Rlocal(t)) is temporally accurate. However, we used the lagged case observations and the serial interval in our Bayesian framework and EpiEstim. Thus, Rlocal(t) (corresponds to the purple curve) is what we could estimate accurately using our Bayesian model.

Figure 3.

Figure 3.

Estimations of local time-varying reproduction numbers in simulated epidemics for Hong Kong and Victoria under three sets of error misidentification rates: α00.1, and α10.3,0.4or0.5. The error bands are the averages of 95% credible intervals over 1000 trials at each time point.

Figure 4.

Figure 4.

Estimations of local time-varying reproduction numbers in simulated epidemics for Hong Kong and Victoria under three sets of error misidentification rates: α10.1, and α00.3,0.4or0.5. The error bands are the averages of 95% credible intervals over 1000 trials at each time point.

Recall that the mean of unlagged infection counts was used in the blue curve (Rlocal(t)) and the mean of lagged diagnosed cases counts was used in the purple curve (Rlocal(t)). When the intervention strategy like shutting down is adopted (e.g. the middle of March in the simulated epidemic in Hong Kong), the infection counts will decrease sharply at the same time, but the diagnosed case counts will decrease smoothly with some time lag if we do not test all people everyday. This is why we see sharp decreases in the blue curve and smooth decreases in the purple curve.

In the simulated epidemics for both Hong Kong and Victoria, if we ignore the misidentification, we will underestimate Rlocal(t) when the mean of α0 is small and the mean of α1 is relatively large (figure 3), and overestimate Rlocal(t) when the mean of α1 is small and the mean of α0 is relatively large (figure 4), with the biases increasing when the means of α0 and α1. The results are consistent with (2.8) implying that the biases will lead to an inappropriate public health response, i.e. inadequate interventions or overreaction. We corrected the bias using our Bayesian hierarchical framework. The biases of our estimators are close to zero in all cases. The 95% credible intervals of our estimators are wide in the first two months because the number of incident cases are very low. For the last month or so when the diagnosed counts are relatively high, the 95% credible intervals are narrow.

(b) . Application

We applied our proposed methods to surveillance data of COVID-19 cases in Hong Kong and Victoria. Figure 5a,b shows the daily local and imported cases counts in Hong Kong and Victoria. For Hong Kong data, Adam et al. [23] calculated the serial intervals using a gamma distribution and estimated shape and rate parameters of 2.23 and 0.37, respectively (corresponding to a mean of around 6 days and standard deviation of around 4 days). There is no specific serial interval that has been calculated for Victoria. Considering the epidemic curve in Victoria is relatively similar to that in Hong Kong, we used the same serial interval distribution when we estimated Rlocal(t) in Victoria.

Figure 5.

Figure 5.

Epidemic curves of COVID-19 cases and estimations of local time-varying reproduction numbers in Hong Kong and Victoria. (a) The epidemic curve of daily cases of laboratory-confirmed SARS-CoV-2 infection in Hong Kong by symptom onset date and coloured by case category. Asymptomatic cases are included here by date of confirmation. (b) The epidemic curve of the coronavirus disease cases in Victoria by sample collection date and coloured by case category. (c,d) Estimations of local time-varying reproduction numbers under three assumed scenarios: (1) no identification error, (2) α00.1 and α10.3 (around 10% imported cases are misclassified as local and around 30% local cases are misclassified as imported), (3) α00.3 and α10.1 (around 30% imported cases are misclassified as local and around 10% local cases are misclassified as imported). The bands are the 95% credible intervals.

Since we did not have access to the contact tracing survey data mentioned in §2d to infer the misidentification rates, we investigated a range of plausible values. Figure 5c,d shows estimates for Rlocal(t) under three assumed scenarios: (1) no identification error, (2) small α0 and large α1, (3) small α1 and large α0. We ran MCMC chains of 10 000 samples and the error bands are the 95% credible intervals. We can see that the estimated local time-varying reproduction numbers are quite different when the two identification error rates are about 10% and 30%. If we think we are more likely to misclassify local cases as imported, then we should trust the curve corresponding to scenario (2). If imported cases are more likely to be misidentified as local, then the curve corresponding to scenario (3) is reliable. And if we believe the identification error is close to zero, we should trust the estimate under scenario (1). For example, in late March, the estimated local time-varying reproduction numbers and 95% credible intervals are below one under scenario (1), but are near or above one under scenario (2). The differences can lead to different public health policies.

Ultimately, we see that the ability to account for identification error appropriately in reporting the local time-varying reproduction number can lead to substantially different conclusions than use of the original, noisy local time-varying reproduction number. These differences can then in turn be translated to decision making for public health response.

4. Discussion

We have developed a general framework for estimation of the true local time-varying reproduction numbers in contexts wherein one has identified local and imported case counts with some error. Simulations demonstrate that substantial inferential accuracy by our estimators is possible when non-trivial error is present. And our application to epidemics in Hong Kong and Victoria shows that the gains offered by our approach over presenting the noisy local instantaneous reproduction number can be pronounced.

We have shown examples on a state/province level, but our method could be useful for cities, or more local settings, such as a university trying to determine if there is substantial local transmission occurring. Our approach requires daily numbers of local and imported cases, serial interval and contact tracing data or other data to provide adequate information to estimate the misidentification rates.

We have pursued a Bayesian approach to the problem of estimating the local instantaneous reproduction number. The credible intervals are relatively wide when the number of cases is low. To improve the performance at low case incidence, Kalman filtering is a natural approach. Estimating the time-vary reproduction number by Kalman filtering is an emerging topic. For instance, Parag [27] constructed a recursive Bayesian smoother for estimating the effective reproduction number from the incidence of an infectious disease in real time and retrospectively. However, one typically does not distinguish between local and imported cases in this setting.

The identification errors are informed by contact tracing survey data in our approach. If the data from the survey are categorical (e.g. we ask people where they were infected and attach some qualitative measure of our confidence that we think they are local cases), we can transform them into numerical values. For example, Patki et al. [28] proposed a method that converts categorical variables to numerical data for a Gaussian distribution. We could modify the method to convert categorical variables to Beta distributed data. If the survey data are unavailable, using genomic data is a natural alternative. Genomic surveillance has been used to detect transmission clusters and to provide information on the possible source of individual cases [2934].

We assume the identification errors are constant over time in our model. One future direction is relaxing this assumption. The identification errors may vary over time as the quality of surveillance data may not be the same. And the errors may depend on the incidence of local and imported cases. If there are few imported cases, an imported case might be likely to be incorrectly classified as local but a local case will be less likely to be incorrectly classified as imported.

We have shown the results of retrospective estimation. And it is computationally feasible to run MCMC on each day to obtain real-time estimators; it takes about 5 minutes for the MCMC chain of 10 000 samples.

In the simulation study, we reported the mean of posterior means of estimated local time-varying reproduction numbers over 1000 trials. To see if there is much variation in estimated values between simulations, we have computed the standard derivation of posterior means from 1000 simulated epidemic trials at each time point. For the simulated epidemic in Hong Kong, the average of standard derivations (over time) is ranging from 0.37 to 0.43 in the six misidentification error scenarios shown in figures 3 and 4. For the simulated epidemic in Victoria, the average of standard derivations (over time) is ranging from 0.28 to 0.38 in the six misidentification error scenarios shown in figures 3 and 4.

We assume the serial interval for Victoria is the same as that in Hong Kong. There is variability in the serial interval among countries. Ali et al. [35] summarized 129 estimates of serial intervals reported for COVID-19, with means or medians ranging from 1.0 to 9.9. Also, serial interval observations for COVID-19 could be negative [36]. Exploring the robustness of our model to the serial interval could be a potential future direction.

The use of the lagged case observations and the serial interval can lead to temporal inaccuracies in the estimation of local time-varying reproduction numbers, which can hinder inference about the impact of changes in behaviour and policies on the local transmission. The best practice is to back-calculate unlagged infection counts from lagged case observations [37]. Thus, to improve the accuracy of the estimation of local time-varying reproduction numbers, we can first back-calculate the unlagged infection counts using the noisy surveillance data and then run the MCMC with those unlagged counts.

If contact tracing datasets contain cases with unknown classification as local or imported, we could use the information from other data (e.g. genomic data) to impute these cases. If no other information is available, we could randomly classify these cases as local or imported.

As shown in the simulation study, ignoring misclassification of local or imported cases can lead to substantially inaccurate estimation of local time-varying reproduction numbers. In our data application, the misidentification rates are relatively small and thus the incorrect classification of local or imported cases does not have a big impact on the estimation of local time-varying reproduction numbers. However, there may be other real-world examples where our modelling framework becomes important.

While this paper was awaiting review, we became aware of related work that appeared by Tsang et al. [38]. In that paper, those authors developed a Bayesian framework to estimate the local time-varying reproductive number, accounting for unlinked local cases and potential different infectiousness among local and imported cases. One of the main differences between their work and our work is that they assumed misspecification of the source of infection for local cases, but perfect classification of cases (i.e. α0=α1=0).

Appendix A

If XBeta(α,β), then 1XBeta(β,α).

Proof.

Define g(x)=1x and let Y=g(X)=1X. Applying the theorem 2.1.5 (Distributions of functions of a random variable) in [39], for y(0,1), we get

fY(y)=fX(g1(y))|ddyg1(y)|=Γ(α+β)Γ(α)Γ(β)(1y)α1yβ1,

which is the probability density function of Beta(β,α). ▪

Data accessibility

No primary data are used in this paper. Secondary data sources are taken from [23,24]. These data and the code necessary to reproduce the results in this paper are available at https://github.com/KolaczykResearch/EstimLocalRt.

Authors' contributions

W.L.: data curation, formal analysis, investigation, methodology, software, validation, visualization, writing—original draft, writing—review and editing; K.B.: data curation, investigation, software, writing—review and editing; B.G.: data curation, investigation, software, writing—review and editing; L.F.W.: conceptualization, funding acquisition, methodology, project administration, supervision, writing—review and editing; E.D.K.: conceptualization, funding acquisition, methodology, project administration, supervision, writing—review and editing.

All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

Conflict of interest declaration

We declare we have no competing interests.

Funding

This work was supported in part by Army Research Office award W911NF1810237. This work was also supported by the National Institutes of Health, R01 GM122878 and R35 GM141821. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

  • 1.You C et al. 2020. Estimation of the time-varying reproduction number of COVID-19 outbreak in China. Int. J. Hyg. Environ. Health 228, 113555. ( 10.1016/j.ijheh.2020.113555) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Li Y, Campbell H, Kulkarni D, Harpur A, Nundy M, Wang X, Nair H, Usher Network for COVID. 2020. The temporal association of introducing and lifting non-pharmaceutical interventions with the time-varying reproduction number (R) of SARS-CoV-2: a modelling study across 131 countries. Lancet Infect. Dis. 21, 193-202. ( 10.1016/S1473-3099(20)30785-4) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Rubin D et al. 2020. Association of social distancing, population density, and temperature with the instantaneous reproduction number of SARS-CoV-2 in counties across the United States. JAMA Netw. Open 3, e2016099. ( 10.1001/jamanetworkopen.2020.16099) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Abbott S et al. 2020. Estimating the time-varying reproduction number of SARS-CoV-2 using national and subnational case counts. Wellcome Open Res. 5, 112. ( 10.12688/wellcomeopenres.16006.1) [DOI] [Google Scholar]
  • 5.Ackland GJ, Ackland JA, Antonioletti M, Wallace DJ. 2021. Fitting the reproduction number from UK coronavirus case data, and why it is close to 1. medRxiv. ( 10.1101/2021.09.23.21256065) [DOI] [PMC free article] [PubMed]
  • 6.Panovska-Griffiths J et al. 2022. Statistical and agent-based modelling of the transmissibility of different SARS-CoV-2 variants in England and impact of different interventions. medRxiv. 2021-12. ( 10.1101/2021.12.30.21267090) [DOI]
  • 7.Thompson RN et al. 2019. Improved inference of time-varying reproduction numbers during infectious disease outbreaks. Epidemics 29, 100356. ( 10.1016/j.epidem.2019.100356) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Cori A, Ferguson NM, Fraser C, Cauchemez S. 2013. A new framework and software to estimate time-varying reproduction numbers during epidemics. Am. J. Epidemiol. 178, 1505-1512. ( 10.1093/aje/kwt133) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Creswell R et al. 2022. Heterogeneity in the onwards transmission risk between local and imported cases affects practical estimates of the time-dependent reproduction number. Phil. Trans. R. Soc. A 380, 20210308. ( 10.1098/rsta.2021.0308) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Chong KC et al. 2020. Transmissibility of coronavirus disease 2019 in Chinese cities with different dynamics of imported cases. PeerJ 8, e10350. ( 10.7717/peerj.10350) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Arroyo Marioli F, Bullano F, Kučinskas S, Rondón-Moreno C. 2020. Tracking R of COVID-19: a new real-time estimation using the Kalman filter. Available at SSRN 3581633. ( 10.1101/2020.04.19.20071886) [DOI] [PMC free article] [PubMed]
  • 12.Reich N, Lessler J, Cummings D, Brookmeyer R. 2009. Estimating incubation period distributions with coarse data. Stat. Med. 28, 2769-2784. ( 10.1002/sim.3659) [DOI] [PubMed] [Google Scholar]
  • 13.Ma Y, Jenkins HE, Sebastiani P, Ellner JJ, Jones-López EC, Dietze R, Horsburgh CR Jr., White LF. 2020. Using cure models to estimate the serial interval of tuberculosis with limited follow-up. Am. J. Epidemiol. 189, 1421-1426. ( 10.1093/aje/kwaa090) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Miller DA, Talley BL, Lips KR, Campbell Grant EH. 2012. Estimating patterns and drivers of infection prevalence and intensity when detection is imperfect and sampling error occurs. Methods Ecol. Evol. 3, 850-859. ( 10.1111/j.2041-210X.2012.00216.x) [DOI] [Google Scholar]
  • 15.McClintock BT, Nichols JD, Bailey LL, MacKenzie DI, Kendall WL, Franklin AB. 2010. Seeking a second opinion: uncertainty in disease ecology. Ecol. Lett. 13, 659-674. ( 10.1111/j.1461-0248.2010.01472.x) [DOI] [PubMed] [Google Scholar]
  • 16.Cui N, Chen Y, Small DS. 2013. Modeling parasite infection dynamics when there is heterogeneity and imperfect detectability. Biometrics 69, 683-692. ( 10.1111/biom.12050) [DOI] [PubMed] [Google Scholar]
  • 17.Fraser C. 2007. Estimating individual and household reproduction numbers in an emerging epidemic. PLoS ONE 2, e758. ( 10.1371/journal.pone.0000758) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Li T, White LF. 2021. Bayesian back-calculation and nowcasting for line list data during the COVID-19 pandemic. PLoS Comput. Biol. 17, e1009210. ( 10.1371/journal.pcbi.1009210) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.de Valpine P, Turek D, Paciorek C, Anderson-Bergman C, Temple Lang D, Bodik R. 2017. Programming with models: writing statistical algorithms for general model structures with NIMBLE. J. Comput. Graph. Stat. 26, 403-413. ( 10.1080/10618600.2016.1172487) [DOI] [Google Scholar]
  • 20.de Valpine P et al. 2020. Nimble: MCMC, particle filtering, and programmable hierarchical modeling. R package version 0.10.1. See https://cran.r-project.org/package=nimble.
  • 21.de Valpine P et al. 2020. NIMBLE user manual. R package manual version 0.10.1. See https://r-nimble.org.
  • 22.Neal RM. 2003. Slice sampling. Ann. Stat. 31, 705-741. ( 10.1214/aos/1056562461) [DOI] [Google Scholar]
  • 23.Adam DC, Wu P, Wong JY, Lau EHY, Tsang TK, Cauchemez S, Leung GM, Cowling BJ. 2020. Clustering and superspreading potential of SARS-CoV-2 infections in Hong Kong. Nat. Med. 26, 1714-1719. ( 10.1038/s41591-020-1092-0) [DOI] [PubMed] [Google Scholar]
  • 24.Seemann T et al. 2020. Tracking the COVID-19 pandemic in Australia using genomics. Nat. Commun. 11, 1-9. ( 10.1038/s41467-020-18314-x) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Kerr CC et al. 2020. Covasim: an agent-based model of COVID-19 dynamics and interventions. medRxiv. ( 10.1101/2020.05.10.20097469) [DOI]
  • 26.Cori A, Kamvar ZN, Stockwin JE, Jombart T, Thompson RN, Dahlqwist E. 2020. EpiEstim. ( 10.5281/zenodo.3685977) [DOI] [PMC free article] [PubMed]
  • 27.Parag KV. 2020. Improved estimation of time-varying reproduction numbers at low case incidence and between epidemic waves. medRxiv. ( 10.1101/2020.09.14.20194589) [DOI]
  • 28.Patki N, Wedge R, Veeramachaneni K. 2016. The synthetic data vault. In 2016 IEEE Int. Conf. on Data Science and Advanced Analytics (DSAA), Montreal, Canada, 17–19 October, pp. 399–410. Manhattan, NY: IEEE. ( 10.1109/dsaa.2016.49) [DOI]
  • 29.Leavitt SV, Lee RS, Sebastiani P, Horsburgh CR, Jenkins HE, White LF. 2020. Estimating the relative probability of direct transmission between infectious disease patients. Int. J. Epidemiol. 49, 764-775. ( 10.1093/ije/dyaa031) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Meredith LW et al. 2020. Rapid implementation of SARS-CoV-2 sequencing to investigate cases of health-care associated COVID-19: a prospective genomic surveillance study. Lancet Infect. Dis. 20, 1263-1272. ( 10.1016/S1473-3099(20)30562-4) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Deng X et al. 2020. Genomic surveillance reveals multiple introductions of SARS-CoV-2 into Northern California. Science 369, 582-587. ( 10.1126/science.abb9263) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Poon AFY et al. 2016. Near real-time monitoring of HIV transmission hotspots from routine HIV genotyping: an implementation case study. Lancet HIV 3, e231-e238. ( 10.1016/S2352-3018(16)00046-1) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Sansone M, Andersson M, Gustavsson L, Andersson LM, Nordén R, Westin J. 2020. Extensive hospital in-ward clustering revealed by molecular characterization of influenza A virus infection. Clin. Infect. Dis. 71, e377-e383. ( 10.1093/cid/ciaa108) [DOI] [PubMed] [Google Scholar]
  • 34.Peters PJ et al. 2016. HIV infection linked to injection use of oxymorphone in Indiana, 2014–2015. N. Engl. J. Med. 375, 229-239. ( 10.1056/NEJMoa1515195) [DOI] [PubMed] [Google Scholar]
  • 35.Ali ST et al. 2021. Serial intervals and case isolation delays for COVID-19: a systematic review and meta-analysis. Clin. Infect. Dis. 74, 685-694. ( 10.1093/cid/ciab491) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Du Z, Xu X, Wu Y, Wang L, Cowling BJ, Meyers LA. 2020. Serial interval of COVID-19 among publicly reported confirmed cases. Emerg. Infect. Dis. 26, 1341-1343. ( 10.3201/eid2606.200357) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Gostic KM et al. 2020. Practical considerations for measuring the effective reproductive number, Rt. PLoS Comput. Biol. 16, e1008409. ( 10.1371/journal.pcbi.1008409) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Tsang TK, Wu P, Lau EH, Cowling BJ. 2021. Accounting for imported cases in estimating the time-varying reproductive number of COVID-19 in Hong Kong. J. Infect. Dis. 224, 783-787. ( 10.1093/infdis/jiab299) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Casella G, Berger RL. 2002. Statistical inference. Duxbury Advanced Series in Statistics and Decision Sciences. Chicago, IL: Thomson Learning. See https://books.google.com/books?id=0x_vAAAAMAAJ. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No primary data are used in this paper. Secondary data sources are taken from [23,24]. These data and the code necessary to reproduce the results in this paper are available at https://github.com/KolaczykResearch/EstimLocalRt.


Articles from Philosophical transactions. Series A, Mathematical, physical, and engineering sciences are provided here courtesy of The Royal Society

RESOURCES