Abstract
Mechanistic models fit to streaming surveillance data are critical to understanding the transmission dynamics of an outbreak as it unfolds in real-time. However, transmission model parameter estimation can be imprecise, and sometimes even impossible, because surveillance data are noisy and not informative about all aspects of the mechanistic model. To partially overcome this obstacle, Bayesian models have been proposed to integrate multiple surveillance data streams. We devised a modeling framework for integrating SARS-CoV-2 diagnostics test and mortality time series data, as well as seroprevalence data from cross-sectional studies, and tested the importance of individual data streams for both inference and forecasting. Importantly, our model for incidence data accounts for changes in the total number of tests performed. We model the transmission rate, infection-to-fatality ratio, and a parameter controlling a functional relationship between the true case incidence and the fraction of positive tests as time-varying quantities and estimate changes of these parameters nonparametrically. We compare our base model against modified versions which do not use diagnostics test counts or seroprevalence data to demonstrate the utility of including these often unused data streams. We apply our Bayesian data integration method to COVID-19 surveillance data collected in Orange County, California between March 2020 and February 2021 and find that 32–72% of the Orange County residents experienced SARS-CoV-2 infection by mid-January, 2021. Despite this high number of infections, our results suggest that the abrupt end of the winter surge in January 2021 was due to both behavioral changes and a high level of accumulated natural immunity.
1. Introduction
SARS-CoV-2 is a human coronavirus associated with high morbidity and mortality that caused a pandemic in 2020 (Cummings et al., 2020; Wu and McGoogan, 2020; Song et al., 2020). Like other human coronaviruses, SARS-CoV-2 is transmitted person to person through close contact and has high transmission potential in crowded indoor settings and around activities that generate aerosols (WHO, 2021). In the early stages of the COVID-19 pandemic, transmission dynamics modeling played an important role in alerting the public about the potential dangers of unmitigated virus spread (Prem and et. al, 2020; Ferguson and et. al, 2020; Davies et al., 2020). At later pandemic stages, this modeling helped evaluate intervention effectiveness (Knock et al., 2021) and to quantify transmission advantages of genetic variants (Davies et al., 2021). We develop models that integrate diagnostics test and mortality time series data with cross-sectional seroprevalence data to estimate underlying transmission dynamics and forecast future case and death counts. These models are flexible in the data sources they incorporate. By comparing the forecasting capabilities of these models, we aim to investigate which data streams should be used in future modeling efforts.
Differences in mitigation strategies, surveillance efforts, and population characteristics across countries and even across different regions within one country prompted development of regional modeling of SARS-CoV-2 transmission (Anderson et al., 2020; Jewell et al., 2021; Morozova et al., 2021; Irons and Raftery, 2021). However, neither national nor subnational/regional modelers fully integrate all surveillance data available to them, because inclusion of each additional data source leads to an increase in model complexity, which complicates statistical inference and reduces computational efficiency of this inference. In addition, including more data sources necessitates additional modeling assumptions and risks specification in one part of the model influencing inference in other aspects of the model. Modelers were faced with many questions about which data to use, which data to ignore, and how to best integrate them into their models. Incorporating case incidence data into inference proved particularly problematic because a data generating model for cases needs to account for preferential sampling of symptomatic individuals and dependence of case counts on the number of diagnostic tests performed, which significantly varies temporally and spatially. However, even with delayed reporting, positive diagnostic tests (cases) are among the earliest indicators of changing disease dynamics, so we hypothesize that taking advantage of this source of information could be important for producing timely forecasts and for policy decision-making. Similarly, properly incorporating seroprevalence data into models may improve the accuracy of a model’s estimations of the underlying cumulative number of infections, which, in part, drives the effective reproduction number and is crucial for forecasting. We investigate these ideas by fitting and comparing the forecasting abilities of multiple models, some of which use these data, while others do not.
We show how to fit a mechanistic model of SARS-CoV-2 spread to incidence and mortality time series, while accounting for the time-varying number of diagnostic tests performed. The mechanistic model is a standard ordinary differential equation (ODE) model that describes changes in the proportions of the population residing in model compartments. Death counts are modeled with a negative binomial distribution that allows for over-dispersion often observed in surveillance data. Our first innovation is the model for cases, where we use a flexible beta-binomial distribution, whose mean is a product of the total number of tests performed and a non-linear function of unobserved infections modeled by the ODE model. This ensures that our estimates are not unduly influenced by large fluctuations of COVID-19 diagnostic test positivity fractions. Our second innovation is nonparametric estimation of time-varying parameters that control both the transmission model and surveillance model. By allowing our model to adapt to temporal changes in transmission and surveillance, we can identify how changes in policy affected the near-term progression of the outbreak. Our third contribution is a careful assessment of the usefulness of various data streams in the context of a case study in Orange County, California, USA.
To benchmark the ability of our model to capture temporal trends in transmission dynamics, we use simulated data to compare our estimation of changes in the effective reproductive number to analogous estimates produced by epidemia, a simpler semi-parametric method (Scott et al., 2020). This comparison shows that failing to account for fluctuations in the number of diagnostic tests performed can result in misleading inferences about effective reproduction numbers, a critical quantity in infectious disease epidemiology. Our data integration approach allows our model to provide a much more detailed picture of the spread of an infectious disease beyond the effective reproduction number, including estimating the time-varying infection fatality ratio, the total number of infected individuals, and changes in testing patterns. Data integration further allows us to produce reasonable short-term forecasts of deaths and testing positivity. We demonstrate these enhanced capabilities by fitting our compartmental model to COVID-19 surveillance data collected in Orange County, California — the sixth most populous county in the United States of America (U.S.A.), with an estimated 3.2 million inhabitants as of 2019 (United States Census Bureau, 2020). We analyze Orange County surveillance data collected between March 30, 2020 and February 14, 2021, prior to the start of widespread vaccine availability. We find both basic and effective reproductive numbers varied widely during the first year of the pandemic, which is expected in light of implementation and subsequent relaxation of mitigation measures. We compare several modifications to our primary model which omit negative diagnostic tests or seroprevalence data. We demonstrate that our models produce reasonable short term (up to 4 weeks ahead) probabilistic forecasts of mortality, but that different data streams may be more or less useful during different periods of the pandemic.
2. Methods
2.1. Data
We start with time series of daily numbers of SARS-CoV-2 diagnostic tests (positive and negative), case counts (positive tests), and deaths observed over some time period of interest. We aggregate the three types of counts in weekly intervals. Figure 1 shows such a collection of aggregated time series for Orange County, CA, corresponding to the observation period spanning days between March 30, 2020 and February 14, 2021. We end our modeling period in February because vaccines became more widely available around this time, and our model does not account for vaccine-induced immunity. The data was compiled from anonymized individual test results provided by the Orange County Health Care Agency (OCHCA). We define cases as either confirmed or presumed COVID-19 diagnoses that have been officially reported to the OCHCA. We used specimen collection dates and dates of deaths to tabulate test, case, and death counts. We denote the vector of binned tests by , the vector of case counts by , and the vector of deaths by , where the weeks are indexed by and is the total number of weeks. Additionally, we use data from Bruckner et al. (2021), a study conducted to estimate the seroprevalence in Orange County from a population-representative sample, which consists of 343 seropositive cases among 2979 tests conducted between July 10 and August 16, 2020. For simplicity, we lump all seroprevalence test dates to a single time point — August 16, 2020 — corresponding to week . To formulate the surveillance model for cases , deaths , and seropositive cases , we first need a model for latent trajectories of incidence and prevalence of SARS-CoV-2 infections.
Figure 1:
COVID-19 surveillance data from Orange County, CA. The figure shows weekly counts of tests, cases (positive tests), reported deaths due to COVID-19, as well as testing positivity.
2.2. Transmission model
To model latent incidence and prevalence trajectories, we divide all individuals in the population of Orange County, CA into five compartments: S = susceptible individuals, E = infected, but not yet infectious individuals, I = infectious individuals, R = recovered individuals, D = individuals who died due to COVID-19. Possible progressions of an individual through the above compartments are depicted in Figure 2. We model the time-evolution of the proportions of individuals occupying the above compartments with a set of deterministic ordinary differential equations (ODEs). For simplicity, we assume a homogeneously mixing population of fixed size N, although it is possible to relax these assumptions, and we also assume that recovery confers immunity to subsequent infection over the duration of the modeling period. Let denote the population proportion in each compartment at time , and let denote the population proportions at time , the start of the modeling period. By convention, we model the population at risk, i.e., those individuals who may still move throughout the model compartments. Hence, we take and normalize so that , where . Since we want to fit this model to incidence data, it is convenient to also keep track of the cumulative proportion of the population that experiences transitions between compartments from to . To describe mathematically how vectors and change through time, we first define rates of transitions between compartments, with possible transitions corresponding to the arrows in Figure 2:
(1) |
where is the transmission rate, which varies over time, is the constant population size, is the mean latent period duration, is the mean infectious period duration, and is the infection-to-fatality ratio (IFR), which varies over time. The time-varying transmission rate allows our model to capture the effects of interventions and changes in human behavior, while the time-varying IFR should capture changes in age profiles of infected individuals and stress on healthcare providers during surges. We demonstrate this property in Supplementary Section S-3.
Figure 2:
Model diagram depicting possible progressions between infection states. The model compartments are as follows: susceptible (S), infected, but not yet infectious (E), infectious (I), recovered (R), and deceased (D).
Equipped with the population-level transition rates, we define ODEs for our model:
(2) |
subject to initial conditions and , where are initial compartment proportions. We set and , because these proportions do not play a role in future dynamics of the epidemic, leaving , and as free model parameters.
The above equations are redundant, and typically only the prevalence ODEs in the left column are used in mathematical modeling. However, the cumulative incidence/transition representation of the model, shown by the ODEs in the right column, is useful for statistical modeling of infectious disease dynamics (Bretó and Ionides, 201I). In practice, we solve the subset of the above ODEs that are needed to track and the parts of that “connect” our transmission model to data. We proceed to make this connection in the next subsection.
2.3. Surveillance model
We fit our transmission model to seroprevalence data and two time series: numbers of new cases and deaths reported during some pre-specified time periods (e.g., weeks). We do not model changes in the numbers of diagnostic tests performed. Rather, we condition on test counts in the specification of the sampling model for the vector of case counts, which describes the probability of the observed case count given the observed number of tests and unobserved/latent incidence of cases over each time interval. First, we assume that, conditional on , case and death counts are independent of each other and across time intervals, because they are just noisy realizations of information encoded by . This leaves us with formulating models for cases and deaths in each individual observation interval.
Consider the number of deaths observed in time interval , where . Since our ODEs track the latent cumulative fraction of deaths , we can compute — the latent fraction of the population that died in the interval . We model the observed death count as a realization from the following negative binomial distribution:
(3) |
where is the population size, and are the mean and variance of the negative binomial distribution, is the mean overall death detection probability, and is an over-dispersion parameter. Informally, our mortality model says that, on average, the observed number of deaths, , is a fraction of the true death count estimated by the model, , with some noise due to underreporting, delayed reporting, and sampling variability.
Next, we develop a model for the number of positive tests (cases), , observed in the time interval . We start with a simple binomial model with per-test positivity probability :
where is the number of COVID-19 diagnostic tests administered during the time interval . We use another layer of randomness to account for unobserved factors affecting positivity probabilities (e.g., variable testing guidelines and test shortages) and assume that the positivity probability in interval follows the beta distribution:
(4) |
where is an over-dispersion parameter and is the mean test positivity probability. We assume that mean test positivity odds is proportional to the unobserved odds of transitioning from exposed to infectious, in interval :
(5) |
where . This functional form ensures that, on average, the probability of detecting a SARS-CoV-2 infection grows with the population incidence. Parameter can be thought of as an effect of testing guidelines and practices. A model with (i.e., ) says that in interval testing is done approximately by sampling individuals uniformly at random, so that the positivity probability over a time interval is equal to the fraction of the population that transitions from the latent to infectious state. As we increase above 0, the model mimics preferential testing of individuals who are more likely to have severe infection (e.g., testing only individuals with certain symptoms).
We can streamline our surveillance model for case counts by integrating over positivity probabilities and arriving at the following beta-binomial distribution:
(6) |
Properties of the beta-binomial distribution imply that . This means that our model predicts that, on average, cases grow linearly with the number of diagnostic tests administered. Keeping in mind our assumed relationship between and , the average number of cases also grows with the accumulation of new infections. Furthermore, the variance of the fraction of tests that are positive under the beta-binomial distribution is
where the variance under an analogous pure binomial model would be . Hence, the over-dispersion parameter, , can be interpreted in terms of the excess variance of the beta-binomial model relative to a pure binomial distribution. In summary, our beta-binomial distribution for observed case counts ensures that we do not confuse increase in testing for increase in SARS-CoV-2 incidence and implicitly allows for heterogeneity in the mean test positivity probability.
Finally, we model the number of observed seropositive cases among , tests with a binomial distribution:
(7) |
This simple model assumes the seroprevalence data comes from a high-quality study based on random sampling, which does not exhibit the problems observed in the testing data.
In addition, we also consider a more typical approach to modeling observed cases, which is not conditional on tests and is similar to (3).
(8) |
where is the population size, and are the mean and variance of the negative binomial distribution, is the mean case detection probability, which varies over time, and is an over-dispersion parameter.
2.4. Putting all the pieces into a Bayesian model
We now describe our inferential Bayesian procedure. First, we re-parameterize our model by replacing with a basic reproductive number . We parameterize each of our time-varying parameters, as piecewise constant functions, where each vector defining the constants a priori follows a Gaussian Markov random field (GMRF).
More precisely, we define the auxiliary vectors:
which follow the Gaussian Markov random field priors:
(9) |
and define the piecewise constant functions:
(10) |
(11) |
(12) |
(13) |
In addition, we parameterize initial compartment fractions as , where . This construction allows us to specify independent prior distributions for and while preserving the sum-to-one constraint on the original initial compartmental fractions. Next, we collect all our model parameters into a vector . When using the traditional case-emission model (8), , and are substituted for , and . Our probabilistic construction described above implies that the likelihood function — probability of observing incidence, mortality, and seroprevalence data — can be written in the following way:
where , and are the probability mass functions given by (3), (6) or (8), and (7) respectively.
We encode available information about our model parameters in a prior distribution with density . We assume that all univariate non-GMRF distributed parameters are a priori independent and list our prior assumptions in Table S-1.1. Since our model is highly parametric, we rely on informative prior distributions that we parameterize using existing scientific studies. We base all our inferences and predictions on the posterior distribution of all model parameters:
(14) |
We sample from this posterior using the No-U-Turn Sampler (Hoffman and Gelman, 2014) as implemented in the Turing Julia package (Ge et al., 2018). Model code and data are available at the following GitHub repository: https://github.com/damonbayer/semi_parametric_COVID_19_OC_model.
3. Results
3.1. Simulation Study
To validate our model, we simulated 200 datasets with parameters given in Supplementary Table S-1.2. An example of one of these datasets is presented in Figure S-1.1. The number of tests at each time point is the same as in the Orange County data set, and the parameters were deliberately chosen to produce data similar to the Orange County data. Priors used for these model fits are the same as those used in the Orange County model (see Table S-1.1). The prior and posterior distribution of the model fit to the single simulated dataset from Figure S-1.1 are presented in Supplementary Figures S-1.2–S-1.3. For each simulated dataset, we used four Markov chains run in parallel to draw a total of 1000 posterior samples. In this single dataset example, most of the scalar parameter posteriors shift slightly toward the true parameters compared to the priors, without much posterior variance contraction relative to the prior. We define posterior variance contraction as one minus the ratio of standard deviation of the posterior and the prior, where negative contraction indicates that the posterior is wider than the prior, and 100% contraction indicates that the posterior is a degenerate distribution. For the time-varying parameters and compartments, the variance contraction and shift are much more apparent. We further explore our simulation study results with summary measurements, presented in Supplementary Figures S-1.5–S-1.9. Figure S-1.5 shows the coverage of the posterior 80% credible intervals constructed for the scalar parameters in the 200 simulated datasets. Most parameters demonstrate nearly 100% coverage, except for ϕC, which shows approximately 90% coverage, which is still above the nominal 80%. Similarly, Figure S-1.7 displays coverage of the posterior 80% credible intervals constructed for the time-varying parameters, with only one time point for α demonstrating less than nominal coverage. We observe slightly less conservative coverage when examining the latent compartments in Figure S-1.9, with the D compartment falling to around 60% coverage at some points. In addition to coverage, we also consider posterior contraction. Figure S-1.6 shows contraction of the scalar parameters, with most parameters demonstrating nearly no contraction. Notable exceptions to this are σR0, and σα, which exhibit positive contraction. Figure S-1.8 shows contraction of the time-varying parameters, with all parameters exhibiting a high amount of positive contraction. Similarly, the latent compartments also demonstrate a large degree of positive contraction in Figure S-1.10.
We used the epidemia R package (Scott et al., 2020) to fit a state-of-the-art method for effective reproduction number estimation (Rt), to the same 200 simulated datasets. This semi-mechanistic method does not attempt to estimate the unobserved number of susceptible and recovered individuals and does not account for changes in testing volume. See Supplementary Section S-2 for a brief description of the epidemia statistical model. Metrics based on estimates of Rt from the true model and epidemia for the simulation study are presented in Figure S-2.11. We assess the envelope, which is the proportion of time points which the 80% posterior credible interval contains the true Rt value specified in the simulation. Mean credible interval width (MCIW) is calculated as the mean of credible interval widths across time points within a simulation replication. Absolute deviation is a measure of bias, and is calculated as the mean of the absolute difference between the posterior median and the true Rt value at each time point. The mean absolute sequential variation (MASV) is calculated as the mean of the absolute difference between the posterior median at a time point and the posterior median at the previous time point. In all metrics, the true model fit achieves the superior result. The envelope for epidemia is typically around 50%, while it is near 100% for the true model. The mean credible interval width for epidemia (around 0.50) is larger than for the true model (around 0.35). The epidemia results also indicate more bias compared to the true model, as measured by the absolute deviation (0.20 for epidemia vs 0.05 for the true model). Additionally, the true model has a mean absolute sequential variation close to that of the simulated parameters (around 0.085), while the MASV reported by epidemia is larger (around 0.10).
3.2. Application to Orange County, California Data
Next, we apply our Bayesian inferential procedure to COVID-19 surveillance data collected in Orange County, California between March 30, 2020 and January 17, 2021. We again used four Markov chains run in parallel to draw a total of 1000 posterior samples. By the end of the modeling period, approximately 4% of Orange County residents were at least partially vaccinated. Because our model does not incorporate vaccination directly, it doesn’t make sense to use our model beyond January 2021. Throughout the modeling period, a variety of non-pharmaceutical interventions were enacted and sometimes lifted at the state, county, and city level. Notably, in-person school closures, indoor dining bans, and mask mandates were in effect for most or all of this period. Fitting the model took approximately 7.5 hours to generate 1,000 posterior samples from 4 chains, totaling around 29 CPU hours. These run times show that our model can be fit frequently enough to be useful for a real-time policy response. This number of chains and posterior samples resulted in a satisfactory effective sample size for posterior inference. Convergence and mixing were assessed using potential scale reduction factors, effective posterior sample sizes, and traceplots of model parameters, which are presented in Appendix S-5. Our main interest is in understanding differences in transmission dynamics and surveillance efforts throughout this period. Since our model is highly parametric, we used existing knowledge of SARS-CoV-2 transmission dynamics to formulate informative priors for all model parameters that we list in Table S-1.1. We briefly highlight some of our assumptions. Our priors for the initial compartment sizes reflect our belief that the number of infections was small, but potentially underreported by a factor of 10, at the beginning of the pandemic. Since our observation period starts close to the stay-at-home order taking effect in California, we assume that the March 2020 basic reproduction number should be around 1.0 to reflect reduced contacts during this period. Lengths of latent and infectious periods a priori assumed to be 0.8 and 1.2 weeks respectively, with substantial variance. Based on Orange County, CA seroprevalence study, we assume initial infection-to-fatality ratio to be around 0.4% (Bruckner et al., 2021). We compare reported deaths in Orange County, CA with estimates of U.S. county-specific excess mortality to set the prior for death reporting probability to be around 0.9 (Stokes et al., 2021). Prior and posterior distributional summaries of all model parameters are available in Supplementary Figures S-4.17 and S-4.18. These figures also contain the results of our sensitivity analysis, where we examined the effects of our prior assumptions on our inference. Our main conclusion is our results are not sensitive to reasonable prior perturbations.
The upper-left plot of Figure 3 presents the posterior distribution of the basic reproductive number (R0) for Orange County. Throughout the late spring and summer, the basic reproductive number is estimated to be slightly above 1.0, with some probability of being below 1.0 in the early fall. Beginning in October, the basic reproductive number begins to rise and surpasses 2.0 at the peak of the winter wave. This rise in the fall may be associated with the school re-openings that occurred around this time. Despite the high basic reproductive number throughout the modeling period, the upper-right plot of Figure 3 shows that the effective reproductive fell below 1.0 for much of the summer and again in January, following the winter surge, allowing us to separate the effects of reducing the average community contact rate and accumulated infection-induced immunity. We also apply the epidemia method to the Orange County data and plot the results in Figure 4. From this, we observe that the two methods lead to similar conclusions about the posterior distribution of Rt. At all but one of the time points, the 80% credible intervals of the posteriors from both methods overlap with one another, but the full model appears to generally produce smaller credible intervals, especially near the beginning of the fitting period. Additionally, the full model produces a smoother posterior than the epidemia model.
Figure 3:
Posterior distributions of the time-varying basic reproductive number R0, effective reproductive number Re, infection-to-fatality ratio (IFR), proportion in the proportional log-odds model of the beta-binomial observational model for cases α, weekly latent:case ratio, and cumulative latent:case ratio. Solid blue lines show point-wise posterior medians, while shaded areas denote 50%, 80%, and 95% Bayesian credible intervals.
Figure 4:
Posterior inference for the effective reproduction number from the full model and epidemia fit to the Orange County data.
We proceed with describing inference results for the other two time-varying parameters in our model: infection-to-fatality ratio η and the parameter α that governs the relationship between testing positivity and the true proportion of newly infected individuals in the population. The posterior distribution of the infection-to-fatality ratio is presented in the middle-left plot of Figure 3. The IFR is estimated to be consistent over time, hovering around 0.3%, but our estimates are less certain near the end of the modeling period. This potential rise in IFR could have been caused by a combination of the overwhelmed healthcare system and the increasing prevalence of the Alpha variant at this time, which has been tied to more severe outcomes (Grint et al., 2021). The middle-right plot and bottom plots of Figure 3 present three perspectives on testing policy and case detection: the posterior α, the weekly latent:case ratio, and the cumulative latent:case ratio. Generally, α, drifts lower over time, indicating that testing policy became less preferential toward selecting infected individuals as testing became more accessible. This trend is reversed slightly during the summer and winter waves, which is reflected in the decreasing weekly latent:case ratio during these times. The cumulative latent:case ratio also drifts lower over time, eventually arriving at a final cumulative latent:case ratio of 4:1 – 9:1.
We plot posterior medians and Bayesian credible intervals of the latent cumulative death counts (NID(t)) between March 2020 and January 2021, using three credibility levels shown in the left plot of Figure 5. Reported death counts are shown as black circles in the same plot. The plot reflects an overall death reporting rate of 87% - 94%. The center plot of Figure 5 shows the posterior distributions of the cumulative number of infections (NSE(t)) occurred in Orange County, with the cumulative observed cases displayed as black circles. We estimate that 32–72% of Orange County residents experienced SARS-CoV-2 infection by mid-January 2021. As in Figure 3, this shows a cumulative latent:case ratio of 4:1 – 9:1, with 1/3 – 2/3 of all Orange County residents having been infected by the end of January 2021. From this plot, we also note that our posterior estimate of seroprevalence in mid-August of 2020 (11.2%–13.7%) closely matches the 11.5% estimate from (Bruckner et al., 2021). We explicitly used this seroprevalence data in our inference, so this is unsurprising. The right plot of Figure 5 shows the prevalence of SARS-CoV-2 infected individuals at a particular time, (E(t) + I(t)). At the peak of the winter wave, we estimate that 7.8% – 14.9% of Orange County residents had an active infection at the same time.
Figure 5:
Latent and observed cumulative death (left) and incidence (center) trajectories and latent prevalence trajectories (right) in Orange County, CA (population 3.2 million). Solid blue lines show point-wise posterior medians, while shaded areas denote 50%, 80%, and 95% Bayesian credible intervals. Black circles denote observed data. Note that the posterior predictive distributions are of latent deaths and cases are not forecasts of their observed counterparts. Forecasts are plotted in Figure 6.
Finally, we turn to model-based forecasting of observable quantities. In Figure 6, we present one week and four week ahead forecasts of observed deaths and test positivity. The credible intervals shown for a given date are generated from a model that is fit to data from March 30, 2020 up to 1 week or 4 weeks prior to the given date. The forecasts are produced by augmenting the posterior time-varying parameters by carrying forward the previous mean values in (9) and solving the ODEs from (2) into the future. Since this model makes use of the seroprevalence data, we only produce forecasts for times after this data is available, beginning in late August 2020. Because forecasting cases with this model is impossible without knowing how many tests will be conducted in the future, we focus on the positivity fraction (cases divided by the total number of tests) instead. As in the other figures, we use three credibility levels in Figure 6, and observed values are displayed as black circles. Our one-week ahead probabilistic forecasts for both observed deaths in the upper half of Figure 6 generally capture the observed values, indicating our method to be precise and well calibrated. The four-week ahead forecasts predict the data well in times of relative stability, but exhibit poor performance when time-varying parameters are changing rapidly, such as during the winter wave. In this case, when forecasting four weeks out, we tend to underestimate the rise at the beginning of the wave and overestimate the fall at the end of the wave. This is not of major concern because the four-week time horizon is long enough that interventions and behavioral changes may take effect that are not foreseen by the model. Scenario-based modeling, where some values are specified for the future time-varying parameters, rather than simulating from the prior, may be more appropriate for this task.
Figure 6:
Forecast distributions for observed deaths (left column) and testing positivity (right column). Solid blue lines show point-wise posterior medians, while shaded areas denote 50%, 80%, and 95% Bayesian credible intervals. Observed values are presented as black circles.
We now compare our forecasting results to three variants of our model, which make use of different data streams. Each model can either be conditioned on tests or not conditioned on tests. When conditioning on tests, we use the case emission model given by (6). When not conditioning on tests, we use the case emission model given by (8). Each model can also make use of the seroprevalence data or not. When using the seroprevalence data, we use the emission distribution given by (7). When not using the seroprevalence data, no emission distribution is used. The model results discussed above are for the model which is conditioned on tests and uses the seroprevalence data. We compare these models by calculating the Continuous Ranked Probability Score (CRPS) (Matheson and Winkler, 1976), as implemented in the scoringRules package (Jordan et al., 2019), facilitated by the scoringutils package (Bosse et al., 2022). We present comparisons of these scores in Figure 7. We only show scores based on deaths because comparisons based on cases would require developing a method to forecast future test counts, which we have not considered in this work.
Figure 7:
Comparison of Continuous Rank Probability Score for models fit to the Orange County data. Lower is better.
From Figure 7, we observe that all models appear to perform similarly throughout much of the assessed time period. However, in both one week and four week ahead forecasts, the models which are conditioned on tests tend to score slightly better in the late summer period, when testing policy was rapidly changing. During the winter surge, the differences in the model forecasting abilities are more pronounced, with the models not conditioned on tests appearing to be consistently superior to those which are conditioned on tests. There is no clear pattern differentiating the models which use the seroprevalence data from those that do not.
4. Discussion
We developed a Bayesian SARS-CoV-2 transmission model that integrates information from incidence, mortality, and seroprevalence data. Our approach combines an ODE-based SEIR compartmental model of SARS-CoV-2 transmission dynamics and a carefully constructed surveillance model for cases, deaths, and seroprevalence. Importantly, our method accounts for variability in the number of SARS-CoV-2 diagnostics tests across time, thus ensuring that we do not confuse increases in testing with increases in incidence. Another distinguishing feature of our approach is nonparametric modeling of changes in key transmission and surveillance model parameters. Since we are integrating multiple sources of information, we can afford to be fairly ambitious and to include three such parameters into our model. Changes in one of these parameters accounts for changing the strength of preferentially testing SARS-CoV-2 infected individuals, which helps us avoid an important source of potential bias when inferring transmission model parameters. We reconstruct latent dynamics of the two pre-delta COVID-19 waves in Orange County, CA and estimate that 32–72% of Orange County residents experienced SARS-CoV-2 infection by mid-January 2021. Retrospective analysis shows that our model produces accurate and well calibrated one week ahead and reasonable four week ahead forecasts, but the latter lack accuracy during periods when SARS-CoV-2 transmission dynamics and mitigation policies change rapidly. Additionally, we evaluated our forecasting performance when including or excluding certain data streams (test counts and seroprevalence study data) and found that incorporating negative test counts into our model was useful near the beginning of the modeling period, when testing policy was changing rapidly, but less useful when the policy became consistent. We also found that excluding seroprevalence data did not negatively affect our forecasting ability.
Our primary focus in this work was on developing a framework for integrating multiple data streams into a transmission model. However, there are a number of extensions we could pursue to improve the realism of the assumed transmission dynamics and strengthen the model’s forecasting skill. Our model assumes that the population of interest is well mixed and that all individuals in the population infect others and get infected at the same per capita rate. In fact, the actual SARS-CoV-2 transmission process is much more complex because individuals come into contact with each other based on their geographical and social network proximity. Furthermore, it is well established that COVID-19 disease progression process depends on the individual’s age and other characteristics (Kim et al., 2021; Bhargava et al., 2020; Petrilli et al., 2020). Similarly, all transmission model parameters may depend on the vaccination status of an individual. Fortunately, compartmental models can be extended to account for these complexities. For example, we can stratify each model compartment by age, vaccination status, and geographical location, as is commonly done in epidemiological modeling (Li and Brauer, 2008; Van den Driessche, 2008).
We have addressed changes in control/mitigation measures and in human behavior by nonparametrically modeling variability of some of the SARS-CoV-2 transmission model parameters across time. Anderson et al. (2020) and Jewell et al. (2021) use parametric approaches to model the effects of mitigation measures on R0. It would be interesting to try a semi-parametric approach that combines parametric and non-parametric components, which would allow us to include indicators of human behavior (e.g., mobility data as in Jewell et al. (2021)) into our inference and forecasting.
In this paper, we have sidestepped the thorny issue of reporting delays by restricting our analyses to time periods in which the data have stabilized. Hence, our analyses should be robust to reporting delays so long as we have either “run out the clock” on the extent of the delays or reporting delays do not differ between positive and negative COVID-19 diagnostic tests. A useful set of extensions that would make our model more useful for real-time surveillance involve estimating the reporting delay distribution (Höhle and an der Heiden, 2014; Stoner and Economou, 2020) and using this distribution in our surveillance model.
Finally, we would like to point out that our deterministic representation of the latent epidemic process could be substituted for a fully stochastic model where the latent epidemic is represented as a Markov jump process, albeit with some loss of computational efficiency. In our large population setting, this could be achieved via simulation-based methods (Bretó et al., 2009; Andrieu et al., 2010; Dukic et al., 2012), data augmentation (Pooley et al., 2015; Nguyen-Van-Yen et al., 2021), or a variety of approximations of the latent stochastic epidemic process (Lekone and Finkenstädt, 2006; Cauchemez and Ferguson, 2008; Fintzi et al., 2022). Scaling our model to the state or national level could be done by analyzing multiple counties independently or by building a Bayesian hierarchical model that would allow borrowing information among counties. An even more ambitious undertaking would be allowing importation/exportation events across county lines, as was done by Pei et al. (2021). We hope that our methodology and other works in this spirit, along with better quality of surveillance data, will provide us with better predictive analytics tools when the next pandemic strikes.
Supplementary Material
Acknowledgements
This work utilized the infrastructure for high-performance and high-throughput computing, research data storage and analysis, and scientific software tool integration built, operated, and updated by the Research Cyberinfrastructure Center (RCIC) at the University of California, Irvine. We are grateful for funding from the UCI Infectious Disease Science Initiative. This work was made possible in part through support from the UC CDPH Modeling Consortium. D.B, I.H.G, and V.M.M were supported in part by NIH grant R01AI147336. V.M.M was supported in part by NIH grant R01AI170204 and NSF grant DMS 1936833. ER was supported by the Division of Intramural Research, NIAID, NIH.
Footnotes
Disclaimers
This project has been funded in part with federal funds from the National Cancer Institute, National Institutes of Health, under Contract No. 75N91019D00024, Task Order No. 75N91019F00130. This work was in part supported by the intramural research programs of the National Institutes of Health, Bethesda, MD. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.
References
- Anderson S. C., Edwards A. M., Yerlanov M., Mulberry N., Stockdale J. E., Iyaniwura S. A., Falcao R. C., Otterstatter M. C., Irvine M. A., Janjua N. Z., Coombs D., and Colijn C. (2020), “Quantifying the impact of COVID-19 control measures using a Bayesian model of physical distancing,” PLOS Computational Biology, 16, 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Andrieu C., Doucet A., and Holenstein R. (2010), “Particle Markov chain Monte Carlo methods,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72, 269–342. [Google Scholar]
- Bhargava A., Fukushima E. A., Levine M., Zhao W., Tanveer F., Szpunar S. M., and Saravolatz L. (2020), “Predictors for severe COVID-19 infection,” Clinical Infectious Diseases, 71, 1962–1968. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bosse N. I., Gruson H., Cori A., van Leeuwen E., Funk S., and Abbott S. (2022), “Evaluating forecasts with scoringutils in R,” arXiv preprint arXiv:2205.07090. [Google Scholar]
- Bretó C., He D., Ionides E., and King A. (2009), “Time series analysis via mechanistic models,” The Annals of Applied Statistics, 3, 319–348. [Google Scholar]
- Bretó C. and Ionides E. (2011), “Compound Markov counting processes and their applications to modeling infinitesimally over–dispersed systems,” Stochastic Processes and their Applications, 121, 2571–2591. [Google Scholar]
- Bruckner T. A., Parker D. M., Bartell S. M., Vieira V. M., Khan S., Noymer A., Drum E., Albala B., Zahn M., and Boden-Albala B. (2021), “Estimated seroprevalence of SARS-CoV-2 antibodies among adults in Orange County, California,” Scientific Reports, 11, 3081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Byrne A. W., McEvoy D., Collins A. B., Hunt K., Casey M., Barber A., Butler F., Griffin J., Lane E. A., McAloon C., O’Brien K., Wall P., Walsh K. A., and More S. J. (2020), “Inferred duration of infectious period of SARS-CoV-2: rapid scoping review and analysis of available evidence for asymptomatic and symptomatic COVID-19 cases,” BMJ Open, 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bürkner P.-C., Gabry J., Kay M., and Vehtari A. (2022), “posterior: Tools for working with posterior distributions,” R package version 1.3.1. [Google Scholar]
- Cauchemez S. and Ferguson N. (2008), “Likelihood-based estimation of continuous-time epidemic models from time-series data: application to measles transmission in London,” Journal of the Royal Society Interface, 5, 885–897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cummings M. J., Baldwin M. R., Abrams D., Jacobson S. D., Meyer B. J., Balough E. M., Aaron J. G., Claassen J., Rabbani L. E., Hastie J., et al. (2020), “Epidemiology, clinical course, and outcomes of critically ill adults with COVID-19 in New York City: a prospective cohort study,” The Lancet, 395, 1763–1770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davies N. G., Abbott S., Barnard R. C., Jarvis C. I., Kucharski A. J., Munday J. D., Pearson C. A. B., Russell T. W., Tully D. C., Washburne A. D., Wenseleers T., Gimma A., Waites W., Wong K. L. M., van Zandvoort K., Silverman J. D., Group, C. C.-. W., Consortium, C.-. G. U. C.-U., Diaz-Ordaz K., Keogh R., Eggo R. M., Funk S., Jit M., Atkins K. E., and Edmunds W. J. (2021), “Estimated transmissibility and impact of SARS-CoV-2 lineage B.1.1.7 in England,” Science, 372, ISSN 0036–8075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davies N. G., Kucharski A. J., Eggo R. M., Gimma A., Edmunds W. J., Jombart T., O’Reilly K., Endo A., Hellewell J., Nightingale E. S., et al. (2020), “Effects of non-pharmaceutical interventions on COVID-19 cases, deaths, and demand for hospital services in the UK: a modelling study,” The Lancet Public Health, 5, e375–e385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dukic V., Lopes H., and Polson N. (2012), “Tracking epidemics with Google flu trends data and a state-space SEIR model,” Journal of the American Statistical Association, 107, 1410–1426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferguson N. and et al. (2020), “Report 9: Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand,” MRC Centre for Global Infectious Disease Analysis Reports, accessed: 2020-06-19. [Google Scholar]
- Fintzi J., Wakefield J., and Minin V. N. (2022), “A linear noise approximation for stochastic epidemic models fit to partially observed incidence counts,” Biometrics, 78, 1530–1541. [DOI] [PubMed] [Google Scholar]
- Ge H., Xu K., and Ghahramani Z. (2018), “Turing: A language for flexible probabilistic inference,” in Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, edited by Storkey A. and Perez-Cruz F.<.ed>, volume 84 of Proceedings of Machine Learning Research, pages 1682–1690, PMLR. [Google Scholar]
- Grint D. J., Wing K., Houlihan C., Gibbs H. P., Evans S. J., Williamson E., McDonald H. I., Bhaskaran K., Evans D., Walker A. J., et al. (2021), “Severity of SARS-CoV-2 alpha variant (B.1.1.7) in England,” Clinical Infectious Diseases, in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoffman M. D. and Gelman A. (2014), “The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo,” Journal of Machine Learning Research, 15, 1593–1623. [Google Scholar]
- Höhle M. and an der Heiden M. (2014), “Bayesian nowcasting during the STEC O104: H4 outbreak in Germany, 2011,” Biometrics, 70, 993–1002. [DOI] [PubMed] [Google Scholar]
- Irons N. J. and Raftery A. E. (2021), “Estimating SARS-CoV-2 infections from deaths, confirmed cases, tests, and random surveys,” Proceedings of the National Academy of Sciences, 118, e2103272118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jewell S., Futoma J., Hannah L., Miller A. C., Foti N. J., and Fox E. B. (2021), “It’s complicated: characterizing the time-varying relationship between cell phone mobility and COVID-19 spread in the US,” npj Digital Medicine, 4, 152, ISSN 2398–6352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jordan A., Krüger F., and Lerch S. (2019), “Evaluating probabilistic forecasts with scoringRules,” Journal of Statistical Software, 90, 1–37. [Google Scholar]
- Kim L., Garg S., O?Halloran A., Whitaker M., Pham H., Anderson E. J., Armistead I., Bennett N. M., Billing L., Como-Sabetti K., et al. (2021), “Risk factors for intensive care unit admission and in-hospital mortality among hospitalized adults identified through the US coronavirus disease 2019 (COVID-19)-associated hospitalization surveillance network (COVID-NET),” Clinical Infectious Diseases, 72, e206–e214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Knock E. S., Whittles L. K., Lees J. A., Perez-Guzman P. N., Verity R., FitzJohn R. G., Gaythorpe K. A. M., Imai N., Hinsley W., Okell L. C., Rosello A., Kantas N., Walters C. E., Bhatia S., Watson O. J., Whittaker C., Cattarino L., Boonyasiri A., Djaafara B. A., Fraser K., Fu H., Wang H., Xi X., Donnelly C. A., Jauneikaite E., Laydon D. J., White P. J., Ghani A. C., Ferguson N. M., Cori A., and Baguelin M. (2021), “Key epidemiological drivers and impact of interventions in the 2020 SARS-CoV-2 epidemic in England,” Science Translational Medicine, 13, eabg4262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lekone P. and Finkenstädt B. (2006), “Statistical inference in a stochastic epidemic SEIR model with control intervention: Ebola as a case study,” Biometrics, 62, 1170–1177. [DOI] [PubMed] [Google Scholar]
- Li J. and Brauer F. (2008), “Continuous-time age-structured models in population dynamics and epidemiology,” in Mathematical Epidemiology, edited by Brauer F., van den Driessche P., and Wu J., chapter 9, pages 205–227, Springer. [Google Scholar]
- Matheson J. E. and Winkler R. L. (1976), “Scoring rules for continuous probability distributions,” Management Science, 22, 1087–1096. [Google Scholar]
- Morozova O., Li Z. R., and Crawford F. W. (2021), “One year of modeling and forecasting COVID-19 transmission to support policymakers in Connecticut,” Scientific Reports, 13, 20271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nguyen-Van-Yen B., Del Moral P., and Cazelles B. (2021), “Stochastic epidemic models inference and diagnosis with Poisson random measure data augmentation,” Mathematical Biosciences, 335, 108583. [DOI] [PubMed] [Google Scholar]
- Pei S., Yamana T. K., Kandula S., Galanti M., and Shaman J. (2021), “Burden and characteristics of COVID-19 in the United States during 2020,” Nature, pages 1–18. [DOI] [PubMed] [Google Scholar]
- Petrilli C. M., Jones S. A., Yang J., Rajagopalan H., O’Donnell L., Chernyak Y., Tobin K. A., Cerfolio R. J., Francois F., and Horwitz L. I. (2020), “Factors associated with hospital admission and critical illness among 5279 people with coronavirus disease 2019 in New York City: prospective cohort study,” BMJ, 369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pooley C., Bishop S., and Marion G. (2015), “Using model-based proposals for fast parameter inference on discrete state space, continuous-time Markov processes,” Journal of The Royal Society Interface, 12, 20150225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prem K. and et al. (2020), “The effect of control strategies to reduce social mixing on outcomes of the COVID-19 epidemic in Wuhan, China: a modelling study,” The Lancet Public Health, 5, e261–e270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scott J. A., Gandy A., Mishra S., Unwin J., Flaxman S., and Bhatt S. (2020), “epidemia: Modeling of epidemics using hierarchical Bayesian models,” R package version 1.0.0. [Google Scholar]
- Song J.-W., Zhang C., Fan X., Meng F.-P., Xu Z., Xia P., Cao W.-J., Yang T., Dai X.-P., Wang S.-Y., et al. (2020), “Immunological and inflammatory profiles in mild and severe cases of COVID-19,” Nature Communications, 11, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stan Development Team (2020), “RStan: the R interface to Stan,” R package version 2.21.2. [Google Scholar]
- Stokes A. C., Lundberg D. J., Elo I. T., Hempstead K., Bor J., and Preston S. H. (2021), “COVID-19 and excess mortality in the United States: A county-level analysis,” PLOS Medicine, 18, 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stoner O. and Economou T. (2020), “Multivariate hierarchical frameworks for modeling delayed reporting in count data,” Biometrics, 76, 789–798. [DOI] [PMC free article] [PubMed] [Google Scholar]
- United States Census Bureau (2020), “Quick facts: Orange County, California,” https://www.census.gov/quickfacts/orangecountycalifornia, accessed: 2020-09-05.
- Van den Driessche P. (2008), “Spatial structure: Patch models,” in Mathematical Epidemiology, edited by Brauer F., van den Driessche P., and Wu J., chapter 7, pages 179–189, Springer. [Google Scholar]
- Vehtari A., Gelman A., Simpson D., Carpenter B., and Bürkner P.-C. (2021), “Rank-Normalization, Folding, and Localization: An Improved for Assessing Convergence of MCMC (with Discussion),” Bayesian Analysis, 16, 667 – 718. [Google Scholar]
- WHO (2021), “Word Health Organization q&a: Coronavirus disease (COVID-19): How is it transmitted?” https://www.who.int/emergencies/diseases/novel-coronavirus-2019/question-and-answers-hub/q-a-detail/coronavirus-disease-covid-19-how-is-it-transmitted, accessed: 2021-09-12.
- Wu Z. and McGoogan J. M. (2020), “Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention,” JAMA, 323, 1239–1242. [DOI] [PubMed] [Google Scholar]
- Xin H., Li Y., Wu P., Li Z., Lau E. H. Y., Qin Y., Wang L., Cowling B. J., Tsang T. K., and Li Z. (2021), “Estimating the Latent Period of Coronavirus Disease 2019 (COVID-19),” Clinical Infectious Diseases, 74, 1678–1681, ISSN 1058–4838. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.