A machine learning model for nowcasting epidemic incidence

Saumya Yashmohini Sahai; Saket Gurukar; Wasiur R KhudaBukhsh; Srinivasan Parthasarathy; Grzegorz A Rempała

doi:10.1016/j.mbs.2021.108677

. 2021 Nov 27;343:108677. doi: 10.1016/j.mbs.2021.108677

A machine learning model for nowcasting epidemic incidence

Saumya Yashmohini Sahai ^a, Saket Gurukar ^a, Wasiur R KhudaBukhsh ^b, Srinivasan Parthasarathy ^a, Grzegorz A Rempała ^b,^⁎

PMCID: PMC8635898 PMID: 34848217

Abstract

Due to delay in reporting, the daily national and statewide COVID-19 incidence counts are often unreliable and need to be estimated from recent data. This process is known in economics as nowcasting. We describe in this paper a simple random forest statistical model for nowcasting the COVID-19 daily new infection counts based on historic data along with a set of simple covariates, such as the currently reported infection counts, day of the week, and time since first reporting. We apply the model to adjust the daily infection counts in Ohio, and show that the predictions from this simple data-driven method compare favorably both in quality and computational burden to those obtained from the state-of-the-art hierarchical Bayesian model employing a complex statistical algorithm. The interactive notebook for performing nowcasting is available online at https://tinyurl.com/simpleMLnowcasting.

MSC: 00-01, 99-00

Keywords: Nowcasting, Backfilling, COVID-19 incidence, Random forest

1. Introduction

The SARS-CoV-2 virus, first observed in the United States (USA) in January 2020 [1], [2], [3], is highly contagious [4] and has spread in both urban and rural regions [5], [6] of the USA. To gauge and combat the SARS-CoV-2 spread, governments and health organizations have set up public information systems such as COVID-19 dashboards [7], [8], [9], [10]. These dashboards are useful to brief the public [8] about the current state of COVID-19 in specific regions, make data-driven public health decisions [10], and improve transparency in governance [11]. Many of these dashboards show the number of daily new infections (daily incidence), where the infection count on a particular date refers to the number of people who started experiencing disease symptoms on that date (i.e., the onset date of illness). Whereas reporting onset dates is very useful from the viewpoint of contact tracing and disease spread monitoring, it is also challenging due to unavoidable delays. [7], [12], [13]. These delays are often due to the time-lags between experiencing initial symptoms and seeking care, receiving testing results, and updating the statewide records [14], [15]. As a consequence, the incidence reporting based on onset counts leads to under-counting of the present and most recent cases. Dashboards often explicitly warn about this problem [16], [17]. Fig. 1 shows one such example from COVID-19 Dashboard maintained by the Ohio Department of Health (ODH) [7] where the region of possible under-reporting is marked with a gray rectangle.

Fig. 1 — Daywise COVID-19 cases in Ohio, as on 12-01-2020. The shaded area – comprising of 21 days – is the preliminary case data and is likely under-reported to the ODH due to delayed reporting.

The incomplete current count data poses huge challenges for both local and national healthcare policymakers as they strive to make difficult public health decisions (e.g., introduce lockdowns, curfews, evaluate vaccination effects, etc.) in real time to limit the spread of the virus. The use of statistical methods to moderate the effects of incomplete data could help reduce uncertainty in public health decision-making during the COVID-19 pandemic and increase public awareness of the most recent disease trends.

While forecasting COVID-19 cases is typically concerned with predicting the future burden of the epidemic, nowcasting [18], [19], [20] addresses the problem of delayed reporting and focuses on the estimation of current case counts from not-too-distant historic data. Given the under-reported infection data for a particular date, the nowcasting models estimate the total number of current infections for that date, which will be reported eventually. In the literature, there exist several sophisticated statistical methods for addressing the issue of nowcasting for COVID-19. For instance, Wu et al. [21] nowcast the probable size of the COVID-19 outbreak in Wuhan, China. The authors estimate the basic reproduction number $R_{0}$ from their proposed non-homogeneous counting process modeling the exported number of international cases from Wuhan and the global human mobility data from/to Wuhan. The authors then used the estimated $R_{0}$ in the Susceptible–Exposed–Infected–Recovered or SEIR model [22] for nowcasting and forecasting the outbreak’s size. The nowcasting problem for delayed reporting of COVID-19 cases is also addressed by Silva et al. [23] and Greene et al. [13] using Bayesian smoothing approach [19] where the authors model the delayed number of reported cases with their proposed Markov counting processes.

In this paper, we propose a simple yet efficient machine learning model that addresses the problem of nowcasting in a way that is easily understood by non-experts and therefore suitable for presenting to public health decision-makers. The only data our proposed model requires can be readily collected from publicly available dashboards. Despite its simplicity, the model is seen to predict, with high accuracy (measured with the typical regression-style $R^{2}$ value), the number of people who start experiencing COVID-19 symptoms on a particular date. We also show that our proposed model outperforms the state-of-the-art hierarchical Bayesian model [24] in terms of nowcasting accuracy while being also approximately 72000x faster. Our model predictions can also be utilized as input to other forecasting models, for instance, the ones created for ODH [25] that forecast the future number of infections and subsequent hospital burden in Ohio. Note that since the goal is to nowcast the state epidemic incidence curve, there is no accounting for non-symptomatic cases.

2. Materials and methods

2.1. Data processing

To perform our analysis, we used the public data available at ODH COVID-19 dashboard,1 which is updated daily. It provides the daily partial incidence count, that is, the count of all individuals $i_{t d}$ reported on a given day $t$ to be confirmed COVID-19 cases with the day of onset $d$ where $d \leq t$ . For our analysis we aggregated cases by the onset date to get the state-level progression of the onset reporting. This was done by pulling data from the dashboard everyday — the dashboard provides the data for the $d$ days which we pull for $t$ days. Accordingly, the infection count $I_{T D}$ on a specific day $T$ for a given specific onset date $D$ is given by

I_{T D} = \sum_{d \leq t \leq T} 1_{i_{t d}},

(1)

where $1_{i_{t d}}$ is the indicator function

1_{i_{t d}} = \{\begin{matrix} 1 where t \leq T, d = D, \\ 0 otherwise. \end{matrix})

(2)

Note that for a given $D$ , $I_{T D}$ is non-decreasing as a function of $t$ and, assuming that it is also bounded, it has a limit as $T \to \infty$ . This is illustrated in Fig. 2 where we see that over the course of 52 days $I_{T D}$ becomes approximately a constant.

Fig. 2 — Progression of infection count $I_{t d}$ for a specific $d$ value (11-01-2020) over $t$ ranging from 11-01-2020 to 12-22-2020. There is a steep rise in the infection count in the initial days of data collection as the data is backfilled, but it gradually stabilizes.

We denote the asymptotic stable value of $I_{T D}$ for an onset date $D$ by

I_{D}^{s} = lim_{T \to \infty} I_{T D}

(3)

and define $F_{T D}$ as the amount of undercounting for a specific $D$ on day $T$ given by

F_{T D} = 1 - \frac{I_{T D}}{I_{D}^{s}} .

(4)

We may think about $F_{T D}$ as a standardized measure of undercounting that is also robust to changes in incidence rates during the course of the pandemic. In what follows, we therefore consider $F_{T D}$ in place of $I_{T D}$ . Note that although in general $F_{T D} \to 0$ as $T \to \infty$ , this convergence is not necessarily monotone and that in the fixed time window, $I_{T D}$ only approximately stabilizes as it approaches $I_{D}^{s}$ . In order to improve data stability in the time windows of interest, we consider the $I_{T D}$ limit to be reached in practice as soon as $F_{T D} < 0.05$ . This particular cutoff value was chosen by cross-validation $[0, 0.5]$ , as described in Section 2.2.

In order to cross-validate and measure the prediction testing error, data to be used for nowcasting is split into a training and a validation (testing) set based on $t$ , where all $F_{t d}$ with $t < T_{t r a i n}$ are in the former and $t > T_{t r a i n}$ are in the latter.

2.2. Model

Covariates.

The model includes the following features to predict the $F_{t d}$ .

•
Days since data collection ( $Δ$ ). For any given infection count $I_{t d}$ reported on day $t$ with onset date $d$ , we define this feature as
$Δ_{t d} = t - d .$ (5)
•
Day of the week ( $ω_{t}$ ). This categorical variable denotes the day of the week for $t$ , at which data is being reported, $ω_{t} \in$ {Mo, Tu, We, Th, Fr, Sa, Su}.
•
Raw infection count ( $I_{t d}$ ). This is the daily partial incidence count for the pandemic, as described in Eq. (1).

Random forest regression.

We train a random forest (RF) regression model [26] on the data partition defined in Section 2.1, to predict $F_{t d}$ from the covariates. Formally, we may write

F_{t d} = f (Δ_{t d}, Δ_{t d}^{2}, Δ_{t d}^{3}, ω_{i}, I_{t d}),

(6)

where $f$ is the RF model.

3. Results

Goodness of fit.

The explained variance ( $R^{2}$ value) is used to evaluate the goodness of fit of the model on both the training data (time window from 10-01-2020 to 11-15-2020) and on the testing data (time window from 11-16-2020 to 12-15-2020). The predictions from the fitted model plotted against the true values in test data can be seen in Fig. 3. The explained variance is 0.99 on the training data and 0.89 on the testing data, which shows that the model’s prediction of $F_{t d}$ generalizes well to the unseen data.

Fig. 3 — Actual vs Predicted $F_{t d}$ on the testing dataset. Robust prediction of $F_{t d}$ is crucial for correct prediction of final infection count $I_{d}^{s}$ .

Importance of covariates.

The relative importance of covariates (the Gini importance or the mean decrease in impurity) in the fitted model, described in Section 2.2 can be seen in Table 1. The covariate days since data collection ( $Δ$ ), along with its quadratic and cubic transforms turn out to be the most important features in determining the fraction of missing data $F_{t d}$ . The day of the week $ω_{i}$ has much less relative importance.

Table 1.

Relative (Gini) importance of covariates. Days since data collection ( $Δ$ ) and its transformations are the most important, with day of the week $(ω_{t})$ having the least effect.

Covariate	$Δ^{2}$	$Δ^{3}$	$Δ$	$I$	$ω_{t} = Th$	$ω_{t} = Tu$	$ω_{t} = We$	$ω_{t} = Fr$	$ω_{t} = Mo$	$ω_{t} = Sa$	$ω_{t} = Su$
Importance	0.337	0.325	0.311	0.013	0.003	0.003	0.003	0.002	0.001	0.001	0.001

Open in a new tab

Prediction of missingness $F_{t d}$ .

Fig. 4 shows the prediction of $F_{t d}$ for different values of $Δ_{t d}$ . As seen from the plot, the model predictions are close to the true $F_{t d}$ when $Δ > 4$ . The good agreement at $Δ = 0$ is trivial, as at first date of collection, $F_{t d}$ is almost always close to 1.0 and thus easy to predict. It is also evident that first 3–4 days of data collection seem to be unreliable in predicting the correct $F_{t d}$ and therefore should be utilized cautiously in the nowcasting predictions.

Actual count prediction.

Based on the prediction of $F_{t d}$ and the current observed count $I_{t d}$ , we use (4), to get the estimate of $I_{d}^{s}$ , which is the stable value of the infection count on day $d$ . The typical trends for 4 different days of the week can be seen in Fig. 5. The infection count from the model predicts the stable value $I_{d}^{s}$ robustly after five days (starting from $Δ = 5$ ), and in some cases even earlier. In Fig. 5 we may see that irrespective of the day of the week (Monday, Wednesday, Friday, Sunday), the model is seen to predict the value of $I_{d}^{s}$ with good accuracy. We may also note that on Monday and Sunday the model predictions have higher uncertainty likely due to the effect of weekend test processing slowdown.

Fig. 5 — Prediction of raw infection count.

Comparison with the Bayesian model.

In order to provide some context for assessing the quality of the RF model predictions, we compare our results with a state-of-the-art hierarchical Bayesian model proposed recently by Kline et al. [24], which has been used for the same purpose of nowcasting COVID-19 cases in the state of Ohio. The model, which we refer to as the Bayesian model (BM) in the following, is more elaborate than ours as it has also a spatial component. Specifically, it keeps track of COVID-19 cases over time in different geographical regions (counties in Ohio). Although in our comparison we aggregate BM spatial counts, for the sake of completeness we briefly describe here the entire model along with its spatial component. Denoting by $Y_{i, t}$ the true count of cases in county $i$ with onset date $t$ the BM assumes the following Poisson model for the dynamics of the disease:

Y_{i, t} \sim POISSON (exp (O_{i} + α_{i, t} + X_{t} η_{i})),

(7)

where $O_{i}$ is an offset of the logarithm of population of county $i$ , the spatio-temporal random variables $α_{i, t}$ are the latent states of the process, the design vector $X_{t}$ indicates the day of the week, and the vector $η_{i}$ captures the day of the week effect. It is assumed that $Y_{i, t}$ is only partially observed for time $t > T_{max} - D$ , where $T_{max}$ stands for the last onset date and $D$ (assumed $30$ in [24]) is the maximum reporting delay following onset. BM also uses a semi-local linear trend model [27] for the spatio-temporal random variables $α_{i, t}$ . Further, the spatial correlation is accounted for using an intrinsic conditional auto-regressive model. The reporting delay is described by a Multinomial-Dirichlet model as follows. Denoting by $Z_{i, t, d}$ the count of cases in county $i$ with onset date $t$ , which are observed $d$ days after $t$ , one defines $Z_{i, t} = (Z_{i, t, 0}, Z_{i, t, 1}, \dots, Z_{i, t, D})$ . Then, the Multinomial-Dirichlet model prescribes

Z_{i, t} \sim MULTINOMIAL (p_{i, t}, Y_{i, t}),

p_{i, t} \sim GENERALIZEDDIRICHLET (a_{i, t}, b_{i, t}),

where the vectors $a_{i, t}$ and $b_{i, t}$ are described in terms of mean and dispersion parameters [28]. The choice of a Generalized Dirichlet distribution allows for modeling potential overdispersion in $p_{i, t}$ (see [28]). Moreover, it leads to a convenient $BETA-BINOMIAL$ conditional distribution representation for the components $Z_{i, t, d}$ . For the purpose of Bayesian analysis, the authors specify normally distributed priors for the parameters and use the R package nimble to perform a Markov chain Monte Carlo (MCMC) algorithm. The authors report a run time of approximately 20 h for 30,000 iterations.

In Fig. 6, Fig. 7 we visually compare the nowcasts of the two models and see in particular that the RF enjoys narrower uncertainty bounds and less bias than the corresponding BM model. In order to quantify this difference more formally, we calculate the $L_{2}$ distance between the predictions made by the RF and the Bayesian model, respectively and the actual known stable values in the Ohio COVID-19 daily counts dataset. We report the ratio of the two $L_{2}$ distance values as a measure of relative closeness of the models to the true (stable) data value for days $T - 10$ to $T$ and $T - 10$ to $T - 5$ , where $T$ is the last available date in the data. The results are presented in Table 2. As can be seen in the table, the predictions by the random forest model are relatively closer to the true values than those generated using the Bayesian model estimates. The ratio is smaller in the full 10 day window, indicating that the RF model makes better predictions than BM for days that are close to data collection.

Table 2.

The ratio of $L_{2}$ norm of nowcasted predictions from the Bayesian model (BM) and random forest model (RF) from the true stable values at two different time instants $T$ and $T - 5$ . The ratio values below one indicate that in both cases the RF model performs better than BM.

	$T - 10 : T$	$T - 10 : T - 5$
$R F / B M$	0.565	0.726

Open in a new tab

4. Summary and discussion

We presented here a simple method for nowcasting COVID-19 cases from historic data on daily incidence of new cases, as measured by the onset of symptoms. Such type of data is now widely available for all states in the USA as well as for most countries in the world. When the need to take immediate decisions on governance or policy arises, nowcasting can be a useful tool in providing more accurate estimates about disease incidence and spread. Specifically, our proposed nowcasting algorithm uses a random forest (RF) regression methodology and leverages covariates that are based on day of the week, the number of days passed since first data collection and total incidence so far.

The proposed algorithm is both conceptually simple and computationally efficient. Our results also suggest that it compares favorably with a much more elaborate Bayesian model. We have illustrated the application of our approach on publicly available data from COVID-19 daily onsets in Ohio, as available from the state’s COVID-19 interactive dashboard. We observed that the model is able to predict the final incidence for a day, within 3 to 4 days of data collection. We also find that the number of days passed since first data collection, along with its transformations (or derivatives), are the most important covariates in predicting the final incidence.

The proposed model learns from the specific epidemic curve (in our case COVID-19) and depends on how this curve is updated. In our study, we have nowcasted epidemic incidence for Ohio. The process of updating is highly dependent on part of the country, population density, availability of testing and reporting by local health departments. It is likely that data from a different geographic region will lead to different learned model. There could be some level of nowcasting similarity in different geographical regions of the country and our method could be used to help identify such cases. This can be a potential follow up to our work.

In order to make our RF method predictions broadly available to the interested researchers and practitioners, we have created a publicly available and accessible interactive notebook (see below). As described in the repository, the notebook allows one to use our algorithm to nowcast current COVID-19 onset occurrences, based on any user-provided historic data supplied in appropriate format.

The problem of nowcasting historic data is an important one, specially during the current COVID-19 pandemic, when delays in reporting can snowball into sub-optimal policies and actions, that can cost lives and create unnecessary societal burden. Our proposed method allows both general public and health providers to carefully monitor the pandemic trends and make informed decisions. The ideas we presented while focused on COVID-19 can be broadly applicable to similar public health problems in the future.

Software availability

The interactive self-contained notebook for performing the nowcasting using the random forest approach described in the paper, along with installation instructions, is freely available at https://zenodo.org/badge/latestdoi/346708110. Additionally, the web-based version of the interactive notebook is available at https://tinyurl.com/simpleMLnowcasting.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was partially funded by NSF, United States of America grants DMS-1853587 and DMS-2027001 to GAR. The work of WKB was supported by the President’s Postdoctoral Scholars Program (PPSP) of the Ohio State University, United States of America . We would like to thank Harley Vossler for providing helpful feedback on the interactive notebook.

Footnotes

https://coronavirus.ohio.gov/wps/portal/gov/covid-19/dashboards/overview.

References

1.Bedford T., Greninger A.L., Roychoudhury P., Starita L.M., Famulare M., Huang M.-L., Nalla A., Pepper G., Reinhardt A., Xie H., et al. Cryptic transmission of sars-cov-2 in washington state. Science. 2020;370(6516):571–575. doi: 10.1126/science.abc0523. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.2020. Centers for disease control and prevention, first travel-related case of 2019 novel coronavirus detected in United States. https://www.cdc.gov/media/releases/2020/p0121-novel-coronavirus-travel-case.html. [Google Scholar]
3.Fauver J.R., Petrone M.E., Hodcroft E.B., Shioda K., Ehrlich H.Y., Watts A.G., Vogels C.B., Brito A.F., Alpert T., Muyombwe A., et al. Coast-to-coast spread of sars-cov-2 during the early epidemic in the united states. Cell. 2020;181(5):990–996. doi: 10.1016/j.cell.2020.04.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.2020. Transmission of SARS-CoV-2: implications for infection prevention precautions, COVID-19 data dashboard. https://www.who.int/news-room/commentaries/detail/transmission-of-sars-cov-2-implications-for-infection-prevention-precautions. [Google Scholar]
5.Paul R., Arif A.A., Adeyemi O., Ghosh S., Han D. Progression of covid-19 from urban to rural areas in the united states: a spatiotemporal analysis of prevalence rates. J. Rural Health. 2020;36(4):591–601. doi: 10.1111/jrh.12486. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.J.T. Mueller, K. McConnell, P.B. Burow, K. Pofahl, A.A. Merdjanoff, J. Farrell, Impacts of the covid-19 pandemic on rural america, Proc. Natl. Acad. Sci. 118 (1). [DOI] [PMC free article] [PubMed]
7.2021. Ohio department of health, COVID-19 dashboard. https://coronavirus.ohio.gov/wps/portal/gov/covid-19/dashboards. [Google Scholar]
8.2021. NYC Health, COVID-19: Data. https://www1.nyc.gov/site/doh/covid/covid-19-data.page. [Google Scholar]
9.2021. California all, tracking COVID-19 in california. https://covid19.ca.gov/state-dashboard/ [Google Scholar]
10.2021. Utah department of health, phased guidelines for the general public and businesses to maximize public health and economic reactivation version 4.5. https://coronavirus-download.utah.gov/Health/Phased_Health_Guidelines_V4.5.3_05262020.pdf. [Google Scholar]
11.Fell L. Trust and covid-19: Implications for interpersonal, workplace, institutional, and information-based trust. Dig. Gov.: Res. Pract. 2020;2(1):1–5. [Google Scholar]
12.J.E. Harris, Overcoming reporting delays is critical to timely epidemic monitoring: The case of covid-19 in new york city, MedRxiv.
13.Greene S.K., McGough S.F., Culp G.M., Graf L.E., Lipsitch M., Menzies N.A., Kahn R. Nowcasting for real-time covid-19 tracking in new york city: An evaluation using reportable disease data from early in the pandemic. JMIR Public Health Surveill. 2021;7(1) doi: 10.2196/25538. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.2020. The wall street journal, Covid-19 data reporting system gets off to rocky start. https://www.wsj.com/articles/covid-19-data-reporting-system-gets-off-to-rocky-start-11597178974. [Google Scholar]
15.2020. Governer of ohio, COVID-19 update: Antigen testing, K-12 education update, DataOhio portal. https://governor.ohio.gov/wps/portal/gov/governor/media/news-and-media/covid19-update-12072020. [Google Scholar]
16.2021. World health organization, WHO coronavirus disease (COVID-19) dashboard. https://covid19.who.int/ [Google Scholar]
17.2021. Washington state department of health, COVID-19 data dashboard. https://www.doh.wa.gov/Emergencies/COVID19/DataDashboard. [Google Scholar]
18.van de Kassteele J., Eilers P.H., Wallinga J. Nowcasting the number of new symptomatic cases during infectious disease outbreaks using constrained p-spline smoothing. Epidemiol. (Cambridge Mass.) 2019;30(5):737. doi: 10.1097/EDE.0000000000001050. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.McGough S.F., Johansson M.A., Lipsitch M., Menzies N.A. Nowcasting by bayesian smoothing: A flexible, generalizable model for real-time epidemic tracking. PLoS Comput. Biol. 2020;16(4) doi: 10.1371/journal.pcbi.1007735. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Lawless J. Adjustments for reporting delays and the prediction of occurred but not reported events. Canad. J. Statist. 1994;22(1):15–31. [Google Scholar]
21.Wu J.T., Leung K., Leung G.M. Nowcasting and forecasting the potential domestic and international spread of the 2019-ncov outbreak originating in wuhan, china: a modelling study. Lancet. 2020;395(10225):689–697. doi: 10.1016/S0140-6736(20)30260-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Aron J.L., Schwartz I.B. Seasonality and period-doubling bifurcations in an epidemic model. J. Theoret. Biol. 1984;110(4):665–679. doi: 10.1016/s0022-5193(84)80150-2. [DOI] [PubMed] [Google Scholar]
23.d. Silva A.A.M., Lima-Neto L.G., d. Costa L.M.M., Bragança M.L.B.M., Barros Filho A.K.D., Wittlin B.B., d. Souza B.F., d. Oliveira B.L.C.A., d. Carvalho C.A., Thomaz E.B.A.F., et al. Population-based seroprevalence of sars-cov-2 and the herd immunity threshold in maranhão. Rev. Saúde Públ. 2020;54:131. doi: 10.11606/s1518-8787.2020054003278. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.D. Kline, A. Hyder, E. Liu, M. Rayo, S. Malloy, E. Root, A Bayesian spatio-temporal nowcasting model for public health decision-making and surveillance, arXiv preprint arXiv:2102.04544. [DOI] [PubMed]
25.2021. Infectious disease institute (IDI) COVID-19 response modeling team at the ohio state university, predicting COVID-19 cases and subsequent hospital burden in ohio. https://idi.osu.edu/assets/pdfs/covid_response_white_paper.pdf. [Google Scholar]
26.Statistics L.B., Breiman L. Machine Learning. 2001. Random forests; pp. 5–32. [Google Scholar]
27.Brodersen K.H., Gallusser F., Koehler J., Remy N., Scott S.L. Inferring causal impact using Bayesian structural time-series models. Ann. Appl. Stat. 2015;9(1):247–274. [Google Scholar]
28.Stoner O., Economou T. Multivariate hierarchical frameworks for modeling delayed reporting in count data. Biometrics. 2020;76(3):789–798. doi: 10.1111/biom.13188. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b1] 1.Bedford T., Greninger A.L., Roychoudhury P., Starita L.M., Famulare M., Huang M.-L., Nalla A., Pepper G., Reinhardt A., Xie H., et al. Cryptic transmission of sars-cov-2 in washington state. Science. 2020;370(6516):571–575. doi: 10.1126/science.abc0523. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b2] 2.2020. Centers for disease control and prevention, first travel-related case of 2019 novel coronavirus detected in United States. https://www.cdc.gov/media/releases/2020/p0121-novel-coronavirus-travel-case.html. [Google Scholar]

[b3] 3.Fauver J.R., Petrone M.E., Hodcroft E.B., Shioda K., Ehrlich H.Y., Watts A.G., Vogels C.B., Brito A.F., Alpert T., Muyombwe A., et al. Coast-to-coast spread of sars-cov-2 during the early epidemic in the united states. Cell. 2020;181(5):990–996. doi: 10.1016/j.cell.2020.04.021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b4] 4.2020. Transmission of SARS-CoV-2: implications for infection prevention precautions, COVID-19 data dashboard. https://www.who.int/news-room/commentaries/detail/transmission-of-sars-cov-2-implications-for-infection-prevention-precautions. [Google Scholar]

[b5] 5.Paul R., Arif A.A., Adeyemi O., Ghosh S., Han D. Progression of covid-19 from urban to rural areas in the united states: a spatiotemporal analysis of prevalence rates. J. Rural Health. 2020;36(4):591–601. doi: 10.1111/jrh.12486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b6] 6.J.T. Mueller, K. McConnell, P.B. Burow, K. Pofahl, A.A. Merdjanoff, J. Farrell, Impacts of the covid-19 pandemic on rural america, Proc. Natl. Acad. Sci. 118 (1). [DOI] [PMC free article] [PubMed]

[b7] 7.2021. Ohio department of health, COVID-19 dashboard. https://coronavirus.ohio.gov/wps/portal/gov/covid-19/dashboards. [Google Scholar]

[b8] 8.2021. NYC Health, COVID-19: Data. https://www1.nyc.gov/site/doh/covid/covid-19-data.page. [Google Scholar]

[b9] 9.2021. California all, tracking COVID-19 in california. https://covid19.ca.gov/state-dashboard/ [Google Scholar]

[b10] 10.2021. Utah department of health, phased guidelines for the general public and businesses to maximize public health and economic reactivation version 4.5. https://coronavirus-download.utah.gov/Health/Phased_Health_Guidelines_V4.5.3_05262020.pdf. [Google Scholar]

[b11] 11.Fell L. Trust and covid-19: Implications for interpersonal, workplace, institutional, and information-based trust. Dig. Gov.: Res. Pract. 2020;2(1):1–5. [Google Scholar]

[b12] 12.J.E. Harris, Overcoming reporting delays is critical to timely epidemic monitoring: The case of covid-19 in new york city, MedRxiv.

[b13] 13.Greene S.K., McGough S.F., Culp G.M., Graf L.E., Lipsitch M., Menzies N.A., Kahn R. Nowcasting for real-time covid-19 tracking in new york city: An evaluation using reportable disease data from early in the pandemic. JMIR Public Health Surveill. 2021;7(1) doi: 10.2196/25538. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b14] 14.2020. The wall street journal, Covid-19 data reporting system gets off to rocky start. https://www.wsj.com/articles/covid-19-data-reporting-system-gets-off-to-rocky-start-11597178974. [Google Scholar]

[b15] 15.2020. Governer of ohio, COVID-19 update: Antigen testing, K-12 education update, DataOhio portal. https://governor.ohio.gov/wps/portal/gov/governor/media/news-and-media/covid19-update-12072020. [Google Scholar]

[b16] 16.2021. World health organization, WHO coronavirus disease (COVID-19) dashboard. https://covid19.who.int/ [Google Scholar]

[b17] 17.2021. Washington state department of health, COVID-19 data dashboard. https://www.doh.wa.gov/Emergencies/COVID19/DataDashboard. [Google Scholar]

[b18] 18.van de Kassteele J., Eilers P.H., Wallinga J. Nowcasting the number of new symptomatic cases during infectious disease outbreaks using constrained p-spline smoothing. Epidemiol. (Cambridge Mass.) 2019;30(5):737. doi: 10.1097/EDE.0000000000001050. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b19] 19.McGough S.F., Johansson M.A., Lipsitch M., Menzies N.A. Nowcasting by bayesian smoothing: A flexible, generalizable model for real-time epidemic tracking. PLoS Comput. Biol. 2020;16(4) doi: 10.1371/journal.pcbi.1007735. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b20] 20.Lawless J. Adjustments for reporting delays and the prediction of occurred but not reported events. Canad. J. Statist. 1994;22(1):15–31. [Google Scholar]

[b21] 21.Wu J.T., Leung K., Leung G.M. Nowcasting and forecasting the potential domestic and international spread of the 2019-ncov outbreak originating in wuhan, china: a modelling study. Lancet. 2020;395(10225):689–697. doi: 10.1016/S0140-6736(20)30260-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b22] 22.Aron J.L., Schwartz I.B. Seasonality and period-doubling bifurcations in an epidemic model. J. Theoret. Biol. 1984;110(4):665–679. doi: 10.1016/s0022-5193(84)80150-2. [DOI] [PubMed] [Google Scholar]

[b23] 23.d. Silva A.A.M., Lima-Neto L.G., d. Costa L.M.M., Bragança M.L.B.M., Barros Filho A.K.D., Wittlin B.B., d. Souza B.F., d. Oliveira B.L.C.A., d. Carvalho C.A., Thomaz E.B.A.F., et al. Population-based seroprevalence of sars-cov-2 and the herd immunity threshold in maranhão. Rev. Saúde Públ. 2020;54:131. doi: 10.11606/s1518-8787.2020054003278. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b24] 24.D. Kline, A. Hyder, E. Liu, M. Rayo, S. Malloy, E. Root, A Bayesian spatio-temporal nowcasting model for public health decision-making and surveillance, arXiv preprint arXiv:2102.04544. [DOI] [PubMed]

[b25] 25.2021. Infectious disease institute (IDI) COVID-19 response modeling team at the ohio state university, predicting COVID-19 cases and subsequent hospital burden in ohio. https://idi.osu.edu/assets/pdfs/covid_response_white_paper.pdf. [Google Scholar]

[b26] 26.Statistics L.B., Breiman L. Machine Learning. 2001. Random forests; pp. 5–32. [Google Scholar]

[b27] 27.Brodersen K.H., Gallusser F., Koehler J., Remy N., Scott S.L. Inferring causal impact using Bayesian structural time-series models. Ann. Appl. Stat. 2015;9(1):247–274. [Google Scholar]

[b28] 28.Stoner O., Economou T. Multivariate hierarchical frameworks for modeling delayed reporting in count data. Biometrics. 2020;76(3):789–798. doi: 10.1111/biom.13188. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A machine learning model for nowcasting epidemic incidence

Saumya Yashmohini Sahai

Saket Gurukar

Wasiur R KhudaBukhsh

Srinivasan Parthasarathy

Grzegorz A Rempała

Abstract

1. Introduction

Fig. 1.