Skip to main content
Elsevier - PMC COVID-19 Collection logoLink to Elsevier - PMC COVID-19 Collection
. 2021 Nov 27;343:108677. doi: 10.1016/j.mbs.2021.108677

A machine learning model for nowcasting epidemic incidence

Saumya Yashmohini Sahai a, Saket Gurukar a, Wasiur R KhudaBukhsh b, Srinivasan Parthasarathy a, Grzegorz A Rempała b,
PMCID: PMC8635898  PMID: 34848217

Abstract

Due to delay in reporting, the daily national and statewide COVID-19 incidence counts are often unreliable and need to be estimated from recent data. This process is known in economics as nowcasting. We describe in this paper a simple random forest statistical model for nowcasting the COVID-19 daily new infection counts based on historic data along with a set of simple covariates, such as the currently reported infection counts, day of the week, and time since first reporting. We apply the model to adjust the daily infection counts in Ohio, and show that the predictions from this simple data-driven method compare favorably both in quality and computational burden to those obtained from the state-of-the-art hierarchical Bayesian model employing a complex statistical algorithm. The interactive notebook for performing nowcasting is available online at https://tinyurl.com/simpleMLnowcasting.

MSC: 00-01, 99-00

Keywords: Nowcasting, Backfilling, COVID-19 incidence, Random forest

1. Introduction

The SARS-CoV-2 virus, first observed in the United States (USA) in January 2020 [1], [2], [3], is highly contagious [4] and has spread in both urban and rural regions [5], [6] of the USA. To gauge and combat the SARS-CoV-2 spread, governments and health organizations have set up public information systems such as COVID-19 dashboards [7], [8], [9], [10]. These dashboards are useful to brief the public [8] about the current state of COVID-19 in specific regions, make data-driven public health decisions [10], and improve transparency in governance [11]. Many of these dashboards show the number of daily new infections (daily incidence), where the infection count on a particular date refers to the number of people who started experiencing disease symptoms on that date (i.e., the onset date of illness). Whereas reporting onset dates is very useful from the viewpoint of contact tracing and disease spread monitoring, it is also challenging due to unavoidable delays. [7], [12], [13]. These delays are often due to the time-lags between experiencing initial symptoms and seeking care, receiving testing results, and updating the statewide records [14], [15]. As a consequence, the incidence reporting based on onset counts leads to under-counting of the present and most recent cases. Dashboards often explicitly warn about this problem [16], [17]. Fig. 1 shows one such example from COVID-19 Dashboard maintained by the Ohio Department of Health (ODH) [7] where the region of possible under-reporting is marked with a gray rectangle.

Fig. 1.

Fig. 1

Daywise COVID-19 cases in Ohio, as on 12-01-2020. The shaded area – comprising of 21 days – is the preliminary case data and is likely under-reported to the ODH due to delayed reporting.

The incomplete current count data poses huge challenges for both local and national healthcare policymakers as they strive to make difficult public health decisions (e.g., introduce lockdowns, curfews, evaluate vaccination effects, etc.) in real time to limit the spread of the virus. The use of statistical methods to moderate the effects of incomplete data could help reduce uncertainty in public health decision-making during the COVID-19 pandemic and increase public awareness of the most recent disease trends.

While forecasting COVID-19 cases is typically concerned with predicting the future burden of the epidemic, nowcasting [18], [19], [20] addresses the problem of delayed reporting and focuses on the estimation of current case counts from not-too-distant historic data. Given the under-reported infection data for a particular date, the nowcasting models estimate the total number of current infections for that date, which will be reported eventually. In the literature, there exist several sophisticated statistical methods for addressing the issue of nowcasting for COVID-19. For instance, Wu et al. [21] nowcast the probable size of the COVID-19 outbreak in Wuhan, China. The authors estimate the basic reproduction number R0 from their proposed non-homogeneous counting process modeling the exported number of international cases from Wuhan and the global human mobility data from/to Wuhan. The authors then used the estimated R0 in the Susceptible–Exposed–Infected–Recovered or SEIR model [22] for nowcasting and forecasting the outbreak’s size. The nowcasting problem for delayed reporting of COVID-19 cases is also addressed by Silva et al. [23] and Greene et al. [13] using Bayesian smoothing approach [19] where the authors model the delayed number of reported cases with their proposed Markov counting processes.

In this paper, we propose a simple yet efficient machine learning model that addresses the problem of nowcasting in a way that is easily understood by non-experts and therefore suitable for presenting to public health decision-makers. The only data our proposed model requires can be readily collected from publicly available dashboards. Despite its simplicity, the model is seen to predict, with high accuracy (measured with the typical regression-style R2 value), the number of people who start experiencing COVID-19 symptoms on a particular date. We also show that our proposed model outperforms the state-of-the-art hierarchical Bayesian model [24] in terms of nowcasting accuracy while being also approximately 72000x faster. Our model predictions can also be utilized as input to other forecasting models, for instance, the ones created for ODH [25] that forecast the future number of infections and subsequent hospital burden in Ohio. Note that since the goal is to nowcast the state epidemic incidence curve, there is no accounting for non-symptomatic cases.

2. Materials and methods

2.1. Data processing

To perform our analysis, we used the public data available at ODH COVID-19 dashboard,1 which is updated daily. It provides the daily partial incidence count, that is, the count of all individuals itd reported on a given day t to be confirmed COVID-19 cases with the day of onset d where dt. For our analysis we aggregated cases by the onset date to get the state-level progression of the onset reporting. This was done by pulling data from the dashboard everyday — the dashboard provides the data for the d days which we pull for t days. Accordingly, the infection count ITD on a specific day T for a given specific onset date D is given by

ITD=dtT1itd, (1)

where 1itd is the indicator function

1itd=1 where tT,d=D,0 otherwise. (2)

Note that for a given D, ITD is non-decreasing as a function of t and, assuming that it is also bounded, it has a limit as T. This is illustrated in Fig. 2 where we see that over the course of 52 days ITD becomes approximately a constant.

Fig. 2.

Fig. 2

Progression of infection count Itd for a specific d value (11-01-2020) over t ranging from 11-01-2020 to 12-22-2020. There is a steep rise in the infection count in the initial days of data collection as the data is backfilled, but it gradually stabilizes.

We denote the asymptotic stable value of ITD for an onset date D by

IDs=limTITD (3)

and define FTD as the amount of undercounting for a specific D on day T given by

FTD=1ITDIDs. (4)

We may think about FTD as a standardized measure of undercounting that is also robust to changes in incidence rates during the course of the pandemic. In what follows, we therefore consider FTD in place of ITD. Note that although in general FTD0 as T, this convergence is not necessarily monotone and that in the fixed time window, ITD only approximately stabilizes as it approaches IDs. In order to improve data stability in the time windows of interest, we consider the ITD limit to be reached in practice as soon as FTD<0.05. This particular cutoff value was chosen by cross-validation [0,0.5], as described in Section 2.2.

In order to cross-validate and measure the prediction testing error, data to be used for nowcasting is split into a training and a validation (testing) set based on t, where all Ftd with t<Ttrain are in the former and t>Ttrain are in the latter.

2.2. Model

Covariates.

The model includes the following features to predict the Ftd.

  • Days since data collection (Δ). For any given infection count Itd reported on day t with onset date d, we define this feature as
    Δtd=td. (5)
  • Day of the week (ωt). This categorical variable denotes the day of the week for t, at which data is being reported, ωt{Mo, Tu, We, Th, Fr, Sa, Su}.

  • Raw infection count (Itd). This is the daily partial incidence count for the pandemic, as described in Eq. (1).

Random forest regression.

We train a random forest (RF) regression model [26] on the data partition defined in Section 2.1, to predict Ftd from the covariates. Formally, we may write

Ftd=f(Δtd,Δtd2,Δtd3,ωi,Itd), (6)

where f is the RF model.

3. Results

Goodness of fit.

The explained variance (R2 value) is used to evaluate the goodness of fit of the model on both the training data (time window from 10-01-2020 to 11-15-2020) and on the testing data (time window from 11-16-2020 to 12-15-2020). The predictions from the fitted model plotted against the true values in test data can be seen in Fig. 3. The explained variance is 0.99 on the training data and 0.89 on the testing data, which shows that the model’s prediction of Ftd generalizes well to the unseen data.

Fig. 3.

Fig. 3

Actual vs Predicted Ftd on the testing dataset. Robust prediction of Ftd is crucial for correct prediction of final infection count Ids.

Importance of covariates.

The relative importance of covariates (the Gini importance or the mean decrease in impurity) in the fitted model, described in Section 2.2 can be seen in Table 1. The covariate days since data collection (Δ), along with its quadratic and cubic transforms turn out to be the most important features in determining the fraction of missing data Ftd . The day of the week ωi has much less relative importance.

Table 1.

Relative (Gini) importance of covariates. Days since data collection (Δ) and its transformations are the most important, with day of the week (ωt) having the least effect.

Covariate Δ2 Δ3 Δ I ωt=Th ωt=Tu ωt=We ωt=Fr ωt=Mo ωt=Sa ωt=Su
Importance 0.337 0.325 0.311 0.013 0.003 0.003 0.003 0.002 0.001 0.001 0.001

Prediction of missingness Ftd.

Fig. 4 shows the prediction of Ftd for different values of Δtd. As seen from the plot, the model predictions are close to the true Ftd when Δ>4. The good agreement at Δ=0 is trivial, as at first date of collection, Ftd is almost always close to 1.0 and thus easy to predict. It is also evident that first 3–4 days of data collection seem to be unreliable in predicting the correct Ftd and therefore should be utilized cautiously in the nowcasting predictions.

Fig. 4.

Fig. 4

Predicted missing fraction, Ftd at various Δtd .

Actual count prediction.

Based on the prediction of Ftd and the current observed count Itd, we use (4), to get the estimate of Ids, which is the stable value of the infection count on day d. The typical trends for 4 different days of the week can be seen in Fig. 5. The infection count from the model predicts the stable value Ids robustly after five days (starting from Δ=5), and in some cases even earlier. In Fig. 5 we may see that irrespective of the day of the week (Monday, Wednesday, Friday, Sunday), the model is seen to predict the value of Ids with good accuracy. We may also note that on Monday and Sunday the model predictions have higher uncertainty likely due to the effect of weekend test processing slowdown.

Fig. 5.

Fig. 5

Prediction of raw infection count.

Comparison with the Bayesian model.

In order to provide some context for assessing the quality of the RF model predictions, we compare our results with a state-of-the-art hierarchical Bayesian model proposed recently by Kline et al. [24], which has been used for the same purpose of nowcasting COVID-19 cases in the state of Ohio. The model, which we refer to as the Bayesian model (BM) in the following, is more elaborate than ours as it has also a spatial component. Specifically, it keeps track of COVID-19 cases over time in different geographical regions (counties in Ohio). Although in our comparison we aggregate BM spatial counts, for the sake of completeness we briefly describe here the entire model along with its spatial component. Denoting by Yi,t the true count of cases in county i with onset date t the BM assumes the following Poisson model for the dynamics of the disease:

Yi,tPOISSONexpOi+αi,t+Xtηi, (7)

where Oi is an offset of the logarithm of population of county i, the spatio-temporal random variables αi,t are the latent states of the process, the design vector Xt indicates the day of the week, and the vector ηi captures the day of the week effect. It is assumed that Yi,t is only partially observed for time t>TmaxD, where Tmax stands for the last onset date and D (assumed 30 in [24]) is the maximum reporting delay following onset. BM also uses a semi-local linear trend model [27] for the spatio-temporal random variables αi,t. Further, the spatial correlation is accounted for using an intrinsic conditional auto-regressive model. The reporting delay is described by a Multinomial-Dirichlet model as follows. Denoting by Zi,t,d the count of cases in county i with onset date t, which are observed d days after t, one defines Zi,t=Zi,t,0,Zi,t,1,,Zi,t,D. Then, the Multinomial-Dirichlet model prescribes

Zi,tMULTINOMIALpi,t,Yi,t,
pi,tGENERALIZEDDIRICHLETai,t,bi,t,

where the vectors ai,t and bi,t are described in terms of mean and dispersion parameters [28]. The choice of a Generalized Dirichlet distribution allows for modeling potential overdispersion in pi,t (see [28]). Moreover, it leads to a convenient BETA-BINOMIAL conditional distribution representation for the components Zi,t,d. For the purpose of Bayesian analysis, the authors specify normally distributed priors for the parameters and use the R package nimble to perform a Markov chain Monte Carlo (MCMC) algorithm. The authors report a run time of approximately 20 h for 30,000 iterations.

In Fig. 6, Fig. 7 we visually compare the nowcasts of the two models and see in particular that the RF enjoys narrower uncertainty bounds and less bias than the corresponding BM model. In order to quantify this difference more formally, we calculate the L2 distance between the predictions made by the RF and the Bayesian model, respectively and the actual known stable values in the Ohio COVID-19 daily counts dataset. We report the ratio of the two L2 distance values as a measure of relative closeness of the models to the true (stable) data value for days T10 to T and T10 to T5, where T is the last available date in the data. The results are presented in Table 2. As can be seen in the table, the predictions by the random forest model are relatively closer to the true values than those generated using the Bayesian model estimates. The ratio is smaller in the full 10 day window, indicating that the RF model makes better predictions than BM for days that are close to data collection.

Fig. 6.

Fig. 6

Comparison between the Bayesian model (BM) and random forest (RF) model from 11-01-2020 to 12-09-2020. The vertical line indicates split between testing and training dataset (of 20 days) used by RF. The weekly fluctuations (weekend effects) are clearly visible in the data and are accounted for by both models.

Fig. 7.

Fig. 7

Comparison between the Bayesian model (BM) and random forest (RF) model from 12-01-2020 to 12-09-2020 including the respective 95% uncertainty envelopes.

Table 2.

The ratio of L2 norm of nowcasted predictions from the Bayesian model (BM) and random forest model (RF) from the true stable values at two different time instants T and T-5. The ratio values below one indicate that in both cases the RF model performs better than BM.

T-10:T T-10:T-5
RF/BM 0.565 0.726

4. Summary and discussion

We presented here a simple method for nowcasting COVID-19 cases from historic data on daily incidence of new cases, as measured by the onset of symptoms. Such type of data is now widely available for all states in the USA as well as for most countries in the world. When the need to take immediate decisions on governance or policy arises, nowcasting can be a useful tool in providing more accurate estimates about disease incidence and spread. Specifically, our proposed nowcasting algorithm uses a random forest (RF) regression methodology and leverages covariates that are based on day of the week, the number of days passed since first data collection and total incidence so far.

The proposed algorithm is both conceptually simple and computationally efficient. Our results also suggest that it compares favorably with a much more elaborate Bayesian model. We have illustrated the application of our approach on publicly available data from COVID-19 daily onsets in Ohio, as available from the state’s COVID-19 interactive dashboard. We observed that the model is able to predict the final incidence for a day, within 3 to 4 days of data collection. We also find that the number of days passed since first data collection, along with its transformations (or derivatives), are the most important covariates in predicting the final incidence.

The proposed model learns from the specific epidemic curve (in our case COVID-19) and depends on how this curve is updated. In our study, we have nowcasted epidemic incidence for Ohio. The process of updating is highly dependent on part of the country, population density, availability of testing and reporting by local health departments. It is likely that data from a different geographic region will lead to different learned model. There could be some level of nowcasting similarity in different geographical regions of the country and our method could be used to help identify such cases. This can be a potential follow up to our work.

In order to make our RF method predictions broadly available to the interested researchers and practitioners, we have created a publicly available and accessible interactive notebook (see below). As described in the repository, the notebook allows one to use our algorithm to nowcast current COVID-19 onset occurrences, based on any user-provided historic data supplied in appropriate format.

The problem of nowcasting historic data is an important one, specially during the current COVID-19 pandemic, when delays in reporting can snowball into sub-optimal policies and actions, that can cost lives and create unnecessary societal burden. Our proposed method allows both general public and health providers to carefully monitor the pandemic trends and make informed decisions. The ideas we presented while focused on COVID-19 can be broadly applicable to similar public health problems in the future.

Software availability

The interactive self-contained notebook for performing the nowcasting using the random forest approach described in the paper, along with installation instructions, is freely available at https://zenodo.org/badge/latestdoi/346708110. Additionally, the web-based version of the interactive notebook is available at https://tinyurl.com/simpleMLnowcasting.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was partially funded by NSF, United States of America grants DMS-1853587 and DMS-2027001 to GAR. The work of WKB was supported by the President’s Postdoctoral Scholars Program (PPSP) of the Ohio State University, United States of America . We would like to thank Harley Vossler for providing helpful feedback on the interactive notebook.

Footnotes

References


Articles from Mathematical Biosciences are provided here courtesy of Elsevier

RESOURCES