Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil

Matheus Henrique Dal Molin Ribeiro; Ramon Gomes da Silva; Viviana Cocco Mariani; Leandro dos Santos Coelho

doi:10.1016/j.chaos.2020.109853

. 2020 May 1;135:109853. doi: 10.1016/j.chaos.2020.109853

Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil

Matheus Henrique Dal Molin Ribeiro ^a,^b,^⁎, Ramon Gomes da Silva ^a, Viviana Cocco Mariani ^c,^d, Leandro dos Santos Coelho ^a,^d

PMCID: PMC7252162 PMID: 32501370

Highlights

•
Regression models are employed to forecasting COVID-19 cases in the Brazilian context.
•
Data from ten states with a high incidence of COVID-19 are adopted.
•
Models for multi-step-ahead forecasting are evaluated.
•
Out-of-sample forecasting errors lower than 6.9% are achieved by best models.
•
SVR and stacking ensemble are the most suitable tools to forecasting COVID-19 cases in the evaluated scenarios.

Keywords: ARIMA, COVID-19, Forecasting, Decision-making, Machine learning, Time-series

Abstract

The new Coronavirus (COVID-19) is an emerging disease responsible for infecting millions of people since the first notification until nowadays. Developing efficient short-term forecasting models allow forecasting the number of future cases. In this context, it is possible to develop strategic planning in the public health system to avoid deaths. In this paper, autoregressive integrated moving average (ARIMA), cubist regression (CUBIST), random forest (RF), ridge regression (RIDGE), support vector regression (SVR), and stacking-ensemble learning are evaluated in the task of time series forecasting with one, three, and six-days ahead the COVID-19 cumulative confirmed cases in ten Brazilian states with a high daily incidence. In the stacking-ensemble learning approach, the CUBIST regression, RF, RIDGE, and SVR models are adopted as base-learners and Gaussian process (GP) as meta-learner. The models’ effectiveness is evaluated based on the improvement index, mean absolute error, and symmetric mean absolute percentage error criteria. In most of the cases, the SVR and stacking-ensemble learning reach a better performance regarding adopted criteria than compared models. In general, the developed models can generate accurate forecasting, achieving errors in a range of 0.87%–3.51%, 1.02%–5.63%, and 0.95%–6.90% in one, three, and six-days-ahead, respectively. The ranking of models, from the best to the worst regarding accuracy, in all scenarios is SVR, stacking-ensemble learning, ARIMA, CUBIST, RIDGE, and RF models. The use of evaluated models is recommended to forecasting and monitor the ongoing growth of COVID-19 cases, once these models can assist the managers in the decision-making support systems.

1. Introduction

The new Coronavirus (COVID-19) is an emerging disease responsible for infecting millions of people and killing thousands worldwide since the first notification until nowadays, according to the World Health Organization (WHO) [1], [2]. Also according to WHO, Brazil registered 40.581 confirmed cases until April 22nd 2020, holding the 12th position in the world ranking in the number of confirmed cases of COVID-19, and 2nd position in the Americas (behind the United States of America).

Due to the impacts of the COVID-19 pandemic in people’s lives and the world’s economy, the governments and population are most concerned with (i) when the COVID-19 outbreak will peak; (ii) how long the outbreak will last and (iii) how many people will eventually be infected [3]. Further, Boccaletti et al. [4] have identified at least three scientific communities that may cooperate in the effort to deal with the current pandemic: (i) the community of applied mathematicians, virologists and epidemiologists, developing sophisticated diffusion models to the specific properties of a given pathogen; (ii) the community of complex systems scientists who study the spread of infections using compartmental models, using methods and principles from statistical mechanics and nonlinear dynamics; and (iii) the community of scientists who incorporate artificial intelligence (AI) and most specifically deep learning approaches to produce accurate predictive models. Also, different studies are evaluating the impacts of COVID-19 on society, whether through predictions of future cases, as well as variables capable of helping to understand the spread of this disease [5], [6], [7], [8], [9].

Moreover, epidemiological time series forecasting plays an important role in health public system, once it allows the managers to develop strategic planning to avoid possible epidemics. Forecasting diseases as accurate as possible is important due to their impact on the public health system. To ensure this accuracy, AI models have been widely used to forecast epidemiological time series over the years [10], [11], [12]. Moreover, in the AI context, Vaishya et al. [13] presented a review of trends in COVID-19 data analysis.

Regarding this context, the objective of this paper is to explore and compare the predictive capacity of machine learning regression and statistical models, in the task of forecasting one, three, and six-days-ahead COVID-19 cumulative cases in Brazil. In this respect, datasets of ten Brazilian states some with a high incidence of COVID-19 until now, like Sao Paulo and Rio de Janeiro, are adopted to evaluates the forecasting efficiency through of the autoregressive integrated moving average (ARIMA), cubist regression (CUBIST), random forest (RF), ridge regression (RIDGE), support vector regression (SVR), and stacking-ensemble learning models. In the stacking-ensemble learning modelling, which is an effective ensemble learning approach [14], [15], CUBIST, RF, RIDGE, and SVR are used as base-learners (weak models), and Gaussian process (GP) as meta-learner (strong model). The out-of-sample forecasting accuracy of each model is compared by some performance metrics such as the improvement percentage index (IP), mean absolute errors (MAE), and symmetric mean absolute percentage error (sMAPE).

The contributions of this paper can be summarized as follows:

•
The first contribution is related to the presentation of a novel analysis of the forecast model for cumulative confirmed cases of COVID-19 in Brazil, whose accuracy of the models assists governors in decision-making to contain the pandemic and strategies concerning the health system;
•
The second contribution, we can highlight the use of heterogeneous machine learning models, as well as the stacking-ensemble learning approach to forecast the Brazilian cumulative confirmed cases of COVID-19;
•
Also, this paper evaluates models forecasting in a multi-day-ahead forecasting strategy. The forecasting time horizons are the interval of one, three, and six-days-ahead. This range of the forecasting time horizon allows us to verify the effectiveness of the predicting models in different scenarios, helping in future strategies in fighting COVID-19.

The remainder of this paper is organized as follows: Section 2.1 a brief description of the dataset adopted in this paper is given. The forecasting models applied in this study are described in Section 2.2. Section 3 details the procedures applied in the research methodology. Results obtained and related discussion about models forecasting performance are given on Section 4. Finally, Section 5 concludes this work with considerations and some directions for future research proposals.

2. Material and methods

This section presents the description of the material analyzed (Section 2.1) as well as the models description applied in this paper (Section 2.2).

2.1. Dataset description

The collected dataset refers to the cumulative confirmed cases of COVID-19 that occurred in Brazil until April, 18 or 19 of 2020. The dataset was collected from an application programming interface [16] that retrieves the daily information about COVID-19 cases from all 27 Brazilian State Health Offices, gather them, and make it a publicly available. Among the 27 federative units (26 states and one federal district), ten states some with a high incidence of COVID-19 cases and other states with lower temperatures, states from south of Brazil, were chosen, among them are Amazonas (AM), Bahia (BA), Ceara (CE), Minas Gerais (MG), Parana (PR), Rio de Janeiro (RJ), Rio Grande do Norte (RN), Rio Grande do Sul (RS), Santa Catarina (SC), and Sao Paulo (SP). The measurement period of each state varies, once each state counts since the day of its first case until the day of the last report. The cumulative confirmed cases and deaths of each state, as well as the period from the first and last reports, are illustrated in Table 1 . The change in the way of accounting for the number of cases, by the health departments, may change the data presented in this paper.

Table 1.

First and last report dates by state.

State	Number of observed days	First report	Last report	Cumulative confirmed cases	Cumulative deaths
AM	34	13/03/2020	19/04/2020	2044	182
BA	43	06/03/2020	19/04/2020	1249	45
CE	35	16/03/2020	19/04/2020	3306	189
MG	42	08/03/2020	19/04/2020	1154	39
PR	36	12/03/2020	18/04/2020	960	49
RJ	38	05/03/2020	19/04/2020	4675	402
RN	30	12/03/2020	18/04/2020	561	26
RS	38	10/03/2020	19/04/2020	869	26
SC	39	12/03/2020	19/04/2020	1025	35
SP	53	25/02/2020	19/04/2020	14267	1015

Open in a new tab

A heatmap of the cumulative confirmed cases is presented in Fig. 1 .

2.2. Methodologies

This section describes a brief of each model employed in the data analysis.

•
ARIMA is a Box & Jenkins modelling usually employed to deal with non-stationary time series. In fact, the ARIMA model is full specified by autoregressive (p), different degrees of trend differences (d), and moving average operators (q). These parameters are used do define the model order, and usually defined by grid-search, as well as by autocorrelation and partial autocorrelation function. In this context, the model is described as ARIMA(p,d,q) [17].
•
CUBIST is a rule-based model, which performs predictions following the regression of trees principle [18]. Through the use of a committee of the rules, and using the neighborhood concept similar to k-nearest-neighbor modelling, the final forecasting is obtained.
•
GP is composed of a set of random variables Gaussian distributed and fully specified by its mean and covariance (kernel) function [19]. In this paper, the GP with a linear kernel is adopted.
•
RIDGE is a regularized regression approach [20] which employs a penalization term in the ordinary least squares algorithm. It is an effective tool, once it reduces the bias of parameter estimates by controlling the standard errors. Moreover, the model can deal with inputs multi-collinearity problem.
•
RF is a bagging ensemble-based model, which combines the bagging advantages characterized by the creation of multiple samples, with refitting through of the bootstrap technique, from the same set of data, and random selection of predictors to compose each node of the decision tree [21]. RF is a fast and robust supervised learning method able to deal with the randomness of the time series. Furthermore, it is interesting because, in addition to being an ensemble approach, only the number of predictors for each node needs to be tuned.
•
SVR consists in determining support vectors (points) close to a hyperplane that maximizes the margin between two-point classes obtained from the difference between the target value and a threshold. To deal with non-linear problems SVR takes into account kernel functions, which calculates the similarity between two observations. In this paper, the linear kernel is adopted. The main advantages of the use of SVR lies in its capacity to capture the predictor non-linearity and then use it to improve the forecasting cases. In the same direction, it is advantageous to employ this perspective in this case study adopted, since that the samples are small [22].
•
Stacked Generalization or stacking-ensemble learning is an ensemble-based approach [23] which combines through a meta-learner the predictions of a set of weak models (base-learners) to obtain a stronger learner. This approach usually operates into two levels, where in the first level the base-learners are trained and its predictions are obtained. In the next stage, a meta-learner uses, as inputs, the predictions of the previous level in the training phase. The stacking predictions are obtained from meta-learner. The main advantage of the stacking-ensemble learning is that this approach can improve the accuracy and additionally reduce error variance [14].

3. Proposed forecasting framework

This section describes the main steps in the data analysis adopted by CUBIST, RF, RIDGE, SVR, and stacking-ensemble learning models. Also, the ARIMA modelling is described.

Step 1: Firstly, the raw data is split into training and test datasets. The test dataset is composed of six last observations, and the training dataset by the remain samples [14]. The training data are centered by its mean value and divided by its standard deviation. To develops multi-days-ahead COVID-19 cases forecasting, recursive strategy is employed [24]. In this aspect, one model is fitted for one-day-ahead forecasting. Next, the recursive strategy uses the forecasting value as an input for the same model to forecast the next step, continuing this manner until reaching the desirable horizon. The training structure adopted in this paper is stated as follows,

y_{(t + 1)} = f {y_{t}, \dots, y_{t + 1 - n_{y}}} + ϵ ϵ \sim N (0, σ^{2}),

(1)

in which f is a function related to the adopted model in the training stage, $y_{t + 1}$ is the COVID-19 case one-day-ahead, $n_{y} = 5$ are the past confirmed cases, ϵ is the random error, following a normal distribution with zero mean (0) and constant variance σ ². In this paper, the aim is to obtain the cases up to H next days, especially up to 1 (ODA, one-day-ahead), 3 (TDA, three-days-ahead), and 6-days-ahead (SDA, six-days-ahead), respectively. The following structures are considered,

\begin{matrix} {\hat{y}}_{t + h} = {\begin{matrix} \hat{f} [y_{t}, y_{t - 1}, \dots, y_{t - n_{y} + 1}] & if h = 1 \\ \hat{f} [{\hat{y}}_{t + h - 1}, \dots, {\hat{y}}_{t + 1}, y_{t}, \dots, y_{t + h - n_{y}}] & if h \in [2, . . ., n_{y}] \\ \hat{f} [{\hat{y}}_{t + h - 1}, \dots, {\hat{y}}_{t + h - n_{y}}] & if h \in [n_{y} + 1, \dots, H], \end{matrix} \end{matrix}

(2)

where, ${\hat{y}}_{t + h}$ is the forecast value at time t and forecast horizon up to h, $y_{t + h - n_{y}}$ and ${\hat{y}}_{t + h - n_{y}}$ are the previously observed and forecast cases lags in $n_{y} = 5$ days. The n_y value is chosen through grid-search with purpose to capture the best data behavior.

Step 2: In the stacking-ensemble learning modelling, the base-learners CUBIST, RF, RIDGE, SVR are trained and its forecasting are used as inputs for meta-learner GP. In the training stage, leave-one-out cross-validation with a time slice is adopted [14]. Finally, the out-of-sample forecasts are computed. These approaches are developed using the caret package [25]. The ARIMA modeling is performed through the use of forecast package [26], [27] with use of auto.arima function. To define the ARIMA order, grid-search is adopted, and the most suitable order is that reach a lower Akaike and Bayesian Akaike criteria information. Both analyses are developed using R software [28]. All hyperparameters employed in this study are presented in Table B.1 in Appendix B.

Step 3: To evaluate the effectiveness of adopted models, from obtained forecasts out-of-sample (test set), performance IP (3), MAE (4), and sMAPE (5) criteria are computed as

MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |,

(3)

sMAPE = \frac{2}{n} \sum_{i = 1}^{n} \frac{| y_{i} - {\hat{y}}_{i} |}{| y_{i} | + | {\hat{y}}_{i} |},

(4)

IP = 100 \times \frac{M_{c} - M_{b}}{M_{c}},

(5)

where n is the number of observations, y_i and ${\hat{y}}_{i}$ are the ith observed and predicted values, respectively. Also, the M_c and M_b represent the performance measure of compared and best models, respectively.

Fig. 2 presents the proposed forecasting framework.

4. Results

This section describes the results of the developed experiments in forecasts out-of-sample (test set). First, Section 4.1 compares the results of evaluated models over ten datasets and three forecasting horizons adopted. In Table A.1 in Appendix A, the best results regarding accuracy are presented in bold. Additionally, Figs. 3 up to 4 illustrate the relation between observed and predicted values achieved by models with best set of performance measures depicted in Table A.1, as well as box-plots for out-of-sample errors are illustrated in Fig. 5 .

Fig. 3 — Predicted versus observed cumulative confirmed cases of COVID-19 for AM, BA, CE, and MG states.

Fig. 4 — Predicted versus observed cumulative confirmed cases of COVID-19 for PR, RJ, RN, RS, SC, and SP states.

Fig. 5 — Box-plot for absolute error according to model and state for COVID-19 forecasting up to SDA.

4.1. Performance measures for compared models

In this section, the main results achieved by the best model regarding MAE and sMAPE criteria are presented for short-term forecasting multi-days- ahead of cumulative cases of COVID-19 from ten Brazilian states.

•
AM: In this state, CUBIST, and RIDGE approaches could be considered to forecasting COVID-19 cases. In fact, in respect to ODA and TDA, CUBIST outperforms models, while for SDA the RIDGE achieves better accuracy regarding MAE and sMAPE than others. The improvement in the MAE for ODA and TDA achieved by CUBIST ranges between 6.58%–92.77%, and 11.39%–88.54%, respectively. Through sMAPE analysis, the RIDGE model outperforms other models, and this criterion is reduced in the range of 16.46%–91.88%, for SDA horizon.
•
BA, MG, RS, and SP: For these states, in all forecasting windows, the SVR approach achieved better accuracy than other models, for both MAE and sMAPE criteria in the multi-days-ahead forecasting task of the confirmed number of COVID-19. In fact, the improvement in sMAPE is ranged in 13.26%–95.11%, 4.23%–94.88%, and 38.59%–95.24%, respectively, in ODA, TDA, and SDA forecasting horizons. Moreover, the same behavior is observed when the improvement in sMAPE criterion is obtained.
•
CE and RN: In the CE state, the ARIMA model has a better performance in the forecasting out-of-sample than other models for ODA and TDA time windows. In this aspect, for MAE criterion, the improvement is ranged between 72.36%–98.03%, and 45.93%–92.40%, for ODA, and TDA time windows, respectively. For sMAPE, the improvement on ODA, and TDA horizons is 65.06%–97.84%, and 32.81%–92.53%, respectively. The SVR has better results than ARIMA model for SDA. Considering the RN state, the same analysis is developed for ODA, and TDA horizons. The exception to the SDA horizon, in which the CUBIST model has better effectiveness in the MAE and sMAPE criteria than remain models.
•
PR, RJ, and SC: For these states localized into the south region (PR and SC) and southeast region (RJ) of Brazil, the most appropriate approach to forecast cumulative cases of COVID-19 is the stacking-ensemble learning, exception in ODA horizon, when ARIMA model has better results. Stacking overcomes the drawback of single models and achieves the best accuracy than other models. In fact, for these states, the improvement in MAE and sMAPE are between 14.01%–94.68%, and 17.48%–95.41%, respectively, for ODA horizon. The improvement in order forecasting horizons presents the same behavior of ODA, with the greatest magnitude of improvement for TDA and SDA.

Remark: In this experiment, 180 scenarios (10 datasets, 3 forecasting horizons, and 6 models) were evaluated for the task of forecasting cumulative COVID-19 cases. In an overview, the best models for each state, obtained sMAPE ranged between 0.87%–3.51%, 1.02%–5.63%, and 0.95%–6.90% for ODA, TDA, and SDA forecasting, respectively. The ranking of models in all scenarios is SVR, stacking-ensemble, ARIMA, CUBIST, RIDGE, and RF models. In contrast to finds of [29], for the datasets evaluated in this paper, ARIMA modelling was effective in some situation for very-short horizons When the horizon is SDA, ARIMA model has worst performance than most of compared models. However, for ODA the applications are limited. From a broader perspective, the efficiency of SVR is due to its ability to deal with small size dataset, while the stacking-ensemble learning combines the advantages of several single models to learn the data behavior and obtain forecasts similar to observed values. On the other hand, the difficulty of the RF model to forecasting cumulative COVID-19 cases could be attributed to the fact that this approach requires more observations to effectively learn the data pattern.

According to the information depicted in Figs. 3 and 4 it is possible to identify that the behavior of the data is learned by the evaluated models, which can forecasting compatible cases with the observed values. The good performance obtained in the training phase persists in the test stage. In the Fig. 3a and 4c the models, RIDGE and CUBIST, as well as in Fig. 3d and 4f, SVR presented difficulties to capture the variability of the first observations. The dataset is reduced for all states, which justifies the difficulties of the mathematical models to learn the behavior.

Fig. 5 shows the box-plots of out-of-sample forecasting errors in the SDA horizon for each model and dataset used. This horizon is chosen to analysis due to the recursive strategy adopted, once the errors increase according to the growth of the forecasting horizon. The box diagram depicts the variation of absolute errors for each model, which reflects the stability of each model. In this context, the dots out of boxes are considered outliers errors, and the black dot inside of the box is the MAE for each model.

Through the box-plot analysis, boxes with lower size indicate models with lower variation in the errors, and the results presented in Table A.1 are corroborated by the depicted in Fig. 5. Models with lower errors also reach better stability, which means that the most suitable modelling for each state can maintain a learning pattern, achieving homogeneous prediction errors.

5. Conclusion and future research

In this paper, six machine learning approaches named CUBIST, RF, RIDGE, SVR, and stacking-ensemble learning, as well as ARIMA statistical model, were employed in the task of forecasting one, three, and six-days-ahead the COVID-19 cumulative confirmed cases in ten Brazilian states with a high daily incidence. The COVID-19 cumulative confirmed cases for AM, BA, CE, MG, PR, RJ, RN, RS, SC, and SP states were used. The IP, MAE, and sMAPE criteria were adopted to evaluate the performance of the compared approaches. Moreover, the stability of out-of-sample errors was evaluated through box-plots.

In respect of obtained results, it is possible to infer that SVR and stacking-ensemble learning model are suitable tools to forecast COVID-19 cases for most of the adopted states, once that these approaches were able to learn the nonlinearities inherent to the evaluated epidemiological time series. Also, ARIMA can be considered in some aspects for ODA, while CUBIST and RIDGE models deserve attention for the development of this task in TDA and SDA time windows. Therefore, the ranking of models, from the best to the worst regarding accuracy, in all scenarios is SVR, stacking-ensemble learning, ARIMA, CUBIST, RIDGE, and RF models. However, even though the models discussed in this paper presented forecasting cases similar to those observed, they should be used cautiously. This fact is attributed to the chaotic dynamics of the analyzed data, as well as the diversity of exogenous factors that can affect the daily notifications of COVID-19.

For future works, it is intended (i) to adopt deep learning approaches combined to stacking-ensemble learning, (ii) to employ copulas functions for data augmentation dealing with small samples, (iii) to use multi-objective optimization to tune hyperparameters of adopted forecasting models, (iv) to adopt set of features which can help to explain the future cases of the COVID-19.

CRediT authorship contribution statement

Matheus Henrique Dal Molin Ribeiro: Conceptualization, Methodology, Formal analysis, Validation, Writing - original draft, Writing - review & editing. Ramon Gomes da Silva: Conceptualization, Methodology, Formal analysis, Validation, Writing - original draft, Writing - review & editing. Viviana Cocco Mariani: Conceptualization, Writing - review & editing. Leandro dos Santos Coelho: Conceptualization, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank the National Council of Scientific and Technologic Development of Brazil – CNPq (Grants number: 307958/2019-1-PQ, 307966/2019-4-PQ, 404659/2016-0-Univ, 405101/2016-3-Univ), PRONEX ‘Fundação Araucária’ 042/2018, and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001 for financial support of this work.

Appendix A. Performance measures

Table A.1 presents the performance measures for each model in each state and forecasting horizon.

Appendix B. Hyperparameters

Table B.1 presents the hyperparameters obtained by grid-search for the models employed in this paper. In the stacking-ensemble learning modeling, for the GP meta-learner there is no hyperaparameter to be tuned.

Table A1.

Performance measures for each evaluated model.

State	Forecasting Horizon	Criteria	Model
			ARIMA	CUBIST	RF	RIDGE	Stacking	SVR
AM	ODA	MAE	95	45	622.17	48.17	121.5	56.33
		sMAPE	6.61%	2.80%	42.50%	2.83%	7.13%	3.18%
	TDA	MAE	101.33	71.33	622.17	83.67	176.67	80.5
		sMAPE	6.55%	4.50%	42.50%	4.49%	10.47%	4.19%
	SDA	MAE	119.17	162.17	622.17	62.33	233.17	79.17
		sMAPE	6.97%	9.55%	42.50%	3.45%	13.87%	4.13%

BA	ODA	MAE	12	93.83	366.33	45.33	107.67	42.33
		sMAPE	1.56%	9.16%	42.02%	4.36%	10.68%	4.15%
	TDA	MAE	70	132	366.33	74.33	171.67	59.67
		sMAPE	8.00%	12.92%	42.02%	7.46%	17.32%	5.63%
	SDA	MAE	155.67	152.33	366.33	152.83	215.83	73.17
		sMAPE	15.41%	15.08%	42.02%	15.16%	22.25%	6.90%

CE	ODA	MAE	18	65.17	916	70.33	220.83	87.67
		sMAPE	0.87%	2.49%	40.28%	2.81%	8.20%	3.17%
	TDA	MAE	69.66	128.83	916	149.83	382.17	136.67
		sMAPE	3.01%	4.48%	40.28%	5.39%	14.48%	4.78%
	SDA	MAE	257	118.17	916	98.17	484.33	164.17
		sMAPE	9.34%	4.11%	40.28%	3.52%	18.78%	5.77%

MG	ODA	MAE	32	17.5	235.5	24.33	56.5	16
		sMAPE	3.63%	1.81%	26.21%	2.50%	5.59%	1.57%
	TDA	MAE	26	21.33	235.5	21.67	78.17	21
		sMAPE	3.08%	2.20%	26.21%	2.13%	7.81%	2.04%
	SDA	MAE	55	36.83	235.5	32.17	97.83	14.33
		sMAPE	5.43%	3.58%	26.21%	3.14%	9.88%	1.41%

PR	ODA	MAE	31	27.33	163.5	38	23.5	35.33
		sMAPE	3.96%	3.26%	21.09%	4.50%	2.69%	4.18%
	TDA	MAE	51.66	57.33	163.5	76.5	28.17	60.17
		sMAPE	6.21%	6.56%	21.09%	8.61%	3.21%	6.89%
	SDA	MAE	73.67	118	163.5	151	24.17	117.17
		sMAPE	8.20%	12.56%	21.09%	15.75%	2.75%	12.53%

RJ	ODA	MAE	110	165.5	1305.67	273.67	69.5	360.83
		sMAPE	3.17%	3.82%	37.06%	6.25%	1.70%	8.09%
	TDA	MAE	120	275.67	1305.67	462.83	68	429.33
		sMAPE	3.18%	6.24%	37.06%	10.20%	1.65%	9.49%
	SDA	MAE	158.33	532.67	1305.67	696.17	65.17	529.5
		sMAPE	3.67%	11.34%	37.06%	14.67%	1.58%	11.43%

RN	ODA	MAE	6	17	152.5	24.83	30.33	18.33
		sMAPE	1.61%	3.87%	39.28%	5.56%	6.45%	4.14%
	TDA	MAE	8.33	30.83	152.5	37.67	54	35.5
		sMAPE	2.11%	6.54%	39.28%	8.51%	11.66%	7.69%
	SDA	MAE	36.33	15.83	152.5	62	54	18.5
		sMAPE	7.61%	3.42%%	39.28%	12.76%	11.66%	4.15%

RS	ODA	MAE	12	12.83	146.67	11.33	45.5	8.17
		sMAPE	1.64%	1.62%	19.82%	1.43%	5.76%	0.97%
	TDA	MAE	24	19.17	147.33	18.67	71.33	8.5
		sMAPE	3.22%	2.47%	19.92%	2.42%	9.14%	1.02%
	SDA	MAE	34.5	34.17	147.5	37.67	91.83	7.83
		sMAPE	4.31%	4.26%	19.95%	4.74%	11.89%	0.95%

SC	ODA	MAE	21	93.67	179.5	180.5	33.83	177.67
		sMAPE	2.43%	9.66%	20.97%	17.53%	3.66%	17.27%
	TDA	MAE	44.33	100.33	179.5	277	41	257.33
		sMAPE	4.76%	10.30%	20.97%	25.34%	4.39%	23.79%
	SDA	MAE	56	102.83	179.5	338.5	43.83	330.33
		sMAPE	5.65%	10.53%	20.97%	29.95%	4.68%	29.23%

SP	ODA	MAE	436	1587	3799	537.33	1363.83	409
		sMAPE	4.65%	13.47%	35.85%	4.44%	11.44%	3.51%
	TDA	MAE	1485.66	2471.83	3801	579.17	2243	326.67
		sMAPE	14.56%	21.81%	35.88%	4.79%	19.47%	2.77%
	SDA	MAE	2779	3054.67	3801.5	591.83	2665.83	362.83
		sMAPE	24.74%	27.60%	35.88%	4.95%	23.55%	3.04%

Open in a new tab

Table B1.

Hyperparameters selected by grid-search for each evaluated model.

State	Model
	ARIMA	CUBIST		SVR	RIDGE	RF
	(p,d,q)	Committees	Neighbors	Cost	Regularization	Number of randomly selected predictors
AM	(1,2,0)	10	5	1	3.16E-03	2
BA	(0,2,1)	20	9	1	1E-04	2
CE	(2,2,1)	1	9	1	0	4
MG	(0,2,1)	1	9	1	1E-04	2
PR	(0,2,1)	20	5	1	3.16E-03	3
RJ	(0,2,1)	1	9	1	1E-04	3
RN	(1,1,0)	1	9	1	3.16E-03	5
RS	(0,1,0)	1	9	1	1E-04	3
SC	(0,2,1)	10	0	1	3.16E-03	5
SP	(0,2,0)	20	9	1	1E-04	5

Open in a new tab

References

1.World Health Organization (WHO). Coronavirus (COVID-19). 2020. (accessed in 22nd April, 2020). https://covid19.who.int/.
2.Sohrabi C., Alsafi Z., O’Neill N., Khan M., Kerwan A., Al-Jabir A. World health organization declares global emergency: a review of the 2019 novel coronavirus (COVID-19) Int J Surg. 2020;76:71–76. doi: 10.1016/j.ijsu.2020.02.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Zhang X., Ma R., Wang L. Predicting turning point, duration and attack rate of COVID-19 outbreaks in major western countries. Chaos Solitons Fractals. 2020:109829. doi: 10.1016/j.chaos.2020.109829. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Boccaletti S., Ditto W., Mindlin G., Atangana A. Modeling and forecasting of epidemic spreading: the case of COVID-19 and beyond. Chaos Solitons Fractals. 2020 doi: 10.1016/j.chaos.2020.109794. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Becerra M., Jerez A., Aballay B., Garcés H.O., Fuentes A. Forecasting emergency admissions due to respiratory diseases in high variability scenarios using time series: a case study in Chile. Sci Total Environ. 2020;706:134978. doi: 10.1016/j.scitotenv.2019.134978. [DOI] [PubMed] [Google Scholar]
6.Fanelli D., Piazza F. Analysis and forecast of COVID-19 spreading in China, Italy and France. Chaos Solitons Fractals. 2020;134:109761. doi: 10.1016/j.chaos.2020.109761. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Fong S.J., Li G., Dey N., Crespo R.G., Herrera-Viedma E. Composite monte carlo decision making under high uncertainty of novel coronavirus epidemic using hybridized deep learning and fuzzy rule induction. Appl Soft Comput. 2020 doi: 10.1016/j.asoc.2020.106282. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Roosa K., Lee Y., Luo R., Kirpich A., Rothenberg R., Hyman J. Real-time forecasts of the COVID-19 epidemic in China from february 5th to february 24th, 2020. Infect Dis Model. 2020;5:256–263. doi: 10.1016/j.idm.2020.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Effenberger M., Kronbichler A., Shin J.I., Mayer G., Tilg H., Perco P. Association of the COVID-19 pandemic with internet search volumes: a google trendstm analysis. Int J Infect Dis. 2020 doi: 10.1016/j.ijid.2020.04.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Davis J.K., Gebrehiwot T., Worku M., Awoke W., Mihretie A., Nekorchuk D. A genetic algorithm for identifying spatially-varying environmental drivers in a malaria time series model. Environ Model Softw. 2019;119:275–284. doi: 10.1016/j.envsoft.2019.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Ribeiro M.H.D.M., da Silva R.G., Fraccanabbia N., Mariani V.C., Coelho L.d.S. 14th Brazilian computational intelligence meeting (CBIC) Belém, Brazil; PA: 2019. Forecasting epidemiological time series based on decomposition and optimization approaches; pp. 1–8. [Google Scholar]
12.Scavuzzo J.M., Trucco F., Espinosa M., Tauro C.B., Abril M., Scavuzzo C.M. Modeling dengue vector population using remotely sensed data and machine learning. Acta Trop. 2018;185:167–175. doi: 10.1016/j.actatropica.2018.05.003. [DOI] [PubMed] [Google Scholar]
13.Vaishya R., Javaid M., Khan I.H., Haleem A. Artificial intelligence (ai) applications for COVID-19 pandemic. Diabetes Metab Syndrome. 2020;14(4):337–339. doi: 10.1016/j.dsx.2020.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Ribeiro M.H.D.M., Coelho L.d.S. Ensemble approach based on bagging, boosting and stacking for short-term prediction in agribusiness time series. Appl Soft Comput. 2020;86(105837) doi: 10.1016/j.asoc.2019.105837. [DOI] [Google Scholar]
15.Moreno S.R., da Silva R.G., Ribeiro M.H.D.M., Fraccanabbia N., Mariani V.C., Coelho L.d.S. 14th Brazilian computational intelligence meeting (CBIC) Belém, Brazil; PA: 2019. Very short-term wind energy forecasting based on stacking ensemble; pp. 1–8. [Google Scholar]
16.Justen A.. COVID-19: coronavirus newsletters and cases by municipality per day. 2020. (accessed in 20 April, 2020). https://brasil.io/api/dataset/covid19/caso/data/?place_type=state.
17.Box G.E., Jenkins G.M., Reinsel G.C., Ljung G.M. 5th ed. John Wiley & Sons; 2015. Time series analysis: forecasting and control. [Google Scholar]
18.Quinlan J.R. Proceedings of the10th international conference on international conference on machine learning. Morgan Kaufmann Publishers Inc. ICML’93; San Francisco, CA, USA: 1993. Combining instance-based and model-based learning; pp. 236–243. [Google Scholar]
19.Rasmussen C.E. Springer; Heidelberg, Germany: 2004. Gaussian processes in machine learning; pp. 63–71. [Google Scholar]
20.Hoerl A.E., Kennard R.W. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12(1):55–67. doi: 10.1080/00401706.1970.10488634. [DOI] [Google Scholar]
21.Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
22.Drucker H., Burges C.J.C., Kaufman L., Smola A.J., Vapnik V. Support vector regression machines. In: Mozer M.C., Jordan M.I., Petsche T., editors. Advances in neural information processing systems 9. MIT Press; 1997. pp. 155–161. [Google Scholar]
23.Wolpert D.H. Stacked generalization. Neural Netw. 1992;5(2):241–259. doi: 10.1016/S0893-6080(05)80023-1. [DOI] [Google Scholar]
24.Moreno S.R., da Silva R.G., Mariani V.C., Coelho L.d.S. Multi-step wind speed forecasting based on hybrid multi-stage decomposition model and long short-term memory neural network. Energy Convers Manage. 2020;213(112869) doi: 10.1016/j.enconman.2020.112869. [DOI] [Google Scholar]
25.Kuhn M. Building predictive models in R using the Caret package. J Stat Softw. 2008;28(5):1–26. doi: 10.18637/jss.v028.i05. [DOI] [Google Scholar]
26.Hyndman R., Athanasopoulos G., Bergmeir C., Caceres G., Chhay L., O’Hara-Wild M., et al. forecast: Forecasting functions for time series and linear models; 2020. R package version 8.12; http://pkg.robjhyndman.com/forecast.
27.Hyndman R.J., Khandakar Y. Automatic time series forecasting: the forecast package for R. Journal of Statistical Software. 2008;26(3):1–22. [Google Scholar]; http://www.jstatsoft.org/article/view/v027i03
28.R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing; Vienna, Austria; 2018.
29.Benvenuto D., Giovanetti M., Vassallo L., Angeletti S., Ciccozzi M. Application of the ARIMA model on the COVID-2019 epidemic dataset. Data Brief. 2020;29(105340) doi: 10.1016/j.dib.2020.105340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0001] 1.World Health Organization (WHO). Coronavirus (COVID-19). 2020. (accessed in 22nd April, 2020). https://covid19.who.int/.

[bib0002] 2.Sohrabi C., Alsafi Z., O’Neill N., Khan M., Kerwan A., Al-Jabir A. World health organization declares global emergency: a review of the 2019 novel coronavirus (COVID-19) Int J Surg. 2020;76:71–76. doi: 10.1016/j.ijsu.2020.02.034. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0003] 3.Zhang X., Ma R., Wang L. Predicting turning point, duration and attack rate of COVID-19 outbreaks in major western countries. Chaos Solitons Fractals. 2020:109829. doi: 10.1016/j.chaos.2020.109829. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0004] 4.Boccaletti S., Ditto W., Mindlin G., Atangana A. Modeling and forecasting of epidemic spreading: the case of COVID-19 and beyond. Chaos Solitons Fractals. 2020 doi: 10.1016/j.chaos.2020.109794. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0005] 5.Becerra M., Jerez A., Aballay B., Garcés H.O., Fuentes A. Forecasting emergency admissions due to respiratory diseases in high variability scenarios using time series: a case study in Chile. Sci Total Environ. 2020;706:134978. doi: 10.1016/j.scitotenv.2019.134978. [DOI] [PubMed] [Google Scholar]

[bib0006] 6.Fanelli D., Piazza F. Analysis and forecast of COVID-19 spreading in China, Italy and France. Chaos Solitons Fractals. 2020;134:109761. doi: 10.1016/j.chaos.2020.109761. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0007] 7.Fong S.J., Li G., Dey N., Crespo R.G., Herrera-Viedma E. Composite monte carlo decision making under high uncertainty of novel coronavirus epidemic using hybridized deep learning and fuzzy rule induction. Appl Soft Comput. 2020 doi: 10.1016/j.asoc.2020.106282. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0008] 8.Roosa K., Lee Y., Luo R., Kirpich A., Rothenberg R., Hyman J. Real-time forecasts of the COVID-19 epidemic in China from february 5th to february 24th, 2020. Infect Dis Model. 2020;5:256–263. doi: 10.1016/j.idm.2020.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0009] 9.Effenberger M., Kronbichler A., Shin J.I., Mayer G., Tilg H., Perco P. Association of the COVID-19 pandemic with internet search volumes: a google trendstm analysis. Int J Infect Dis. 2020 doi: 10.1016/j.ijid.2020.04.033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0010] 10.Davis J.K., Gebrehiwot T., Worku M., Awoke W., Mihretie A., Nekorchuk D. A genetic algorithm for identifying spatially-varying environmental drivers in a malaria time series model. Environ Model Softw. 2019;119:275–284. doi: 10.1016/j.envsoft.2019.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0011] 11.Ribeiro M.H.D.M., da Silva R.G., Fraccanabbia N., Mariani V.C., Coelho L.d.S. 14th Brazilian computational intelligence meeting (CBIC) Belém, Brazil; PA: 2019. Forecasting epidemiological time series based on decomposition and optimization approaches; pp. 1–8. [Google Scholar]

[bib0012] 12.Scavuzzo J.M., Trucco F., Espinosa M., Tauro C.B., Abril M., Scavuzzo C.M. Modeling dengue vector population using remotely sensed data and machine learning. Acta Trop. 2018;185:167–175. doi: 10.1016/j.actatropica.2018.05.003. [DOI] [PubMed] [Google Scholar]

[bib0013] 13.Vaishya R., Javaid M., Khan I.H., Haleem A. Artificial intelligence (ai) applications for COVID-19 pandemic. Diabetes Metab Syndrome. 2020;14(4):337–339. doi: 10.1016/j.dsx.2020.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0014] 14.Ribeiro M.H.D.M., Coelho L.d.S. Ensemble approach based on bagging, boosting and stacking for short-term prediction in agribusiness time series. Appl Soft Comput. 2020;86(105837) doi: 10.1016/j.asoc.2019.105837. [DOI] [Google Scholar]

[bib0015] 15.Moreno S.R., da Silva R.G., Ribeiro M.H.D.M., Fraccanabbia N., Mariani V.C., Coelho L.d.S. 14th Brazilian computational intelligence meeting (CBIC) Belém, Brazil; PA: 2019. Very short-term wind energy forecasting based on stacking ensemble; pp. 1–8. [Google Scholar]

[bib0016] 16.Justen A.. COVID-19: coronavirus newsletters and cases by municipality per day. 2020. (accessed in 20 April, 2020). https://brasil.io/api/dataset/covid19/caso/data/?place_type=state.

[bib0017] 17.Box G.E., Jenkins G.M., Reinsel G.C., Ljung G.M. 5th ed. John Wiley & Sons; 2015. Time series analysis: forecasting and control. [Google Scholar]

[bib0018] 18.Quinlan J.R. Proceedings of the10th international conference on international conference on machine learning. Morgan Kaufmann Publishers Inc. ICML’93; San Francisco, CA, USA: 1993. Combining instance-based and model-based learning; pp. 236–243. [Google Scholar]

[bib0019] 19.Rasmussen C.E. Springer; Heidelberg, Germany: 2004. Gaussian processes in machine learning; pp. 63–71. [Google Scholar]

[bib0020] 20.Hoerl A.E., Kennard R.W. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12(1):55–67. doi: 10.1080/00401706.1970.10488634. [DOI] [Google Scholar]

[bib0021] 21.Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]

[bib0022] 22.Drucker H., Burges C.J.C., Kaufman L., Smola A.J., Vapnik V. Support vector regression machines. In: Mozer M.C., Jordan M.I., Petsche T., editors. Advances in neural information processing systems 9. MIT Press; 1997. pp. 155–161. [Google Scholar]

[bib0023] 23.Wolpert D.H. Stacked generalization. Neural Netw. 1992;5(2):241–259. doi: 10.1016/S0893-6080(05)80023-1. [DOI] [Google Scholar]

[bib0024] 24.Moreno S.R., da Silva R.G., Mariani V.C., Coelho L.d.S. Multi-step wind speed forecasting based on hybrid multi-stage decomposition model and long short-term memory neural network. Energy Convers Manage. 2020;213(112869) doi: 10.1016/j.enconman.2020.112869. [DOI] [Google Scholar]

[bib0025] 25.Kuhn M. Building predictive models in R using the Caret package. J Stat Softw. 2008;28(5):1–26. doi: 10.18637/jss.v028.i05. [DOI] [Google Scholar]

[bib0026] 26.Hyndman R., Athanasopoulos G., Bergmeir C., Caceres G., Chhay L., O’Hara-Wild M., et al. forecast: Forecasting functions for time series and linear models; 2020. R package version 8.12; http://pkg.robjhyndman.com/forecast.

[bib0027] 27.Hyndman R.J., Khandakar Y. Automatic time series forecasting: the forecast package for R. Journal of Statistical Software. 2008;26(3):1–22. [Google Scholar]; http://www.jstatsoft.org/article/view/v027i03

[bib0028] 28.R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing; Vienna, Austria; 2018.

[bib0029] 29.Benvenuto D., Giovanetti M., Vassallo L., Angeletti S., Ciccozzi M. Application of the ARIMA model on the COVID-2019 epidemic dataset. Data Brief. 2020;29(105340) doi: 10.1016/j.dib.2020.105340. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil

Matheus Henrique Dal Molin Ribeiro

Ramon Gomes da Silva

Viviana Cocco Mariani

Leandro dos Santos Coelho

Highlights

Abstract

1. Introduction