Skip to main content
Elsevier - PMC COVID-19 Collection logoLink to Elsevier - PMC COVID-19 Collection
. 2022 Jan 3;155:111789. doi: 10.1016/j.chaos.2021.111789

Prediction intervals of the COVID-19 cases by HAR models with growth rates and vaccination rates in top eight affected countries: Bootstrap improvement

Eunju Hwang 1
PMCID: PMC8720534  PMID: 35002103

Abstract

This paper is devoted to modeling and predicting COVID-19 confirmed cases through a multiple linear regression. Especially, prediction intervals of the COVID-19 cases are extensively studied. Due to long-memory feature of the COVID-19 data, a heterogeneous autoregression (HAR) is adopted with Growth rates and Vaccination rates; it is called HAR-G-V model. Top eight affected countries are taken with their daily confirmed cases and vaccination rates. Model criteria results such as root mean square error (RMSE), mean absolute error (MAE), R2, AIC and BIC are reported in the HAR models with/without the two rates. The HAR-G-V model performs better than other HAR models. Out-of-sample forecasting by the HAR-G-V model is conducted. Forecast accuracy measures such as RMSE, MAE, mean absolute percentage error and root relative square error are computed. Furthermore, three types of prediction intervals are constructed by approximating residuals to normal and Laplace distributions, as well as by employing bootstrap procedure. Empirical coverage probability, average length and mean interval score are evaluated for the three prediction intervals. This work contributes three folds: a novel trial to combine both growth rates and vaccination rates in modeling COVID-19; construction and comparison of three types of prediction intervals; and an attempt to improve coverage probability and mean interval score of prediction intervals via bootstrap technique.

Keywords: COVID-19, HAR model, Prediction interval, Mean interval score, Bootstrap procedure

1. Introduction

The COVID-19 pandemic has lasted for more than a year although vaccinations against the COVID-19 has been recently distributed around the world. The severe virus is still threatening human health, social activity and global economy, despite struggles and efforts of experts to overcome the crisis. Many researchers have attempted to help with health systems of COVID-19. Statisticians and econometricians contribute to preventing the spread of COVID-19 through a variety of time series models, by modeling and forecasting the COVID-19 confirmed and death cases. Notable time series studies have emerged to model the COVID-19 data. From the beginning of Benvenuto et al. [1], who used an ARIMA (autoregressive integrated moving average) model, there have been so many ARIMA time series analyses for the COVID-19 datasets [2], [3], [4], [5], [6], [7], [8], [9], [10]. Adding sophisticated theories of probability distributions and differential equations, mathematical models have been developed by [12], [13], [14], [15], [16], [17], [18], [19], [11] for efficient predictions of the COVID-19 pandemic. In the area of artificial intelligence and machine learning, stochastic modelings for the COVID-19 epidemic disease have been conducted to support the forecast analyses [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31].

Among recent remarkable works of the COVID-19 modeling, we refer to Chowell and Luo [32], Bracher et al. [33], Ekinci [34], Hwang and Yu [35]: Chowell and Luo [32] introduced ensemble methodology for forecasting growth process of epidemic outbreaks using differential equations. They achieved the improved coverage rate and mean interval score of prediction intervals. Bracher et al. [33] combined multiple models to improve reliability of forecast outputs. Ekinci [34] adopted an ARMA-GARCH (generalized autoregressive conditional heteroscedasticity) process for modeling the growth rate of daily COVID-19 confirmed cases, considering conditional variance and asymmetric effect, while [35] employed a HAR (heterogeneous AR) model to capture the long memory embedded in the COVID-19 data. The GARCH and HAR models in Ekinci [34], Hwang and Yu [35], which are popular and robust time series models for financial volatility, still show good predictive performance on the COVID-19 time series data.

Vaccinations against the COVID-19 have been widely distributed in the world. According to Tregoning et al. [36] it is reported that the development and deployment of vaccines for the COVID-19 are a remarkable science success story: within 16 months of the first vaccine trials, 2.8 billion vaccine doses have been administered. Thanks to Mathieu et al. [37] a global database of COVID-19 vaccinations are available. Shim [38] evaluated the impact of COVID-19 vaccination program through a mathematical model based on age-structured classification, while [39] studied forecasting of the number of fully vaccinated people and examined future vaccination rate by the ARIMA model. Tregoning et al. [36] and Higdon [40] conducted systematic reviews of COVID-19 vaccines efficacy and effectiveness against the SARS-CoV-2 infection and disease. Vaccines and booster shots affect the new COVID-19 spread in the local region. Vaccination rates are one of the primary components that influence strongly the COVID-19 confirmed cases.

This work concerns itself with a time series analysis for the COVID-19 in the world. Vaccination rates are considered as a regressor variable in a multiple time series regression for the purpose of the COVID-19 forecast analysis. In addition, the growth rates of daily COVID-19 confirmed cases are adopted as another regression factor because the growth rates directly determine new numbers in the near future. The COVID-19 confirmed cases have long-memory property with algebraically decaying autocorrelation coefficients. As in Hwang and Yu [35], we use the HAR model that is an additive cascade regression model with a main feature of the long memory. Overall, to forecast the COVID-19 confirmed cases, this paper suggests a HAR model with Growth rates and Vaccination rates, called the HAR-G-V model, which is a form of multiple linear regression. The multiple linear regression and its statistical theories are well-known (for example, see Montgomery and Peck [41]). Their coefficient parameter estimation and related analysis are frequently used, as a standard-level study, to achieve statistically significant results.

The HAR model is a powerful time series model especially for realized volatility of financial asset, (see Corsi [42], Hong et al. [43], Hwang and Hong [44], Wilms et al. [45]). Its background is deferred to a recent work [35] of author, which is also a HAR model-application to the COVID-19 data. The HAR model has moving-averages as regressor variables: for example, in a HAR(3) model with lags (1,7,14), the previous value, 7-day moving average and 14-day moving average are used as explanatory variables. The conventional HAR model is a kind of a linear AR model: for example, the HAR(3) model with lags (1,7,14) is theoretically a linear AR(14) model. However, adding the growth rates to the HAR model results in a nonlinear AR model. Not only the averages but also the growth rates are important factors that determine the COVID-19 spread. In the notable work [34], the growth rate itself of the daily COVID-19 confirmed cases is used for time series modeling. In contrast, the current work uses both averages and growth rates, in a form of nonlinear AR model, and at the same time, in a multiple linear regression model. The modified (nonlinear) HAR(p) model is employed with one-day, one-week, two-week moving growth rates as well as one-day, one-week, two-week moving averages, with lags (1,7,14), if p=3. In addition, two vaccination rates are imposed in our time series model: 1st and 2nd vaccination rates, which are different from country to country. The following three main issues: (i) reflecting long memory of COVID-19 data, (ii) adopting a nonlinear autoregressive effect with both moving averages and moving rates, and (iii) taking vaccination rates that play a decisive role in the COVID-19 situation, motivate us to choose the multiple regression of the HAR-G-V model in this study.

For the empirical analysis of the COVID-19 time series data in the world, top eight affected countries are taken: Brazil, France, Germany, India, Italy, Russia, UK and USA. Their daily COVID-19 confirmed cases and vaccination rates are analyzed through HAR-G-V models of order 2, 3 with lags (1,7), (1,7,14), respectively. In the HAR models with/without growth rates and vaccination rates, model criteria measures such as root mean square error (RMSE), mean absolute error (MAE), R2, AIC and BIC are evaluated for the comparison. In addition, these are compared to ARIMA models, which are popular in time series literature including COVID-19. The proposed HAR-G-V model outperforms other HAR models and ARIMA models with respect to RMSE and MAE. It is because the both rates have strong influences on the spread of COVID-19. The fitted HAR-G-V models are well-matched to real daily COVID-19 confirmed cases.

To investigate forecast performance of the HAR-G-V model, out-of-sample forecasting is conducted to evaluate RMSE, MAE, mean absolute percentage error (MAPE) and root relative square error (RRSE) of forecasts. As a main goal of this work, 100(1α)% prediction intervals are established. Their empirical coverage probability, average length and mean interval score are reported. The mean interval score was assessed in Chowell and Luo [32], to account the forecasting performances across a diversity of epidemic datasets. The interval score was suggested by Gneiting and Raftery [46]. It is a scoring rule for prediction interval, which rewards for a narrow prediction interval and incurs a penalty if the observation misses the interval. See [46] for its details, who dealt with the interval score and other scoring rules of prediction intervals.

Based on the residuals of the HAR-G-V model, 100(1α)% prediction intervals, (α=0.05,0.2), are constructed by approximating the residual distribution. The following three approaches are attempted: (i) Approximation by normal distribution; it does not provide the best performance because residuals do not follow the normal distribution. (ii) Approximation by Laplace distribution; it seems to fit well the residual distribution and to improve the normal approximation by enhancing coverage probability in some countries. However, the Laplace distribution makes longer prediction interval and thus it has the worst mean interval score in the comparison of 95% prediction intervals. (iii) In order to improve the previous two approximations, the bootstrap technique is applied to the residuals to construct the bootstrap prediction interval.

The validity of the bootstrap technique in a regression model was initially established by Freedman [47]. Also, the bootstrap procedure was used by [48] to estimate the mean squared error of forecast for the AR(p) model. As a useful application of the bootstrap, Thombs and Schucany [49] dealt with the bootstrap prediction intervals for the AR models. For weak dependence structure in a network vector autoregression, Hong and Hwang [50] implemented the bootstrap prediction intervals using the stationary bootstrap and the classical residual bootstrap. In the current work, in order to construct a better prediction interval, the classical residual-bootstrap method is used. This residual resampling-technique not only improves empirical coverage but also shortens average length of prediction interval, and thus the proposed bootstrap prediction interval yields the best mean interval score. So as to find how much the bootstrap method can improve the normal and Laplace approaches, bootstrap improvement ratio (BIR) of the mean interval score is suggested. In cases of India, Italy and Russia, which have so long prediction intervals by the normal and Laplace approximations, their BIRs are evaluated as more than 0.7, which means that mean interval scores of the bootstrap prediction intervals can be enhanced by more than 70% from those of other two prediction intervals.

This research has contributions in three aspects: (i) This present work combines both growth rates and vaccination rates, for modeling and forecasting the COVID-19 pandemic, through a nonlinear type of HAR models. It provides results of well-fitting to actual daily COVID-19 confirmed cases and good criteria measures such as small values of RMSE and MAE. (ii) Forecasting accuracy performances of RMSE, MAE, MAPE and RRSE are evaluated with small values and, moreover, in behalf of more informative interval estimation, three types of prediction intervals are constructed and compared. (iii) Bootstrap method can improve prediction intervals not only by yielding the empirical coverage, close to the nominal coverage, but also by shortening average length and thus resulting in better mean interval score. Consequently, the combination of the HAR-G-V model and the bootstrap technique achieves an accurate prediction interval with the best mean interval score. An outline of this study is drawn in the flow chart of Fig. 1 .

Fig. 1.

Fig. 1

Flow chart for this research of HAR-G-V model.

The rest of the paper is organized as follows: In Section 2, model and data are presented. In Section 3 empirical results of estimation and prediction are explored. Conclusion and discussion are stated in Section 4.

2. Model and data

2.1. HAR-G-V model for a triangular array

We first introduce the HAR-G-V model, which is used for a triangular array of observations. Suppose that a sample {X1,,XN} of size N is observed, and the dataset {Xt=XN,t:t=1,2,,N;NZ+} is a triangular array with mean zero and finite variance, but it is not necessarily stationary. In the expression XN,t, we omit subscript N and use Xt in the model just for notational simplicity.

The HAR-G-V model of order p, {Xt=XN,t:t=1,2,,N;NZ+}, is defined as follows:

Xt=α1Xt1(1)++αpXt1(p)+β1rt1(1)++βprt1(p)+γ1Vt1(1)+γ2Vt1(2)+ϵt (1)

where the HAR components Xt1(i) are given by

Xt1(i)=1hi(Xt1+Xt2++Xthi),i=1,...,p

with positive integers 1=h1<h2<<hp, and the growth rate components rt1(i) are given by log difference of shifted values:

rt1(1)=log(Xt1+cXt2+c),rt1(i)=log(Xt1+cXthi+c)i=2,...,p (2)

with some constant c>0, called a shift threshold, such that Xt+c>0 for all t=1,2,,N. The last two regressors Vt1(1) and Vt1(2) are used as the 1st and 2nd vaccination rates at time t1, respectively. Coefficients α1,,αp,β1, ,βp,γ1,γ2 are parameters to be estimated for model-fitting and forecasting, and {ϵt:tZ} is an i.i.d. sequence of random variables with mean zero and finite variance.

The HAR-G-V model of order p in (1) can be regarded as a nonlinear AR model of order hp with additional regression variables, which make the model a multiple regression form. As mentioned above, as explanatory variables, the model has moving averages and moving rates, both of which are functions of the observations {Xt1,Xthp}. In other words, we may write

Xt=f(Xt1,Xthp)+γ1Vt1(1)+γ2Vt1(2)+ϵt

for some real-valued function f(·) defined on hp-dimensional real space Rhp, i.e., f:RhpR is a composite of algebraic additions, divisions and logarithmic function, along with parameters α1,,αp,β1, ,βp,γ1,γ2 and c. The shift threshold c is estimated as any positive number only with condition Xt+c>0 for all t in the logarithm function. Instead of a general function f, we consider a specific semiparametric AR model with the form of (1).

In the HAR-G-V model (1), if β1==βp=0, then model (1) is denoted by HAR-V model, with only vaccination rates, and if γ1=γ2=0, then it is HAR-G model, with only growth rates. If all β1==βp=γ1=γ2=0, then model (1) is the HAR(p) model, which is used in the analysis of [35].

In the next section of an empirical study, the four models; HAR, HAR-V, HAR-G and HAR-G-V models of order p, are compared together, by applying eight datasets of COVID-19 cases. Order p is chosen as p=2,3, and lags (h1,,hp) are (1,7) for p=2 and (1,7,14) for p=3. According to Linton et al. [51] the incubation period of the COVID-19 is 14 to 21 days and so the recent work [35] of author dealt with (1,7,14) for p=3 and (1,7,14,21) for p=4. However, since this current work handles additionally growth rates and vaccination rates, (if we use p=4, then there are 10 coefficients in the multiple regression), and also since the results of order p=4 are expected to be similar to those of order p=3, we omit an empirical study for p=4.

2.2. COVID-19 data

This work considers the new daily COVID-19 confirmed cases in top eight affected countries with their growth rates and vaccination rates. The daily confirmed cases (7-day smoothed) and vaccination rates are taken from Our World in Data in https://ourworldindata.org/covid-vaccinations, (see Mathieu et al. [37]). The confirmed cases are applied by 7-day moving-averaging as in Ekinci [34] so as to smoothen the daily data. Eight countries: Brazil, France, Germany, India, Italy, Russia, UK and USA, are chosen to apply the HAR models and to construct the prediction intervals as a main goal. Statistics summaries of daily confirmed cases (7-day smoothed) are displayed in Table 1 along with those of their one-day, one-week and two-week growth rates.

Table 1.

Statistics of daily confirmed cases (7-day smoothed) and their growth rates; size N is the period from the starting date in 2021 to 08/24/2021; SD = standard deviation.

Brazil France Germany India Italy Russia UK USA
Starting 02/09 01/26 01/16 02/15 01/18 03/02 01/10 01/19
Size N 190 204 214 184 212 169 220 211
Daily confirmed cases Mean 56917.9 18765.2 8429.5 116687.6 9691.6 14028.4 14912.9 60591.1
SD 13699.6 10707.5 6303.9 117743.3 6890.7 6435.6 13467.5 40198.2
Min 28337.6 1840.6 538.0 12751.6 639.3 8016.4 1517.7 11299.3
Median 60438.8 20287.4 7864.8 50359.8 9430.5 9488.6 8662.2 57301.1
Max 77327.6 45641.6 21595.7 391232.0 23211.9 24490.6 48675.4 171103.9
Skewness 0.5867 0.0849 0.4389 1.2000 0.3819 0.5397 0.8334 0.8423
Kurtosis 0.7192 0.6760 1.0002 0.0274 1.0363 1.5331 0.5727 0.0264
One-day growth rate Mean 0.0027 0.0003 0.0033 0.0050 0.0031 0.0038 0.0018 0.0006
SD 0.0386 0.0815 0.0683 0.0428 0.0451 0.0163 0.0737 0.0460
Min 0.2419 0.3246 0.1898 0.1225 0.1636 0.0295 0.5146 0.1538
Median 0.0002 0.0017 0.0032 0.0013 0.0047 0.0010 0.0030 0.0086
Max 0.2165 0.2546 0.2958 0.1337 0.1284 0.0706 0.5197 0.1960
Skewness 0.5975 0.3307 0.02783 0.2650 0.4591 1.5086 0.1375 1.0606
Kurtosis 13.4556 2.4453 1.6424 0.4308 1.1372 2.6062 0.5964 2.9285
One-week growth rate Mean 0.0173 0.0057 0.0265 0.0353 0.0232 0.0309 0.0109 0.0035
SD 0.1298 0.3277 0.3176 0.2838 0.2847 0.1006 0.3320 0.2420
Min 0.3572 0.7531 0.7623 0.4066 0.5087 0.1171 0.9072 0.4321
Median 0.0070 0.0234 0.0027 0.0531 0.0147 0.0072 0.0481 0.0379
Max 0.2275 1.1725 0.5079 0.5356 0.7688 0.3528 0.6126 0.6937
Skewness 0.6097 0.8542 0.2229 0.2691 0.6975 1.5848 0.3378 0.7216
Kurtosis 0.2087 1.2571 0.7937 1.0847 0.3750 1.6817 0.9006 0.2771
Two-week growth rate Mean 0.0391 0.0141 0.0652 0.0722 0.0471 0.0693 0.0154 0.0005
SD 0.1889 0.6056 0.5886 0.5626 0.5492 0.1876 0.5574 0.4625
Min 0.4629 1.1577 1.2154 0.7770 0.9424 0.1296 1.2253 0.6608
Median 0.0260 0.0170 0.0161 0.0896 0.0907 0.0020 0.1433 0.1470
Max 0.3476 1.6405 0.8989 1.0268 1.3463 0.5938 1.0605 1.0772
Skewness 0.2535 0.7179 0.2469 0.2241 0.6688 1.4191 0.4948 0.7940
Kurtosis 0.4413 0.4999 1.0432 1.1951 0.0393 0.7863 1.0508 0.4554

Because the vaccine supply differs from country to country, the sample size N is different for each country. More specifically, the start date in the sample period is set as the first day of the 2nd vaccination, while the end date is all 08.24.2021. The size N in each country, shown in Table 1, is the length between the start date and 08.24.2021.

As introduced above, model (1) is assumed to have mean zero, and so we adopt Xt as standardized values of COVID-19 confirmed cases for each country. In other words, Xt=(Ctm^)/σ^, for t=1,2,,N, where Ct is daily COVID-19 confirmed case at time t, m^=m^N and σ^=σ^N are its sample mean and sample standard deviation, respectively. Note that {Xt=XN,t=(Ctm^)/σ^,t=1,,N} generates a triangular array with mean zero and variance one, (but not stationary).

With lags (1,7,14) for p=3, the growth rates rt1(i) in (2) are computed as one-day, one-week and two-week growth rates of daily numbers shifted by c. The shift threshold c may be chosen as some positive number so that Xt+c>0 for all t. Either c^=max|Xt|+ϵ, for any ϵ>0, or c^=m^/σ^ are possible. Here we select threshold c^=m^/σ^ because it implies that

rt1(i)=log(Xt1+c^Xthi+c^)=log(σ^Xt1+m^)log(σ^Xthi+m^)=log(Ct1Cthi)

which is log difference of original COVID-19 confirmed cases {Ct}. These growth rates are depicted in Fig. 2 , where we see one-day, one-week, two-week growth rates of the eight countries. Positive values during a time period imply that the numbers of COVID-19 confirmed cases increase during the period; for example, in India, during the period before 60 on time axis, growth rates are positive, which means new daily COVID-19 confirmed numbers increase, while during the period between 75 and 140, growth rates are negative, which means new daily COVID-19 numbers decrease. These facts are also shown on the plot of India in Fig. 5. In Russia, between 80 and 110 on time axis, growth rates are positive, which indicates increasing COVID-19 during the period, as seen on the plot of Russia in Fig. 5. In Fig. 3 , the 1st and 2nd vaccination rates of the eight countries are shown, until date 08.24.2021. India and Russia have low vaccination rates, whereas Italy, UK and USA have high rates. Data in Figs. 2 and 3 are used as regressor variables in the HAR-G-V model of a triangular array.

Fig. 2.

Fig. 2

Growth rates of daily COVID-19 confirmed cases (7-day smoothed).

Fig. 5.

Fig. 5

Daily confirmed cases (7-day smoothed) and HAR-G-V model fitting with its residuals.

Fig. 3.

Fig. 3

Vaccination rates.

A main characteristic of the HAR model is the long memory, which means that historical observation has a persistent impact on future value. The long memory can be identified by autocorrelation coefficient functions (ACFs) with algebraically decay rates. In order to see whether the COVID-19 data are suitable for the HAR model, their sample ACFs are illustrated in Fig. 4 . The COVID-19 confirmed cases have the long-memory effects as seen in Fig. 4, where the sample ACFs of the confirmed cases are depicted for the eight countries. The sample ACFs are decreasing very slowly. This feature leads us to adopt the HAR models for modeling the COVID-19 pandemic time series. In the next section, we discuss empirical results of the COVID-19 confirmed cases with their estimation and forecasting through the HAR models with/without the growth rates and vaccination rates. Because the proposed models are a type of multiple linear regression model, which is popular and standard in statistical analysis, and also because earlier works [43], [44] dealt with a simulation study in the HAR model, in this work, the simulation is omitted and real data analysis only is discussed.

Fig. 4.

Fig. 4

Sample autocorrelation coefficient function (ACF) of daily COVID-19 confirmed cases (7-day smoothed).

3. Empirical study

3.1. Parameter estimation

This section presents the empirical results of the proposed models with daily confirmed cases and vaccination rates. Four models; HAR, HAR-V, HAR-G and HAR-G-V models, are compared by reporting some criteria measures of the parameter estimation. Additionally, these results are compared with those by popular ARIMA models.

Suppose the observations of size N are relabelled by {Xhp+1,,X1,X0,X1,,Xn}, where n is chosen so that N=n+hp, with hp=7, 14, for p=2, 3, respectively. We use the least squares method that minimizes the sum of squared errors (SSE):

SSE=t=1n(Xtα1Xt1(1)αpXt1(p)β1rt1(1)βprt1(p)γ1Vt1(1)γ2Vt1(2))2.

The regression model in (1) is written as Xt=θYt1+ϵt where θ=(α1,,αp,β1,,βp,γ1,γ2) and Yt1=(Xt1(1),,Xt1(p),rt1(1),,rt1(p),Vt1(1),Vt1(2)). The ordinary least squared (OLS) estimator θ^ is obtained by minimizing the SSE and given as

θ^=argminθSSE=(t=1nYt1Yt1)1(t=1nYt1Xt).

To see the performance of the OLS regression, we compute criteria measures such as root mean square error (RMSE), mean absolute error (MAE), R2, AIC and BIC.

Let ϵ^t=Xtθ^Yt1. Using the residuals {ϵ^t:t=1,,n}, RMSE and MAE are given by

RMSE=(1nt=1nϵ^t2)1/2,MAE=1nt=1n|ϵ^t|.

Also, R2, AIC and BIC are given by

R2=1t=1nϵ^t2t=1n(XtX¯)2,AIC=2LL+2τ,BIC=2LL+τlog(n)

where LL is the log-likelihood function and τ is the number of parameters.

For the HAR(-G-V) models with order p=2,3, these performance measures in the eight countries are reported in Tables 2 and 3 . In order to compare the results with existing model, Table 4 gives the performances by ARIMA models. In the columns of RMSE and MAE of Table 2, Table 3, Table 4, the values in the parentheses are the errors multiplied by the standard deviations of the daily confirmed cases in each country, where the standard deviations are given in Table 1. Because HAR and HAR-V models have fewer parameters, AIC and BIC values in these models are somewhat smaller than those in the HAR-G and HAR-G-V models.

Table 2.

Criteria of HAR(p) models with/without growth rates and vaccination rates; Brazil, France, Germany and India.

RMSE MAE R2 AIC BIC
Brazil p=2 HAR 0.14876(2038.06) 0.09671(1324.91) 0.979 167.2 160.9
HAR-V 0.14634(2004.85) 0.09508(1302.64) 0.980 169.0 156.3
HAR-G 0.14782(2025.06) 0.09731(1333.16) 0.979 165.5 152.8
HAR-G-V 0.14493(1985.55) 0.09363(1282.75)* 0.980 168.4 149.4
p=3 HAR 0.14864(2036.32) 0.09726(1332.36) 0.979 165.5 156.0
HAR-V 0.14582(1997.74) 0.09575(1311.75) 0.980 168.3 152.4
HAR-G 0.14698(2013.58) 0.09684(1326.60) 0.979 163.5 144.5
HAR-G-V 0.14473(1982.75)* 0.09437(1292.96) 0.980 164.9 139.5
France p=2 HAR 0.18192(1948.01) 0.09762(1045.33) 0.969 104.4 97.88
HAR-V 0.17893(1915.97) 0.09195(984.65) 0.970 106.7 93.69
HAR-G 0.17065(1827.26) 0.09503(1017.59) 0.973 124.7 111.7
HAR-G-V 0.17042(1824.78) 0.09471(1014.13) 0.973 121.2 101.7
p=3 HAR 0.17907(1917.40) 0.09266(992.16) 0.970 108.4 98.65
HAR-V 0.17779(1903.79) 0.08936(956.84)* 0.971 107.1 90.87
HAR-G 0.17061(1826.83) 0.09438(1010.62) 0.973 120.8 101.3
HAR-G-V 0.17040(1824.58)* 0.09428(1009.51) 0.973 117.3 91.27
Germany p=2 HAR 0.10697(674.34) 0.05891(371.62) 0.989 322.5 315.9
HAR-V 0.10508(662.44) 0.05684(358.35) 0.989 325.6 312.4
HAR-G 0.10205(643.37) 0.05991(377.65) 0.990 337.3 324.1
HAR-G-V 0.10146(639.62) 0.05940(374.49) 0.990 335.6 315.9
p=3 HAR 0.10563(665.93) 0.05684(358.29) 0.989 325.5 315.6
HAR-V 0.10475(660.37) 0.05563(350.72)* 0.989 324.9 308.4
HAR-G 0.10194(642.59) 0.05952(375.18) 0.990 333.8 314.0
HAR-G-V 0.10124(638.23)* 0.05919(373.15) 0.990 332.5 306.1
India p=2 HAR 0.01323(1558.30) 0.00919(1082.32) 1.000 984.0 977.8
HAR-V 0.01303(1533.87) 0.00896(1055.56) 1.000 985.4 972.9
HAR-G 0.01318(1552.68) 0.00922(1085.87) 1.000 981.3 968.7
HAR-G-V 0.01299(1530.08) 0.00903(1062.89) 1.000 982.2 963.4
p=3 HAR 0.01186(1396.15) 0.00783(921.56) 1.000 1019 1010
HAR-V 0.01181(1390.10) 0.00776(913.84) 1.000 1017 1001
HAR-G 0.01139(1340.87) 0.00744(875.97) 1.000 1027 1008
HAR-G-V 0.01125(1324.41)* 0.00731(860.22)* 1.000 1027 1002

Table 3.

Criteria of HAR(p) models with/without growth rates and vaccination rates; Italy, Russia, UK and USA.

RMSE MAE R2 AIC BIC
Italy p=2 HAR 0.03619(249.42) 0.01897(130.77) 0.999 748.3 741.8
HAR-V 0.03608(248.61) 0.01882(129.71) 0.999 745.6 732.5
HAR-G 0.03565(245.71) 0.01891(130.32) 0.999 750.3 737.1
HAR-G-V 0.03553(244.85) 0.01881(129.64) 0.999 747.7 727.9
p=3 HAR 0.03613(248.97) 0.01888(130.15) 0.999 747.0 737.2
HAR-V 0.03603(248.27) 0.01866(128.57)* 0.999 744.2 727.7
HAR-G 0.03541(244.01) 0.01887(130.06) 0.999 749.0 729.3
HAR-G-V 0.03530(243.28)* 0.01871(128.96) 0.999 746.2 719.9
Russia p=2 HAR 0.01644(105.83) 0.01193(76.79) 1.000 829.5 823.4
HAR-V 0.01628(104.77) 0.01203(77.43) 1.000 828.6 816.5
HAR-G 0.01585(101.98) 0.01218(78.38) 1.000 827.0 824.8
HAR-G-V 0.01548(101.35) 0.01224(78.79) 1.000 834.9 816.7
p=3 HAR 0.01629(104.84) 0.01198(77.07) 1.000 803.5 821.3
HAR-V 0.01615(103.98) 0.01207(77.68) 1.000 829.0 813.8
HAR-G 0.01534(98.74) 0.01184(76.21)* 1.000 843.0 824.7
HAR-G-V 0.01530(98.50)* 0.01188(76.49) 1.000 839.8 815.4
UK p=2 HAR 0.03639(490.06) 0.02126(286.43) 0.998 776.6 769.9
HAR-V 0.03526(474.89) 0.02144(288.81) 0.999 785.5 772.2
HAR-G 0.03612(486.43) 0.02111(284.39) 0.998 775.6 762.3
HAR-G-V 0.03435(462.64) 0.02109(284.11) 0.999 792.3 772.3
p=3 HAR 0.03288(442.83) 0.01987(267.61) 0.999 816.3 806.3
HAR-V 0.03144(423.47) 0.01976(266.07) 0.999 830.7 814.1
HAR-G 0.03165(426.26) 0.01932(260.16) 0.999 826.0 806.1
HAR-G-V 0.03081(414.95)* 0.01927(259.54)* 0.999 833.1 806.5
USA p=2 HAR 0.03952(1588.63) 0.02708(1088.51) 0.998 709.9 703.4
HAR-V 0.03914(1573.53) 0.02707(1088.02) 0.998 709.7 696.6
HAR-G 0.03897(1566.44) 0.02712(1090.36) 0.998 711.5 698.3
HAR-G-V 0.03862(1552.69) 0.02700(1085.47) 0.998 710.9 691.2
p=3 HAR 0.03912(1572.87) 0.02690(1081.42) 0.998 711.9 702.0
HAR-V 0.03886(1562.19) 0.02692(1082.19) 0.998 710.5 694.1
HAR-G 0.03843(1544.88) 0.02624(1054.95)* 0.998 712.9 693.2
HAR-G-V 0.03821(1536.29)* 0.02633(1058.66) 0.998 711.1 684.9

Table 4.

Comparison with ARIMA models.

RMSE MAE R2 AIC BIC
Brazil ARIMA(1,1,1) 0.14527(1990.21) 0.09655(1322.72) 0.979 186.9 177.23
France ARIMA(1,1,1) 0.17485(1872.25) 0.10083(1079.63) 0.969 125.8 115.9
Germany ARIMA(1,2,1) 0.10476(660.43) 0.05947(374.92) 0.989 347.0 336.9
India ARIMA(1,1,1) 0.011297(1330.12) 0.00671(789.64)* 1.000 1112.7 1103.1
Italy ARIMA(1,2,1) 0.03474(239.36)* 0.01829(126.06)* 0.999 808.6 798.6
Russia ARIMA(1,2,1) 0.01615(103.95) 0.01252(80.57) 1.000 897.9 888.6
UK ARIMA(1,2,0) 0.02763(370.39)* 0.01674(224.51)* 0.999 942.2 935.4
USA ARIMA(1,2,1) 0.03854(1549.26) 0.02661(1069.80) 0.998 760.9 750.9

As for RMSE in Tables 2 and 3, HAR-V models have smaller errors than HAR models, and also HAR-G-V models have smaller errors than HAR-G models. It implies that vaccination rates affect significantly the model identification. In Tables 2 and 3, bold numbers are the smallest errors in the given order in each country, whereas mark * indicates the best one among the two smallest errors in both orders p=2,3 in each country. All eight countries have the smallest values of RMSE in HAR-G-V model of order 3. Consequently, HAR-G-V model of order p=3 is the most suitable model in the sense of RMSE. As for MAE in Tables 2 and 3, Brazil, India and UK have the smallest errors in HAR-G-V model of order 3. In other countries, we see that the models with vaccination rates, HAR-V or HAR-G-V, are better than the models without them, HAR or HAR-G, respectively, with respect to MAE. Table 4 presents empirical results of ARIMA(p,d,q) models for the COVID-19 data. RMSE, MAE, R2, AIC and BIC by our proposed HAR(-G-V) models are compared with those in Table 4. Orders (p,d,q) of ARIMA(p,d,q) models are selected through a unit-root test, i.e., ADF test, as well as ACF and PACF plots, and through consideration of smallest RMSE and MAE. Bold numbers in Table 4 of ARIMA models indicate smaller errors than HAR(-G-V) model errors in each country. Most cases demonstrate that HAR(-G-V) models fit with the COVID-19 data better than the ARIMA models. To see the model-fitting of HAR-G-V models, Fig. 5 depicts the actual daily COVID-19 confirmed cases and HAR-G-V models along with their residuals. For all the eight countries, HAR-G-V models are well-matched with the actual data. It is concluded that HAR-G-V model of order 3 is the best one with reduced errors, compared to other models. Therefore, in the next subsection of prediction, we adopt the HAR-G-V model of order 3 and discuss prediction intervals by approximating the residual distribution.

3.2. Prediction analysis

Now we investigate the forecasting performance by the HAR-G-V model of order 3, applied to the daily COVID-19 confirmed cases. For the first prediction result, one-, two-, three-step ahead forecasts are computed and their forecasting performance accuracy is reported. In order to provide the forecasting analysis, m rolling windows are used. Out of the given observations of size N, we take two subsamples; in-sample and out-of-sample. To assess the k-step ahead forecast, (k=1,2,), let n¯=Nmk, and let Ai(k)={Xi,Xi+1,,Xi+n¯} with relabelled observations, (i=1,2,,m). Ai(k) is the in-sample used to estimate parameters. The k-step ahead predicted value X^i+n¯+k is computed from Ai(k), (i=1,2,,m), and accuracy measures of RMSE, MAE, MAPE and RRSE are evaluated along with observation Xi+n¯+k in the out-of-sample as follows:

Root Mean Square Error (RMSE):

RMSEk=(1mi=1m(X^i+n¯+kXi+n¯+k)2)1/2.

Mean Absolute Error (MAE):

MAEk=1mi=1m|X^i+n¯+kXi+n¯+k|.

Mean Absolute Percentage Error (MAPE):

MAPEk=100mi=1m|X^i+n¯+kXi+n¯+kXi+n¯+k|.

Root Relative Square Error (RRSE):

RRSEk=(i=1m(X^i+n¯+kXi+n¯+k)2i=1m(X¯kXi+n¯+k)2)1/2,

where X¯k=1mi=1mXi+n¯+k.

In Table 5 , RMSE, MAE, MAPE and RRSE for the k-step ahead forecasts, (k=1,2,3), are evaluated with size m=30. The values in the parentheses on the columns of RMSE and MAE indicate the errors multiplied by the standard deviations of in-sample observations. As expected, it is shown in Table 5 that all the errors are reasonably small, and the farther step ahead, the larger errors.

Table 5.

Forecasting performance of one-, two-, three-step ahead (k=1,2,3) forecasts in HAR-G-V model with size 30.

k RMSE MAE MAPE RRSE
Brazil 1 0.20620(2824.94) 0.11714(1604.88) 4.35666 0.45181
2 0.31809(4357.75) 0.22947(3143.68) 8.57290 0.68883
3 0.37590(5149.75) 0.29034(3977.58) 11.25078 0.85273
France 1 0.05194(556.22) 0.04302(460.71) 2.19886 0.20366
2 0.07343(786.35) 0.06043(647.13) 2.99611 0.33470
3 0.09759(1045.04) 0.07635(817.59) 3.71851 0.52219
Germany 1 0.02556(161.13) 0.02103(132.58) 5.51255 0.12122
2 0.02581(162.73) 0.02124(133.93) 4.82855 0.11271
3 0.05738(361.72) 0.04367(275.31) 11.58729 0.23259
India 1 0.01886(2220.95) 0.01117(1315.23) 3.22558 0.71244
2 0.03623(4265.88) 0.02432(2864.33) 7.00961 1.28780
3 0.05698(6709.39) 0.03789(4462.14) 10.89532 1.88765
Italy 1 0.01446(99.64) 0.01134(78.14) 1.5562 0.12081
2 0.03627(249.95) 0.02876(198.19) 3.92301 0.33632
3 0.06402(441.16) 0.05005(344.91) 6.74034 0.65573
Russia 1 0.01622(104.42) 0.01343(86.45) 0.39060 0.11079
2 0.02954(190.16) 0.02418(155.64) 0.71247 0.20343
3 0.04626(297.73) 0.03795(244.27) 1.13091 0.31719
UK 1 0.06500(871.42) 0.04350(583.18) 1.78073 0.16298
2 0.15677(2101.66) 0.10299(1380.77) 4.32968 0.47361
3 0.27381(3670.67) 0.18412(2468.32) 8.04029 1.02424
USA 1 0.08251(3316.91) 0.06854(2755.29) 3.23027 0.10518
2 0.11920(4791.86) 0.09111(3662.80) 4.45301 0.15395
3 0.14519(5836.68) 0.11731(4715.98) 5.42386 0.19048

For the second prediction result, we construct 100(1α)% prediction intervals using the residual distributions. In the HAR-G-V model with estimated parameter vector θ^, the residuals {ϵ^t=Xtθ^Yt1,t=1,2,,n} are approximated by three approaches. The distributions of the residuals are plotted with histograms in Fig. 6 , where the normal and Laplace distributions by estimated mean and standard deviation of the residuals are shown as well. The 100(1α)% Box–Jenkins prediction interval by the normal approximation is constructed as μ^ϵ±zα/2σ^ϵ/n, with zα/2=1.96,1.28 for α=0.05,0.2, respectively, where μ^ϵ and σ^ϵ are the estimated mean and standard deviation of the residuals. However, as shown in Fig. 6, the normal density is not well fitted with distribution of the residuals. Hence we adopt the Laplace distribution for better fitting of the density. The Laplace distribution, which is well-known as two-sided exponential distribution, has the following probability density function with location parameter μ and scale parameter b>0: f(x|μ,b)=12bexp(|xμ|b). Its mean and variance are μ and 2b2, respectively. In Fig. 6, red curves indicate Laplace density functions with estimated parameters from the residuals. Using the Laplace distribution, 100(1α)% prediction intervals are constructed as μ^ϵ±b^ϵlog(α)/n with the estimated mean and scale parameter, μ^ϵ and b^ϵ=σ^ϵ/2. By rescaling these intervals, the one-step prediction intervals of the daily COVID-19 confirmed cases are constructed.

Fig. 6.

Fig. 6

Distribution of residuals by HAR-G-V models.

In Table 6 , for the last days, (=20,50,100), empirical coverage probability, average length and mean interval score of 80% prediction intervals are displayed. Moreover, those of 95% prediction intervals are given in Table 7 . The empirical coverage probability is the ratio of observations that lie in the prediction interval. The interval score is a scoring rule for prediction interval, which gives a penalty if observation misses the interval and rewards for a narrow interval, (see Chowell and Luo [32], Gneiting and Raftery [46]). The mean interval score (MIS) is defined by

MIS=1i=1[(UiLi)+2α(Lixi)I{xi<Li}+2α(xiUi)I{xi>Ui}] (3)

where Li and Ui are lower and upper bounds of the 100(1α)% prediction interval for level α, respectively, I{·} is the indicator function and xi is observation. If coverage probability is one, then the mean interval score is the same as the average length. Lower coverage probability yields more penalty as much as the last two terms in (3). For the case of Brazil in Tables 6 and 7, the normal and Laplace approximations address small coverage probabilities, not close to, but lower than the nominal coverage, and thus they make much larger (or worse) mean interval scores with high penalty, compared to the average lengths.

Table 6.

Coverage probability (CP), average length (AL) and mean interval score (MIS) of 80% prediction intervals by normal, laplace distributions and Bootstrap procedure in HAR-G-V residuals for the last days.

Normal

Laplace

Bootstrap
CP AL MIS CP AL MIS CP AL MIS
Brazil 20 0.7 2447 3790 0.65 2176 3924 0.85 3526 3749
50 0.6 2176 8631 0.54 1935 8887 0.76 3509 7780
100 0.53 2096 9145 0.46 1863 9475 0.72 3422 7926
France 20 1.0 2089 2089 1.0 1858 1858 1.0 2636 2636
50 0.9 2250 2588 0.88 2000 2472 0.9 2806 2985
100 0.66 2231 4910 0.62 1984 5105 0.81 3429 4244
Germany 20 1.0 1192 1192 1.0 1060 1060 1.0 1218 1218
50 1.0 1234 1234 0.94 1097 1126 1.0 1251 1251
100 0.77 1203 2022 0.66 1070 2073 0.84 1413 1861
India 20 1.0 24234 24234 1.0 21547 21547 0.95 2703 2706
50 1.0 26141 26141 1.0 23242 23242 0.78 2792 6798
100 1.0 31015 31015 1.0 27575 27575 0.76 3063 6000
Italy 20 1.0 1313 1313 1.0 1167 1167 1.0 359 359
50 1.0 1389 1389 1.0 1235 1235 0.82 368 486
100 1.0 1361 1361 1.0 1210 1210 0.82 390 599
Russia 20 1.0 1357 1357 1.0 1206 1206 0.7 253 335
50 1.0 1272 1272 1.0 1131 1131 0.74 251 366
100 0.78 787 1006 0.75 699 948 0.6 213 573
UK 20 1.0 2434 2434 1.0 2164 2164 0.8 688 979
50 0.86 2373 3235 0.82 2110 3190 0.56 569 4030
100 0.93 2413 2844 0.91 2146 2686 0.62 441 2495
USA 20 0.75 6952 9895 0.65 6181 10339 0.52 3245 14417
50 0.82 7186 9892 0.78 6389 9894 0.56 2871 13218
100 0.91 7597 8950 0.89 6755 8507 0.7 2663 8044

Table 7.

Coverage probability (CP), average length (AL) and mean interval score (MIS) of 95% prediction intervals by normal, laplace distributions and Bootstrap procedure in HAR-G-V residuals for the last days.

Normal

Laplace

Bootstrap
CP AL MIS CP AL MIS CP AL MIS
Brazil 20 0.9 3747 4584 0.95 4050 4465 1.0 8011 8011
50 0.74 3332 21773 0.78 3602 20856 0.96 8095 17968
100 0.69 3209 208 0.74 3469 22065 0.95 8137 16182
France 20 1.0 3200 3200 1.0 3458 3458 1.0 9449 9449
50 1.0 3445 3445 1.0 3723 3723 1.0 034 10034
100 0.82 3416 7901 0.87 3692 7342 1.0 10361 10361
Germany 20 1.0 1825 1825 1.0 1973 1973 1.0 3227 3227
50 1.0 1890 1890 1.0 2042 2042 1.0 3439 3439
100 0.89 1842 3190 0.9 1991 3054 0.99 3563 3602
India 20 1.0 37109 37109 1.0 40107 40107 1.0 6340 6340
50 1.0 40028 40028 1.0 43261 43261 0.94 5867 13861
100 1.0 47492 47492 1.0 51328 51328 0.94 5490 11140
Italy 20 1.0 2011 2011 1.0 2173 2173 1.0 869 869
50 1.0 2127 2127 1.0 2299 2299 1.0 981 981
100 1.0 2084 2084 1.0 2253 2253 0.98 1093 1156
Russia 20 1.0 2078 2078 1.0 2254 2254 0.95 437 478
50 1.0 1948 1948 1.0 2105 2105 0.92 433 525
100 0.88 1205 1716 0.91 1302 1771 0.80 355 1154
UK 20 1.0 3745 3745 1.0 4047 4047 0.95 1996 2007
50 0.94 3653 4344 0.96 3948 4393 0.78 1703 9220
100 0.97 3716 4061 0.98 4016 4239 0.83 1366 5315
USA 20 0.9 10645 11671 0.95 11505 11596 0.7 6284 25698
50 0.94 11004 14384 0.96 11892 14556 0.72 5264 24559
100 0.97 11634 13324 0.98 12573 13905 0.85 4835 14012

The Laplace distribution seems to fit the residual distributions better and to improve coverage probabilities of the 95% prediction intervals in Brazil, France, Germany and Russia (Table 7). Also, the Laplace distribution makes the mean interval scores better in most countries for the 80% prediction intervals (Table 6), whereas this is not true for the 95% prediction intervals (Table 7). However, average lengths of prediction intervals by both normal and Laplace distributions are large in some countries and so coverage probability is always equal to one. Too long intervals make always coverage one. Thus we need to improve prediction intervals with proper coverage probability close to the nominal coverage, as well as with shorter length. For this purpose, we consider the bootstrap procedure in the next subsection.

3.3. Bootstrap prediction interval

In order to assess the bootstrap forecast, we first generate a bootstrap sample of residuals. Using the residuals {ϵ^t:t=1,2,,n}, we define the empirical distribution F^ϵ(·) by

F^ϵ(x)=1nt=1nI(,ϵ^t)(x).

We draw an i.i.d. sequence {ϵ^t*:t=1,2,} from the empirical distribution F^ϵ, which is a bootstrap sample to be used in constructing a bootstrap prediction interval. One-step ahead bootstrap forecast is given by

X^n+1*X^n*(1)=θ^Yn+ϵ^1*.

An 100(1α)% bootstrap prediction interval is constructed by the α/2- and (1α/2)-bootstrap quantiles, qα/2* and q1α/2*, respectively, of the bootstrap sample {ϵ^t*:t=1,2,,B}, where B is the number of bootstrap replications. The 100(1α)% bootstrap prediction interval is given by

[X^n*(1)α/2,X^n*(1)1α/2]=[θ^Yn+qα/2*,θ^Yn+q1α/2*].

In Figs. 7 and 8 , the right columns depict the 95% bootstrap prediction intervals for the COVID-19 confirmed cases, where B=1000 is used to get the bootstrap quantiles, while the left columns illustrate the Box–Jenkins prediction intervals by the normal approximation. Because all the lengths by the Laplace approximation in Table 7 are larger than those by the normal approximation, plots of Laplace prediction intervals are omitted. The interval plots in Figs. 7 and 8 describe the numerical results in Table 7 with 95% one-step prediction intervals for the last 15 days. The left and right columns use the same scales on the vertical axis in each country so that the Box–Jenkins and bootstrap prediction intervals are compared with each other at sight. In most countries the prediction intervals include the actual values of the COVID-19 confirmed cases. On the left columns of Figs. 7 and 8, the prediction intervals by the Box–Jenkins are so large with coverage probability one in some countries. The lengths of the bootstrap prediction intervals on the right columns are not too large or too small because bootstrap distribution mimics the distribution of the residuals.

Fig. 7.

Fig. 7

One-day ahead predicted values and 95% prediction intervals by HAR-G-V models: Box–Jenkins prediction intervals (left column) and Bootstrap prediction intervals (right column); Brazil, France, Germany and India.

Fig. 8.

Fig. 8

One-day ahead predicted values and 95% prediction intervals by HAR-G-V models: Box–Jenkins prediction intervals (left column) and Bootstrap prediction intervals (right column); Italy, Russia, UK and USA.

To add some discussion of the bootstrap prediction intervals, we go back to Tables 6 and 7, whose last columns report results of the 80%, 95% bootstrap prediction intervals. In Tables 6 and 7 we see that the bootstrap procedure not only improves coverage probability but also makes better mean interval score of prediction intervals. In Brazil, France and Germany, the coverage probabilities are improved to the values close to the norminal coverage by enlarging the average length of the intervals. In India, Italy and Russia, the bootstrap prediction intervals are shortened with the improved coverage probability as well, which are in cases of so large average length by the normal and Laplace distributions. Consequently, the bootstrap technique produces the best mean interval scores: bold numbers in columns of MIS in Tables 6 and 7 indicate the best one among three scores of the normal, Laplace and bootstrap approaches.

Finally, in order to show how much the bootstrap method improves the normal and Laplace approximations, in the countries with the best mean interval scores in all three cases =20,50,100 of Table 7, i.e., in Brazil, India, Italy and Russia, the coverage probability and mean interval score of 80%, 95% prediction intervals are depicted in Fig. 9, Fig. 10, Fig. 11, Fig. 12 . Instead of =20,50,100 in Tables 6 and 7, =1,2,,50 are set on the horizontal axis in Fig. 9, Fig. 10, Fig. 11, Fig. 12. Using the last days, the empirical coverage probability and the mean interval score are computed and then the two measures are illustrated on the -axis of the horizon. Red graphs in the figures indicate the results by the bootstrap while green and blue ones are by the normal and Laplace, respectively. Red graphs in the coverage probability (left columns of Fig. 9, Fig. 10, Fig. 11, Fig. 12) are closer to the nominal coverage in both 80% and 95% intervals. In the plots of coverage probability, the grey horizontal lines in the middle lie on the nominal coverage. Coverage probabilities of the 80% bootstrap prediction intervals average to 0.8470, 0.8708, 0.9191 and 0.7717 in Brazil, India, Italy and Russia, respectively, whereas those of the normal and Laplace approximations are all one except for Brazil, which has the averages 0.7073 and 0.6564. Also, the averages of coverage probabilities of the 95% bootstrap prediction intervals are 0.9761, 0.9624, 1.0 and 0.9691 in the four countries, respectively, whereas those of others are all one, but for Brazil with 0.8543 and 8894.

Fig. 9.

Fig. 9

Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for Brazil.

Fig. 10.

Fig. 10

Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for India.

Fig. 11.

Fig. 11

Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for Italy.

Fig. 12.

Fig. 12

Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for Russia.

As for the mean interval score (right columns of Fig. 9, Fig. 10, Fig. 11, Fig. 12), red graphs of the bootstrap are much smaller values than others except for a few values of Brazil. As shown in the figures, the Laplace approximation does not significantly improve the normal one. However, the bootstrap method considerably lowers and enhances the mean interval scores from both normal and Laplace approximations. To understand the bootstrap improvement with an evaluated measure, we suggest a bootstrap improvement ratio (BIR) of the mean interval score. The BIR is defined as

BIR(·)=AMIS(·)AMIS(Boot)AMIS(·)

where AMIS(·) is average of the mean interval scores by the underlying approach for =1,2,,50:

AMIS=150=150MIS

with MIS in (3), and Boot stands for the bootstrap. The BIR values are displayed in Table 8 , whose results are computed from the mean interval scores (right columns) of Fig. 9, Fig. 10, Fig. 11, Fig. 12. In the case of Brazil, it is reported that the BIR values of the 95% prediction intervals are not as good as those in others because the mean interval scores of the bootstrap are higher (or worse) for =1,2,,24, as seen in Fig. 9, than other two approaches. In India and Russia, BIR values are between 0.7 and 0.8, which means that even up to 70–80% of the scores are improved either from the normal to the bootstrap or from the Laplace to the bootstrap. In conclusion, we claim that the bootstrap technique can improve prediction intervals with having more proper empirical coverages (closer to nominal one) as well as shorter average lengths and thus better mean interval scores.

Table 8.

Bootstrap improvement ratio (BIR) of prediction interval (PI) on the mean interval scores in Fig. 9, Fig. 10, Fig. 11, Fig. 12.

Normal to Bootstrap


Laplace to Bootstrap
80% PI 95% PI 80% PI 95% PI
Brazil 0.0246 0.0101 0.0434 0.0172
India 0.8031 0.7132 0.7786 0.7346
Italy 0.7103 0.5754 0.6741 0.6071
Russia 0.7683 0.7781 0.7394 0.7947

4. Conclusion and discussion

The COVID-19 pandemic might continue over the next few years in spite of the vaccination supply all around the world. Health scientists and researchers will make more efforts to reverse the virus-contaminated Earth to the uncontaminated one again. For this purpose, more efficient models as well as more accurate predictions are needed to be reflected in social policies such as lockdown and social distancing. In the statistical aspect, rather than point-estimation of predicted values, interval-estimation such as prediction intervals can give more information with high probability. To this end, this study suggests a multiple regression with a form of nonlinear HAR along with growth rates and vaccination rates as its additional regressors, and aims to provide more accurate prediction intervals for the daily COVID-19 confirmed cases.

Model selections of the HAR models with/without the growth rates and vaccination rates are given by the criteria measures such as RMSE, MAE, R2, AIC and BIC. Our proposed HAR-G-V model addresses the smallest values of RMSE and MAE in top eight affected countries. We conclude that the HAR-G-V model is a suitable and useful tool for obtaining accurate prediction. For the forecasting, point-estimation of predicted values as well as interval-estimation through three kinds of prediction intervals are extensively provided. RMSE, MAE, MAPE and RRSE are evaluated for the forecast performance accuracy. In order to assess accurate prediction intervals, three approaches are carried out with the residuals of the HAR-G-V model. Normal and Laplace approximations, and bootstrap method are tried to construct prediction intervals. The last one improves prediction intervals with maximum 70–80% of mean interval score, because bootstrap procedure mimics the residual distribution.

This work has the following contributions: (i) A time series model collaborated with both growth rates and vaccination rates is proposed in order to forecast the daily COVID-19 confirmed cases. As a result, the proposed model yields smaller errors in RMSE and MAE. (ii) This work mainly focuses on the prediction intervals to be more informative, by establishing the 80%, 95% prediction intervals through the normal, Laplace and bootstrap approaches. All three prediction intervals produce good empirical coverage probabilities. (iii) The bootstrap technique can improve the mean interval score of prediction intervals. Thus we conclude that the HAR-G-V model combined with the bootstrap method provides accurate prediction intervals of the COVID-19 pandemic.

As a next step of the forecast analysis of the COVID-19, author plans to propose and study a HAR(-G-V) model with partially periodicity as mentioned as in the last section of Hwang and Yu [35]. This current work has handled the data of 7-day smoothed COVID-19 cases as in Ekinci [34]. Empirical results in Ekinci [34] and in this work can provide useful predictive information in deciding the COVID-19 social policies such as social distancing or pandemic relief money. It is because such a social policy is executed for a rather long period, specifically, much longer than 7 days. Due to irregular PCR tests from day to day, for example, rare on Sunday or frequent on Wednesday, the 7-day smoothed data analysis is reasonable and sufficient to understand the overall COVID-19 situation. However, as an interesting and challenging topic, author is interested in modeling the partially periodic COVID-19 cases. As seen in the patterns of COVID-19 data (unsmoothed) of USA, Germany, Brazil, India etc., there is a phenomenon such as strong oscillation with 7-day periodicity, which makes 7-day periodic ACFs. The oscillation seems to appear as the magnitude of the COVID-19 data is larger. In other words, if the magnitude becomes larger, then the oscillation becomes stronger; together with no oscillation on small magnitude. In order to reflect such a feature, partial oscillations not only having some truncation but also proportional to the magnitude might be more desirable than full oscillations. Such a time series modeling for the partially periodic COVID-19 cases will be a promising and challenging study. It will need a mathematical refinement or probability distributional complement (like [18]) as well as an experimental design along with the basis of the statistical inference.

CRediT authorship contribution statement

Eunju Hwang: Conceptualization, Methodology, Software, Data curation, Writing – original draft, Visualization, Investigation, Supervision, Validation, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by National Research Foundation of Korea (NRF-2018R1D1A1B07048745).

References

  • 1.Benvenuto D., Giovanetti M., Vassallo L., Angeletti S., Ciccozzi M. Application of the ARIMA model on the COVID-2019 epidemic dataset. Date Brief. 2020;29:105340. doi: 10.1016/j.dib.2020.105340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Hernandez-Matamoros A., Fujita F., Hayashi T., Perez-Meana H. Forecasting of COVID19 per regions using ARIMA models and polynomial functions. Appl Soft Comput J. 2020;96:106610. doi: 10.1016/j.asoc.2020.106610. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Sahai A.K., Rath N., Sood V., Singh M. ARIMA modelling and forecasting of COVID-19 in top five affected countries. Diabetes Metab Syndr. 2020;14(5):1419–1427. doi: 10.1016/j.dsx.2020.07.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Ceylan Z. Estimation of COVID-19 prevalence in Italy, Spain, and France. Sci Total Environ. 2020;729:138817. doi: 10.1016/j.scitotenv.2020.138817. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Kirbas I., Sozen A., Tuncer A.D., Kazancioglu F.S. Comparative analysis and forecasting of COVID-19 cases in various european countries with ARIMA, NARNN and LSTM approaches. Chaos, Solitons Fractals. 2020;138:110015. doi: 10.1016/j.chaos.2020.110015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ribeiro M.H.D.M., Silva R.G.D., Mariani V.C., Coelho L.D.S. Short-term forecasting COVID-19 cumulative confirmed cases: perspectives for Brazil. Chaos, Solitons Fractals. 2020;135:109853. doi: 10.1016/j.chaos.2020.109853. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Mustafa H.I., Fareed N.Y. 2nd Al-noor international conference for science and technology (NICST) 2020. COVID-19 cases in Iraq; forecasting incidents using Box–Jenkins ARIMA model; pp. 22–26. [Google Scholar]
  • 8.Soukhovolsky V., Kovalev A., Pitt A., Kessel B. A new modelling of the COVID 19 pandemic. Chaos, Solitons Fractals. 2020;139:110039. doi: 10.1016/j.chaos.2020.110039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Atangana A., Araz I.S. Modeling and forecasting the spread of COVID-19 with stochastic and deterministic approaches: Africa and Europe. Adv Differ Equ. 2021;57:2021. doi: 10.1186/s13662-021-03213-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Katoch R., Sidhu A. An application of ARIMA model to forecast the dynamics of COVID-19 epidemic in India. Global Bus Rev. 2021 doi: 10.1177/0972150920988653. [DOI] [Google Scholar]
  • 11.Chatterjee S., Asad A., Shayak B., Bhattacharya S., Alam S., Verma M.K. Evolution of COVID-19 pandemic: power-law growth and saturation. Int license. 2020 doi: 10.1101/2020.05.05.20091389. [DOI] [Google Scholar]
  • 12.Sanchez-Cabalero S., Selles M.A., Peydro M.A., Perez-Bernabeu E. An efficient COVID-19 prediction model validated with the cases of China, Italy and Spain: total or partial lockdowns? J Clin Med. 2020;9:1547. doi: 10.3390/jcm9051547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Reis R.F., Quintela B.D.M., Campos J.D.O., Gomes J.M., Rocha B.M., Lobosco M., et al. Characterization of the COVID-19 pandemic and the impact of uncertainties, mitigation strategies, and underreporting of cases in South Korea, Italy, and Brazil. Chaos, Solitons Fractals. 2020;136:109888. doi: 10.1016/j.chaos.2020.109888. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Sarkar K., Khajanchi S., Nieto J.J. Modeling and forecasting the COVID-19 pandemic in India. Chaos, Solitons Fractals. 2020;139:110049. doi: 10.1016/j.chaos.2020.110049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Mohammed A.A., Ewees A.A., Fan H., Abd El Aziz M. Optimization method for forecasting confirmed cases of COVID-19 in China. J Clin Med. 2020;9:674. doi: 10.3390/jcm9030674. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Maleki M., Mahmoudi M.R., Heydari M.H. Modeling and forecasting the spread and death rate of coronavirus (COVID-19) in the world using time series models. Chaos, Solitons Fractals. 2020;140:110151. doi: 10.1016/j.chaos.2020.110151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Petropoulos F., Makridakis S., Stylianou N. COVID-19: forecasting confirmed cases and deaths with a simple time series model. Int J Forecast. 2021 doi: 10.1016/j.ijforecast.2020.11.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Khan N.M., Bakouch H.S., Soobhug A.D., Scotto M.G. Insights on the trend of the novel coronavirus 2019 series in some small island developing states: a thinning-based modelling approarch. Alexandria Eng J. 2021;60:2535–2550. [Google Scholar]
  • 19.Pitchaimani M., Brasanna Devi M. Stochastic probical strategies in a delay virus infection model to combat COVID-19. Chaos, Solitons Fractals. 2021;152:111325. doi: 10.1016/j.chaos.2021.111325. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Mehra A.H.A., Shafieirad M., Abbasi Z., Zamani I. Parameter estimation and predicition of COVID-19 epidemic teruning point and ending time of a case study on SIR/SQAIR epidemic models. Comput Math Methods Med. 2020 doi: 10.1155/2020/1465923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Chaurasia V., Pal S. Application of machine learning time series analysis for prediction COVID-19 pandemic. Res Biomed Eng. 2020 doi: 10.1007/s42600-020-00105-4. [DOI] [Google Scholar]
  • 22.Salgotra R., Gandomi M., Gandomi A.H. Time series analysis and forecast of the COVID-19 pandemic in India using genetic programming. Chaos, Solitons Fractals. 2020;138:109945. doi: 10.1016/j.chaos.2020.109945. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Feroze N. Forecasting the patterns of COVID-19 and causal impacts of lockdown in top five affected countries using Bayesian structural time series models. Chaos, Solitons Fractals. 2020;140:110196. doi: 10.1016/j.chaos.2020.110196. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Prasanth S., Singh U., Kumar A., Tikkiwal V.A., Chong P.H.J. Forecasting spread of COVID-19 using google trends: a hybrid GWO-deep learning approach. Chaos, Solitons Fractals. 2021;142:110336. doi: 10.1016/j.chaos.2020.110336. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Balli S. Data analysis of COVID-19 pandemic and short-term cumulative case forecasting using machine learning time series methods. Chaos, Solitons Fractals. 2021;142:110512. doi: 10.1016/j.chaos.2020.110512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Kalantari M. Forecasting COVID-19 pandemic using optimal singular spectrum analysis. Chaos, Solitons Fractals. 2021;142:110547. doi: 10.1016/j.chaos.2020.110547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Katris C. A time series-based statistical approach for outbreak spread forecating: application of COVID-19 in Greece. Expert Syst Appl. 2021;166:114077. doi: 10.1016/j.eswa.2020.114077. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Marmarelis V.Z. Predictive modeling of COVID-19 data in the US: adaptive phase-space approach. J Eng Med Biol. 2020:3008313. doi: 10.1109/OJEMB.2020.3008313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Nikolopoulos K., Punia S., Schäfers A., Tsinopoulos C., Vasilakis C. Forecasting and planning during a pandemic: COVID-19 growth rates, supply chain disruptions, and governmental decisions. Eur J Oper Res. 2021;290:99–115. doi: 10.1016/j.ejor.2020.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Liu L., Moon H.R., Schorfheide F. Panel forecasts of country-level COVID-19 infections. J Econom. 2021;220:2–22. doi: 10.1016/j.jeconom.2020.08.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Luo X., Duan H., Xu K. A novel grey model based on traditional richards model and its application in COVID-19. Chaos, Solitons Fractals. 2021;142:110480. doi: 10.1016/j.chaos.2020.110480. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Chowell G., Luo R. Ensenble bootstrap methodology for forecasting dynamic growth processes using differential equations: application to epidemic outbreaks. BMC Med Res Methodol. 2021;21:34. doi: 10.1186/s12874-021-01226-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Bracher J., Wolffram D., Deuschel J., et al. A pre-registered short-term forecasting study of COVID-19 in Germany and Poland during the second wave. Nat Commun. 2021;12:5173. doi: 10.1038/s41467-021-25207-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Ekinci A. Modeling and forecasting of growth rate of new COVID-19 cases in top nine affected countries: considering conditional variance and asymmetric effect. Chaos, Solitons Fractals. 2021;151:111227. doi: 10.1016/j.chaos.2021.111227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Hwang E., Yu S. Modeling and forecasting the COVID-19 pandemic with heterogeneous autoregression approaches: South Korea. Results Phys. 2021;29:104631. doi: 10.1016/j.rinp.2021.104631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Tregoning J.S., Flight K.E., Higham S.L., et al. Progress of the COVID-19 vaccine effort: viruses, vaccines and variants versus efficacy, effectiveness and escape. Nat Rev Immunol. 2021;21:626–636. doi: 10.1038/s41577-021-00592-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Mathieu E., Ritchie H., Ortiz-Ospina E., et al. A global database of COVID-19 vaccinations. Nat Hum Behav. 2021;5:947–953. doi: 10.1038/s41562-021-01122-8. [DOI] [PubMed] [Google Scholar]
  • 38.Shim E. Projecting the impact of SARS-CoV-2 variants and the vaccination program on the fourth wave of the COVID-19 pandemic in South Korea. Int J Environ Res Public Health. 2021;18:7578. doi: 10.3390/ijerph18147578. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Cihan P. Forecasting fully vaccinated people against COVID-19 and examining future vaccination rate for herd immunity in the US, Asia, Europe, Africa, South America, and the world. Appl Soft Comput. 2021;111:107708. doi: 10.1016/j.asoc.2021.107708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Higdon M.M., et al. A systematic review of COVID-19 vaccines efficacy and effectiveness against the SARS-CoV-2 infection and disease. medRxiv2021. Preprint. doi: 10.1101/2021.09.17.21263549. [DOI]
  • 41.Montgomery D.C., Peck E.A. 2nd ed. Wiley Series in Probability and Mathematical Statistics; 1991. Introduction to linear regression analysis. [Google Scholar]
  • 42.Corsi F. A simple approximate long-memory model of realized volatility. J Financ Econom. 2009;7:174–196. [Google Scholar]
  • 43.Hong W.T., Lee J., Hwang E. A note on the asymptotic normality theory of the least squares estimates in multivariate HAR-RV models. Mathematics. 2020;8:2083. [Google Scholar]
  • 44.Hwang E., Hong W.T. A multivariate HAR-RV model with heteroscedastic errors and its WLS estimation. Econ Lett. 2021;203:109855. [Google Scholar]
  • 45.Wilms I., Rombouts J., Croux C. Multivariate volatility forecasts for stock market indices. Int J Forecast. 2021;37:484–499. [Google Scholar]
  • 46.Gneiting T., Raftery A.E. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc. 2007;477:359–378. [Google Scholar]
  • 47.Freedman D.A. Bootstrapping regression models. Ann Stat. 1981;9:1218–1228. [Google Scholar]
  • 48.Stine R.A. Estimating properties of autoregressive forecasts. J Am Stat Assoc. 1987;82:1072–1078. [Google Scholar]
  • 49.Thombs L.A., Schucany W.R. Bootstrap prediction intervals for autoregression. J Am Stat Assoc. 1990;85:486–492. [Google Scholar]
  • 50.Hong M., Hwang E. Bootstrap inference for network vector autoregression in large-scale social network. J Korean Stat Soc. 2021;50:1238–1258. doi: 10.1007/s42952-021-00115-7. [DOI] [Google Scholar]
  • 51.Linton N.M., Kobayashi T., Yang Y., Hayashi K., Akhmetzhanov A.R., Jung S., et al. Incubation period and other epidemiological characteristics of 2019 novel coronavirus infections with right truncation: astatistical analysis of publicly available case data. J Clin Med. 2020;9:538. doi: 10.3390/jcm9020538. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Chaos, Solitons, and Fractals are provided here courtesy of Elsevier

RESOURCES