Abstract
Since the seminal paper by Bates and Granger in 1969, a vast number of ensemble methods that combine different base regressors to generate a unique one have been proposed in the literature. The so-obtained regressor method may have better accuracy than its components, but at the same time it may overfit, it may be distorted by base regressors with low accuracy, and it may be too complex to understand and explain. This paper proposes and studies a novel Mathematical Optimization model to build a sparse ensemble, which trades off the accuracy of the ensemble and the number of base regressors used. The latter is controlled by means of a regularization term that penalizes regressors with a poor individual performance. Our approach is flexible to incorporate desirable properties one may have on the ensemble, such as controlling the performance of the ensemble in critical groups of records, or the costs associated with the base regressors involved in the ensemble. We illustrate our approach with real data sets arising in the COVID-19 context.
Keywords: Machine Learning, Ensemble Method, Mathematical Optimization, Selective Sparsity, COVID-19
1. Introduction
A plethora of methodologies of very different nature is currently available for predicting a continuous response variable, as it is the case in regression as well as in time series forecasting. Those methodologies come mainly from Machine Learning, such as Support Vector Machines (Carrizosa, Romero Morales, 2013, Vapnik, 1995), Random Forests (Breiman, 2001), Optimal Trees (Bertsimas, Dunn, 2017, Blanquero, Carrizosa, Molero-Río, Romero Morales, 2021, Carrizosa, Molero-Río, Romero Morales, 2021), Deep Learning (Gambella, Ghaddar, & Naoum-Sawaya, 2021); or from Statistics, such as Generalized Linear Models (Hastie, Tibshirani, & Wainwright, 2015), Semi- and Nonparametric approaches to regression (such as smoothing techniques) (Härdle, 1990), Regression models for time series analysis (Kedem & Fokianos, 2005), or Random Effects models (Lee, Nelder, & Pawitan, 2018). Some of these techniques have shown a relatively high degree of success in COVID-19 time series forecasting (Benítez-Peña, Carrizosa, Guerrero, Jiménez-Gamero, Martín-Barragán, Molero-Río, Ramírez-Cobo, Romero Morales, Sillero-Denamiel, 2020b, Nikolopoulos, Punia, Schäfers, Tsinopoulos, Vasilakis, 2021), which is the application that has inspired this work.
In this way, the user has at hand a long list of fitted regression models, referred to in what follows as base regressors, and faces the problem of deciding which one to choose, or alternatively, how to combine (some of) the competing approaches, that is, how to build an ensemble. While a thorough computational study of the different models may help the user to identify the most convenient one, such an approach becomes unworkable when predicting new phenomena in real-time, like the evolution of the COVID-19 counts (confirmed cases, hospitalized patients, ICU patients, recovered patients, and fatalities). Here, the most accurate method will probably change over time since we are dealing with a dynamic setting, but also because of the non-stationarity of the data caused, for instance, by the different interventions of authorities to flatten the curve.
Hence, it may be more convenient to build an ensemble where some accuracy measure, such as a (cross-validation) estimate of the expected squared error or of the absolute error (Ando, Li, 2014, Bates, Granger, 1969), is optimized at each forecast origin. With this approach other relevant issues can be modeled, such as sparsity in the feature space (Bertsimas, King, Mazumder, 2016, Carrizosa, Mortensen, Romero Morales, Sillero-Denamiel, 2020a, Carrizosa, Olivares-Nadal, Ramírez-Cobo, 2017b, Fountoulakis, Gondzio, 2016), interpretability (Carrizosa, Nogales-Gómez, Romero Morales, 2016, Carrizosa, Nogales-Gómez, Romero Morales, 2017a, Carrizosa, Olivares-Nadal, Ramírez-Cobo, 2020b, Martín-Barragán, Lillo, Romo, 2014), critical values of features (Carrizosa, Martín-Barragán, Romero Morales, 2010, Carrizosa, Martín-Barragán, Romero Morales, 2011), measurement costs (Carrizosa, Martín-Barragán, & Romero Morales, 2008), or cost-sensitive performance constraints (Benítez-Peña, Blanquero, Carrizosa, Ramírez-Cobo, 2019a, Benítez-Peña, Blanquero, Carrizosa, Ramírez-Cobo, 2020a, Blanquero, Carrizosa, Ramírez-Cobo, Sillero-Denamiel, 2020c). See (Friese, Bartz-Beielstein, Emmerich, 2016, Mendes-Moreira, Soares, Jorge, Sousa, 2012, Ren, Zhang, Suganthan, 2016) and references therein for the role of mathematical optimization when constructing ensembles and (Friese, Bartz-Beielstein, Bäck, Naujoks, & Emmerich, 2019) for the use of ensembles to enhance the optimization of black-box expensive functions.
In this paper, we propose an optimization approach to build a sparse ensemble. In contrast to existing proposals in the literature, our paper focuses on an innovative definition of sparsity, the so-called selective sparsity. Our goal is to build a sparse ensemble, which takes into account the individual performance of each base regressor, in such a way that only good base regressors are allowed to take part in the ensemble. This is done with the aim to adapt to dynamic settings, such as in COVID-19 counts, where the composition of the ensemble may change over time, but also to avoid that the ensemble is distorted by base regressors with low accuracy or may be too complex to understand and explain. Ours can be seen as a sort of what (Mendes-Moreira et al., 2012) calls an ensemble pruning, where the ensemble is constructed by using a subset of all available base regressors. The novelty of our approach resides in the fact that the selection of the subset and the weights in the ensemble are simultaneously optimized.
We propose a Mathematical Optimization model that trades off the accuracy of the ensemble and the number of base regressors used. The latter is controlled by means of a regularization term that penalizes regressors with a poor individual performance. Our approach is flexible to incorporate desirable properties one may have on the ensemble, such as controlling the performance of the ensemble in critical groups of records, or the costs associated with the base regressors involved in the ensemble. Our data-driven approach is applied to short-term predictions of the evolution of COVID-19, as an alternative to model-based prediction algorithms as in Achterberg et al. (2020) and references therein.
The remainder of the paper is structured as follows. Section 2 formulates the Mathematical Optimization problem to construct the sparse ensemble. Theoretical properties of the optimal solution are studied, and how to accommodate some desirable properties on the ensemble is also discussed. Section 3 illustrates our approach with real data sets arising in the COVID-19 context, where one can see how the ensemble composition changes over time. The paper ends with some concluding remarks and lines for future research in Section 4.
2. The optimization model
This section presents the new ensemble approach. Section 2.1 describes the formulation of the model in terms of an optimization problem with linear constraints. Section 2.2 establishes the connection of the approach with the constrained Lasso (Blanquero, Carrizosa, Ramírez-Cobo, Sillero-Denamiel, 2020c, Gaines, Kim, Zhou, 2018) and some theoretical results of the solution are derived. Finally, Section 2.3 considers some extensions of the model concerning the control of the set of base regressors or control of the performance in critical groups.
2.1. The formulation
Let be a finite set of base regressors for the response variable . No restriction is imposed on the collection of base regressors. It may include a variety of state-of-the-art models and methodologies for setting their parameters and hyperparameters. It may even use alternative samples for training, for example where individuals are characterized by different sets of features. By taking convex combinations of the base regressors in we obtain a broader class of regressors, namely, Throughout this section, vectors will be denoted with bold typesetting, e.g., .
The selection of one combined regressor from will be made by optimizing a function which takes into account two criteria. The first and fundamental criterion is the overall accuracy of the combined regressor, measured through a loss function defined on
For each base regressor we assume its individual loss is given. This may be simply defined as but other options are possible too, in which, for instance, and are both empirical losses, as in Section 2.2, but use different training samples.
With the second criterion, a selective sparsity is pursued to make the method more reluctant to choose base regressors with lower reliability, i.e., with higher individual loss reducing thus overfitting. To achieve this, we add a regularization term in which the weight of base regressor say is multiplied by its individual loss . The selective sparse ensemble is obtained by solving the following Mathematical Optimization problem with linear constraints:
(1) |
where is the unit simplex in
and is a regularization parameter, which trades off the importance given to the loss of the ensemble regressor and to the selective sparsity of the base regressors used.
2.2. Theoretical results
In general, Problem (1) has a nonlinear objective function and linear constraints. For loss functions commonly used in the literature, we can rewrite its objective as a linear or a convex quadratic function while the constraints remain linear. Therefore, for these loss functions, Problem (1) is easily tractable with commercial solvers. In addition, and under some mild assumptions, we characterize the behavior of the optimal solution with respect to the parameter
First, we will rewrite the second term in the objective function, so that the proposed model can be seen as a particular case of the constrained Lasso. As for Lasso models and extensions of them, having a sparse model reduces the danger of overfitting.
Remark 1
The so-called selective norm in is defined as
The objective function in Problem (1) can be written as With this, and for well-known losses Problem (1) can be seen as a constrained Lasso problem, (Blanquero, Carrizosa, Ramírez-Cobo, Sillero-Denamiel, 2020c, Gaines, Kim, Zhou, 2018), in which a selective sparsity is sought, as opposed to a plain sparsity with as few nonzero coefficients as possible. □
Remark 2
Let be a training sample, in which each individual is characterized by its feature vector and its response . Let be the empirical loss of quantile regression, (Koenker & Hallock, 2001), for
(2) where
for some Then, as in e.g. Koenker and Ng (2005), Problem (1) can be expressed as a linear program and thus efficiently solved with Linear Programming solvers. □
Remark 3
Let be a training sample, in which each individual is characterized by its feature vector and its response . Let be the empirical loss of Ordinary Least Squares (OLS) regression for i.e.,
(3) Hence, Problem (1) is a convex quadratic problem with linear constraints, which, by Remark 1, can be seen as a constrained Lasso. In particular, the results in Gaines, Kim, and Zhou, (2018) apply, and thus, we can assert that, if the design matrix has full rank, then,
- 1.
For any Problem (1) has unique optimal solution
- 2.
The path of optimal solutions is piecewise linear in . □
Under mild conditions on applicable in particular for the quantile and OLS empirical loss functions, we characterize the optimal solution of Problem (1) for large values of the parameter Intuitively speaking, for growing to infinity, the first term in the objective function becomes negligible, and thus we only need to solve the Linear Programming problem of minimizing in the simplex This problem attains its optimum at one of the extreme points of the feasible region, i.e., at some namely, one for which We formalize this intuition in the following proposition, where under the assumption of convexity of we show that a finite value of exists for which such sparse solution is optimal. Before stating it, notice that, since the set is given, we can define
for some such that
Proposition 1
Assume thatis convex in an open convex setFurthermore, assume that there exists a base regressorsuch thatfor allThen, there existssuch that, for anyis an optimal solution to Problem(1).
Proof. Let be as in the statement of the proposition, and let denote the vector with 1 in its component corresponding to and 0 otherwise. Since is defined in the open set the subdifferential of the convex function at is not empty. Let and let denote the normal cone of at Then,
(4) |
which is satisfied iff
(5) |
Setting equal to the value on the right-hand side of (5), and taking into account that the condition on the left-hand side of (4) is necessary and sufficient for the optimality of the result follows. □
2.3. Extensions
Problem (1) can be enriched to address some desirable properties one may seek for the ensemble. Three of them are discussed in what follows. The first two properties relate to the transparency and interpretability of the ensemble, Deng (2019) and Florez-Lopez and Ramon-Jeronimo (2015), while the third one relates to the performance of the ensemble in critical groups.
As mentioned in the introduction, the ensemble may contain base regressors built with several methodologies of very diverse nature. Therefore, one may want to control the number of methodologies used in the final ensemble. For instance, in the application described in Section 3, we consider four methodologies, namely, Support Vector Regression, Random Forests, Optimal Trees, and Logistic Regression. Let where is the set of base regressors using methodology and let be the corresponding subvector of namely, the one containing the components in referring to methodology . With this, we can extend the objective function of Problem (1) to
(6) |
In a similar fashion, one may want to control the set of features used by the ensemble. Let be the set of base regressors using feature and let be the corresponding subvector of namely, the one containing the components in referring to feature . With this, we can extend the objective function of Problem (1) to
(7) |
In both cases, the terms can be rewritten using new decision variables and linear constraints, and thus the structure of the problem is not changed. This way, if is the quantile regression (respectively, the Ordinary Least Squares) empirical loss, the optimization problem with objective as in (6) is written as a linear problem (respectively, as a convex quadratic problem with linear constraints). The same holds for the optimization problem with objective as in (7).
In addition, our approach can easily incorporate cost-sensitive performance constraints to ensure that we control not only the overall accuracy of the regressor, but also the accuracy on a number of critical groups, as in Benítez-Peña et al. (2019a), Benítez-Peña, Blanquero, Carrizosa, and Ramírez-Cobo (2019b), Blanquero et al. (2020) and Datta and Das (2015). With this, if denotes the threshold on the loss for group we can add to the feasible region of Problem (1) constraints
(8) |
For the quantile and Ordinary Least Squares empirical loss functions, these constraints are linear or convex quadratic, respectively, and thus the optimization problems can be addressed with the very same numerical tools as before.
3. Short-term predictions of the evolution of COVID-19
The purpose of this section is to illustrate how, thanks to the selective sparsity term in Problem (1), we can provide good ensembles in terms of accuracy. For this, we use data sets arising in the context of COVID-19.
3.1. The data
COVID-19 was first identified in China in December 2019 and, subsequently, started to spread broadly. Quickly after this, data started to be collected daily by the different countries. Several variables of interest, such as confirmed cases, hospitalized patients, ICU patients, recovered patients, and fatalities, among others, were considered. Different initiatives around the world emerged in order to get to know this new scenario.
In this section, we focus on the evolution of the pandemic in Spain and Denmark. The first cases were confirmed in Spain and Denmark in late February 2020 and early March 2020, respectively. In this paper, the considered variable of interest is the cumulative number of hospitalized patients in the regions of Andalusia (Spain) and Sjælland (Denmark). Figs. 1 and 2 display the data in the periods 10/03/2020-20/05/2020 for Andalusia and 06/03/2020-20/05/2020 for Sjælland, which can be found at the repositories in Fernández-Casal (2020) and Statens Serum Institut (2020), respectively.
Fig. 1.
Cumulative number of hospitalized patients in Andalusia (Spain) for COVID-19 in the period 10/03/2020–20/05/2020.
Fig. 2.
Cumulative number of hospitalized patients in Sjælland (Denmark) for COVID-19 in the period 06/03/2020–20/05/2020.
The univariate time series with representing the cumulative number of hospitalized patients in the region under consideration in day is converted into a multivariate series using seven lags. In other words, the data fed to the base regressors is not the time series itself, but the vectors of covariates and responses in Fig. 3 . This training set is just one of the different options we have considered to create base regressors. In the next section, we discuss other data choices, which we will refer to as Country, Transformation and Differences.
Fig. 3.
Covariates (in parentheses) and response variable for the cumulative number of hospitalized patients in the region under consideration.
3.2. Options for feeding the data
We first discuss the Country data choice. Let be the number of regions of the country under consideration, and, without loss of generalization, let us assume that the first one is the region under consideration. The time series for regions were also available. Such times series are correlated with the one under consideration. We had to decide whether to incorporate these additional time series in our forecasting model. If we do so, the feeding data contains the 7-uples in Fig. 3 from the region under consideration, as well as the ones from the other regions, see Fig. 4 . We now move to the Transformation choice. For the two choices in Figs. 3 and 4, either the crude data are used or they are transformed using some standard Box-Cox transformations, Hastie, Tibshirani and Wainwright (2015), namely, and . Finally, with respect to the Differences choice, we have also considered whether information about the monotonicity (first difference, ) and the curvature (second difference, ) is added to the feeding data as predictors, thus yielding 6 and 5 new predictors because of monotonicity and curvature, respectively.
Fig. 4.
Covariates (in parentheses) and response variable for the cumulative number of hospitalized patients in each of the regions of the country.
To end this section, observe that the time series of cumulative number of hospitalized patients in the region under consideration is, by nature, nondecreasing. However, some of the methodologies in the next section used to build base regressors do not guarantee such monotonicity. To ensure that the predictions show the monotonicity property present in the data, we use as response variable instead of . Once the procedure is completed, we undo this transformation to predict the original response variable . Figs. 5 and 6 display for Andalusia and Sjælland, respectively, where is as in Figs. 1 and 2.
Fig. 5.
Representation of the function where denote the cumulative number of hospitalized patients in Andalusia for COVID-19 in the period 10/03/2020–20/05/2020.
Fig. 6.
Representation of the function where denote the cumulative number of hospitalized patients in Sjælland for COVID-19 in the period 06/03/2020–20/05/2020.
3.3. The base regressors
We consider four base methodologies to build the set of base regressors . This includes three state-of-the-art Machine Learning tools, namely Support Vector Regression (SVR) (Carrizosa & Romero Morales, 2013), Random Forest (RF) (Breiman, 2001), and Sparse Optimal Randomized Regression Trees (S-ORRT) (Blanquero, Carrizosa, Molero-Río, & Romero Morales, 2020a), as well as the classic Linear Regression (LR). Each of them is fed each time with one of the data choices described in Section 3.1. See Table 1 for a description of the elements of according to their methodology and the data choices. These methodologies have some parameters which must be tuned, and we explain below the tuning we have performed together with other computational details.
Table 1.
Description of the chosen base regressors according to the data choices on Country, Transformation and Differences and the four methodologies used, with tuning parameters as in Section 3.3.
|
|
|||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country No | ||||||||||||||||||
Country Yes | ||||||||||||||||||
Transformation | ||||||||||||||||||
Transformation | ||||||||||||||||||
Transformation | ||||||||||||||||||
Differences Yes | ||||||||||||||||||
Differences No | ||||||||||||||||||
|
|
|||||||||||||||||
Country No | ||||||||||||||||||
Country Yes | ||||||||||||||||||
Transformation | ||||||||||||||||||
Transformation | ||||||||||||||||||
Transformation | ||||||||||||||||||
Differences Yes | ||||||||||||||||||
Differences No |
To tune the parameters, the different base regressors are trained using all the available data, except for the last four days, i.e., these models are trained on . The e1071 (Meyer, Dimitriadou, Hornik, Weingessel, & Leisch, 2019) and randomForest (Liaw & Wiener, 2002) R packages have been used for training SVR and RF, respectively, while the lm routine in R is used for LR. The computational details for training S-ORRT are those in Blanquero et al. (2020a). For SVR, we use the RBF kernel and perform a grid search in for both parameters, cost and gamma. For RF, we set and for we try out five random values. If only information from the region under consideration is included (‘Country No’ data option in Table 1), eight fold cross-validation is used. However, when information from all regions in the country is included, we limit this to five fold cross-validation, due to the small amount of data and the lack of observations in some regions. Such cross-validation estimates are used to select the best values of the parameters. With those best values, for each combination of feeding data and methodology, the base regressors are built using information from see Fig. 7 .
Fig. 7.
The timeline of building the base regressors in solving Problem (1) to obtain the sparse ensemble for a given value of and making the out-of-sample predictions.
3.4. The pseudocode of the complete procedure
The complete procedure for making short-term predictions with our selective sparse ensemble methodology is summarized in Algorithm 1 and can be visualized in Fig. 7. The considered grid of values for the tradeoff parameter in Problem (1) is . For the tests considered in this section, this grid is wide enough. On one extreme, we have included the trivial value for which the selective sparsity term does not play a role. On the other extreme, with this grid we ensure that is reached, for which, by Proposition 1, the ensemble shows the highest level of sparsity.
Algorithm 1.
Pseudocode for the complete procedure.
We start by training the base regressors in Table 1, with tuning parameters as in Section 3.3, using the data available up to day . We then move to solve Problem (1) for the different values of in the grid. For this, we have chosen the loss as in (3), where consists of the data in the four days left out when tuning the base regressors, namely, while the individual losses are taken as . For each value of we obtain the optimal weights returned by Problem (1). With these weights, the final ensemble regressor is built using all the data up to day and this final ensemble regressor is used to make fourteen-day-ahead predictions in .
The commercial optimization package Gurobi (Gurobi Optimization, 2018) has been used to solve the convex quadratic problems with linear constraints arising when solving Problem (1) with the loss in (3). Our experiments have been conducted on a PC, with an Intel ®Core i7-8550U CPU 1.80GHz processor and 8 GB RAM. The operating system is 64 bits.
3.5. The numerical results
The out-of-sample prediction performance of our approach is illustrated in three training and testing splits, with all training periods starting on 10/03/2020 for Andalusia and on 06/03/2020 for Sjælland, and all testing periods containing 14 days. For Andalusia, we have 10/03/2020–03/04/2020 (Training Period 1) and 04/04/2020–17/04/2020 (Testing Period 1), 10/03/2020–14/04/2020 (Training Period 2) and 15/04/2020–28/04/2020 (Testing Period 2), and 10/03/2020–06/05/2020 (Training Period 3) 07/05/2020–20/05/2020 (Testing Period 3). Similar periods are chosen for Sjælland, where all training periods start on 06/03/2020.
For each value of in the considered grid, the fourteen-days-ahead predictions made by the ensemble together with the realized values of the variable can be found in Table 2, Table 3, Table 4, Table 5, Table 6, Table 7 for each period and region, while Tables 8 and 9 report the Mean Squared Error (MSE) and the Mean Absolute Error (MAE) over the fourteen days. In Tables 8 and 9, we highlight in bold the best MSE performance of the ensemble across all the values of considered, and denote by the value of the parameter where the minimum MSE is achieved. Note that in this case, for each period and region combination, the best MAE is also achieved at . Figs. 14 and 15 present the weights of the base regressors in the ensembles as a function of by means of heatmaps. The color bar of each heatmap transitions from white to black, where the darker means a higher weight.
Table 2.
For each value of fourteen-day-ahead predictions of the ensemble for the cumulative number of hospitalized patients in Andalusia for COVID-19 in Testing Period 1. Last row shows the actual values.
04/04 | 05/04 | 06/04 | 07/04 | 08/04 | 09/04 | 10/04 | 11/04 | 12/04 | 13/04 | 14/04 | 15/04 | 16/04 | 17/04 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4132 | 4337 | 4536 | 4713 | 4871 | 5020 | 5162 | 5297 | 5427 | 5554 | 5677 | 5799 | 5919 | 6038 |
4073 | 4233 | 4386 | 4527 | 4655 | 4776 | 4892 | 5005 | 5115 | 5225 | 5333 | 5442 | 5552 | 5662 | |
3985 | 4067 | 4146 | 4220 | 4290 | 4356 | 4419 | 4481 | 4541 | 4601 | 4659 | 4717 | 4775 | 4833 | |
3961 | 4021 | 4079 | 4132 | 4183 | 4231 | 4277 | 4321 | 4365 | 4407 | 4447 | 4488 | 4528 | 4568 | |
3960 | 4021 | 4078 | 4132 | 4183 | 4230 | 4277 | 4321 | 4364 | 4407 | 4447 | 4488 | 4527 | 4567 | |
3980 | 4064 | 4148 | 4228 | 4307 | 4385 | 4462 | 4537 | 4613 | 4688 | 4761 | 4835 | 4908 | 4981 | |
4014 | 4138 | 4265 | 4391 | 4518 | 4646 | 4776 | 4905 | 5035 | 5166 | 5295 | 5425 | 5555 | 5685 | |
4066 | 4246 | 4434 | 4628 | 4829 | 5037 | 5250 | 5468 | 5689 | 5911 | 6132 | 6351 | 6564 | 6772 | |
4121 | 4341 | 4579 | 4813 | 5040 | 5280 | 5514 | 5759 | 6001 | 6239 | 6482 | 6718 | 6957 | 7195 | |
4102 | 4302 | 4515 | 4722 | 4920 | 5127 | 5326 | 5534 | 5739 | 5938 | 6144 | 6343 | 6548 | 6754 | |
4106 | 4308 | 4524 | 4734 | 4935 | 5145 | 5348 | 5559 | 5767 | 5969 | 6178 | 6380 | 6588 | 6797 | |
4112 | 4320 | 4543 | 4760 | 4966 | 5183 | 5391 | 5609 | 5822 | 6030 | 6245 | 6453 | 6668 | 6883 | |
4125 | 4344 | 4581 | 4810 | 5028 | 5257 | 5477 | 5707 | 5934 | 6153 | 6381 | 6600 | 6827 | 7055 | |
4149 | 4390 | 4654 | 4908 | 5147 | 5401 | 5643 | 5898 | 6148 | 6390 | 6641 | 6883 | 7134 | 7386 | |
4149 | 4390 | 4654 | 4908 | 5147 | 5401 | 5643 | 5898 | 6148 | 6390 | 6641 | 6883 | 7134 | 7386 | |
Actual | 4107 | 4227 | 4335 | 4463 | 4599 | 4715 | 4808 | 4950 | 4993 | 5054 | 5147 | 5226 | 5298 | 5341 |
Table 3.
For each value of fourteen-day-ahead predictions of the ensemble for the cumulative number of hospitalized patients in Andalusia for COVID-19 in Testing Period 2. Last row shows the actual values.
15/04 | 16/04 | 17/04 | 18/04 | 19/04 | 20/04 | 21/04 | 22/04 | 23/04 | 24/04 | 25/04 | 26/04 | 27/04 | 28/04 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5415 | 5602 | 5729 | 5849 | 5949 | 6031 | 6115 | 6186 | 6252 | 6321 | 6389 | 6457 | 6532 | 6610 |
5412 | 5598 | 5724 | 5843 | 5943 | 6025 | 6109 | 6179 | 6246 | 6315 | 6383 | 6451 | 6525 | 6603 | |
5411 | 5596 | 5722 | 5840 | 5939 | 6021 | 6105 | 6175 | 6241 | 6310 | 6378 | 6446 | 6520 | 6598 | |
5407 | 5590 | 5715 | 5832 | 5930 | 6011 | 6095 | 6165 | 6231 | 6300 | 6368 | 6436 | 6510 | 6587 | |
5346 | 5491 | 5597 | 5699 | 5787 | 5862 | 5939 | 6006 | 6070 | 6136 | 6201 | 6265 | 6334 | 6405 | |
5221 | 5294 | 5360 | 5420 | 5478 | 5534 | 5590 | 5644 | 5698 | 5751 | 5804 | 5856 | 5909 | 5961 | |
5219 | 5290 | 5359 | 5425 | 5490 | 5553 | 5616 | 5677 | 5738 | 5798 | 5857 | 5916 | 5975 | 6033 | |
5220 | 5290 | 5358 | 5424 | 5489 | 5551 | 5614 | 5675 | 5735 | 5794 | 5853 | 5911 | 5969 | 6024 | |
5220 | 5292 | 5361 | 5429 | 5495 | 5560 | 5624 | 5686 | 5749 | 5809 | 5870 | 5929 | 5989 | 6046 | |
5221 | 5293 | 5363 | 5431 | 5498 | 5563 | 5628 | 5691 | 5754 | 5815 | 5876 | 5936 | 5996 | 6054 | |
5221 | 5293 | 5363 | 5431 | 5498 | 5563 | 5628 | 5691 | 5754 | 5815 | 5876 | 5936 | 5996 | 6054 | |
5221 | 5293 | 5363 | 5431 | 5498 | 5563 | 5628 | 5691 | 5754 | 5815 | 5876 | 5936 | 5996 | 6054 | |
5221 | 5293 | 5363 | 5431 | 5498 | 5563 | 5628 | 5691 | 5754 | 5815 | 5876 | 5936 | 5996 | 6054 | |
Actual | 5226 | 5298 | 5341 | 5424 | 5473 | 5509 | 5565 | 5615 | 5675 | 5715 | 5747 | 5767 | 5792 | 5831 |
Table 4.
For each value of fourteen-day-ahead predictions of the ensemble for the cumulative number of hospitalized patients in Andalusia for COVID-19 in Testing Period 3. Last row shows the actual values.
07/05 | 08/05 | 09/05 | 10/05 | 11/05 | 12/05 | 13/05 | 14/05 | 15/05 | 16/05 | 17/05 | 18/05 | 19/05 | 20/05 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6046 | 6059 | 6073 | 6090 | 6104 | 6117 | 6129 | 6139 | 6147 | 6156 | 6163 | 6169 | 6174 | 6179 |
6043 | 6054 | 6062 | 6067 | 6069 | 6069 | 6068 | 6063 | 6057 | 6047 | 6035 | 6020 | 6002 | 5982 | |
6043 | 6054 | 6062 | 6067 | 6069 | 6069 | 6068 | 6064 | 6057 | 6048 | 6035 | 6020 | 6003 | 5983 | |
6047 | 6063 | 6077 | 6089 | 6100 | 6110 | 6120 | 6128 | 6136 | 6142 | 6148 | 6152 | 6156 | 6159 | |
6039 | 6048 | 6055 | 6062 | 6069 | 6075 | 6082 | 6088 | 6094 | 6099 | 6104 | 6108 | 6112 | 6116 | |
6043 | 6054 | 6065 | 6074 | 6084 | 6092 | 6099 | 6106 | 6113 | 6120 | 6126 | 6131 | 6136 | 6141 | |
6045 | 6055 | 6066 | 6080 | 6098 | 6110 | 6116 | 6122 | 6131 | 6141 | 6151 | 6155 | 6159 | 6168 | |
6050 | 6056 | 6071 | 6091 | 6124 | 6144 | 6147 | 6151 | 6164 | 6180 | 6197 | 6199 | 6202 | 6218 | |
6049 | 6056 | 6070 | 6091 | 6125 | 6145 | 6148 | 6152 | 6165 | 6182 | 6199 | 6201 | 6204 | 6220 | |
6050 | 6056 | 6071 | 6093 | 6128 | 6148 | 6152 | 6156 | 6169 | 6187 | 6204 | 6207 | 6209 | 6226 | |
6050 | 6056 | 6071 | 6093 | 6129 | 6150 | 6153 | 6157 | 6170 | 6188 | 6206 | 6209 | 6211 | 6228 | |
6050 | 6056 | 6071 | 6094 | 6131 | 6153 | 6156 | 6159 | 6173 | 6192 | 6210 | 6212 | 6214 | 6232 | |
6051 | 6055 | 6071 | 6095 | 6135 | 6158 | 6161 | 6164 | 6178 | 6198 | 6218 | 6219 | 6221 | 6240 | |
6051 | 6054 | 6070 | 6098 | 6144 | 6170 | 6171 | 6172 | 6188 | 6210 | 6233 | 6234 | 6235 | 6256 | |
6052 | 6053 | 6070 | 6100 | 6152 | 6181 | 6181 | 6181 | 6198 | 6223 | 6248 | 6248 | 6248 | 6272 | |
Actual | 6038 | 6069 | 6080 | 6092 | 6101 | 6114 | 6128 | 6146 | 6161 | 6174 | 6178 | 6182 | 6196 | 6210 |
Table 5.
For each value of fourteen-day-ahead predictions of the ensemble for the cumulative number of hospitalized patients in Sjælland for COVID-19 in Testing Period 1. Last row shows the actual values.
04/04 | 05/04 | 06/04 | 07/04 | 08/04 | 09/04 | 10/04 | 11/04 | 12/04 | 13/04 | 14/04 | 15/04 | 16/04 | 17/04 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 257 | 263 | 268 | 274 | 279 | 285 | 289 | 294 | 299 | 304 | 309 | 313 | 318 | 323 |
259 | 266 | 272 | 279 | 285 | 290 | 295 | 299 | 304 | 307 | 311 | 314 | 317 | 319 | |
259 | 266 | 273 | 279 | 285 | 291 | 296 | 301 | 305 | 308 | 312 | 316 | 318 | 321 | |
259 | 267 | 274 | 281 | 287 | 293 | 298 | 303 | 307 | 311 | 315 | 319 | 322 | 325 | |
260 | 266 | 274 | 279 | 285 | 290 | 295 | 300 | 304 | 308 | 312 | 316 | 319 | 322 | |
261 | 271 | 281 | 289 | 297 | 305 | 313 | 321 | 327 | 334 | 340 | 347 | 353 | 358 | |
262 | 271 | 282 | 290 | 298 | 307 | 315 | 323 | 329 | 336 | 343 | 350 | 356 | 362 | |
262 | 272 | 283 | 291 | 300 | 309 | 318 | 326 | 333 | 340 | 347 | 354 | 360 | 367 | |
262 | 272 | 284 | 292 | 300 | 309 | 318 | 326 | 333 | 340 | 347 | 354 | 361 | 367 | |
262 | 272 | 284 | 292 | 301 | 310 | 319 | 327 | 334 | 341 | 348 | 355 | 361 | 367 | |
263 | 273 | 285 | 293 | 302 | 311 | 320 | 328 | 335 | 342 | 349 | 356 | 362 | 368 | |
263 | 273 | 285 | 293 | 302 | 311 | 320 | 328 | 335 | 342 | 349 | 356 | 362 | 368 | |
263 | 273 | 285 | 293 | 302 | 311 | 320 | 328 | 335 | 342 | 349 | 356 | 362 | 368 | |
263 | 273 | 285 | 293 | 302 | 311 | 320 | 328 | 335 | 342 | 349 | 356 | 362 | 368 | |
263 | 273 | 285 | 293 | 302 | 311 | 320 | 328 | 335 | 342 | 349 | 356 | 362 | 368 | |
Actual | 257 | 262 | 272 | 280 | 285 | 292 | 299 | 304 | 309 | 316 | 324 | 335 | 341 | 351 |
Table 6.
For each value of fourteen-day-ahead predictions of the ensemble for the cumulative number of hospitalized patients in Sjælland for COVID-19 in Testing Period 2. Last row shows the actual values.
15/04 | 16/04 | 17/04 | 18/04 | 19/04 | 20/04 | 21/04 | 22/04 | 23/04 | 24/04 | 25/04 | 26/04 | 27/04 | 28/04 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 335 | 346 | 356 | 367 | 378 | 392 | 405 | 417 | 430 | 444 | 459 | 471 | 481 | 493 |
334 | 345 | 355 | 365 | 376 | 389 | 401 | 413 | 424 | 438 | 452 | 463 | 473 | 484 | |
333 | 343 | 352 | 361 | 371 | 382 | 393 | 403 | 413 | 425 | 437 | 447 | 457 | 466 | |
331 | 339 | 346 | 353 | 361 | 369 | 376 | 384 | 391 | 400 | 408 | 416 | 424 | 431 | |
330 | 336 | 342 | 348 | 353 | 359 | 364 | 369 | 375 | 381 | 386 | 392 | 398 | 405 | |
330 | 336 | 342 | 347 | 353 | 359 | 365 | 370 | 376 | 382 | 388 | 393 | 399 | 405 | |
330 | 337 | 343 | 350 | 356 | 363 | 369 | 376 | 382 | 389 | 395 | 401 | 408 | 414 | |
330 | 337 | 343 | 349 | 356 | 362 | 369 | 375 | 382 | 388 | 395 | 400 | 407 | 414 | |
330 | 336 | 343 | 349 | 355 | 362 | 368 | 374 | 381 | 387 | 394 | 399 | 405 | 412 | |
330 | 336 | 342 | 348 | 355 | 361 | 367 | 373 | 379 | 385 | 391 | 396 | 403 | 409 | |
330 | 336 | 342 | 348 | 354 | 360 | 366 | 372 | 378 | 384 | 390 | 395 | 401 | 407 | |
330 | 336 | 342 | 348 | 354 | 360 | 366 | 372 | 378 | 384 | 390 | 395 | 401 | 407 | |
330 | 336 | 342 | 348 | 354 | 360 | 366 | 372 | 378 | 384 | 390 | 395 | 401 | 407 | |
Actual | 335 | 341 | 351 | 360 | 369 | 380 | 393 | 402 | 413 | 423 | 430 | 432 | 443 | 445 |
Table 7.
For each value of fourteen-day-ahead predictions of the ensemble for the cumulative number of hospitalized patients in Sjælland for COVID-19 in Testing Period 3. Last row shows the actual values.
07/05 | 08/05 | 09/05 | 10/05 | 11/05 | 12/05 | 13/05 | 14/05 | 15/05 | 16/05 | 17/05 | 18/05 | 19/05 | 20/05 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 505 | 528 | 553 | 574 | 593 | 610 | 623 | 637 | 650 | 664 | 677 | 690 | 704 | 715 |
504 | 526 | 550 | 571 | 589 | 607 | 619 | 633 | 646 | 659 | 673 | 685 | 698 | 710 | |
503 | 525 | 548 | 568 | 586 | 603 | 616 | 629 | 642 | 655 | 668 | 680 | 693 | 704 | |
501 | 522 | 544 | 563 | 580 | 596 | 608 | 621 | 634 | 646 | 659 | 671 | 683 | 694 | |
495 | 516 | 539 | 557 | 573 | 588 | 602 | 616 | 629 | 643 | 657 | 671 | 684 | 697 | |
491 | 510 | 535 | 550 | 565 | 579 | 592 | 605 | 617 | 631 | 645 | 659 | 672 | 686 | |
488 | 503 | 523 | 535 | 547 | 558 | 568 | 579 | 588 | 600 | 611 | 623 | 634 | 645 | |
483 | 490 | 500 | 505 | 511 | 517 | 522 | 527 | 533 | 539 | 545 | 551 | 557 | 563 | |
483 | 487 | 491 | 494 | 498 | 501 | 505 | 508 | 511 | 515 | 518 | 522 | 525 | 528 | |
483 | 487 | 491 | 494 | 498 | 501 | 505 | 508 | 511 | 515 | 518 | 521 | 524 | 527 | |
483 | 487 | 490 | 494 | 498 | 501 | 504 | 508 | 511 | 514 | 518 | 521 | 524 | 527 | |
483 | 487 | 490 | 494 | 498 | 501 | 504 | 508 | 511 | 514 | 518 | 521 | 524 | 527 | |
483 | 487 | 490 | 494 | 498 | 501 | 504 | 508 | 511 | 514 | 518 | 521 | 524 | 527 | |
483 | 487 | 490 | 494 | 498 | 501 | 504 | 508 | 511 | 514 | 518 | 521 | 524 | 527 | |
483 | 487 | 490 | 494 | 498 | 501 | 504 | 508 | 511 | 514 | 518 | 521 | 524 | 527 | |
Actual | 483 | 485 | 488 | 490 | 491 | 491 | 493 | 495 | 502 | 503 | 503 | 505 | 507 | 510 |
Table 8.
For each value of Mean Squared Error (MSE) and Mean Absolute Error (MAE) of the ensemble for Testing Period 1, 2 and 3 in Andalusia. For each period, the best performance is highlighted in bold. Last row contains the MSE and MAE of the persistence model tested.
Testing Period 1 |
Testing Period 2 |
Testing Period 3 |
||||
---|---|---|---|---|---|---|
(04/04/2020–17/04/2020) |
(15/04/2020–28/04/2020) |
(07/05/2020–20/05/2020) |
||||
MSE | MAE | MSE | MAE | MSE | MAE | |
0 | 309188.29 | 532.71 | 174813.93 | 372.79 | 188.86 | 11.00 |
302755.21 | 526.93 | 22697.21 | 120.07 | 12713.93 | 88.64 | |
298320.21 | 523.07 | 154944.93 | 369.50 | 12623.93 | 88.36 | |
288510.14 | 514.14 | 311559.21 | 518.21 | 585.00 | 18.57 | |
151329.79 | 368.50 | 311996.36 | 518.64 | 3353.57 | 51.43 | |
3290.07 | 40.50 | 105662.43 | 311.86 | 1635.64 | 35.36 | |
9174.07 | 71.21 | 21515.93 | 118.07 | 565.00 | 20.43 | |
8477.29 | 68.29 | 554612.29 | 585.43 | 214.64 | 12.21 | |
10905.86 | 78.86 | 1034287.57 | 841.14 | 243.57 | 13.29 | |
11893.29 | 82.86 | 580431.07 | 625.79 | 351.07 | 16.50 | |
11893.29 | 82.86 | 620786.79 | 648.36 | 397.14 | 17.57 | |
11893.29 | 82.86 | 705260.29 | 694.43 | 498.00 | 19.86 | |
11893.29 | 82.86 | 890921.71 | 786.86 | 737.07 | 24.36 | |
11893.29 | 82.86 | 1310319.64 | 964.93 | 1387.93 | 33.36 | |
11893.29 | 82.86 | 1310319.64 | 964.93 | 2236.71 | 42.14 | |
Persistence | 3243399.00 | 1429.89 | 183250.60 | 347.38 | 228.52 | 10.98 |
Table 9.
For each value of Mean Squared Error (MSE) and Mean Absolute Error (MAE) of the ensemble for Testing Period 1, 2 and 3 in Sjælland. For each period, the best performance is highlighted in bold. Last row contains the MSE and MAE of the persistence model tested.
Testing Period 1 |
Testing Period 2 |
Testing Period 3 | ||||
---|---|---|---|---|---|---|
(04/04/2020–17/04/2020) |
(15/04/2020–28/04/2020) |
(07/05/2020–20/05/2020) |
||||
MSE | MAE | MSE | MAE | MSE | MAE | |
0 | 538.07 | 18.36 | 186.00 | 11.00 | 19171.64 | 126.93 |
327.50 | 14.07 | 170.14 | 8.71 | 18097.14 | 123.14 | |
66.71 | 5.00 | 146.79 | 7.93 | 17080.29 | 119.57 | |
228.71 | 13.43 | 103.14 | 6.57 | 15200.14 | 112.57 | |
947.93 | 27.07 | 141.79 | 8.21 | 14582.07 | 108.64 | |
905.43 | 26.57 | 164.14 | 12.14 | 12379.21 | 99.36 | |
600.29 | 21.71 | 217.79 | 14.07 | 7189.43 | 75.43 | |
622.57 | 22.14 | 313.21 | 16.79 | 1055.79 | 28.36 | |
671.57 | 23.00 | 319.29 | 17.00 | 134.14 | 10.00 | |
761.57 | 24.43 | 343.14 | 17.57 | 126.79 | 9.79 | |
818.00 | 25.29 | 379.29 | 18.57 | 123.14 | 9.57 | |
818.00 | 25.29 | 379.29 | 18.57 | 123.14 | 9.57 | |
818.00 | 25.29 | 379.29 | 18.57 | 123.14 | 9.57 | |
818.00 | 25.29 | 379.29 | 18.57 | 123.14 | 9.57 | |
818.00 | 25.29 | 379.29 | 18.57 | 123.14 | 9.57 | |
Persistence | 593.83 | 19.88 | 36.38 | 4.92 | 5.15 | 1.91 |
Fig. 14.
For each value of heatmap of the weights of the base regressors in the ensemble in Training Period 1, 2 and 3 in Andalusia. We highlight in blue, in black, and in green. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 15.
For each value of heatmap of the weights of the base regressors in the ensemble in Training Period 1, 2 and 3 in Sjælland. We highlight in blue, in black, and in green. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Figs. 8–13 depict the realized values of the variable at hand, cumulative number of hospitalized patients in the respective region (in red), as well as the fourteen-days-ahead predictions for three different ensembles. In the first ensemble, with the selective sparsity term does not play a role by construction (blue line). In the second ensemble, the ensemble is the one that performs the best in terms of MSE among all values of considered (black line). Finally, in the third ensemble, with the ensemble is the one showing the highest level of sparsity (green line).
Fig. 8.
Fourteen-day-ahead predictions for the cumulative number of hospitalized patients in Andalusia for COVID-19 in Testing Period 1 for three values of the tradeoff parameter together with the actual values of the variable. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 13.
Fourteen-day-ahead predictions for the cumulative number of hospitalized patients in Sjælland for COVID-19 in Testing Period 3 for three values of the tradeoff parameter together with the actual values of the variable. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
We start by discussing the results obtained for Period 1 in Andalusia. In Fig. 8, we can see that it is possible to improve the out-of-sample prediction performance by taking a strictly positive value of . As pointed in the introduction, this is one of the advantages of our approach, namely, when seeking selective sparsity one may obtain also improvements on the out-of-sample prediction performance. A great benefit is observed with the ensemble that performs the best (black line), which is rather close to the actual values (red line). While the ensemble with presents a MAE of 532.71, for the MAE is reduced to 40.50. This ensemble consists of the base regressors and with respective weights 0.71, 0.14 and 0.15. In Fig. 9 , we plot the out-of-sample information for Andalusia and Period 2. Similar conclusions hold. In addition, the best ensemble is the one with and consists of with respective weights 0.25 and 0.75. This means that the ensemble composition has changed over time, which can be explained by the non-stationarity of the data. If after having built the best ensemble for Training Period 1 one would have discarded these two base regressors because they were not selected, we would have lost the best combination for Training Period 2. This illustrates another advantage of our approach, namely, its adaptability. The ensemble composition changes again in Training Period 3 in Andalusia, where and compose the best ensemble, see Fig. 10 . Note that, for this particular period, =0, although this is not in general the case.
Fig. 9.
Fourteen-day-ahead predictions for the cumulative number of hospitalized patients in Andalusia for COVID-19 in Testing Period 2 for three values of the tradeoff parameter together with the actual values of the variable. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 10.
Fourteen-day-ahead predictions for the cumulative number of hospitalized patients in Andalusia for COVID-19 in Testing Period 3 for three values of the tradeoff parameter together with the actual values of the variable. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Regarding Sjælland, similar conclusions are obtained, see Table 9 and Fig. 11, Fig. 12, Fig. 13 . The best ensembles are achieved for strictly positive values of namely, for Testing Period 1, for Testing Period 2 and for Testing Period 3. Their compositions also differ across the three periods, and for Training Period 1, and in Training Period 2, and in Training Period 3. This again illustrates the advantage of our approach in terms of adaptability.
Fig. 11.
Fourteen-day-ahead predictions for the cumulative number of hospitalized patients in Sjælland for COVID-19 in Testing Period 1 for three values of the tradeoff parameter together with the actual values of the variable. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 12.
Fourteen-day-ahead predictions for the cumulative number of hospitalized patients in Sjælland for COVID-19 in Testing Period 2 for three values of the tradeoff parameter together with the actual values of the variable. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
We end the section with a few words about the set of base regressors. In their last row, Tables 8 and 9 report the MSE and MAE of a persistence model in which the increase in the variable is kept constant throughout the testing period and equal to the last increase in the training period. As for any forecasting model, the persistence model might yield good results in some cases, such as in Testing Period 2 and 3 in Sjælland, but very poor ones in other situations, such as in Testing Period 1 and 2 in Andalusia. We could have easily embedded this persistence model, or any other one, by enlarging the set of base regressors. Again, because of the adaptability of our approach, the persistence model would have been chosen or not to be part of the sparse ensemble, depending on the period and the region being considered.
4. Conclusions
In this paper we have addressed the problem of building ensembles with selective sparsity of regression methods, which is suitable in changing circumstances such as those related to the COVID-19 pandemic. The construction of the ensemble amounts to solving an optimization problem, which is quadratic convex under linear constraints for the empirical Ordinary Least Squares regression loss and it can be written as a linear problem for empirical loss of quantile regression. Under convexity assumptions on the loss we show that, by varying the parameter in the interval we move from the ensemble minimizing the overall loss to the ensemble with one single base regressor namely, the one with lowest individual loss Moreover, different types of desirable properties of the ensemble can be easily accommodated by modifying the penalty term or the constraints. The application to data on hospitalized patients in Andalusia (Spain) and Sjælland (Denmark) shows the advantage of using an ensemble with selective sparsity instead of a rough ensemble or one single base regressor.
The computational experience reported is limited to the problem motivating this work. For other types of problems, it may be interesting to combine the selective sparsity suggested in this paper (number of regressors used) with the feature sparsity (number of features used), by adding penalties as in Section 2.3 and in Blanquero et al. (2020a) and Blanquero, Carrizosa, Molero-Río, and Romero Morales (2020b). It may also be attractive to use different measures for the individual losses and the overall loss . For instance, one can build the ensemble with lowest least squares errors, but being reluctant to use base regressors with high least absolute deviations, or more generally, quantile errors.
Even if we knew the probabilistic mechanism generating the data, sound probability assessments are rather difficult in the setting considered in this paper. Those probability assessments are what Efron (2020) calls “attributions”. As recognized in that paper, prediction is much easier than attribution. The use of an adequate bootstrap procedure (see Bühlmann, 2002, for a review of bootstraps for time series) could yield probability attributes. The consistency of the bootstrap for Support Vector Machines when the data can be assumed to be independent and identically distributed, has been shown in Christmann and Hable (2013). To the best of our knowledge, an analogous result for time series in a general setting as the one considered here has not been stated yet, and it certainly constitutes a field for future research.
Another challenging line of research is the construction of sparse ensembles (sparse both in base regressors and in features) for classification problems. Although some attempts have been made to address this problem using Linear Programming, Zhang and Zhou (2011), natural losses yield versions of Problem (1) with (many) binary variables, and thus new strategies are to be defined to cope with data sets of realistic size. This challenging problem is now under study.
Acknowledgements
We thank the reviewers for their thorough comments and suggestions, which have been very valuable to strengthen the quality of the paper. This research has been financed in part by research projects EC H2020 MSCA RISE NeEDS (Grant agreement ID: 822214); FQM-329 and P18-FR-2369 (Junta de Andalucía, Spain); MTM2017-89422-P (Ministerio de Economía, Industria y Competitividad, Spain); PID2019-110886RB-I00 (Ministerio de Ciencia, Innovacin y Universidades, Spain); PR2019-029 (Universidad de Cádiz, Spain); PITUFLOW-CM-UC3M (Comunidad de Madrid and Universidad Carlos III de Madrid, Spain); and EP/R00370X/1 (EPSRC, United Kingdom). This support is gratefully acknowledged.
References
- Achterberg M., Prasse B., Ma L., Trajanovski S., Kitsak M., Van Mieghem P. Comparing the accuracy of several network-based COVID-19 prediction algorithms. Forthcoming in International Journal of Forecasting. 2020 doi: 10.1016/j.ijforecast.2020.10.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ando T., Li K.C. A model-averaging approach for high-dimensional regression. Journal of the American Statistical Association. 2014;109:254–265. [Google Scholar]
- Bates J., Granger C. The combination of forecasts. Operations Research Quarterly. 1969;20:451–468. [Google Scholar]
- Benítez-Peña S., Blanquero R., Carrizosa E., Ramírez-Cobo P. Cost-sensitive feature selection for support vector machines. Computers & Operations Research. 2019;106:169–178. [Google Scholar]
- Benítez-Peña S., Blanquero R., Carrizosa E., Ramírez-Cobo P. On support vector machines under a multiple-cost scenario. Advances in Data Analysis and Classification. 2019;13:663–682. [Google Scholar]
- Benítez-Peña S., Blanquero R., Carrizosa E., Ramírez-Cobo P. Technical Report IMUS, Sevilla, Spain. 2020. Cost-sensitive probabilistic predictions for support vector machines. [Google Scholar]; https://www.researchgate.net/publication/341103637_Cost-sensitive_probabilistic_predictions_for_support_vector_machines.
- Benítez-Peña S., Carrizosa E., Guerrero V., Jiménez-Gamero M.D., Martín-Barragán B., Molero-Río C., Ramírez-Cobo P., Romero Morales D., Sillero-Denamiel M.R. Technical Report IMUS, Sevilla, Spain. 2020. Short-term predictions of the evolution of COVID-19 in andalusia. an ensemble method. [Google Scholar]; https://www.researchgate.net/publication/340716304_Short-Term_Predictions_of_the_Evolution_of_COVID-19_in_Andalusia_An_Ensemble_Method.
- Bertsimas D., Dunn J. Optimal classification trees. Machine Learning. 2017;106:1039–1082. [Google Scholar]
- Bertsimas D., King A., Mazumder R. Best subset selection via a modern optimization lens. The Annals of Statistics. 2016;44:813–852. [Google Scholar]
- Blanquero R., Carrizosa E., Molero-Río C., Romero Morales D. Technical Report IMUS, Sevilla, Spain. 2020. On sparse optimal regression trees. [Google Scholar]; https://www.researchgate.net/publication/341099512_On_Sparse_Optimal_Regression_Trees.
- Blanquero R., Carrizosa E., Molero-Río C., Romero Morales D. Sparsity in optimal randomized classification trees. European Journal of Operational Research. 2020;284:255–272. [Google Scholar]
- Blanquero R., Carrizosa E., Molero-Río C., Romero Morales D. Optimal randomized classification trees. Computers & Operations Research. 2021;132:105281. [Google Scholar]
- Blanquero R., Carrizosa E., Ramírez-Cobo P., Sillero-Denamiel M.R. A cost-sensitive constrained lasso. Advances in Data Analysis and Classification. 2020;15:121–158. [Google Scholar]
- Breiman L. Random forests. Machine Learning. 2001;45:5–32. [Google Scholar]
- Bühlmann P. Bootstraps for time series. Statistical Science. 2002;17:52–72. [Google Scholar]
- Carrizosa E., Martín-Barragán B., Romero Morales D. Multi-group support vector machines with measurement costs: A biobjective approach. Discrete Applied Mathematics. 2008;156:950–966. [Google Scholar]
- Carrizosa E., Martín-Barragán B., Romero Morales D. Binarized support vector machines. INFORMS Journal on Computing. 2010;22:154–167. [Google Scholar]
- Carrizosa E., Martín-Barragán B., Romero Morales D. Detecting relevant variables and interactions in supervised classification. European Journal of Operational Research. 2011;213:260–269. [Google Scholar]
- Carrizosa E., Molero-Río C., Romero Morales D. Mathematical optimization in classification and regression trees. TOP. 2021;29:5–33. doi: 10.1007/s11750-021-00594-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carrizosa E., Mortensen L.H., Romero Morales D., Sillero-Denamiel M.R. Technical Report IMUS, Sevilla, Spain. 2020. On linear regression models with hierarchical categorical variables. [Google Scholar]; https://www.researchgate.net/publication/341042405_On_linear_regression_models_with_hierarchical_categorical_variables.
- Carrizosa E., Nogales-Gómez A., Romero Morales D. Strongly agree or strongly disagree?: Rating features in support vector machines. Information Sciences. 2016;329:256–273. [Google Scholar]
- Carrizosa E., Nogales-Gómez A., Romero Morales D. Clustering categories in support vector machines. Omega. 2017;66:28–37. [Google Scholar]
- Carrizosa E., Olivares-Nadal A., Ramírez-Cobo P. A sparsity-controlled vector autoregressive model. Biostatistics. 2017;18:244–259. doi: 10.1093/biostatistics/kxw042. [DOI] [PubMed] [Google Scholar]
- Carrizosa E., Olivares-Nadal A., Ramírez-Cobo P. Novel constraints for enhancing interpretability in linear regression. SORT (Statistics and Operations Research Transactions) 2020;44:67–98. [Google Scholar]
- Carrizosa E., Romero Morales D. Supervised classification and mathematical optimization. Computers and Operations Research. 2013;40:150–165. [Google Scholar]
- Christmann A., Hable R. In: Empirical inference: Festschrift in honor of vladimir n. vapnik. Schölkopf B., Luo Z., Vovk V., editors. Springer; Berlin, Heidelberg: 2013. On the consistency of the bootstrap approach for support vector machines and related kernel-based methods; pp. 231–244. [Google Scholar]
- Datta S., Das S. Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. Neural Networks. 2015;70:39–52. doi: 10.1016/j.neunet.2015.06.005. [DOI] [PubMed] [Google Scholar]
- Deng H. Interpreting tree ensembles with intrees. International Journal of Data Science and Analytics. 2019;7:277–287. [Google Scholar]
- Efron B. Prediction, estimation, and attribution. Journal of the American Statistical Association. 2020;115:636–655. [Google Scholar]
- Fernández-Casal, R. (2020). COVID-19 github repository. Accessed on: September. https://github.com/rubenfcasal/COVID-19.
- Florez-Lopez R., Ramon-Jeronimo J. Enhancing accuracy and interpretability of ensemble strategies in credit risk assessment. a correlated-adjusted decision forest proposal. Expert Systems with Applications. 2015;42:5737–5753. [Google Scholar]
- Fountoulakis K., Gondzio J. A second-order method for strongly convex -regularization problems. Mathematical Programming. 2016;156:189–219. [Google Scholar]
- Friese M., Bartz-Beielstein T., Bäck T., Naujoks B., Emmerich M. AIP conference proceedings. 2019. Weighted ensembles in model-based global optimization. [Google Scholar]
- Friese M., Bartz-Beielstein T., Emmerich M. Technical Report. 2016. Building ensembles of surrogate models by optimal convex combination. [Google Scholar]; http://nbn-resolving.de/urn:nbn:de:hbz:832-cos4-3480.
- Gaines B.R., Kim J., Zhou H. Algorithms for fitting the constrained lasso. Journal of Computational and Graphical Statistics. 2018;27:861–871. doi: 10.1080/10618600.2018.1473777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gambella C., Ghaddar B., Naoum-Sawaya J. Optimization models for machine learning: A survey. European Journal of Operational Research. 2021;290:807–828. [Google Scholar]
- Gurobi Optimization L. Gurobi optimizer reference. manual. 2018 [Google Scholar]; http://www.gurobi.com.
- Härdle W. Cambridge University Press; 1990. Applied nonparametric regression, 19. [Google Scholar]
- Hastie T., Tibshirani R., Wainwright M. CRC press; 2015. Statistical learning with sparsity: The lasso and generalizations. [Google Scholar]
- Statens Serum Institut. (2020). COVID-19 SSI repository. Accessed on: September. https://covid19.ssi.dk/overvagningsdata.
- Kedem B., Fokianos K. John Wiley & Sons; 2005. Regression models for time series analysis volume 488. [Google Scholar]
- Koenker R., Hallock K. Quantile regression. Journal of Economic Perspectives. 2001;15:143–156. [Google Scholar]
- Koenker R., Ng P. Inequality constrained quantile regression. Sankhyā: The Indian Journal of Statistics. 2005;67:418–440. [Google Scholar]
- Lee Y., Nelder J., Pawitan Y. Generalized linear models with random effects: Unified analysis via H-likelihood. CRC Press. 2018;153 [Google Scholar]
- Liaw A., Wiener M. Classification and regression by random forest. R News. 2002;2:18–22. [Google Scholar]
- Martín-Barragán B., Lillo R., Romo J. Interpretable support vector machines for functional data. European Journal of Operational Research. 2014;232:146–155. [Google Scholar]
- Mendes-Moreira J., Soares C., Jorge A.M., Sousa J.F.D. Ensemble approaches for regression: A survey. ACM Computing Surveys. 2012;45:1–40. [Google Scholar]
- Meyer D., Dimitriadou E., Hornik K., Weingessel A., Leisch F. e1071: Misc functions of the department of statistics, probability theory group (formerly: E1071) TU Wien. 2019 [Google Scholar]; https://CRAN.R-project.org/package=e1071 R package version 1.7-1
- Nikolopoulos K., Punia S., Schäfers A., Tsinopoulos C., Vasilakis C. Forecasting and planning during a pandemic: Covid-19 growth rates, supply chain disruptions, and governmental decisions. European Journal of Operational Research. 2021;290:99–115. doi: 10.1016/j.ejor.2020.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ren Y., Zhang L., Suganthan P. Ensemble classification and regression-recent developments, applications and future directions. IEEE Computational Intelligence Magazine. 2016;11:41–53. [Google Scholar]
- Vapnik V. Springer-Verlag; 1995. The nature of statistical learning theory. [Google Scholar]
- Zhang L., Zhou W.D. Sparse ensembles using weighted combination methods based on linear programming. Pattern Recognition. 2011;44:97–106. [Google Scholar]