ABSTRACT
To handle the multicollinearity issues in the regression analysis, a class of ‘strictly concave penalty function’ is described in this paper. As an example, a new penalty function called ‘modified log penalty’ is introduced. The penalized estimator based on strictly concave penalties enjoys the oracle property under certain regularity conditions discussed in the literature. In the multicollinearity cases where such conditions are not applicable, the behaviors of the strictly concave penalties are discussed through examples involving strongly correlated covariates. Real data examples and simulation studies are provided to show the finite-sample performance of the modified log penalty in terms of prediction error under scenarios exhibiting multicollinearity.
KEYWORDS: Grouping effect, modified log penalty, multicollinearity, penalized regression, strictly concave penalty function
1. Introduction
In the regression analysis, multicollinearity occurs when two or more covariates are strongly correlated, see [3,24] for general discussions on the multicollinearity issues. Multicollinearity leads to computational difficulties related to the inverse of the nearly singular matrix and results in low efficiency in the model estimation and prediction. To eliminate the multicollinearity, it is of paramount importance to select a parsimonious model that excludes redundant covariates that can be predicted from other covariates.
In the literature, several approaches have been proposed to overcome the difficulties of multicollinearity. One approach is the best subset selection method proposed in [18,23,24]. The model selection problem is reformulated as a constrained integer quadratic programing problem involving indicators of multicollinearity, such as the condition number of the correlation matrix and the variance inflation factor. However, solving such integer quadratic programing problems can be computationally intensive. Another approach is the partial least squares regression method discussed in [4,17,27]. The idea underlying this approach is to reduce the correlations in the covariates by means of orthogonal transformation.
Over the past two decades, penalized regression methods have been widely studied for the purpose of variable selection, to name a few, [8,10,15,19,25,28–30]. The idea is to use a penalty function that is non-differentiable at zero to shrink the small regression coefficients towards zero. Parsimony and the grouping effect are two important criteria for evaluating the performance of a penalty function. These two criteria can be conflicting to each other in the multicollinearity cases. Grouping effect (see, [30]) means that the strongly correlated covariates tend to be selected or deselected together. Parsimony can be described through the ability of a variable selection method to recover the so-called ‘true subset’ that is relevant to the response. For example, the idea of oracle property described in [8,10,12] has been widely used for such a purpose. However, the definition of the true subset can be ambiguous in the multicollinearity cases where some covariates can be predicted from other covariates. Consider the example where two covariates and are identical and is relevant to the response Y . In such a situation, the models and are equivalent. The first one is more parsimonious. On the other hand, grouping effect requires that both and be selected. In certain applications such as microarray data analysis, grouping effect is considered to be a desirable property. However, in the applications where prediction is the main goal, the situation can be different because the parsimonious model with redundant covariates removed tends to give smaller prediction error.
There is a lack of literature discussing the penalty functions that achieve parsimony in the variable selection problem under the presence of multicollinearity. The Elastic net penalty in [30] is designed to achieve grouping effect in the multicollinearity cases. The Ridge penalty (see, [14,15]), the LASSO penalty (see, [25]), and the Elastic net penalty (see, [30]) do not guarantee the oracle properties in [8,10,11]. Under some regularity conditions on the minimum singular value of the design matrix, the non-concave penalty functions, the SCAD (see, [8,9]) and the MCP (see, [28]), lead to approximately unbiased estimates and guarantee the oracle properties. However, such regularity conditions cannot cover the multicollinearity cases with strong correlations in the covariates.
The aim of this paper is to introduce a new class of strictly concave penalty functions that achieve parsimony even under the multicollinearity cases. It is illustrated that in the situations without multicollinearity, for example, fulfilling the regular conditions in [10], these penalties perform as well as the SCAD penalty in terms of estimation error, prediction error, mean number of false positives, and mean number of false negatives. In the cases where some covariates are identical, at most one among these identical covariates is selected. This means that the redundant covariates can be removed automatically from the model. Moreover, the local quadratic approximation method or majorization-minimization algorithm (MM-algorithm) proposed in [8] and [16] can be used to obtain the estimates. As an example of ‘strictly concave penalty function’, a new penalty function called ‘modified log penalty’ is introduced.
The paper is organized as follows. The strictly concave penalized likelihood estimator and its properties are discussed in Section 2. The modified log penalty is introduced in Section 3. The simulation studies are given in Section 4 to compare the finite-sample performances of the proposed penalty and other penalties, including the Elastic net, the LASSO, and the SCAD. Some real data examples are given in Section 5. The concluding remarks are presented in Section 6.
2. Penalized linear regression with strictly concave penalty
In this section, the strictly concave penalties are introduced to enhance parsimonious model selection in multicollinearity cases. On the contrary, the Elastic net penalty of [30] is strictly convex and exhibits the grouping effect.
2.1. The strictly concave penalized likelihood estimator
Consider the linear regression model:
(1) |
where Y is response vector, X is design matrix, is the vector of unknown parameters, is the model error, and are independent random variables. The strictly concave penalty function is defined below.
Definition 2.1 The strictly concave penalty function —
Let be a tunning parameter. A function is called strictly concave penalty function if the following conditions are satisfied,
has continuous second order derivative on ,
as ,
for all , and
.
A strictly concave penalty function is a non-concave penalty function described in [8,10]. The seemingly confusing use of the terms can be resolved by noting that ‘strictly concave’ here refers to the domain while ‘non-concave’ in [8,10] refers to the domain . It can be checked that the SCAD penalty, the MCP penalty, and penalties of [1] are non-concave but not strictly concave.
Consider the following penalized least squares problem: To minimize
(2) |
with respect to θ, where, t is the observed signal and θ is the unknown. It is suggested in [1] that the conditions in Definition 2.1 guarantee the existence and uniqueness of , the solution to the optimization problem (2). Moreover, is a continuous function of t and as . These conditions are necessary for reducing model complexity and model bias in prediction (see, [2,10]).
Suppose that the data set has n observations and p covariates. Let be the response and be the design matrix, where , are the covariates. The strictly concave penalized likelihood estimator is defined as follows.
Definition 2.2 The strictly concave penalized likelihood estimator —
Let be a strictly concave penalty. For any fixed non-negative λ, the strictly concave penalized likelihood estimator of β in Model (1) is defined as
(3)
For simplicity, if no confusion is caused, we write instead of . Since a strictly concave penalty is also a nonconcave penalty, the majorization-minimization algorithm of [16] can be applied to obtain the penalized least square estimator (3). Similar to the SCAD penalty, if the design matrix X and the model error ϵ satisfy all regularity conditions described in [11], the penalized likelihood estimator always exist and fulfills the so-called oracle properties. Such a property is not guaranteed for the LASSO and the Elastic net.
2.2. Parsimonious variable section in the multicollinearity case
In this subsection, the properties of the strictly concave penalized estimator are discussed under general multicollinearity cases. In such situations, the regularity conditions in [11] can be violated and the penalized estimation methods based on the SCAD penalty of [8] and the MCP penalty of [28] are not guaranteed to select the true model.
To illustrate the ideas, consider the following simple example. Suppose that and . Since both the SCAD penalty and the MCP penalty are constant beyond some point, the local solution always give the same penalized likelihood value when K is smaller than some critical value. This means that along the direction , the penalized likelihood is flat. This gives difficulties in the numerical optimization of the penalized likelihood. If the LASSO penalty is used and is positive, similar difficulties occur because the penalized likelihood is constant for sufficiently small K. The situations are very different if strictly concave penalty are used instead because these penalty are no longer horizontal line far away from zero.
To describe the general multicollinearity cases, suppose that the design matrix X is generated by perturbing a non-full rank matrix U with a small quantity . Here, the dimensions of both U and Z are the same as that of X. Parsimony requires no linear dependence between the columns in U corresponding to the selected covariates. Detailed results are given in the following theorem.
Theorem 2.3
For any integers n,p>0 and -matrices and there exists a positive constant ϰ depending on the Xand such that the system of column vectors corresponding to the chosen covariates is linearly independent for all where is the strictly concave penalized likelihood estimator of β in Model (1).
The proof is given in Appendix A.2. Further results of the special case where or Z is a zero matrix are summarized in the following proposition.
Proposition 2.4
Let be the strictly concave penalized likelihood estimator of β in Model (1).
The system of column vectors is linearly independent. Let . Then, the number of nonzero estimated coefficients satisfies
Without loss of generality assume that are nonzero and the system is linearly independent. Let and . Then,
(4)
Result (a) illustrates the crucial difference between the penalized regression methods based on the proposed strictly concave penalty and other commonly used convex penalties including the Ridge penalty and the Elastic net penalty. The grouping effect of the Elastic net [30] is in contradiction to the linear dependence of . Roughly speaking, if strictly concave penalty is used, there is no redundancy in the selected variables.
Result (b) suggests that the properties of the penalized likelihood estimator in the non-full rank X cases can be studied indirectly through . To see this, introduce the notations , , and . Since the columns in are maximal linearly independent, there exists an -dimensional matrix C such that
The true model is equivalent to
where . This means that if we are able to show the oracle properties under the design matrix , the model selected based on the proposed method selects covariates corresponding to non-zero in the equivalent model with probability going to one. Following the arguments of Fan and Lv in [11], we state without proof the following proposition on the oracle property of the penalized likelihood estimates based on strictly concave penalty.
Proposition 2.5
Let be defined in Proposition 2.4. If and the error terms ϵ satisfy all regular conditions of [11], then defined by (4) fulfills the so-called oracle properties in [11]. That means,
With probability tending to 1 as the penalized likelihood estimator satisfies:where is a subvector of formed by components in and s is the size of
where is a matrix such that and G is a symmetric positive definite matrix, and is the submatrix of corresponding to .
To compare the strictly concave penalty to the Elastic net penalty, consider the cases with identical covariates that lead to the so-called grouping effects described in [30]. The following proposition follows immediately from Proposition 2.4 and is stated without proof.
Proposition 2.6
Let be the strictly concave penalized likelihood estimator (3). Then, the followings hold,
If and then for sufficiently small δ, .
Suppose that for some q<p. Without loss of generality assume that . Then, fulfillswhere .
(5)
2.3. Feasibility of the majorization minimization algorithm
Majorization -- minimization algorithm of [16] that is closely related to the local quadratic approximation method in [8] can be employed to minimize the penalized likelihood function (3) when the penalty function is non-concave. Note that the majorization - minimization algorithm is applicable only when the matrix is invertible, where X is the design matrix, , , and δ is given small positive value, say . Let
We have the following results.
Proposition 2.7
The matrix is invertible if and only if the system of columns is linearly independent system. In particular, if or the design matrix X has full column rank, the matrix is invertible.
The proof is given in Appendix A.4. Below, the invertibility of is discussed for different kinds of penalty function. Note that, for any strictly concave penalty , . Consequently, and the conclusion of Proposition 2.7 holds trivially. However, this is not true for the SCAD, the MCP, and the HARD penalty because these penalties are constant beyond some critical point. As a result, can be zero if is large. Then, the set can be nonempty. The linear independence assumption on , in Proposition 2.7 is not guaranteed in the multicollinearity cases.
3. Modified log penalty
In this section, the modified log penalty (MLOG), a special case of strictly concave penalty, is introduced.
Definition 3.1 Modified log penalty function —
The modified log penalty (MLOG) is defined as
(6) where is a tunning parameter.
Note that and for all θ. Therefore, the modified log penalized likelihood estimate behaves like the ordinary least squares estimate when λ is close to 0 and behaves like the LASSO estimate when λ goes to infinity. When and λ goes to zero, . Neglecting the constant term, it becomes the logarithmic function.
In the modified log penalty, one is added to the term to avoid the singularity. One can consider a more general penalty function . where is given. In this paper, is chosen because it is the greatest possible value of μ that guarantees the uniqueness and existence of , the solution to the minimization problem (2). To see this, consider the first order condition
The existence and uniqueness of the solution can be established by noting that the derivative of the right-hand-side for all only when .
Following [1], thresholding rule of the MLOG penalty refers to the function and is given by
(7) |
Note that the function is the unique solution to (2) for all . It is a continuous function of t. Moreover, as . To see this, note that
Since the thresholding rule is an odd function, we have
(8) |
The plots of the modified log penalty function and its thresholding rule are shown in Figure 1.
Figure 1.
The plots of modified log penalty function and their thresholding rule functions. (a) The MLOG penalties: MLOG1 is , MLOG2 is and MLOG3 is . (b) The thresholding rules: MLOG1 is , MLOG2 is and MLOG3 is .
4. Simulation studies
In this section, the finite-sample performances of the nonconcave penalty are compared to those of the Elastic net, the LASSO, and the SCAD penalties under some examples exhibiting multicollinearity.
To obtain the penalized likelihood estimates, LARS-EN algorithm of [30] is used for the Elastic net while the majorization-minimization algorithm of [16] that is closely related to the local quadratic approximation method of [8] is used for all other penalties. For the covariate is deselected if the estimated coefficient is . The tuning parameter λ is chosen based on the Bayesian information criterion (BIC) of [13,26]. The optimal λ value is obtained using the grid-point search over 100 grid-points . For the SCAD penalty, the tuning parameter a=3.7 is used as suggested in [8].
In each example, the simulated dataset consists of a training set and a test set. The models are fitted using the training sets and the prediction errors are obtained from the test sets. N=500 replicates are used in the simulation. The sample sizes of the training sets are chosen as n=200, n=400, and n=800. The number of covariates (p) grows with n. The sample sizes of test sets are 100.
The following measures of estimation efficiency, prediction efficiency, and selection consistency are used to compare the performance of different penalties. Let be an estimate of β.
Median of relative model errors (MRME) of [8].
Model error (ME): .
Relative model error (RME): , where is the least squares estimator (LS).
Mean squared error (MSE): .
Prediction error (PE): .
Mean squared prediction error (MPME): .
False positives (FP) and false negatives (FN). FP is the mean number of irrelevant covariates misclassified as relevant and FN is the mean number of relevant covariates misclassified as irrelevant. In the simulation examples, the average FP and FN values are reported.
NONZERO: the average number of selected covariates.
Example 4.1
In this example, the covariates are strongly correlated but not exactly identical. Suppose that . The number of covariates is . Let . The rows of are sampled from , where . The rows of and are generated by
where . The rows of and are sampled from and respectively.
Consider the following two sets of true regression coefficients ,
Here, and the five numbers 3, -1.9, 2.5, -2.2, 1.5 are repeated in κ. In Scenario (a), that is strongly correlated to is relevant while in Scenario (b), both and are irrelevant.
Table 1 summarizes the simulation results. First, both MLOG and SCAD perform well in terms of FP in Scenario (b) where the strong pairwise correlations occur only in the irrelevant covariates. On the other hand, the MLOG outperforms the SCAD in general under Scenario (a) where strong pairwise correlations occur between both relevant and irrelevant covariates. The Elastic net and the LASSO perform well in terms of estimation efficiency, but FP is large in general. This means that the Elastic net and the LASSO tend to select more variables than other penalties.
Table 1. The Simulation results of Example 4.1. Mean squared error (MSE) of estimates, model error (ME), mean of the number of selected covariates (NONZERO), median of relative model errors (MRME), prediction model errors (PE), mean prediction squared errors (MPSE), mean number of the false positives (FP) and false negatives (FN).
Scenario | (n,p,q) | Method | ME | MRME | MSE | PE | MPSE | NONZERO | FP | FN |
---|---|---|---|---|---|---|---|---|---|---|
LS | 0.0741 | 100.00 | 0.9498 | 0.0795 | 0.5420 | 14.38 | 9.38 | 0.00 | ||
n=200 | MLOG | 0.0302 | 40.18 | 0.9863 | 0.0314 | 0.5186 | 5.39 | 0.39 | 0.00 | |
p=15 | ENET | 0.0464 | 62.93 | 0.9768 | 0.0482 | 0.5265 | 13.69 | 8.69 | 0.00 | |
q=5 | LASSO | 0.0476 | 65.38 | 0.9816 | 0.0498 | 0.5271 | 12.80 | 7.80 | 0.00 | |
SCAD | 1.8508 | 2972.39 | 2.8072 | 2.0629 | 1.5218 | 6.25 | 1.56 | 0.31 | ||
LS | 0.0427 | 100.00 | 0.9485 | 0.0462 | 0.2621 | 16.02 | 11.02 | 0.00 | ||
n=400 | MLOG | 0.0148 | 34.75 | 0.9714 | 0.0155 | 0.2533 | 5.47 | 0.47 | 0.00 | |
(a) | p=17 | ENET | 0.0271 | 63.71 | 0.9634 | 0.0286 | 0.2561 | 14.83 | 9.83 | 0.00 |
q=5 | LASSO | 0.0269 | 64.20 | 0.9657 | 0.0284 | 0.2557 | 13.99 | 8.99 | 0.00 | |
SCAD | 1.2158 | 3045.90 | 2.1819 | 1.3219 | 0.5807 | 6.24 | 1.36 | 0.12 | ||
LS | 0.0261 | 100.00 | 0.9720 | 0.0275 | 0.1279 | 19.48 | 12.48 | 0.00 | ||
n=800 | MLOG | 0.0104 | 40.47 | 0.9852 | 0.0109 | 0.1258 | 7.28 | 0.28 | 0.00 | |
p=21 | ENET | 0.0156 | 60.72 | 0.9836 | 0.0165 | 0.1269 | 16.81 | 9.81 | 0.00 | |
q=7 | LASSO | 0.0158 | 61.43 | 0.9839 | 0.0167 | 0.1269 | 16.56 | 9.56 | 0.00 | |
SCAD | 0.2829 | 900.17 | 1.2566 | 0.2848 | 0.1594 | 7.26 | 0.29 | 0.03 | ||
LS | 0.0732 | 100.00 | 0.9300 | 0.0798 | 0.5411 | 14.98 | 9.98 | 0.00 | ||
n=200 | MLOG | 0.0273 | 36.50 | 0.9729 | 0.0285 | 0.5128 | 5.37 | 0.37 | 0.00 | |
p=15 | ENET | 0.0433 | 58.84 | 0.9580 | 0.0461 | 0.5229 | 10.49 | 5.49 | 0.00 | |
q=5 | LASSO | 0.0436 | 60.09 | 0.9662 | 0.0465 | 0.5228 | 8.20 | 3.20 | 0.00 | |
SCAD | 0.0509 | 67.52 | 0.9651 | 0.0524 | 0.5238 | 5.55 | 0.55 | 0.00 | ||
LS | 0.0436 | 100.00 | 0.9572 | 0.0461 | 0.2589 | 16.70 | 11.70 | 0.00 | ||
n=400 | MLOG | 0.0143 | 32.12 | 0.9840 | 0.0145 | 0.2519 | 5.52 | 0.52 | 0.00 | |
(b) | p=17 | ENET | 0.0261 | 59.84 | 0.9751 | 0.0268 | 0.2547 | 9.97 | 4.97 | 0.00 |
q=5 | LASSO | 0.0257 | 59.60 | 0.9789 | 0.0262 | 0.2549 | 8.27 | 3.27 | 0.00 | |
SCAD | 0.0179 | 41.61 | 0.9905 | 0.0184 | 0.2528 | 5.52 | 0.52 | 0.00 | ||
LS | 0.0262 | 100.00 | 0.9711 | 0.0269 | 0.1293 | 20.98 | 13.98 | 0.00 | ||
n=800 | MLOG | 0.0086 | 33.03 | 0.9880 | 0.0087 | 0.1271 | 7.24 | 0.24 | 0.00 | |
p=21 | ENET | 0.0141 | 54.63 | 0.9843 | 0.0142 | 0.1278 | 10.51 | 3.51 | 0.00 | |
q=7 | LASSO | 0.0142 | 54.97 | 0.9846 | 0.0143 | 0.1278 | 10.39 | 3.39 | 0.00 | |
SCAD | 0.0115 | 44.20 | 0.9913 | 0.0117 | 0.1274 | 7.54 | 0.54 | 0.00 |
Example 4.2
In this example, some covariates are exactly identical. Consider three simulation settings. In each simulation setting, the number of covariates (p) is chosen the same as that in Example 4.1 and the number of identical covariates in the true model is chosen as p−q, where . Set . Let be the first -columns of X and be the remaining columns of X. The rows are sampled from , where and . The p−q nonzero coefficients are chosen from the first elements of the sequence κ defined in Example 4.1. In this example, both least squared estimate (LS) and the SCAD result in computational difficulties related to the inverse of singular matrix, so the performances of these two penalties are not reported. Consider the following true regression coefficient vectors ,
In Scenario (a), all identical covariates are irrelevant. In Scenario (b), all identical covariates are relevant. In Scenario (c), some of identical covariates are relevant while the other identical covariates are irrelevant.
Note that . To avoid the difficulties related to model identification, the mean number of false positives, false negatives, and selected covariates are computed based on . Let k be the number of non-zeros in . NONZERO refers to the average number of non-zeros in . To show the detailed information about the identical covariates, additional measures are used. Let be the number of selected covariates among p−q identical covariates , be the number of identical covariates removed from the model, and ‘prob’ be the estimated probability of correctly eliminating all redundant identical covariates. The true values of k, , and in the three scenarios are shown in Tables 2 and 3.
The simulation results are summarized in Table 4. In Scenario (a) where all identical covariates are irrelevant, the performance of the MLOG penlalty is similar to those of the Elastic net and the LASSO in terms of both estimation and prediction efficiency. On the other hand, the MLOG outperforms both Elastic net and LASSO in Scenarios (b) and (c) where some identical covariates are relevant while the others are irrelevant. In terms of ‘prob’, the MLOG is more likely to eliminate redundant identical covariates while the Elastic net more likely exhibits the so-called grouping effect.
Table 2. The true values of the number of nonzeros (k), the number of selected covariates among identical covariates in the model (), and the number of identical covariates that are removed from in the model ().
, , , | ||||
---|---|---|---|---|
Scenario (a) | k=p−q | |||
Scenario (b) | k=1 | |||
Scenario (c) | k=r+1 |
Table 3. The true values of k, , and with n=200, n=400 and n=800.
n | Scenario (a) | Scenario (b) | Scenario (c) | ||||||
---|---|---|---|---|---|---|---|---|---|
200 | k=7, | , | k=1, | , | k=4, | , | |||
400 | k=8, | , | k=1, | , | , | , | |||
800 | k=10, | , | k=1, | , | k=6, | , |
Table 4. The Simulation results of Example 4.2. Mean squared error (MSE) of estimates, model error (ME), mean of the number of selected covariates (NONZERO), prediction model errors (PE), mean prediction squared errors (MPSE), mean number of the false positives (FP) and false negatives (FN), the number of selected covariates among identical covariates in the model () and the probability of correctly eliminated identical covariates (prob).
Scenario | (n,p,k) | Method | ME | MSE | PE | MPSE | NONZERO | FP | FN | prob | |
---|---|---|---|---|---|---|---|---|---|---|---|
n=200 | MLOG | 0.0334 | 0.9496 | 0.0349 | 1.0301 | 7.16 | 0.16 | 0.00 | 0.06 | 0.94 | |
p=15 | ENET | 0.0426 | 0.9425 | 0.0451 | 1.0393 | 8.50 | 1.50 | 0.00 | 2.58 | 0.63 | |
k=7 | LASSO | 0.0422 | 0.9433 | 0.0453 | 1.0392 | 8.53 | 1.53 | 0.00 | 0.73 | 0.27 | |
n=400 | MLOG | 0.0222 | 0.9847 | 0.0224 | 1.0209 | 8.17 | 0.17 | 0.00 | 0.14 | 0.86 | |
(a) | p=17 | ENET | 0.0273 | 0.9819 | 0.0276 | 1.0245 | 9.34 | 1.34 | 0.00 | 1.79 | 0.78 |
k=8 | LASSO | 0.0271 | 0.9824 | 0.0276 | 1.0257 | 9.39 | 1.39 | 0.00 | 0.72 | 0.29 | |
n=800 | MLOG | 0.0130 | 0.9845 | 0.0132 | 0.9995 | 10.11 | 0.11 | 0.00 | 0.08 | 0.92 | |
p=21 | ENET | 0.0156 | 0.9830 | 0.0158 | 1.0025 | 11.38 | 1.38 | 0.00 | 0.16 | 0.98 | |
k=10 | LASSO | 0.0161 | 0.9839 | 0.0165 | 1.0041 | 11.28 | 1.28 | 0.00 | 0.63 | 0.37 | |
n=200 | MLOG | 0.0064 | 0.9787 | 0.0065 | 0.9988 | 1.51 | 0.51 | 0.00 | 1.00 | 1.00 | |
p=15 | ENET | 2.4730 | 3.4459 | 2.4401 | 3.4150 | 1.58 | 0.74 | 0.16 | 5.88 | 0.00 | |
k=1 | LASSO | 0.0188 | 0.9758 | 0.0191 | 1.0102 | 2.66 | 1.66 | 0.00 | 1.00 | 1.00 | |
n=400 | MLOG | 0.0046 | 0.9973 | 0.0045 | 0.9948 | 1.50 | 0.50 | 0.00 | 1.00 | 1.00 | |
(b) | p=17 | ENET | 0.3049 | 1.2999 | 0.3276 | 1.3041 | 1.83 | 0.84 | 0.01 | 7.94 | 0.00 |
k=1 | LASSO | 0.0114 | 0.9941 | 0.0113 | 1.0014 | 2.97 | 1.97 | 0.00 | 1.00 | 1.00 | |
n=800 | MLOG | 0.0021 | 0.9866 | 0.0020 | 0.9810 | 1.42 | 0.42 | 0.00 | 1.00 | 1.00 | |
p=21 | ENET | 0.0049 | 0.9893 | 0.0048 | 0.9831 | 1.51 | 0.51 | 0.00 | 10.00 | 0.00 | |
k=1 | LASSO | 0.0055 | 0.9858 | 0.0054 | 0.9846 | 2.50 | 1.50 | 0.00 | 1.00 | 1.00 | |
n=200 | MLOG | 0.0239 | 0.9589 | 0.0243 | 1.0003 | 4.35 | 0.35 | 0.00 | 1.00 | 1.00 | |
p=15 | ENET | 0.0378 | 0.9474 | 0.0387 | 1.0117 | 6.75 | 2.75 | 0.00 | 7.00 | 0.00 | |
k=4 | LASSO | 0.0388 | 0.9465 | 0.0399 | 1.0118 | 7.01 | 3.01 | 0.00 | 1.01 | 0.99 | |
n=400 | MLOG | 0.0129 | 0.9853 | 0.0131 | 1.0376 | 5.28 | 0.28 | 0.00 | 1.00 | 1.00 | |
(c) | p=17 | ENET | 0.0228 | 0.9781 | 0.0236 | 1.0489 | 7.92 | 2.92 | 0.00 | 8.00 | 0.00 |
k=5 | LASSO | 0.0227 | 0.9782 | 0.0235 | 1.0481 | 8.07 | 3.07 | 0.00 | 1.00 | 1.00 | |
n=800 | MLOG | 0.0079 | 0.9884 | 0.0080 | 1.0080 | 6.25 | 0.25 | 0.00 | 1.00 | 1.00 | |
p=21 | ENET | 0.0131 | 0.9848 | 0.0133 | 1.0134 | 8.78 | 2.78 | 0.00 | 10.00 | 0.00 | |
k=6 | LASSO | 0.0135 | 0.9843 | 0.0137 | 1.0138 | 9.18 | 3.18 | 0.00 | 1.00 | 1.00 |
Example 4.3
This example is the same as Example 4.2 excepting that the covariance matrix
is used instead of , where
with . Unlike Example 4.2, there are both strongly correlated and weakly correlated pairs of covariates in .
The simulation results are shown in Table 5. Similar to Example 4.2, the MLOG outperforms the LASSO and Elastic net penalty in terms of the probability of eliminating redundant covariates and prediction error. The Elastic net in general exhibits grouping effect.
Table 5. The Simulation results of Example 4.3. Mean squared error (MSE) of estimates, model error (ME), mean of the number of selected covariates (NONZERO), prediction model errors (PE), mean prediction squared errors (MPSE), mean number of the false positives (FP) and false negatives (FN), the number of selected covariates among identical covariates in the model () and the probability of correctly eliminated identical covariates (prob).
Scenario | (n,p,k) | Method | ME | MSE | PE | MPSE | NONZERO | FP | FN | prob | |
---|---|---|---|---|---|---|---|---|---|---|---|
n=200 | MLOG | 0.0383 | 0.9588 | 0.0403 | 1.0596 | 7.05 | 0.05 | 0.00 | 0.03 | 0.97 | |
p=15 | ENET | 0.0453 | 0.9531 | 0.0476 | 1.0648 | 8.27 | 1.27 | 0.00 | 2.25 | 0.54 | |
k=7 | LASSO | 0.0452 | 0.9530 | 0.0477 | 1.0650 | 8.36 | 1.36 | 0.00 | 0.67 | 0.33 | |
n=400 | MLOG | 0.0188 | 0.9725 | 0.0190 | 1.0202 | 8.02 | 0.02 | 0.00 | 0.01 | 0.99 | |
(a) | p=17 | ENET | 0.0246 | 0.9700 | 0.0251 | 1.0268 | 9.43 | 1.43 | 0.00 | 3.06 | 0.49 |
k=8 | LASSO | 0.0228 | 0.9702 | 0.0232 | 1.0235 | 9.31 | 1.31 | 0.00 | 0.70 | 0.30 | |
n=800 | MLOG | 0.0126 | 0.9899 | 0.0127 | 1.0341 | 10.02 | 0.02 | 0.00 | 0.02 | 0.98 | |
p=21 | ENET | 0.0153 | 0.9889 | 0.0156 | 1.0361 | 11.26 | 1.26 | 0.00 | 2.64 | 0.74 | |
k=10 | LASSO | 0.0152 | 0.9897 | 0.0155 | 1.0370 | 11.12 | 1.12 | 0.00 | 0.55 | 0.45 | |
n=200 | MLOG | 0.0061 | 0.9707 | 0.0059 | 1.0228 | 1.22 | 0.22 | 0.00 | 1.00 | 1.00 | |
p=15 | ENET | 4.6020 | 5.5639 | 4.5730 | 5.6197 | 1.24 | 0.54 | 0.30 | 4.93 | 0.00 | |
k=1 | LASSO | 0.0205 | 0.9613 | 0.0203 | 1.0379 | 3.23 | 2.23 | 0.00 | 1.01 | 0.99 | |
n=400 | MLOG | 0.0031 | 0.9865 | 0.0032 | 0.9974 | 1.31 | 0.31 | 0.00 | 1.00 | 1.00 | |
(b) | p=17 | ENET | 5.1911 | 6.1341 | 5.3456 | 6.3323 | 1.91 | 1.04 | 0.13 | 6.98 | 0.00 |
k=1 | LASSO | 0.0108 | 0.9822 | 0.0110 | 1.0071 | 3.13 | 2.13 | 0.00 | 1.01 | 0.99 | |
n=800 | MLOG | 0.0016 | 0.9978 | 0.0016 | 0.9941 | 1.24 | 0.24 | 0.00 | 1.00 | 1.00 | |
p=21 | ENET | 0.0159 | 1.0087 | 0.0143 | 1.0037 | 2.05 | 1.05 | 0.00 | 10.00 | 0.00 | |
k=1 | LASSO | 0.0059 | 0.9952 | 0.0061 | 0.9963 | 3.18 | 2.18 | 0.00 | 1.00 | 1.00 | |
n=200 | MLOG | 0.0299 | 0.9768 | 0.0301 | 1.0326 | 4.02 | 0.33 | 0.31 | 0.02 | 0.98 | |
p=15 | ENET | 0.0339 | 0.9648 | 0.0349 | 1.0405 | 6.04 | 2.14 | 0.10 | 6.05 | 0.00 | |
k=4 | LASSO | 0.0310 | 0.9634 | 0.0315 | 1.0347 | 5.81 | 1.90 | 0.10 | 0.92 | 0.89 | |
n=400 | MLOG | 0.0144 | 0.9871 | 0.0147 | 0.9927 | 5.04 | 0.04 | 0.00 | 1.00 | 1.00 | |
(c) | p=17 | ENET | 0.0208 | 0.9823 | 0.0216 | 1.0012 | 6.81 | 1.81 | 0.00 | 8.00 | 0.00 |
k=5 | LASSO | 0.0203 | 0.9846 | 0.0209 | 1.0011 | 6.39 | 1.39 | 0.00 | 1.00 | 1.00 | |
n=800 | MLOG | 0.0077 | 0.9938 | 0.0077 | 0.9937 | 6.04 | 0.04 | 0.00 | 1.00 | 1.00 | |
p=21 | ENET | 0.0110 | 0.9914 | 0.0112 | 0.9969 | 7.93 | 1.93 | 0.00 | 10.00 | 0.00 | |
k=6 | LASSO | 0.0114 | 0.9928 | 0.0116 | 0.9975 | 7.54 | 1.54 | 0.00 | 1.01 | 0.99 |
Example 4.4
In this example, the situation that the number of covariates (p) is much larger than the number of observations (n) is considered under three simulation settings. The MLOG penalty is compared with the LASSO and the Elastic net. Here, n=50, p=100, and the same are used for all three simulation settings,
Here, the 49 nonzero coefficients are chosen as the first 49 elements of the sequence κ introduced in Example 4.1. The simulation settings are as follows,
Model (a): The rows of X are sampled from , where and ;
Model (b): Let , , and . The rows of are sampled from , where and . , where Z are sampled from and ;
Model (c): The rows of are sampled from , where and . The remaining covariates are exactly equal to . That means .
Obviously, there is no strong correlation between the covariates in Model (a). In Model (b), the relevant covariates are strongly correlated with the irrelevant covariates . In the Model (c), some covariates are exactly identical.
The simulation results are shown in Table 6. The MLOG performs the best in all cases while the LASSO and the Elastic net cannot select the true model. In the situation where some covariates are identical, the MLOG outperforms both LASSO and Elastic net penalty in terms of the probability of eliminating redundant covariates and prediction error.
Table 6. The Simulation results of Example 4.4. Mean squared error (MSE) of estimates, model error (ME), mean of the number of selected covariates (NONZERO), prediction model errors (PE), mean prediction squared errors (MPSE), mean number of the false positives (FP) and false negatives (FN), the number of selected covariates among identical covariates in the model () and the probability of correctly eliminated identical covariates (prob).
Scenario | Method | ME | MSE | PE | MPSE | NONZERO | FP | FN | prob | |
---|---|---|---|---|---|---|---|---|---|---|
MLOG | 0.2448 | 0.0005 | 1.0077 | 0.4732 | 49.29 | 0.68 | 0.39 | |||
(a) | ENET | 3.2564 | 5.1164 | 4.0410 | 6.0826 | 45.15 | 18.83 | 22.68 | ||
LASSO | 5.1726 | 6.9262 | 5.2391 | 7.2317 | 27.86 | 10.45 | 31.61 | |||
MLOG | 0.2491 | 0.0009 | 1.1881 | 0.0632 | 49.17 | 0.63 | 0.46 | |||
(b) | ENET | 5.9036 | 7.5343 | 6.4448 | 7.8515 | 57.42 | 26.43 | 18.01 | ||
LASSO | 7.1262 | 7.7524 | 7.2141 | 8.4866 | 35.17 | 12.98 | 26.81 | |||
MLOG | 0.2455 | 0.0017 | 0.9168 | 0.9205 | 25.76 | 0.82 | 0.06 | 48.76 | 0.88 | |
(c) | ENET | 3.1427 | 3.6126 | 3.9263 | 3.9810 | 32.74 | 11.39 | 3.65 | 36.99 | 0.00 |
LASSO | 3.6124 | 4.1524 | 3.9787 | 4.0974 | 31.41 | 10.62 | 4.21 | 48.14 | 0.65 |
To summarize, the simulation results in Tables 1–6 suggest that in the absence of multicollinearity, both SCAD and MLOG perform well. However, when the covariates are strongly correlated or even identical, the MLOG performs better in terms of prediction error. Figures A1–A4 show more detailed simulation results of the above three examples.
5. Real data examples
In this section, the linear models are used and the proposed MLOG penalty is applied to the diabetes dataset in [6] and the prostate dataset in Tibshirani [25] and She [21,22]. To illustrate the performances in the case that the number of covariates (p) is much larger than the number of observations (n), the Gene expression dataset of Scheetz et al. [20] is used too. The performance of the MLOG penalty is compared with the Elastic net, the LASSO, and the SCAD based on the number of selected covariates (NONZERO) and mean squared error. For , the standard error of the estimated coefficient is computed from 500 bootstrap samples (see, [7]) as
where is the estimate of at b−th bootstrap sample and . The size of bootstrap is chosen as B=1000.
The following is used to evaluate the standard error for the estimator ,
where is the estimate of β at b−th bootstrap sample and . To validate the penalized estimators, training data and test data are considered. The model is fitted from a random subsample of size 300. The remaining 142 observations are used as the test data. This procedure is repeated for 500 times. The mean prediction squared error (MPSE) and the standard deviation of prediction squared errors (sd(PSE)) are reported.
Example 5.1 Diabetes data —
The diabetes dataset contains n=442 observations from diabetes patients. There are ten baseline covariates , namely age, sex, body mass index (bmi), average blood pressure (bp), and six blood serum measurements: . The response is a quantitative measure of disease progression one year after the baseline. Before the statistical analysis, the data is standardized so that the means of all variables are zero and the variances are one.
Some covariates are strongly correlated. For example, the pairwise correlation between and is 0.897, between and is 0.66, and between and is −0.738. The sample correlation matrix of the covariates is shown below:
Tables 7 and 8 show the estimation results of estimation and prediction computed from bootstrap samples. The MLOG outperforms the LASSO, the SCAD, and the Elastic net in terms of both MSE and MPSE. Though the results of estimation and prediction are satisfactory for both LASSO and Elastic net, they tend to select more covariates than the MLOG and the SCAD. This is consistent with the results of [5].
Table 7. The bootstrap means and standard deviations of the estimate of the diabetes data.
Covariates | LS | MLOG | ENET | LASSO | SCAD |
---|---|---|---|---|---|
age | −0.006(0.035) | 0.000(0.012) | 0.000(0.034) | 0.000(0.012) | 0.000(0.013) |
sex | −0.148(0.039) | −0.094(0.052) | −0.146(0.041) | −0.093(0.042) | −0.104(0.056) |
bmi | 0.321(0.041) | 0.333(0.037) | 0.323(0.042) | 0.319(0.045) | 0.336(0.050) |
bp | 0.200(0.038) | 0.171(0.041) | 0.198(0.039) | 0.168(0.041) | 0.191(0.057) |
−0.489(0.223) | −0.011(0.046) | −0.347(0.258) | −0.028(0.041) | −0.021(0.072) | |
0.294(0.178) | 0.000(0.043) | 0.181(0.199) | 0.000(0.018) | 0.000(0.066) | |
0.062(0.112) | −0.136(0.059) | −0.000(0.126) | −0.129(0.045) | −0.136(0.082) | |
0.109(0.095) | 0.000(0.055) | 0.091(0.093) | 0.000(0.036) | 0.000(0.087) | |
0.464(0.090) | 0.297(0.051) | 0.411(0.105) | 0.296(0.049) | 0.313(0.063) | |
0.042(0.037) | 0.000(0.019) | 0.041(0.036) | 0.019(0.027) | 0.000(0.012) |
Table 8. The estimation results for diabetes data. The mean of the number of selected covariates (NONZERO), mean squared errors (MSE), standard error of estimator, mean prediction squared errors (MPSE), and the standard deviation of prediction squared errors (sd(PSE)).
Method | Selected Covariates | NONZERO | MSE | se(estimator) | MPSE | sd(PSE) |
---|---|---|---|---|---|---|
LS | age, sex, bmi, bp, ,…, | 10 | 0.4811 | 0.034 | 0.5106 | 0.0455 |
MLOG | sex, bmi, bp, , , | 6 | 0.4832 | 0.016 | 0.5118 | 0.0445 |
ENET | sex, bmi, bp, , , , , | 8 | 0.4935 | 0.039 | 0.5148 | 0.0452 |
LASSO | sex, bmi, bp, , , , | 7 | 0.4909 | 0.012 | 0.5165 | 0.0442 |
SCAD | sex, bmi, bp, , , | 6 | 0.4898 | 0.019 | 0.5224 | 0.0449 |
Example 5.2 Prostate data —
The prostate dataset have n=97 observations and 9 clinical measures. Following She [21], take log(cancer volume) (lcavol) as the response variable and a full quadratic model was considered: the 43 covariates are 8 main effects, 7 squares, and 28 interactions of eight original variables – lweight, age, lbph, svi, lcp, gleason, pgg45, and lpsa, where svi is binary. To validate the estimation methods, 80 observations are randomly selected for model fitting and the remaining 17 observations are for testing.
The covariates in the full quadratic model exhibit even stronger correlations than those in Example 5.1. For example, the within–group correlations are very high, for example, >0.98 for group , and >0.93 for group . The results of She [21] suggested that the LASSO does not give stable and accurate solutions in the presence of many highly correlated covariates.
Tables 9 and 10 show the performances of estimation and prediction based on bootstrap samples. Since there are strongly correlated pairs of covariates, the Elastic net select many redundant covariates and the performances in terms of prediction, MSE, and MPSE are not as good as the MLOG. The MLOG tends to select less covariates than the LASSO and the Elastic net.
Table 9. The bootstrap means and standard deviations of the estimate of the prostate data.
Selected | MLOG | ENET | LASSO | Selected | MLOG | ENET | LASSO |
---|---|---|---|---|---|---|---|
Covariates | (3) | (12) | (4) | Covariates | (3) | (9) | (4) |
age | 0.0177(0.119) | age*gleason | 0.0005(0.021) | 0.0008(0.014) | |||
lcp | 0.0051(0.017) | 0.3059(0.536) | 0.0849(0.630) | age*pgg45 | −0.0001(0.001) | −0.0001(0.001) | |
gleason | 0.0816(1.148) | age*lpsa | 0.0002(0.009) | ||||
lpsa | 0.0079(0.018) | 0.1967(0.773) | 0.4809(0.773) | lbph*svi | 0.0931(0.222) | ||
lweight2 | −0.0187(0.321) | lbph*lcp | −0.0051(0.069) | ||||
lbph2 | 0.0515(0.090) | lbph*pgg45 | −0.0014(0.003) | ||||
lcp2 | 0.0526(0.074) | 0.0019(0.056) | lbph*lpsa | −0.0189(0.086) | −0.0043(0.061) | ||
pgg452 | −0.0001(0.010) | svi*pgg45 | −0.0038(0.016) | ||||
lpsa2 | 0.0311(0.108) | lcp*pgg45 | 0.0032(0.005) | 0.0001(0.004) | |||
lweight*lcp | 0.0886(0.140) | 0.0494(0.147) | lcp*lpsa | −0.1362(0.233) | |||
lweight*lpsa | 0.0184(0.155) | gleason*pgg45 | 0.0001(0.021) | ||||
age*lbph | −0.0009(0.010) | −0.0007(0.007) | gleason*lpsa | 0.0204(0.087) |
Table 10. The estimation results for prostate data. The mean of the number of selected covariates (NONZERO), mean squared errors (MSE), standard error of estimator, mean prediction squared errors (MPSE), and the standard deviation of prediction squared errors (sd(PSE)).
Method | Selected Covariates | NONZERO | MSE | se(estimator) | MPSE | sd(PSE) |
---|---|---|---|---|---|---|
lcp, lpsa, | ||||||
MLOG | age*lbph, age*gleason, | 6 | 0.4906 | 0.128 | 0.5355 | 0.1587 |
age*pgg45, gleason*pgg45 | ||||||
age, lcp, gleason, lpsa, lweight2, lbph2, | ||||||
pgg452, lpsa2, lweight*lcp, lweight*lpsa, | ||||||
ENET | age*lbph, age*pgg45, lbph*svi, lbph*lcp, | 21 | 1.8905 | 0.257 | 7.7054 | 22.8274 |
lbph*pgg45, lbph*lpsa, svi*pgg45, | ||||||
lcp*pgg45, lcp*lpsa, gleason*lpsa, lcp2 | ||||||
lcp, lpsa, lcp2, | ||||||
LASSO | lweight*lcp, age*gleason, age*lpsa | 8 | 0.5216 | 0.174 | 3.2233 | 13.3297 |
lbph*lpsa, lcp*pgg45 |
Example 5.3 Gene expression data —
The microarray data of Scheetz et al. [20] contains the gene expression levels of 200 TRIM32 genes collected from eye tissue samples of 120 rats. In this example, the number of covariates (p) is much greater than the number of observations (n). Moreover, some covariates are so strongly correlated. In such a situation, our previous discussions suggest that the LASSO and the SCAD penalties cannot select the true model. To validate the estimation methods, we randomly select 100 observations for model fitting and the remaining 20 observations for testing.
Table 11 summarizes the estimation and prediction results of the gene expression data. The MLOG outperforms all other penalties in the variable selection, estimation, and prediction. The Elastic net penalty leads to large biases. For the LASSO and the SCAD penalty, although the bias of the estimates are smaller than those of the Elastic penalty, they select too many irrelevant covariates and give larger biases in both estimation and prediction.
Table 11. The results for gene expression data. The mean of the number of selected covariates (NONZERO), mean squared errors (MSE), standard error of estimator, mean prediction squared errors (MPSE), and the standard deviation of prediction squared errors (sd(PSE)).
Method | NONZERO | MSE | se(estimator) | MPSE | sd(PSE) |
---|---|---|---|---|---|
MLOG | 16 | 0.0047 | 0.0341 | 0.0139 | 0.0060 |
ENET | 29 | 5.8597 | 0.1127 | 4.9840 | 1.3391 |
LASSO | 120 | 0.7360 | 0.0414 | 0.0478 | 0.0143 |
SCAD | 122 | 0.1098 | 0.0816 | 0.0901 | 0.0686 |
6. Conclusion
In this paper, we introduce a new class of strictly concave penalty functions, in particular, the modified log penalty to improve the performances of prediction under the multicollinearity cases. The proposed penalties exhibit certain nice properties as described in section 2 even under the multicollinearity cases. In the weakly correlated cases, these penalties perform as well as the SCAD penalty. In the multicollinearity or highly correlated cases, the proposed penalties tend to select less covariates. Real data analysis and simulation studies show that the modified log penalty outperforms the LASSO, the SCAD, and the Elastic net in terms of prediction error in general.
Appendix 1. Proofs
A.1. Technical lemmas
Proposition A.1
is a solution to the minimization problem (3) only if the following conditions are satisfied,
(A1) and
(A2)
Proof.
First, we have the following lemma.
Lemma A.2
Let be a function on . Suppose that attains minimum value at . Then, the function
attains minimum value at .
The proof of Lemma A.2 is trivial. Below, the proof of Proposition A.1 is given. Let
For all with , we have
Let be a solution to the minimization problem . Define and . Without loss of generality assume that
According to Lemma A.2, is a solution to the minimization problem
Therefore,
It is equivalent to
That means and thus
Now, consider j>m. For all , let
where α is the jth element. Since is the global minimizer of , we have
On the other hand, simple algebraic manipulations show that
Therefore,
Since for all j>m and , we have
Choose , we have
Let . Then,
It is equivalent to
Choose sufficiently small such that . Then,
(A3) |
The condition (A3) holds for any small γ. Taking we have
Therefore, . Since is strictly concave penalty and the derivative is non-increasing on . This completes the proof.
A.2. Proof of Theorem 2.3
Let and . Denote the number of components of J by h. Obviously, the system of column vectors is linearly independent if h=1.
Consider h=q+1,q>0. By contradiction assume that the system of column vectors is linearly dependent. Without loss of generality, assume that
Since is linearly dependent and , the system of vectors is also linearly dependent. Then, there exist real values not all zero such that
Without loss of generality assumed that
Define , . We get , and
(A4) |
Since is the solution and , , Proposition A.1 suggests that
From (A4), we have
Then,
Therefore,
(A5) |
From (A4), . For any , define
We have
Since by assumption, we have , . Consider
and
where
To obtain a contradiction and complete the proof, we need to show that . We have
where
and
Let , , and . Since is strictly concave penalty, we have M<0. Note that , we have
If , choose arbitrarily. Otherwise, choose
For all we have
Then, the function is strictly decreasing on . Therefore,
However,
From (A5), we have and . Therefore,
That means the function is strictly increasing on . Therefore, . It is easy to so that
Therefore,
That means . This completes the proof.
A.3. Proof of Proposition 2.4
Result (a) is a direct consequence of Theorem 2.3. The proof of (b) is given in the following. Let
We have
(A6) |
On the other hand, for all , let . We have
Then, . Therefore
(A7) |
A.4. Proof of Proposition 2.7
Since is non-negative definite matrix, it is invertible if and only if it is positive definite matrix. That means
(A8) |
We have
Then,
This completes the proof.
Appendix 2. Figures.
Figure A.1.
Simulation results of Example 4.1 – The mean number of false positives (FP) and false negatives (FN). Panels (a), (c), (e) show the results of Scenario (a). Panels (b), (d), (f) show the results of Scenario (b).
Figure A.2.
Simulation results of Example 4.2 – The mean number of false positives (FP) and false negatives (FN): Panels (a), (b) and (c) show the results of Scenario (a). Panels (d), (e) and (f) show the results of Scenario (b). Panels (g), (h) and (i) show the results of Scenario (c).
Figure A.3.
Simulation results of Example 4.3 – The mean number of false positives (FP) and false negatives (FN): Panels (a), (b) and (c) show the results of Scenario (a). Panels (d), (e) and (f) show the results of Scenario (b). Panels (g), (h) and (i) show the results of Scenario (c).
Figure A.4.
Simulation results of Example 4.4 – The mean number of false positives (FP) and false negatives (FN): Panel (a) shows the results of Model (a). Panel (b) shows the results of Model (b). Panel (c) shows the results of Model (c).
Funding Statement
Chi Tim, Ng's work is supported by National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIP) (No. NRF-2017R1C1B2011652).
Disclosure statement
No potential conflict of interest was reported by the authors.
References
- 1.Antoniadis A. and Fan J., Regularization of wavelet approximations, J. Am. Stat. Assoc. 96 (2001), pp. 939–967. doi: 10.1198/016214501753208942 [DOI] [Google Scholar]
- 2.Breiman L., Heuristics of instability and stabilization in model selection, Ann. Statist. 24 (1996), pp. 2350–2383. doi: 10.1214/aos/1032181158 [DOI] [Google Scholar]
- 3.Chatterjee S. and Hadi A.S., Regression Analysis by Example, 5th ed., John Wiley & Sons, Inc., Hoboken, New Jersey, 2012, 424p. [Google Scholar]
- 4.Chong I.-G. and Jun C.-H., Performance of some variable selection methods when multicollinearity is present, Chemometr. Intell. Lab. Syst. 78 (2005), pp. 103–112. doi: 10.1016/j.chemolab.2004.12.011 [DOI] [Google Scholar]
- 5.Dalayan A., Hebiri M., and Lederer J., On the prediction performance of the LASSO, Bernoulli 23 (2017), pp. 552–581. doi: 10.3150/15-BEJ756 [DOI] [Google Scholar]
- 6.Efron B., Hastie T., Johnstone I., and Tibshirani R., Least angle regression, Ann. Statist. 32 (2004), pp. 407–499. doi: 10.1214/009053604000000067 [DOI] [Google Scholar]
- 7.Efron B. and Tibshirani R.J., An Introduction to the Bootstrap, 1st ed., Chapman & Hall, New York, 1993, 456p. [Google Scholar]
- 8.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. doi: 10.1198/016214501753382273 [DOI] [Google Scholar]
- 9.Fan J. and Lv J., Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc. Ser. B 70 (2008), pp. 849–911. doi: 10.1111/j.1467-9868.2008.00674.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Fan J. and Lv J., A selective overview of variable selection in high dimensional feature space, Stat. Sin. 20 (2010), pp. 101–148. [PMC free article] [PubMed] [Google Scholar]
- 11.Fan J. and Lv J., Nonconcave penalized likelihood with NP-dimensionality, IEEE Trans. Inform. Theory 57 (2011), pp. 5467–5484. doi: 10.1109/TIT.2011.2158486 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Fan J. and Peng H., Nonconcave penalized likelihood with a diverging number of parameters, Ann. Statist. 32 (2004), pp. 928–961. doi: 10.1214/009053604000000256 [DOI] [Google Scholar]
- 13.Fan Y. and Tang C.Y., Tuning parameter selection in high dimensional penalized likelihood, J. R. Stat. Soc. Ser. B 75 (2013), pp. 531–552. doi: 10.1111/rssb.12001 [DOI] [Google Scholar]
- 14.Fitrianto A. and Lee C.Y., Performance of Ridge regression estimator methods on small sample size by varying correlation coefficients: A simulation study, J. Math. Statist. 10 (2014), pp. 25–29. doi: 10.3844/jmssp.2014.25.29 [DOI] [Google Scholar]
- 15.Hoerl A.E. and Kennard R.W., Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970), pp. 55–67. doi: 10.1080/00401706.1970.10488634 [DOI] [Google Scholar]
- 16.Hunter D.R. and Li R., Variable selection using MM algorithms, Ann. Statist. 33 (2005), pp. 1617–1642. doi: 10.1214/009053605000000200 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Jolliffe I.T., A note on the use of principal components in regression, Appl. Stat. 31 (1982), pp. 300–303. doi: 10.2307/2348005 [DOI] [Google Scholar]
- 18.Konno H. and Takaya Y., Multi-step methods for choosing the best set of variables in regression analysis, Comput. Optim. Appl. 46 (2010), pp. 417–426. doi: 10.1007/s10589-008-9193-6 [DOI] [Google Scholar]
- 19.Ng C.T., Oh S., and Lee Y., Going beyond oracle property: Selection consistency and uniqueness of local solution of the generalized linear model, Stat. Methodol. 32 (2016), pp. 147–160. doi: 10.1016/j.stamet.2016.05.006 [DOI] [Google Scholar]
- 20.Scheetz T., Kim K., Swiderski R., Philp A., Braun T., Knudtson K., Dorrance A., DiBona G., Huang J., and Casavant T., Regulation of gene expression in the mammalian eye and its relevance to eye disease, Proc. Natl. Acad. Sci. 103 (2006), pp. 14429–14434. doi: 10.1073/pnas.0602562103 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.She Y., Thresholding-based iterative selection procedures for model selection and shrinkage, Electron. J. Stat. 3 (2009), pp. 384–415. doi: 10.1214/08-EJS348 [DOI] [Google Scholar]
- 22.She Y., An iterative algorithm for fitting nonconvex penalized generalized linear models with grouped predictors, Comput. Stat. Data. Anal. 56 (2012), pp. 2976–2990. doi: 10.1016/j.csda.2011.11.013 [DOI] [Google Scholar]
- 23.Tamura R., Kobayashi K., Takano Y., Miyashiro R., Nakata K., and Matsui T., Mixed integer quadratic optimization formulations for eliminating multicollinearity based on variance inflation factor, Optimization Online (2016). Available at http://www.optimization-online.org/DB_HTML/2016/09/5655.html
- 24.Tamura R., Kobayashi K., Takano Y., Miyashiro R., Nakata K., and Matsui T., Best subset selection for eliminating multicollinearity, J. Oper. Res. Soc. Japan 60 (2017), pp. 321–336. [Google Scholar]
- 25.Tibshirani R., Regression shrinkage and selection via the LASSO, J. R. Stat. Soc. Ser. B 58 (1996), pp. 267–288. [Google Scholar]
- 26.Wang H., Li R., and Tsai C.L., Tuning parameter selectors for the smoothly clipped absolute deviation method, Biometrika 94 (2007), pp. 553–568. doi: 10.1093/biomet/asm053 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wold S., Ruhe A., Wold H., and Dunn III W.J., The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses, SIAM J. Sci. Stat. Comput. 5 (1984), pp. 735–743. doi: 10.1137/0905052 [DOI] [Google Scholar]
- 28.Zhang C.-H., Nearly unbiased variable selection under minimax concave penalty, Ann. Statist. 38 (2010), pp. 894–942. doi: 10.1214/09-AOS729 [DOI] [Google Scholar]
- 29.Zou H., The adaptive LASSO and its oracle properties, J. Am. Stat. Assoc. 101 (2006), pp. 1418–1429. doi: 10.1198/016214506000000735 [DOI] [Google Scholar]
- 30.Zou H. and Hastie T., Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B 67 (2005), pp. 301–320. doi: 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]