Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2019 Jul 3;47(2):201–230. doi: 10.1080/02664763.2019.1637829

Variable selection under multicollinearity using modified log penalty

Van Cuong Nguyen 1, Chi Tim Ng 1,CONTACT
PMCID: PMC9041714  PMID: 35706515

ABSTRACT

To handle the multicollinearity issues in the regression analysis, a class of ‘strictly concave penalty function’ is described in this paper. As an example, a new penalty function called ‘modified log penalty’ is introduced. The penalized estimator based on strictly concave penalties enjoys the oracle property under certain regularity conditions discussed in the literature. In the multicollinearity cases where such conditions are not applicable, the behaviors of the strictly concave penalties are discussed through examples involving strongly correlated covariates. Real data examples and simulation studies are provided to show the finite-sample performance of the modified log penalty in terms of prediction error under scenarios exhibiting multicollinearity.

KEYWORDS: Grouping effect, modified log penalty, multicollinearity, penalized regression, strictly concave penalty function

1. Introduction

In the regression analysis, multicollinearity occurs when two or more covariates are strongly correlated, see [3,24] for general discussions on the multicollinearity issues. Multicollinearity leads to computational difficulties related to the inverse of the nearly singular matrix and results in low efficiency in the model estimation and prediction. To eliminate the multicollinearity, it is of paramount importance to select a parsimonious model that excludes redundant covariates that can be predicted from other covariates.

In the literature, several approaches have been proposed to overcome the difficulties of multicollinearity. One approach is the best subset selection method proposed in [18,23,24]. The model selection problem is reformulated as a constrained integer quadratic programing problem involving indicators of multicollinearity, such as the condition number of the correlation matrix and the variance inflation factor. However, solving such integer quadratic programing problems can be computationally intensive. Another approach is the partial least squares regression method discussed in [4,17,27]. The idea underlying this approach is to reduce the correlations in the covariates by means of orthogonal transformation.

Over the past two decades, penalized regression methods have been widely studied for the purpose of variable selection, to name a few, [8,10,15,19,25,28–30]. The idea is to use a penalty function that is non-differentiable at zero to shrink the small regression coefficients towards zero. Parsimony and the grouping effect are two important criteria for evaluating the performance of a penalty function. These two criteria can be conflicting to each other in the multicollinearity cases. Grouping effect (see, [30]) means that the strongly correlated covariates tend to be selected or deselected together. Parsimony can be described through the ability of a variable selection method to recover the so-called ‘true subset’ that is relevant to the response. For example, the idea of oracle property described in [8,10,12] has been widely used for such a purpose. However, the definition of the true subset can be ambiguous in the multicollinearity cases where some covariates can be predicted from other covariates. Consider the example where two covariates X1 and X2 are identical and X1 is relevant to the response Y . In such a situation, the models EY=2X1 and EY=X1+X2 are equivalent. The first one is more parsimonious. On the other hand, grouping effect requires that both X1 and X2 be selected. In certain applications such as microarray data analysis, grouping effect is considered to be a desirable property. However, in the applications where prediction is the main goal, the situation can be different because the parsimonious model with redundant covariates removed tends to give smaller prediction error.

There is a lack of literature discussing the penalty functions that achieve parsimony in the variable selection problem under the presence of multicollinearity. The Elastic net penalty in [30] is designed to achieve grouping effect in the multicollinearity cases. The Ridge penalty (see, [14,15]), the LASSO penalty (see, [25]), and the Elastic net penalty (see, [30]) do not guarantee the oracle properties in [8,10,11]. Under some regularity conditions on the minimum singular value of the design matrix, the non-concave penalty functions, the SCAD (see, [8,9]) and the MCP (see, [28]), lead to approximately unbiased estimates and guarantee the oracle properties. However, such regularity conditions cannot cover the multicollinearity cases with strong correlations in the covariates.

The aim of this paper is to introduce a new class of strictly concave penalty functions that achieve parsimony even under the multicollinearity cases. It is illustrated that in the situations without multicollinearity, for example, fulfilling the regular conditions in [10], these penalties perform as well as the SCAD penalty in terms of estimation error, prediction error, mean number of false positives, and mean number of false negatives. In the cases where some covariates are identical, at most one among these identical covariates is selected. This means that the redundant covariates can be removed automatically from the model. Moreover, the local quadratic approximation method or majorization-minimization algorithm (MM-algorithm) proposed in [8] and [16] can be used to obtain the estimates. As an example of ‘strictly concave penalty function’, a new penalty function called ‘modified log penalty’ is introduced.

The paper is organized as follows. The strictly concave penalized likelihood estimator and its properties are discussed in Section 2. The modified log penalty is introduced in Section 3. The simulation studies are given in Section 4 to compare the finite-sample performances of the proposed penalty and other penalties, including the Elastic net, the LASSO, and the SCAD. Some real data examples are given in Section 5. The concluding remarks are presented in Section 6.

2. Penalized linear regression with strictly concave penalty

In this section, the strictly concave penalties are introduced to enhance parsimonious model selection in multicollinearity cases. On the contrary, the Elastic net penalty of [30] is strictly convex and exhibits the grouping effect.

2.1. The strictly concave penalized likelihood estimator

Consider the linear regression model:

Y=Xβ+ε, (1)

where Y is n×1 response vector, X is n×p design matrix, β=(β1,β2,,βp)T is the vector of unknown parameters, ε=(ε1,,εn)T is the model error, and εi,i=1,2,,n are independent N(0,σ2) random variables. The strictly concave penalty function is defined below.

Definition 2.1 The strictly concave penalty function —

Let λ>0 be a tunning parameter. A function P(,λ) is called strictly concave penalty function if the following conditions are satisfied,

  1. P(,λ) has continuous second order derivative on [0,),

  2. P(θ,λ)0 as θ+,

  3. 1<P(θ,λ)<0 for all θ>0, and

  4. P(0,λ)=0.

A strictly concave penalty function is a non-concave penalty function described in [8,10]. The seemingly confusing use of the terms can be resolved by noting that ‘strictly concave’ here refers to the domain [0,) while ‘non-concave’ in [8,10] refers to the domain (,). It can be checked that the SCAD penalty, the MCP penalty, and Lr penalties of [1] are non-concave but not strictly concave.

Consider the following penalized least squares problem: To minimize

(θ)=12(θt)2+P(|θ|,λ) (2)

with respect to θ, where, t is the observed signal and θ is the unknown. It is suggested in [1] that the conditions in Definition 2.1 guarantee the existence and uniqueness of θ^(t), the solution to the optimization problem (2). Moreover, θ^(t) is a continuous function of t and θ^(t)t0 as t. These conditions are necessary for reducing model complexity and model bias in prediction (see, [2,10]).

Suppose that the data set has n observations and p covariates. Let Y=(y1,,yn)T be the response and X=[X1,,Xp] be the design matrix, where Xj=(x1j,,xnj)T,j=1,,p, are the covariates. The strictly concave penalized likelihood estimator is defined as follows.

Definition 2.2 The strictly concave penalized likelihood estimator —

Let P(,λ) be a strictly concave penalty. For any fixed non-negative λ, the strictly concave penalized likelihood estimator of β in Model (1) is defined as

β^(λ)=argminβ{12nYXβ2+j=1pP(|βj|,λ)}. (3)

For simplicity, if no confusion is caused, we write β^ instead of β^(λ). Since a strictly concave penalty is also a nonconcave penalty, the majorization-minimization algorithm of [16] can be applied to obtain the penalized least square estimator (3). Similar to the SCAD penalty, if the design matrix X and the model error ϵ satisfy all regularity conditions described in [11], the penalized likelihood estimator β^(λ) always exist and fulfills the so-called oracle properties. Such a property is not guaranteed for the LASSO and the Elastic net.

2.2. Parsimonious variable section in the multicollinearity case

In this subsection, the properties of the strictly concave penalized estimator are discussed under general multicollinearity cases. In such situations, the regularity conditions in [11] can be violated and the penalized estimation methods based on the SCAD penalty of [8] and the MCP penalty of [28] are not guaranteed to select the true model.

To illustrate the ideas, consider the following simple example. Suppose that Y=β1X1+β2X2+ε and X1=X2. Since both the SCAD penalty and the MCP penalty are constant beyond some point, the local solution β^=(X1TY/X1TX1K,K) always give the same penalized likelihood value when K is smaller than some critical value. This means that along the direction (1,1), the penalized likelihood is flat. This gives difficulties in the numerical optimization of the penalized likelihood. If the LASSO penalty is used and X1TY/X1TX1 is positive, similar difficulties occur because the penalized likelihood is constant for sufficiently small K. The situations are very different if strictly concave penalty are used instead because these penalty are no longer horizontal line far away from zero.

To describe the general multicollinearity cases, suppose that the design matrix X is generated by perturbing a non-full rank matrix U with a small quantity δZ. Here, the dimensions of both U and Z are the same as that of X. Parsimony requires no linear dependence between the columns in U corresponding to the selected covariates. Detailed results are given in the following theorem.

Theorem 2.3

For any integers n,p>0 and n×p-matrices U=(U1,,Up)T and Z=(Z1,,Zp)T, there exists a positive constant ϰ (depending on the Xand Z) such that the system of column vectors corresponding to the chosen covariates {Uj=XjδZj|β^j0,j=1,,p} is linearly independent for all 0δϰ, where β^=(β^1,,β^p)T is the strictly concave penalized likelihood estimator of β in Model (1).

The proof is given in Appendix A.2. Further results of the special case where δ=0 or Z is a zero matrix are summarized in the following proposition.

Proposition 2.4

Let β^=(β^1,,β^p)T be the strictly concave penalized likelihood estimator of β in Model (1).

  1. The system of column vectors {Xj|β^j0,j=1,,p} is linearly independent. Let m=rank(X). Then, the number of nonzero estimated coefficients satisfies
    h=#{j1,p¯|β^j0}m.
  2. Without loss of generality assume that β^1,,β^h are nonzero and the system X1,,Xh,Xh+1,,Xm is linearly independent. Let X=[X1,,Xm] and β^=(β^1,,β^m)T. Then,
    β^=argminuRm{12nYXu2+j=1mP(|uj|,λ)}. (4)

Result (a) illustrates the crucial difference between the penalized regression methods based on the proposed strictly concave penalty and other commonly used convex penalties including the Ridge penalty and the Elastic net penalty. The grouping effect of the Elastic net [30] is in contradiction to the linear dependence of {Xj|β^j0,j=1,,p}. Roughly speaking, if strictly concave penalty is used, there is no redundancy in the selected variables.

Result (b) suggests that the properties of the penalized likelihood estimator in the non-full rank X cases can be studied indirectly through X. To see this, introduce the notations β(1)=(β1,,βm)T, β(2)=(βm+1,,βp)T, and X=[Xm+1,,Xp]. Since the columns in X are maximal linearly independent, there exists an m×(pm)-dimensional matrix C such that

X=XC.

The true model is equivalent to

Y=Xβ+ε=Xβ+ε,

where β=β(1)+Cβ(2). This means that if we are able to show the oracle properties under the design matrix X, the model selected based on the proposed method selects covariates corresponding to non-zero β in the equivalent model with probability going to one. Following the arguments of Fan and Lv in [11], we state without proof the following proposition on the oracle property of the penalized likelihood estimates based on strictly concave penalty.

Proposition 2.5

Let X be defined in Proposition 2.4. If X and the error terms ϵ satisfy all regular conditions of [11], then β^ defined by (4) fulfills the so-called oracle properties in [11]. That means,

  1. With probability tending to 1 as n, the penalized likelihood estimator β^=(β^(1)T,β^(2)T)T satisfies:
    β^(2)=0andβ^β2=OP(sn1/2),
    where β^(1) is a subvector of β^ formed by components in supp(β) and s is the size of β^(1);
  2. An(X(1)TX(1))1/2(β^(1)β(1))dN(0,σ2G),
    where σ2=var(εi), An is a m×s matrix such that AnAnTG and G is a m×m symmetric positive definite matrix, and X(1) is the submatrix of X corresponding to β(1).

To compare the strictly concave penalty to the Elastic net penalty, consider the cases with identical covariates that lead to the so-called grouping effects described in [30]. The following proposition follows immediately from Proposition 2.4 and is stated without proof.

Proposition 2.6

Let β^=β^(λ) be the strictly concave penalized likelihood estimator (3). Then, the followings hold,

  1. If Xi=Xj+δV, ij, and Var(V)=1, then for sufficiently small δ, β^iβ^j=0.

  2. Suppose that Xq+1=Xq+2==Xp, for some q<p. Without loss of generality assume that β^q+10. Then, β^=(β^1,,β^q,j=q+1pβ^j)T fulfills
    β^=argminuRq+1{12nYXu2+j=1q+1P(|uj|,λ)}, (5)
    where X=[X1,,Xq+1].

2.3. Feasibility of the majorization minimization algorithm

Majorization -- minimization algorithm of [16] that is closely related to the local quadratic approximation method in [8] can be employed to minimize the penalized likelihood function (3) when the penalty function P(,λ) is non-concave. Note that the majorization - minimization algorithm is applicable only when the matrix XX+nΩ is invertible, where X is the design matrix, Ω=diag(ω1,,ωp), ωj=P(|βj|,λ)/(δ+|βj|)0, and δ is given small positive value, say 108. Let

J1={j|j=1,,p,ωj=0}andJ2={j|j=1,,p,ωj0}.

We have the following results.

Proposition 2.7

The matrix XX+nΩ is invertible if and only if the system of columns Xj,jJ1 is linearly independent system. In particular, if ωj>0,j or the design matrix X has full column rank, the matrix XX+nΩ is invertible.

The proof is given in Appendix A.4. Below, the invertibility of XX+nΩ is discussed for different kinds of penalty function. Note that, for any strictly concave penalty P(,λ), ωj>0,j=1,,p. Consequently, J1= and the conclusion of Proposition 2.7 holds trivially. However, this is not true for the SCAD, the MCP, and the HARD penalty because these penalties are constant beyond some critical point. As a result, ωj can be zero if βj is large. Then, the set J1 can be nonempty. The linear independence assumption on Xj, jJ1 in Proposition 2.7 is not guaranteed in the multicollinearity cases.

3. Modified log penalty

In this section, the modified log penalty (MLOG), a special case of strictly concave penalty, is introduced.

Definition 3.1 Modified log penalty function —

The modified log penalty (MLOG) is defined as

P(MLOG)(θ,λ)=λlog(1+|θ|λ), (6)

where λ>0 is a tunning parameter.

Note that limλ0+P(MLOG)(θ,λ)=0 and limλ(P(MLOG)(θ,λ)/λ|θ|)=1 for all θ. Therefore, the modified log penalized likelihood estimate behaves like the ordinary least squares estimate when λ is close to 0 and behaves like the LASSO estimate when λ goes to infinity. When |θ|0 and λ goes to zero, λlog(1+|θ|/λ)λlog(|θ|/λ)=λlog|θ|12λlogλ. Neglecting the constant term, it becomes the logarithmic function.

In the modified log penalty, one is added to the term |θ|/λ to avoid the singularity. One can consider a more general penalty function λlog(1+μ|θ|). where μ>0 is given. In this paper, μ=μ0=1/λ is chosen because it is the greatest possible value of μ that guarantees the uniqueness and existence of θ^(t), the solution to the minimization problem (2). To see this, consider the first order condition

0=ddθ(12(θt)2+λlog(1+μθ))=θt+λμ1+μθ.

The existence and uniqueness of the solution can be established by noting that the derivative of the right-hand-side 1λμ2/(1+μθ)20 for all θ[0,) only when μ1/λ.

Following [1], thresholding rule of the MLOG penalty refers to the function Φ(MLOG)(t,λ)=θ^(t) and is given by

Φ(MLOG)(t,λ)={tλ2+14(t+λ)2λ,tλ,0,|t|<λ,t+λ214(t+λ)2λ,tλ. (7)

Note that the function Φ(MLOG)(t,λ)=θ^(t) is the unique solution to (2) for all t(,). It is a continuous function of t. Moreover, θ^(t)t0 as t. To see this, note that |t|λ,

Φ(MLOG)(|t|,λ)|t|=14(|t|+λ)2λ|t|+λ2=λ14(|t|+λ)2λ+|t|+λ2=(λ|t|+λ)3(12+14λ(|t|+λ)2)2λ|t|+λ=λ|t|+λ+O(λ|t|+λ)3.

Since the thresholding rule is an odd function, we have

Φ(MLOG)(t,λ)=tλ|t|+λ+O(λ|t|+λ)3,as|t|. (8)

The plots of the modified log penalty function and its thresholding rule are shown in Figure 1.

Figure 1.

Figure 1.

The plots of modified log penalty function and their thresholding rule functions. (a) The MLOG penalties: MLOG1 is λ=1, MLOG2 is λ=0.01 and MLOG3 is λ=4. (b) The thresholding rules: MLOG1 is λ=1, MLOG2 is λ=0.01 and MLOG3 is λ=4.

4. Simulation studies

In this section, the finite-sample performances of the nonconcave penalty are compared to those of the Elastic net, the LASSO, and the SCAD penalties under some examples exhibiting multicollinearity.

To obtain the penalized likelihood estimates, LARS-EN algorithm of [30] is used for the Elastic net while the majorization-minimization algorithm of [16] that is closely related to the local quadratic approximation method of [8] is used for all other penalties. For j=1,2,,p, the covariate Xj is deselected if the estimated coefficient is |β^j|<106. The tuning parameter λ is chosen based on the Bayesian information criterion (BIC) of [13,26]. The optimal λ value is obtained using the grid-point search over 100 grid-points {10(5+7l/99),l=0,1,,99}. For the SCAD penalty, the tuning parameter a=3.7 is used as suggested in [8].

In each example, the simulated dataset consists of a training set and a test set. The models are fitted using the training sets and the prediction errors are obtained from the test sets. N=500 replicates are used in the simulation. The sample sizes of the training sets are chosen as n=200, n=400, and n=800. The number of covariates (p) grows with n. The sample sizes of test sets are 100.

The following measures of estimation efficiency, prediction efficiency, and selection consistency are used to compare the performance of different penalties. Let β^ be an estimate of β.

  1. Median of relative model errors (MRME) of [8].

  2. Model error (ME): ME(β^)=(β^β)TCov(X)(β^β).

  3. Relative model error (RME): ME(β^)/ME(β^LS), where β^LS is the least squares estimator (LS).

  4. Mean squared error (MSE): MSE(β^)=YXβ^2/n.

  5. Prediction error (PE): PE(β^)=(β^β)TCov(Xtest)(β^β).

  6. Mean squared prediction error (MPME): MPSE(β^)=YtestXtestβ^2/n.

  7. False positives (FP) and false negatives (FN). FP is the mean number of irrelevant covariates misclassified as relevant and FN is the mean number of relevant covariates misclassified as irrelevant. In the simulation examples, the average FP and FN values are reported.

  8. NONZERO: the average number of selected covariates.

Example 4.1

In this example, the covariates are strongly correlated but not exactly identical. Suppose that X=[X(1),X(2),X(3)]. The number of covariates is p=pn=[4n1/4]. Let q=[p/3]. The rows of X(1) are sampled from N(0,Σ0), where Σ0=(0.5|ij|)i,j=1,2,,q. The rows of X(2) and X(3) are generated by

X(2)=1τ2X(1)+τZ(1)andX(3)=Z(2),

where τ=0.1. The rows of Z(1) and Z(2) are sampled from N(0,Iq) and N(0,Ip2q) respectively.

Consider the following two sets of true regression coefficients β(true),

Scenario(a):β(true)=(κ1,κ2,,κqq,0,0,,0pq)T,Scenario(b):β(true)=(0,0,,0pq,κ1,κ2,,κqq)T.

Here, κ={3,1.9,2.5,2.2,1.5,3,1.9,2.5,2.2,1.5,} and the five numbers 3, -1.9, 2.5, -2.2, 1.5 are repeated in κ. In Scenario (a), X(1) that is strongly correlated to X(2) is relevant while in Scenario (b), both X(1) and X(2) are irrelevant.

Table 1 summarizes the simulation results. First, both MLOG and SCAD perform well in terms of FP in Scenario (b) where the strong pairwise correlations occur only in the irrelevant covariates. On the other hand, the MLOG outperforms the SCAD in general under Scenario (a) where strong pairwise correlations occur between both relevant and irrelevant covariates. The Elastic net and the LASSO perform well in terms of estimation efficiency, but FP is large in general. This means that the Elastic net and the LASSO tend to select more variables than other penalties.

Table 1. The Simulation results of Example 4.1. Mean squared error (MSE) of estimates, model error (ME), mean of the number of selected covariates (NONZERO), median of relative model errors (MRME), prediction model errors (PE), mean prediction squared errors (MPSE), mean number of the false positives (FP) and false negatives (FN).

Scenario (n,p,q) Method ME MRME MSE PE MPSE NONZERO FP FN
    LS 0.0741 100.00 0.9498 0.0795 0.5420 14.38 9.38 0.00
  n=200 MLOG 0.0302 40.18 0.9863 0.0314 0.5186 5.39 0.39 0.00
  p=15 ENET 0.0464 62.93 0.9768 0.0482 0.5265 13.69 8.69 0.00
  q=5 LASSO 0.0476 65.38 0.9816 0.0498 0.5271 12.80 7.80 0.00
    SCAD 1.8508 2972.39 2.8072 2.0629 1.5218 6.25 1.56 0.31
    LS 0.0427 100.00 0.9485 0.0462 0.2621 16.02 11.02 0.00
  n=400 MLOG 0.0148 34.75 0.9714 0.0155 0.2533 5.47 0.47 0.00
(a) p=17 ENET 0.0271 63.71 0.9634 0.0286 0.2561 14.83 9.83 0.00
  q=5 LASSO 0.0269 64.20 0.9657 0.0284 0.2557 13.99 8.99 0.00
    SCAD 1.2158 3045.90 2.1819 1.3219 0.5807 6.24 1.36 0.12
    LS 0.0261 100.00 0.9720 0.0275 0.1279 19.48 12.48 0.00
  n=800 MLOG 0.0104 40.47 0.9852 0.0109 0.1258 7.28 0.28 0.00
  p=21 ENET 0.0156 60.72 0.9836 0.0165 0.1269 16.81 9.81 0.00
  q=7 LASSO 0.0158 61.43 0.9839 0.0167 0.1269 16.56 9.56 0.00
    SCAD 0.2829 900.17 1.2566 0.2848 0.1594 7.26 0.29 0.03
    LS 0.0732 100.00 0.9300 0.0798 0.5411 14.98 9.98 0.00
  n=200 MLOG 0.0273 36.50 0.9729 0.0285 0.5128 5.37 0.37 0.00
  p=15 ENET 0.0433 58.84 0.9580 0.0461 0.5229 10.49 5.49 0.00
  q=5 LASSO 0.0436 60.09 0.9662 0.0465 0.5228 8.20 3.20 0.00
    SCAD 0.0509 67.52 0.9651 0.0524 0.5238 5.55 0.55 0.00
    LS 0.0436 100.00 0.9572 0.0461 0.2589 16.70 11.70 0.00
  n=400 MLOG 0.0143 32.12 0.9840 0.0145 0.2519 5.52 0.52 0.00
(b) p=17 ENET 0.0261 59.84 0.9751 0.0268 0.2547 9.97 4.97 0.00
  q=5 LASSO 0.0257 59.60 0.9789 0.0262 0.2549 8.27 3.27 0.00
    SCAD 0.0179 41.61 0.9905 0.0184 0.2528 5.52 0.52 0.00
    LS 0.0262 100.00 0.9711 0.0269 0.1293 20.98 13.98 0.00
  n=800 MLOG 0.0086 33.03 0.9880 0.0087 0.1271 7.24 0.24 0.00
  p=21 ENET 0.0141 54.63 0.9843 0.0142 0.1278 10.51 3.51 0.00
  q=7 LASSO 0.0142 54.97 0.9846 0.0143 0.1278 10.39 3.39 0.00
    SCAD 0.0115 44.20 0.9913 0.0117 0.1274 7.54 0.54 0.00

Example 4.2

In this example, some covariates are exactly identical. Consider three simulation settings. In each simulation setting, the number of covariates (p) is chosen the same as that in Example 4.1 and the number of identical covariates in the true model is chosen as pq, where q=[p/2]+1. Set Xq+1==Xp. Let X=[X1,X2,,Xq+1] be the first (q+1)-columns of X and X be the remaining (pq1) columns of X. The rows X are sampled from N(0,Σ0), where Σ0=(ρij) and ρij=0.5|ij|,i,j=1,,q+1. The pq nonzero coefficients are chosen from the first (pq) elements of the sequence κ defined in Example 4.1. In this example, both least squared estimate (LS) and the SCAD result in computational difficulties related to the inverse of singular matrix, so the performances of these two penalties are not reported. Consider the following true regression coefficient vectors β(true),

Scenario(a):β(true)=(κ1,κ2,,κpqpq,0,0,,0q)T,Scenario(b):β(true)=(0,0,,0q,κ1,κ2,,κpqpq)T,Scenario(c):β(true)=(κ1,κ2,,κrr,0,0,,0qr,κr+1,κr+2,,κpqpqr,0,0,,0r)T,wherer=[(pq)/2].

In Scenario (a), all identical covariates are irrelevant. In Scenario (b), all identical covariates are relevant. In Scenario (c), some of identical covariates are relevant while the other identical covariates are irrelevant.

Note that Xq+1==Xp. To avoid the difficulties related to model identification, the mean number of false positives, false negatives, and selected covariates are computed based on β=(β1(true),,βq(true),j=q+1pβj(true))T. Let k be the number of non-zeros in β. NONZERO refers to the average number of non-zeros in β^. To show the detailed information about the identical covariates, additional measures are used. Let I1 be the number of selected covariates among pq identical covariates Xq+1,,Xp, I2 be the number of identical covariates removed from the model, and ‘prob’ be the estimated probability of correctly eliminating all redundant identical covariates. The true values of k, I1, and I2 in the three scenarios are shown in Tables 2 and 3.

The simulation results are summarized in Table 4. In Scenario (a) where all identical covariates are irrelevant, the performance of the MLOG penlalty is similar to those of the Elastic net and the LASSO in terms of both estimation and prediction efficiency. On the other hand, the MLOG outperforms both Elastic net and LASSO in Scenarios (b) and (c) where some identical covariates are relevant while the others are irrelevant. In terms of ‘prob’, the MLOG is more likely to eliminate redundant identical covariates while the Elastic net more likely exhibits the so-called grouping effect.

Table 2. The true values of the number of nonzeros (k), the number of selected covariates among identical covariates in the model (I1), and the number of identical covariates that are removed from in the model (I2).

  X1,X2,,Xq, Xq+1==Xp, q=[p2]+1, r=[pq2]
Scenario (a) β=(κ1,κ2,,κpqpq,0,0,,02qp+1)T k=pq I1=0 I2=pq
Scenario (b) β=(0,0,,0q,j=1pqκj)T k=1 I1=1 I2=pq1
Scenario (c) β=(κ1,,κrr,0,,0qr,j=r+1pqκj)T k=r+1 I1=1 I2=pq1

Table 3. The true values of k, I1, and I2 with n=200, n=400 and n=800.

n Scenario (a) Scenario (b) Scenario (c)
200 k=7, I1=0, I2=7 k=1, I1=1, I2=6 k=4, I1=1, I2=6
400 k=8, I1=0, I2=8 k=1, I1=1, I2=7 k=5, I1=1, I2=7
800 k=10, I1=0, I2=10 k=1, I1=1, I2=9 k=6, I1=1, I2=9

Table 4. The Simulation results of Example 4.2. Mean squared error (MSE) of estimates, model error (ME), mean of the number of selected covariates (NONZERO), prediction model errors (PE), mean prediction squared errors (MPSE), mean number of the false positives (FP) and false negatives (FN), the number of selected covariates among identical covariates in the model (I1) and the probability of correctly eliminated identical covariates (prob).

Scenario (n,p,k) Method ME MSE PE MPSE NONZERO FP FN I1 prob
  n=200 MLOG 0.0334 0.9496 0.0349 1.0301 7.16 0.16 0.00 0.06 0.94
  p=15 ENET 0.0426 0.9425 0.0451 1.0393 8.50 1.50 0.00 2.58 0.63
  k=7 LASSO 0.0422 0.9433 0.0453 1.0392 8.53 1.53 0.00 0.73 0.27
  n=400 MLOG 0.0222 0.9847 0.0224 1.0209 8.17 0.17 0.00 0.14 0.86
(a) p=17 ENET 0.0273 0.9819 0.0276 1.0245 9.34 1.34 0.00 1.79 0.78
  k=8 LASSO 0.0271 0.9824 0.0276 1.0257 9.39 1.39 0.00 0.72 0.29
  n=800 MLOG 0.0130 0.9845 0.0132 0.9995 10.11 0.11 0.00 0.08 0.92
  p=21 ENET 0.0156 0.9830 0.0158 1.0025 11.38 1.38 0.00 0.16 0.98
  k=10 LASSO 0.0161 0.9839 0.0165 1.0041 11.28 1.28 0.00 0.63 0.37
  n=200 MLOG 0.0064 0.9787 0.0065 0.9988 1.51 0.51 0.00 1.00 1.00
  p=15 ENET 2.4730 3.4459 2.4401 3.4150 1.58 0.74 0.16 5.88 0.00
  k=1 LASSO 0.0188 0.9758 0.0191 1.0102 2.66 1.66 0.00 1.00 1.00
  n=400 MLOG 0.0046 0.9973 0.0045 0.9948 1.50 0.50 0.00 1.00 1.00
(b) p=17 ENET 0.3049 1.2999 0.3276 1.3041 1.83 0.84 0.01 7.94 0.00
  k=1 LASSO 0.0114 0.9941 0.0113 1.0014 2.97 1.97 0.00 1.00 1.00
  n=800 MLOG 0.0021 0.9866 0.0020 0.9810 1.42 0.42 0.00 1.00 1.00
  p=21 ENET 0.0049 0.9893 0.0048 0.9831 1.51 0.51 0.00 10.00 0.00
  k=1 LASSO 0.0055 0.9858 0.0054 0.9846 2.50 1.50 0.00 1.00 1.00
  n=200 MLOG 0.0239 0.9589 0.0243 1.0003 4.35 0.35 0.00 1.00 1.00
  p=15 ENET 0.0378 0.9474 0.0387 1.0117 6.75 2.75 0.00 7.00 0.00
  k=4 LASSO 0.0388 0.9465 0.0399 1.0118 7.01 3.01 0.00 1.01 0.99
  n=400 MLOG 0.0129 0.9853 0.0131 1.0376 5.28 0.28 0.00 1.00 1.00
(c) p=17 ENET 0.0228 0.9781 0.0236 1.0489 7.92 2.92 0.00 8.00 0.00
  k=5 LASSO 0.0227 0.9782 0.0235 1.0481 8.07 3.07 0.00 1.00 1.00
  n=800 MLOG 0.0079 0.9884 0.0080 1.0080 6.25 0.25 0.00 1.00 1.00
  p=21 ENET 0.0131 0.9848 0.0133 1.0134 8.78 2.78 0.00 10.00 0.00
  k=6 LASSO 0.0135 0.9843 0.0137 1.0138 9.18 3.18 0.00 1.00 1.00

Example 4.3

This example is the same as Example 4.2 excepting that the covariance matrix

Σ0=(Σ11Σ12Σ21Σ22)

is used instead of Σ0, where

Σ11=(1ρ)Ir+ρ1r1rT,Σ12=Or×(q+1r),Σ21=Σ12T,Σ22=ρIq+1r+(1ρ)1q+1r1q+1rT

with ρ=0.1. Unlike Example 4.2, there are both strongly correlated and weakly correlated pairs of covariates in X.

The simulation results are shown in Table 5. Similar to Example 4.2, the MLOG outperforms the LASSO and Elastic net penalty in terms of the probability of eliminating redundant covariates and prediction error. The Elastic net in general exhibits grouping effect.

Table 5. The Simulation results of Example 4.3. Mean squared error (MSE) of estimates, model error (ME), mean of the number of selected covariates (NONZERO), prediction model errors (PE), mean prediction squared errors (MPSE), mean number of the false positives (FP) and false negatives (FN), the number of selected covariates among identical covariates in the model (I1) and the probability of correctly eliminated identical covariates (prob).

Scenario (n,p,k) Method ME MSE PE MPSE NONZERO FP FN I1 prob
  n=200 MLOG 0.0383 0.9588 0.0403 1.0596 7.05 0.05 0.00 0.03 0.97
  p=15 ENET 0.0453 0.9531 0.0476 1.0648 8.27 1.27 0.00 2.25 0.54
  k=7 LASSO 0.0452 0.9530 0.0477 1.0650 8.36 1.36 0.00 0.67 0.33
  n=400 MLOG 0.0188 0.9725 0.0190 1.0202 8.02 0.02 0.00 0.01 0.99
(a) p=17 ENET 0.0246 0.9700 0.0251 1.0268 9.43 1.43 0.00 3.06 0.49
  k=8 LASSO 0.0228 0.9702 0.0232 1.0235 9.31 1.31 0.00 0.70 0.30
  n=800 MLOG 0.0126 0.9899 0.0127 1.0341 10.02 0.02 0.00 0.02 0.98
  p=21 ENET 0.0153 0.9889 0.0156 1.0361 11.26 1.26 0.00 2.64 0.74
  k=10 LASSO 0.0152 0.9897 0.0155 1.0370 11.12 1.12 0.00 0.55 0.45
  n=200 MLOG 0.0061 0.9707 0.0059 1.0228 1.22 0.22 0.00 1.00 1.00
  p=15 ENET 4.6020 5.5639 4.5730 5.6197 1.24 0.54 0.30 4.93 0.00
  k=1 LASSO 0.0205 0.9613 0.0203 1.0379 3.23 2.23 0.00 1.01 0.99
  n=400 MLOG 0.0031 0.9865 0.0032 0.9974 1.31 0.31 0.00 1.00 1.00
(b) p=17 ENET 5.1911 6.1341 5.3456 6.3323 1.91 1.04 0.13 6.98 0.00
  k=1 LASSO 0.0108 0.9822 0.0110 1.0071 3.13 2.13 0.00 1.01 0.99
  n=800 MLOG 0.0016 0.9978 0.0016 0.9941 1.24 0.24 0.00 1.00 1.00
  p=21 ENET 0.0159 1.0087 0.0143 1.0037 2.05 1.05 0.00 10.00 0.00
  k=1 LASSO 0.0059 0.9952 0.0061 0.9963 3.18 2.18 0.00 1.00 1.00
  n=200 MLOG 0.0299 0.9768 0.0301 1.0326 4.02 0.33 0.31 0.02 0.98
  p=15 ENET 0.0339 0.9648 0.0349 1.0405 6.04 2.14 0.10 6.05 0.00
  k=4 LASSO 0.0310 0.9634 0.0315 1.0347 5.81 1.90 0.10 0.92 0.89
  n=400 MLOG 0.0144 0.9871 0.0147 0.9927 5.04 0.04 0.00 1.00 1.00
(c) p=17 ENET 0.0208 0.9823 0.0216 1.0012 6.81 1.81 0.00 8.00 0.00
  k=5 LASSO 0.0203 0.9846 0.0209 1.0011 6.39 1.39 0.00 1.00 1.00
  n=800 MLOG 0.0077 0.9938 0.0077 0.9937 6.04 0.04 0.00 1.00 1.00
  p=21 ENET 0.0110 0.9914 0.0112 0.9969 7.93 1.93 0.00 10.00 0.00
  k=6 LASSO 0.0114 0.9928 0.0116 0.9975 7.54 1.54 0.00 1.01 0.99

Example 4.4

In this example, the situation that the number of covariates (p) is much larger than the number of observations (n) is considered under three simulation settings. The MLOG penalty is compared with the LASSO and the Elastic net. Here, n=50, p=100, and the same β(true) are used for all three simulation settings,

β(true)=(κ1,,κ2424,0,0,,026,κ25,,κ4925,0,0,,025)T.

Here, the 49 nonzero coefficients are chosen as the first 49 elements of the sequence κ introduced in Example 4.1. The simulation settings are as follows,

Model (a): The rows of X are sampled from N(0,Σ0), where Σ0=(ρij) and ρij=0.5|ij|,i,j=1,,100;

Model (b): Let X(1)=[X1,,X50], X(2)=[X51,,X75], and X(3)=[X76,,X100]. The rows of X=[X(1),X(2)] are sampled from N(0,Σ0), where Σ0=(ρij) and ρij=0.5|ij|,i,j=1,,75. X(3)=1τ2X(2)+τZ, where Z are sampled from N(0,I25) and τ=0.1;

Model (c): The rows of X=[X1,,X51] are sampled from N(0,Σ0), where Σ0=(ρij) and ρij=0.5|ij|,i,j=1,,51. The remaining covariates are exactly equal to X51. That means X51==X100.

Obviously, there is no strong correlation between the covariates in Model (a). In Model (b), the relevant covariates X(2) are strongly correlated with the irrelevant covariates X(3). In the Model (c), some covariates are exactly identical.

The simulation results are shown in Table 6. The MLOG performs the best in all cases while the LASSO and the Elastic net cannot select the true model. In the situation where some covariates are identical, the MLOG outperforms both LASSO and Elastic net penalty in terms of the probability of eliminating redundant covariates and prediction error.

Table 6. The Simulation results of Example 4.4. Mean squared error (MSE) of estimates, model error (ME), mean of the number of selected covariates (NONZERO), prediction model errors (PE), mean prediction squared errors (MPSE), mean number of the false positives (FP) and false negatives (FN), the number of selected covariates among identical covariates in the model (I1) and the probability of correctly eliminated identical covariates (prob).

Scenario Method ME MSE PE MPSE NONZERO FP FN I1 prob
  MLOG 0.2448 0.0005 1.0077 0.4732 49.29 0.68 0.39    
(a) ENET 3.2564 5.1164 4.0410 6.0826 45.15 18.83 22.68    
  LASSO 5.1726 6.9262 5.2391 7.2317 27.86 10.45 31.61    
  MLOG 0.2491 0.0009 1.1881 0.0632 49.17 0.63 0.46    
(b) ENET 5.9036 7.5343 6.4448 7.8515 57.42 26.43 18.01    
  LASSO 7.1262 7.7524 7.2141 8.4866 35.17 12.98 26.81    
  MLOG 0.2455 0.0017 0.9168 0.9205 25.76 0.82 0.06 48.76 0.88
(c) ENET 3.1427 3.6126 3.9263 3.9810 32.74 11.39 3.65 36.99 0.00
  LASSO 3.6124 4.1524 3.9787 4.0974 31.41 10.62 4.21 48.14 0.65

To summarize, the simulation results in Tables 16 suggest that in the absence of multicollinearity, both SCAD and MLOG perform well. However, when the covariates are strongly correlated or even identical, the MLOG performs better in terms of prediction error. Figures A1A4 show more detailed simulation results of the above three examples.

5. Real data examples

In this section, the linear models are used and the proposed MLOG penalty is applied to the diabetes dataset in [6] and the prostate dataset in Tibshirani [25] and She [21,22]. To illustrate the performances in the case that the number of covariates (p) is much larger than the number of observations (n), the Gene expression dataset of Scheetz et al. [20] is used too. The performance of the MLOG penalty is compared with the Elastic net, the LASSO, and the SCAD based on the number of selected covariates (NONZERO) and mean squared error. For j=1,2,,p, the standard error of the estimated coefficient (βj^) is computed from 500 bootstrap samples (see, [7]) as

seboot(βj^)=1B1b=1B(β~b,jβ~j)2,

where β~b,j is the estimate of βj at bth bootstrap sample and β~j=(1/B)b=1Bβ~b,j. The size of bootstrap is chosen as B=1000.

The following is used to evaluate the standard error for the estimator (β^),

seboot(β^)=1B1b=1Bβ~bβ~2,

where β~b is the estimate of β at bth bootstrap sample and β~=(1/B)b=1Bβ~b. To validate the penalized estimators, training data and test data are considered. The model is fitted from a random subsample of size 300. The remaining 142 observations are used as the test data. This procedure is repeated for 500 times. The mean prediction squared error (MPSE) and the standard deviation of prediction squared errors (sd(PSE)) are reported.

Example 5.1 Diabetes data —

The diabetes dataset contains n=442 observations from diabetes patients. There are ten baseline covariates X1,,X10, namely age, sex, body mass index (bmi), average blood pressure (bp), and six blood serum measurements: s1,s2,s3,s4,s5,s6. The response is a quantitative measure of disease progression one year after the baseline. Before the statistical analysis, the data is standardized so that the means of all variables are zero and the variances are one.

Some covariates are strongly correlated. For example, the pairwise correlation between s1 and s2 is 0.897, between s2 and s4 is 0.66, and between s3 and s4 is −0.738. The sample correlation matrix of the covariates is shown below:

(agesexbmibps1age1.0000.1740.1850.3350.260sex0.1741.0000.0880.2410.035bmi0.1850.0881.0000.3950.250bp0.3350.2410.3951.0000.242s10.2600.0350.2500.2421.000s20.2190.1430.2610.1860.897s30.0750.3790.3670.1790.052s40.2040.3320.4140.2580.542s50.2710.1500.4460.3930.516s60.3020.2080.3890.3900.326s2s3s4s5s60.2190.0750.2040.2710.3020.1430.3790.3320.1500.2080.2610.3670.4140.4460.3890.1860.1790.2580.3930.3900.8970.0520.5420.5160.3261.0000.1960.6600.3180.2910.1961.0000.7380.3990.2740.6600.7381.0000.6180.4170.3180.3990.6181.0000.4650.2910.2740.4170.4651.000)

Tables 7 and 8 show the estimation results of estimation and prediction computed from bootstrap samples. The MLOG outperforms the LASSO, the SCAD, and the Elastic net in terms of both MSE and MPSE. Though the results of estimation and prediction are satisfactory for both LASSO and Elastic net, they tend to select more covariates than the MLOG and the SCAD. This is consistent with the results of [5].

Table 7. The bootstrap means and standard deviations of the estimate of the diabetes data.

Covariates LS MLOG ENET LASSO SCAD
age −0.006(0.035) 0.000(0.012) 0.000(0.034) 0.000(0.012) 0.000(0.013)
sex −0.148(0.039) −0.094(0.052) −0.146(0.041) −0.093(0.042) −0.104(0.056)
bmi 0.321(0.041) 0.333(0.037) 0.323(0.042) 0.319(0.045) 0.336(0.050)
bp 0.200(0.038) 0.171(0.041) 0.198(0.039) 0.168(0.041) 0.191(0.057)
s1 −0.489(0.223) −0.011(0.046) −0.347(0.258) −0.028(0.041) −0.021(0.072)
s2 0.294(0.178) 0.000(0.043) 0.181(0.199) 0.000(0.018) 0.000(0.066)
s3 0.062(0.112) −0.136(0.059) −0.000(0.126) −0.129(0.045) −0.136(0.082)
s4 0.109(0.095) 0.000(0.055) 0.091(0.093) 0.000(0.036) 0.000(0.087)
s5 0.464(0.090) 0.297(0.051) 0.411(0.105) 0.296(0.049) 0.313(0.063)
s6 0.042(0.037) 0.000(0.019) 0.041(0.036) 0.019(0.027) 0.000(0.012)

Table 8. The estimation results for diabetes data. The mean of the number of selected covariates (NONZERO), mean squared errors (MSE), standard error of estimator, mean prediction squared errors (MPSE), and the standard deviation of prediction squared errors (sd(PSE)).

Method Selected Covariates NONZERO MSE se(estimator) MPSE sd(PSE)
LS age, sex, bmi, bp, s1,…, s6 10 0.4811 0.034 0.5106 0.0455
MLOG sex, bmi, bp, s1, s3, s5 6 0.4832 0.016 0.5118 0.0445
ENET sex, bmi, bp, s1, s2, s4, s5, s6 8 0.4935 0.039 0.5148 0.0452
LASSO sex, bmi, bp, s1, s3, s5,s6 7 0.4909 0.012 0.5165 0.0442
SCAD sex, bmi, bp, s1, s3, s5 6 0.4898 0.019 0.5224 0.0449

Example 5.2 Prostate data —

The prostate dataset have n=97 observations and 9 clinical measures. Following She [21], take log(cancer volume) (lcavol) as the response variable and a full quadratic model was considered: the 43 covariates are 8 main effects, 7 squares, and 28 interactions of eight original variables – lweight, age, lbph, svi, lcp, gleason, pgg45, and lpsa, where svi is binary. To validate the estimation methods, 80 observations are randomly selected for model fitting and the remaining 17 observations are for testing.

The covariates in the full quadratic model exhibit even stronger correlations than those in Example 5.1. For example, the within–group correlations are very high, for example, >0.98 for group {lcp,lweightlcp,agelcp,gleasonlcp}, and >0.93 for group {lpsa,lweghtlpsa,agelpsa,gleasonlpsa}. The results of She [21] suggested that the LASSO does not give stable and accurate solutions in the presence of many highly correlated covariates.

Tables 9 and 10 show the performances of estimation and prediction based on bootstrap samples. Since there are strongly correlated pairs of covariates, the Elastic net select many redundant covariates and the performances in terms of prediction, MSE, and MPSE are not as good as the MLOG. The MLOG tends to select less covariates than the LASSO and the Elastic net.

Table 9. The bootstrap means and standard deviations of the estimate of the prostate data.

Selected MLOG ENET LASSO Selected MLOG ENET LASSO
Covariates (3) (12) (4) Covariates (3) (9) (4)
age   0.0177(0.119)   age*gleason 0.0005(0.021)   0.0008(0.014)
lcp 0.0051(0.017) 0.3059(0.536) 0.0849(0.630) age*pgg45 −0.0001(0.001) −0.0001(0.001)  
gleason   0.0816(1.148)   age*lpsa     0.0002(0.009)
lpsa 0.0079(0.018) 0.1967(0.773) 0.4809(0.773) lbph*svi   0.0931(0.222)  
lweight2   −0.0187(0.321)   lbph*lcp   −0.0051(0.069)  
lbph2   0.0515(0.090)   lbph*pgg45   −0.0014(0.003)  
lcp2   0.0526(0.074) 0.0019(0.056) lbph*lpsa   −0.0189(0.086) −0.0043(0.061)
pgg452   −0.0001(0.010)   svi*pgg45   −0.0038(0.016)  
lpsa2   0.0311(0.108)   lcp*pgg45   0.0032(0.005) 0.0001(0.004)
lweight*lcp   0.0886(0.140) 0.0494(0.147) lcp*lpsa   −0.1362(0.233)  
lweight*lpsa   0.0184(0.155)   gleason*pgg45 0.0001(0.021)    
age*lbph −0.0009(0.010) −0.0007(0.007)   gleason*lpsa   0.0204(0.087)  

Table 10. The estimation results for prostate data. The mean of the number of selected covariates (NONZERO), mean squared errors (MSE), standard error of estimator, mean prediction squared errors (MPSE), and the standard deviation of prediction squared errors (sd(PSE)).

Method Selected Covariates NONZERO MSE se(estimator) MPSE sd(PSE)
  lcp, lpsa,          
MLOG age*lbph, age*gleason, 6 0.4906 0.128 0.5355 0.1587
  age*pgg45, gleason*pgg45          
  age, lcp, gleason, lpsa, lweight2, lbph2,          
  pgg452, lpsa2, lweight*lcp, lweight*lpsa,          
ENET age*lbph, age*pgg45, lbph*svi, lbph*lcp, 21 1.8905 0.257 7.7054 22.8274
  lbph*pgg45, lbph*lpsa, svi*pgg45,          
  lcp*pgg45, lcp*lpsa, gleason*lpsa, lcp2          
  lcp, lpsa, lcp2,          
LASSO lweight*lcp, age*gleason, age*lpsa 8 0.5216 0.174 3.2233 13.3297
  lbph*lpsa, lcp*pgg45          

Example 5.3 Gene expression data —

The microarray data of Scheetz et al. [20] contains the gene expression levels of 200 TRIM32 genes collected from eye tissue samples of 120 rats. In this example, the number of covariates (p) is much greater than the number of observations (n). Moreover, some covariates are so strongly correlated. In such a situation, our previous discussions suggest that the LASSO and the SCAD penalties cannot select the true model. To validate the estimation methods, we randomly select 100 observations for model fitting and the remaining 20 observations for testing.

Table 11 summarizes the estimation and prediction results of the gene expression data. The MLOG outperforms all other penalties in the variable selection, estimation, and prediction. The Elastic net penalty leads to large biases. For the LASSO and the SCAD penalty, although the bias of the estimates are smaller than those of the Elastic penalty, they select too many irrelevant covariates and give larger biases in both estimation and prediction.

Table 11. The results for gene expression data. The mean of the number of selected covariates (NONZERO), mean squared errors (MSE), standard error of estimator, mean prediction squared errors (MPSE), and the standard deviation of prediction squared errors (sd(PSE)).

Method NONZERO MSE se(estimator) MPSE sd(PSE)
MLOG 16 0.0047 0.0341 0.0139 0.0060
ENET 29 5.8597 0.1127 4.9840 1.3391
LASSO 120 0.7360 0.0414 0.0478 0.0143
SCAD 122 0.1098 0.0816 0.0901 0.0686

6. Conclusion

In this paper, we introduce a new class of strictly concave penalty functions, in particular, the modified log penalty to improve the performances of prediction under the multicollinearity cases. The proposed penalties exhibit certain nice properties as described in section 2 even under the multicollinearity cases. In the weakly correlated cases, these penalties perform as well as the SCAD penalty. In the multicollinearity or highly correlated cases, the proposed penalties tend to select less covariates. Real data analysis and simulation studies show that the modified log penalty outperforms the LASSO, the SCAD, and the Elastic net in terms of prediction error in general.

Appendix 1. Proofs

A.1. Technical lemmas

Proposition A.1

β^=(β^1,β^2,,β^p)T is a solution to the minimization problem (3) only if the following conditions are satisfied,

1nXjT(YXβ^)=P(|β^j|,λ)sgn(β^j),forallβ^j0 (A1)

and

1n|XjT(YXβ^)|P(0+,λ),forallj=1,,p. (A2)
Proof.

First, we have the following lemma.

Lemma A.2

Let f(x1,x2,,xd) be a function on Rd. Suppose that f(x1,x2,,xd) attains minimum value at (x10,x20,,xd0). Then, the function

g(x1,,xk)=f(x1,,xk,xk+10,,xd0)

attains minimum value at (x10,x20,,xk0), k=1,2,,d.

The proof of Lemma A.2 is trivial. Below, the proof of Proposition A.1 is given. Let

F(β)=12nYXβ22+j=1pP(|βj|,λ)=12nYj=1pXjβj2+j=1pP(|βj|,λ).

For all j=1,2,,p with βj0, we have

Fβj=XjT(YXβ)+P(|βj|,λ)sgn(βj).

Let β^=(β^1,β^2,,β^p)T be a solution to the minimization problem minβF(β). Define J0(β^)={j=1,,p|β^j0} and m=#J0(β^). Without loss of generality assume that

J0(β^)={1,2,,m}β^=(β^1,,β^m,0,,0)T.

According to Lemma A.2, (β^1,β^2,,β^m) is a solution to the minimization problem

min(β1,β2,,βm)G(β1,β2,,βm)=min(β1,β2,,βm)F(β1,β2,,βm,0,,0).

Therefore,

Gβj(β^1,β^2,,β^m)=0,j=1,,m.

It is equivalent to

Fβj(β^1,,β^m,0,,0)=0,j=1,,m.

That means Inline graphic and thus

1nXjT(YXβ^)=P(|βj|,λ)sgn(βj)j=1,,m.

Now, consider j>m. For all αR, let

βα(j)=(β^1,,β^m,0,,0,α,0,,0)T,

where α is the jth element. Since β^ is the global minimizer of F(β), we have

F(β^)F(βα(j)),j>m,αR.

On the other hand, simple algebraic manipulations show that

F(βα(j))=12nYk=1mXkβ^kXjα2+k=1mP(|β^k|,λ)+P(|α|,λ).

Therefore,

F(βα(j))=F(β^)+α22nXj21nαXjT(YXβ^)+P(|α|,λ).

Since F(β^)F(βα(j)) for all j>m and αR, we have

α22nXj2α2nαXjT(YXβ^)+P(|α|,λ)0.

Choose α=γXj(YXβ^),0<γ<1, we have

(Xj(YXβ^))2[γ22nXj2γn]+P(|XjT(YXβ^)|γ,λ)0.

Let A=Xj(YXβ^). Then,

A2[γ22nXj2γn]+P(|A|γ,λ)0.

It is equivalent to

A2n[γγ22Xj2]P(|A|γ,λ).

Choose γ(0,1) sufficiently small such that γγ2/2Xj2>0. Then,

A2nP(|A|γ,λ)|A|γ|A|1γ2Xj2. (A3)

The condition (A3) holds for any small γ. Taking γ0, we have

A2n|A|P(0+,λ)1n|A|P(0+,λ).

Therefore, (1/n)|XjT(YXβ^)|P(0+,λ),j>m. Since P(,λ) is strictly concave penalty and the derivative P(,λ) is non-increasing on [0,), P(u,λ)P(0+,λ),u[0,). This completes the proof.

A.2. Proof of Theorem 2.3

Let J={j1,p¯|β^j0} and U(δ)=XδZ. Denote the number of components of J by h. Obviously, the system of column vectors {Uj(δ)|jJ} is linearly independent if h=1.

Consider h=q+1,q>0. By contradiction assume that the system of column vectors {Uj(δ)|jJ} is linearly dependent. Without loss of generality, assume that

J={1,2,,h}.

Since {Uj(δ)|j1,h¯} is linearly dependent and β^j0,j=1,,h, the system of vectors {β^jUj(δ)|j=1,,h} is also linearly dependent. Then, there exist real values γ1,,γh, not all zero such that

j=1q+1γjβ^jUj(δ)=0.

Without loss of generality assumed that

|γh|=max1jh|γj|.

Define αj=γj/γh, j=1,,h. We get |αj|1, j=1,,h and

β^hUh(δ)=j=1qαjβ^jUj(δ). (A4)

Since β^ is the solution and β^j0, j=1,,h, Proposition A.1 suggests that

1nXjT(YXβ^)=P(|β^j|,λ)sgn(β^j),j1,h¯.

From (A4), we have

β^h1nUh(δ)T(YXβ^)=j=1qαj1nβ^jUj(δ)T(YXβ^).

Then,

β^h1n(XhδZh)T(YXβ^)=j=1qαj1nβ^j(XjδZj)T(YXβ^).

Therefore,

|β^h|P(|β^h|,λ)=j=1qαj|β^j|P(|β^j|),λ)+δn(β^hZhj=1qαjβ^jZj)(YXβ^). (A5)

From (A4), Uh(δ)=j=1qUj(δ)αj(β^j/β^h). For any τ>1, define

β~j(τ)={β^j(1αjτ),j=1,,q,β^h(1+1τ),j=h,0,j>h.

We have

U(δ)β~=j=1qUj(δ)β~j(τ)+Uh(δ)β~h(τ)=j=1qUj(δ)β^j(1αjτ)+j=1qUj(δ)αjβ^j(1+1τ)=j=1qUj(δ)β^j(1+αj)=U(δ)β^.

Since τ>1 by assumption, we have 1αj/τ>0, Inline graphic. Consider

F(β^)=12nYXβ^2+j=1qP(|β^j|,λ)+P(|β^h|,λ)

and

F(β~)=12nYXβ~2+j=1qP(|β^j|(1αjτ),λ)+P(|β^h|(1+1τ),λ)=12nYXβ^+δZ(β^β~)2+j=1qP(|β^j|(1αjτ),λ)+P(|β^h|(1+1τ),λ)=12nYXβ^2+F(δ)(τ),

where

F(δ)(τ)=δn(β^β~)TZT(YXβ^)+12nδ2Z(β^β~)2+j=1qP(|β^j|(1αjτ),λ)+P(|β^h|(1+1τ),λ)=δnτ((β^hZhj=1qαjβ^jZj)T(YXβ^)+δ22nτ2β^hZj=1qαjβ^jZj)2+j=1qP(|β^j|(1αjτ),λ)+P(|β^h|(1+1τ),λ).

To obtain a contradiction and complete the proof, we need to show that F(β~)<F((β^)). We have

ddτF(δ)(τ)=δτ2n(β^hZhj=1qαjβ^jZj)T(YXβ^)δ2τ3nβ^hZhj=1qαjβ^jZj)2+1τ2j=1qαj|β^j|P(|β^j|(1αjτ),λ)|β^h|τ2P(|β^h|(1+1τ),λ)=1τ2G(δ)(τ)

where

G(δ)(τ)=δn(β^hZhj=1qαjβ^jZj)T(YXβ^)δ2τnβ^hZhj=1qαjβ^jZj)2+j=1qαj|β^j|P(|β^j|(1αjτ),λ)|β^h|P(|β^h|(1+1τ),λ)

and

ddτG(δ)(τ)=δ2τ2nβ^hZhj=1qαjβ^jZj)2+j=1qαj2τ2β^j2P(|β^j|(1αjτ),λ)+β^h2τ2P(|β^h|(1+1τ),λ).

Let u=min1jh|β^j|, v=max1jh|β^j|, and M=max0θ2vP(θ). Since P(,λ) is strictly concave penalty, we have M<0. Note that |αj|1,j=1,,q, we have

ddτG(δ)(τ)<1τ2(Mu2(1+j=1qαj2)+δ2β^hZhj=1qαjβ^jZj)2).

If β^hZhj=1qαjβ^jZj=0, choose ϰ>0 arbitrarily. Otherwise, choose

ϰ=unMβ^hZhj=1qαjβ^jZj).

For all δ[0,ϰ], we have

ddτG(δ)(τ)<1τ2Mu2(1+j=1qαj2)0.

Then, the function G(δ)(τ) is strictly decreasing on (1;). Therefore,

G(δ)(τ)>limτG(δ)(τ),τ>1.

However,

limτG(δ)(τ)=δn(β^hZhj=1qαjβ^jZj)T(YXβ^)+j=1qαj|β^j|.P(|β^j|),λ)|β^h|.P(|β^h|,λ).

From (A5), we have limτG(δ)(τ)=0 and G(δ)(τ)>0,τ>1. Therefore,

ddτF(δ)(τ)=1τ2G(δ)(τ)>0,τ>1.

That means the function F(δ)(τ) is strictly increasing on (1,). Therefore, F(δ)(τ)<limτF(δ)(τ),τ>1. It is easy to so that

limτF(δ)(τ)=j=1qP(|β^j|,λ)+P(|β^h|,λ).

Therefore,

F(δ)(τ)<j=1qP(|β^j|,λ)+P(|β^h|,λ).

That means F(β~)<F((β^)). This completes the proof.

A.3. Proof of Proposition 2.4

Result (a) is a direct consequence of Theorem 2.3. The proof of (b) is given in the following. Let

G(u)=12nYXu2+j=1mP(|uj|,λ),u=(u1,,um)TRm.

We have

F(β^)=12nYXβ^2+j=1pP(|β^j|,λ)=12nYXβ^2+j=1mP(|β^j|,λ)=G(β^)minuRmG(u). (A6)

On the other hand, for all u=(u1,,um)TRm, let u~=(uT,0,,0)TRp. We have

F(u~)=12nYXu2+j=1mP(|uj|,λ)=G(u).

Then, G(u)=F(u~)minβRpF(β)=F(β^),uRm. Therefore

minuRmG(u)F(β^). (A7)

From (A6), (A7), we get minuRmG(u)=F(β^)=G(β^).

A.4. Proof of Proposition 2.7

Since XX+nΩ is non-negative definite matrix, it is invertible if and only if it is positive definite matrix. That means

u(XX+nΩ)u>0,u=(u1,,up)0. (A8)

We have

u(XX+nΩ)u=(Xu)(Xu)+nj=1puj2ωj=Xu2+jJ2uj2ωj.

Then,

u(XX+nΩ)u=0{Xu=0,uj=0,jJ2{jJ1ujXj=0,uj=0,jJ2.

This completes the proof.

Appendix 2. Figures.

Figure A.1.

Figure A.1.

Simulation results of Example 4.1 – The mean number of false positives (FP) and false negatives (FN). Panels (a), (c), (e) show the results of Scenario (a). Panels (b), (d), (f) show the results of Scenario (b).

Figure A.2.

Figure A.2.

Simulation results of Example 4.2 – The mean number of false positives (FP) and false negatives (FN): Panels (a), (b) and (c) show the results of Scenario (a). Panels (d), (e) and (f) show the results of Scenario (b). Panels (g), (h) and (i) show the results of Scenario (c).

Figure A.3.

Figure A.3.

Simulation results of Example 4.3 – The mean number of false positives (FP) and false negatives (FN): Panels (a), (b) and (c) show the results of Scenario (a). Panels (d), (e) and (f) show the results of Scenario (b). Panels (g), (h) and (i) show the results of Scenario (c).

Figure A.4.

Figure A.4.

Simulation results of Example 4.4 – The mean number of false positives (FP) and false negatives (FN): Panel (a) shows the results of Model (a). Panel (b) shows the results of Model (b). Panel (c) shows the results of Model (c).

Funding Statement

Chi Tim, Ng's work is supported by National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIP) (No. NRF-2017R1C1B2011652).

Disclosure statement

No potential conflict of interest was reported by the authors.

References

  • 1.Antoniadis A. and Fan J., Regularization of wavelet approximations, J. Am. Stat. Assoc. 96 (2001), pp. 939–967. doi: 10.1198/016214501753208942 [DOI] [Google Scholar]
  • 2.Breiman L., Heuristics of instability and stabilization in model selection, Ann. Statist. 24 (1996), pp. 2350–2383. doi: 10.1214/aos/1032181158 [DOI] [Google Scholar]
  • 3.Chatterjee S. and Hadi A.S., Regression Analysis by Example, 5th ed., John Wiley & Sons, Inc., Hoboken, New Jersey, 2012, 424p. [Google Scholar]
  • 4.Chong I.-G. and Jun C.-H., Performance of some variable selection methods when multicollinearity is present, Chemometr. Intell. Lab. Syst. 78 (2005), pp. 103–112. doi: 10.1016/j.chemolab.2004.12.011 [DOI] [Google Scholar]
  • 5.Dalayan A., Hebiri M., and Lederer J., On the prediction performance of the LASSO, Bernoulli 23 (2017), pp. 552–581. doi: 10.3150/15-BEJ756 [DOI] [Google Scholar]
  • 6.Efron B., Hastie T., Johnstone I., and Tibshirani R., Least angle regression, Ann. Statist. 32 (2004), pp. 407–499. doi: 10.1214/009053604000000067 [DOI] [Google Scholar]
  • 7.Efron B. and Tibshirani R.J., An Introduction to the Bootstrap, 1st ed., Chapman & Hall, New York, 1993, 456p. [Google Scholar]
  • 8.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. doi: 10.1198/016214501753382273 [DOI] [Google Scholar]
  • 9.Fan J. and Lv J., Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc. Ser. B 70 (2008), pp. 849–911. doi: 10.1111/j.1467-9868.2008.00674.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Fan J. and Lv J., A selective overview of variable selection in high dimensional feature space, Stat. Sin. 20 (2010), pp. 101–148. [PMC free article] [PubMed] [Google Scholar]
  • 11.Fan J. and Lv J., Nonconcave penalized likelihood with NP-dimensionality, IEEE Trans. Inform. Theory 57 (2011), pp. 5467–5484. doi: 10.1109/TIT.2011.2158486 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Fan J. and Peng H., Nonconcave penalized likelihood with a diverging number of parameters, Ann. Statist. 32 (2004), pp. 928–961. doi: 10.1214/009053604000000256 [DOI] [Google Scholar]
  • 13.Fan Y. and Tang C.Y., Tuning parameter selection in high dimensional penalized likelihood, J. R. Stat. Soc. Ser. B 75 (2013), pp. 531–552. doi: 10.1111/rssb.12001 [DOI] [Google Scholar]
  • 14.Fitrianto A. and Lee C.Y., Performance of Ridge regression estimator methods on small sample size by varying correlation coefficients: A simulation study, J. Math. Statist. 10 (2014), pp. 25–29. doi: 10.3844/jmssp.2014.25.29 [DOI] [Google Scholar]
  • 15.Hoerl A.E. and Kennard R.W., Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970), pp. 55–67. doi: 10.1080/00401706.1970.10488634 [DOI] [Google Scholar]
  • 16.Hunter D.R. and Li R., Variable selection using MM algorithms, Ann. Statist. 33 (2005), pp. 1617–1642. doi: 10.1214/009053605000000200 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Jolliffe I.T., A note on the use of principal components in regression, Appl. Stat. 31 (1982), pp. 300–303. doi: 10.2307/2348005 [DOI] [Google Scholar]
  • 18.Konno H. and Takaya Y., Multi-step methods for choosing the best set of variables in regression analysis, Comput. Optim. Appl. 46 (2010), pp. 417–426. doi: 10.1007/s10589-008-9193-6 [DOI] [Google Scholar]
  • 19.Ng C.T., Oh S., and Lee Y., Going beyond oracle property: Selection consistency and uniqueness of local solution of the generalized linear model, Stat. Methodol. 32 (2016), pp. 147–160. doi: 10.1016/j.stamet.2016.05.006 [DOI] [Google Scholar]
  • 20.Scheetz T., Kim K., Swiderski R., Philp A., Braun T., Knudtson K., Dorrance A., DiBona G., Huang J., and Casavant T., Regulation of gene expression in the mammalian eye and its relevance to eye disease, Proc. Natl. Acad. Sci. 103 (2006), pp. 14429–14434. doi: 10.1073/pnas.0602562103 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.She Y., Thresholding-based iterative selection procedures for model selection and shrinkage, Electron. J. Stat. 3 (2009), pp. 384–415. doi: 10.1214/08-EJS348 [DOI] [Google Scholar]
  • 22.She Y., An iterative algorithm for fitting nonconvex penalized generalized linear models with grouped predictors, Comput. Stat. Data. Anal. 56 (2012), pp. 2976–2990. doi: 10.1016/j.csda.2011.11.013 [DOI] [Google Scholar]
  • 23.Tamura R., Kobayashi K., Takano Y., Miyashiro R., Nakata K., and Matsui T., Mixed integer quadratic optimization formulations for eliminating multicollinearity based on variance inflation factor, Optimization Online (2016). Available at http://www.optimization-online.org/DB_HTML/2016/09/5655.html
  • 24.Tamura R., Kobayashi K., Takano Y., Miyashiro R., Nakata K., and Matsui T., Best subset selection for eliminating multicollinearity, J. Oper. Res. Soc. Japan 60 (2017), pp. 321–336. [Google Scholar]
  • 25.Tibshirani R., Regression shrinkage and selection via the LASSO, J. R. Stat. Soc. Ser. B 58 (1996), pp. 267–288. [Google Scholar]
  • 26.Wang H., Li R., and Tsai C.L., Tuning parameter selectors for the smoothly clipped absolute deviation method, Biometrika 94 (2007), pp. 553–568. doi: 10.1093/biomet/asm053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Wold S., Ruhe A., Wold H., and Dunn III W.J., The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses, SIAM J. Sci. Stat. Comput. 5 (1984), pp. 735–743. doi: 10.1137/0905052 [DOI] [Google Scholar]
  • 28.Zhang C.-H., Nearly unbiased variable selection under minimax concave penalty, Ann. Statist. 38 (2010), pp. 894–942. doi: 10.1214/09-AOS729 [DOI] [Google Scholar]
  • 29.Zou H., The adaptive LASSO and its oracle properties, J. Am. Stat. Assoc. 101 (2006), pp. 1418–1429. doi: 10.1198/016214506000000735 [DOI] [Google Scholar]
  • 30.Zou H. and Hastie T., Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B 67 (2005), pp. 301–320. doi: 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES