Variable selection under multicollinearity using modified log penalty

Van Cuong Nguyen; Chi Tim Ng

doi:10.1080/02664763.2019.1637829

. 2019 Jul 3;47(2):201–230. doi: 10.1080/02664763.2019.1637829

Variable selection under multicollinearity using modified log penalty

Van Cuong Nguyen ¹, Chi Tim Ng ^1,^CONTACT

PMCID: PMC9041714 PMID: 35706515

ABSTRACT

To handle the multicollinearity issues in the regression analysis, a class of ‘strictly concave penalty function’ is described in this paper. As an example, a new penalty function called ‘modified log penalty’ is introduced. The penalized estimator based on strictly concave penalties enjoys the oracle property under certain regularity conditions discussed in the literature. In the multicollinearity cases where such conditions are not applicable, the behaviors of the strictly concave penalties are discussed through examples involving strongly correlated covariates. Real data examples and simulation studies are provided to show the finite-sample performance of the modified log penalty in terms of prediction error under scenarios exhibiting multicollinearity.

KEYWORDS: Grouping effect, modified log penalty, multicollinearity, penalized regression, strictly concave penalty function

1. Introduction

In the regression analysis, multicollinearity occurs when two or more covariates are strongly correlated, see [3,24] for general discussions on the multicollinearity issues. Multicollinearity leads to computational difficulties related to the inverse of the nearly singular matrix and results in low efficiency in the model estimation and prediction. To eliminate the multicollinearity, it is of paramount importance to select a parsimonious model that excludes redundant covariates that can be predicted from other covariates.

In the literature, several approaches have been proposed to overcome the difficulties of multicollinearity. One approach is the best subset selection method proposed in [18,23,24]. The model selection problem is reformulated as a constrained integer quadratic programing problem involving indicators of multicollinearity, such as the condition number of the correlation matrix and the variance inflation factor. However, solving such integer quadratic programing problems can be computationally intensive. Another approach is the partial least squares regression method discussed in [4,17,27]. The idea underlying this approach is to reduce the correlations in the covariates by means of orthogonal transformation.

Over the past two decades, penalized regression methods have been widely studied for the purpose of variable selection, to name a few, [8,10,15,19,25,28–30]. The idea is to use a penalty function that is non-differentiable at zero to shrink the small regression coefficients towards zero. Parsimony and the grouping effect are two important criteria for evaluating the performance of a penalty function. These two criteria can be conflicting to each other in the multicollinearity cases. Grouping effect (see, [30]) means that the strongly correlated covariates tend to be selected or deselected together. Parsimony can be described through the ability of a variable selection method to recover the so-called ‘true subset’ that is relevant to the response. For example, the idea of oracle property described in [8,10,12] has been widely used for such a purpose. However, the definition of the true subset can be ambiguous in the multicollinearity cases where some covariates can be predicted from other covariates. Consider the example where two covariates $X_{1}$ and $X_{2}$ are identical and $X_{1}$ is relevant to the response Y . In such a situation, the models $E Y = 2 X_{1}$ and $E Y = X_{1} + X_{2}$ are equivalent. The first one is more parsimonious. On the other hand, grouping effect requires that both $X_{1}$ and $X_{2}$ be selected. In certain applications such as microarray data analysis, grouping effect is considered to be a desirable property. However, in the applications where prediction is the main goal, the situation can be different because the parsimonious model with redundant covariates removed tends to give smaller prediction error.

There is a lack of literature discussing the penalty functions that achieve parsimony in the variable selection problem under the presence of multicollinearity. The Elastic net penalty in [30] is designed to achieve grouping effect in the multicollinearity cases. The Ridge penalty (see, [14,15]), the LASSO penalty (see, [25]), and the Elastic net penalty (see, [30]) do not guarantee the oracle properties in [8,10,11]. Under some regularity conditions on the minimum singular value of the design matrix, the non-concave penalty functions, the SCAD (see, [8,9]) and the MCP (see, [28]), lead to approximately unbiased estimates and guarantee the oracle properties. However, such regularity conditions cannot cover the multicollinearity cases with strong correlations in the covariates.

The aim of this paper is to introduce a new class of strictly concave penalty functions that achieve parsimony even under the multicollinearity cases. It is illustrated that in the situations without multicollinearity, for example, fulfilling the regular conditions in [10], these penalties perform as well as the SCAD penalty in terms of estimation error, prediction error, mean number of false positives, and mean number of false negatives. In the cases where some covariates are identical, at most one among these identical covariates is selected. This means that the redundant covariates can be removed automatically from the model. Moreover, the local quadratic approximation method or majorization-minimization algorithm (MM-algorithm) proposed in [8] and [16] can be used to obtain the estimates. As an example of ‘strictly concave penalty function’, a new penalty function called ‘modified log penalty’ is introduced.

The paper is organized as follows. The strictly concave penalized likelihood estimator and its properties are discussed in Section 2. The modified log penalty is introduced in Section 3. The simulation studies are given in Section 4 to compare the finite-sample performances of the proposed penalty and other penalties, including the Elastic net, the LASSO, and the SCAD. Some real data examples are given in Section 5. The concluding remarks are presented in Section 6.

2. Penalized linear regression with strictly concave penalty

In this section, the strictly concave penalties are introduced to enhance parsimonious model selection in multicollinearity cases. On the contrary, the Elastic net penalty of [30] is strictly convex and exhibits the grouping effect.

2.1. The strictly concave penalized likelihood estimator

Consider the linear regression model:

Y = X β + ε,

(1)

where Y is $n \times 1$ response vector, X is $n \times p$ design matrix, $β = (β_{1}, β_{2}, \dots, β_{p})^{T}$ is the vector of unknown parameters, $ε = (ε_{1}, \dots, ε_{n})^{T}$ is the model error, and $ε_{i}, i = 1, 2, \dots, n$ are independent $N (0, σ^{2})$ random variables. The strictly concave penalty function is defined below.

Definition 2.1 The strictly concave penalty function —

Let $λ > 0$ be a tunning parameter. A function $P (\cdot, λ)$ is called strictly concave penalty function if the following conditions are satisfied,

$P (\cdot, λ)$ has continuous second order derivative on $[0, \infty)$ ,

$P^{'} (θ, λ) \to 0$ as $θ \to + \infty$ ,

$- 1 < P^{^{″}} (θ, λ) < 0$ for all $θ > 0$ , and

$P (0, λ) = 0$ .

A strictly concave penalty function is a non-concave penalty function described in [8,10]. The seemingly confusing use of the terms can be resolved by noting that ‘strictly concave’ here refers to the domain $[0, \infty)$ while ‘non-concave’ in [8,10] refers to the domain $(- \infty, \infty)$ . It can be checked that the SCAD penalty, the MCP penalty, and $L_{r}$ penalties of [1] are non-concave but not strictly concave.

Consider the following penalized least squares problem: To minimize

ℓ (θ) = \frac{1}{2} (θ - t)^{2} + P (| θ |, λ)

(2)

with respect to θ, where, t is the observed signal and θ is the unknown. It is suggested in [1] that the conditions in Definition 2.1 guarantee the existence and uniqueness of $\hat{θ} (t)$ , the solution to the optimization problem (2). Moreover, $\hat{θ} (t)$ is a continuous function of t and $\hat{θ} (t) - t \to 0$ as $t \to \infty$ . These conditions are necessary for reducing model complexity and model bias in prediction (see, [2,10]).

Suppose that the data set has n observations and p covariates. Let $Y = (y_{1}, \dots, y_{n})^{T}$ be the response and $X = [X_{1}, \dots, X_{p}]$ be the design matrix, where $X_{j} = (x_{1 j}, \dots, x_{n j})^{T}, j = 1, \dots, p$ , are the covariates. The strictly concave penalized likelihood estimator is defined as follows.

Definition 2.2 The strictly concave penalized likelihood estimator —

Let $P (\cdot, λ)$ be a strictly concave penalty. For any fixed non-negative λ, the strictly concave penalized likelihood estimator of β in Model (1) is defined as

$\hat{β} (λ) = {argmin}_{β} {\frac{1}{2 n} ‖ Y - X β ‖^{2} + \sum_{j = 1}^{p} P (| β_{j} |, λ)} .$ (3)

For simplicity, if no confusion is caused, we write $\hat{β}$ instead of $\hat{β} (λ)$ . Since a strictly concave penalty is also a nonconcave penalty, the majorization-minimization algorithm of [16] can be applied to obtain the penalized least square estimator (3). Similar to the SCAD penalty, if the design matrix X and the model error ϵ satisfy all regularity conditions described in [11], the penalized likelihood estimator $\hat{β} (λ)$ always exist and fulfills the so-called oracle properties. Such a property is not guaranteed for the LASSO and the Elastic net.

2.2. Parsimonious variable section in the multicollinearity case

In this subsection, the properties of the strictly concave penalized estimator are discussed under general multicollinearity cases. In such situations, the regularity conditions in [11] can be violated and the penalized estimation methods based on the SCAD penalty of [8] and the MCP penalty of [28] are not guaranteed to select the true model.

To illustrate the ideas, consider the following simple example. Suppose that $Y = β_{1} X_{1} + β_{2} X_{2} + ε$ and $X_{1} = X_{2}$ . Since both the SCAD penalty and the MCP penalty are constant beyond some point, the local solution $\hat{β} = (X_{1}^{T} Y / X_{1}^{T} X_{1} - K, K)$ always give the same penalized likelihood value when K is smaller than some critical value. This means that along the direction $(1, - 1)$ , the penalized likelihood is flat. This gives difficulties in the numerical optimization of the penalized likelihood. If the LASSO penalty is used and $X_{1}^{T} Y / X_{1}^{T} X_{1}$ is positive, similar difficulties occur because the penalized likelihood is constant for sufficiently small K. The situations are very different if strictly concave penalty are used instead because these penalty are no longer horizontal line far away from zero.

To describe the general multicollinearity cases, suppose that the design matrix X is generated by perturbing a non-full rank matrix U with a small quantity $δ Z$ . Here, the dimensions of both U and Z are the same as that of X. Parsimony requires no linear dependence between the columns in U corresponding to the selected covariates. Detailed results are given in the following theorem.

Theorem 2.3

For any integers n,p>0 and $n \times p$ -matrices $U = (U_{1}, \dots, U_{p})^{T}$ and $Z = (Z_{1}, \dots, Z_{p})^{T},$ there exists a positive constant ϰ $($ depending on the Xand $Z)$ such that the system of column vectors corresponding to the chosen covariates ${U_{j} = X_{j} - δ Z_{j} | {\hat{β}}_{j} \neq 0, j = 1, \dots, p}$ is linearly independent for all $0 \leq δ \leq ϰ,$ where $\hat{β} = ({\hat{β}}_{1}, \dots, {\hat{β}}_{p})^{T}$ is the strictly concave penalized likelihood estimator of β in Model (1).

The proof is given in Appendix A.2. Further results of the special case where $δ = 0$ or Z is a zero matrix are summarized in the following proposition.

Proposition 2.4

Let $\hat{β} = ({\hat{β}}_{1}, \dots, {\hat{β}}_{p})^{T}$ be the strictly concave penalized likelihood estimator of β in Model (1).

The system of column vectors ${X_{j} | {\hat{β}}_{j} \neq 0, j = 1, \dots, p}$ is linearly independent. Let $m = rank (X)$ . Then, the number of nonzero estimated coefficients satisfies
$h = # {j \in \bar{1, p} | {\hat{β}}_{j} \neq 0} \leq m .$

Without loss of generality assume that ${\hat{β}}_{1}, \dots, {\hat{β}}_{h}$ are nonzero and the system $X_{1}, \dots, X_{h}, X_{h + 1}, \dots, X_{m}$ is linearly independent. Let $X^{*} = [X_{1}, \dots, X_{m}]$ and ${\hat{β}}^{*} = ({\hat{β}}_{1}, \dots, {\hat{β}}_{m})^{T}$ . Then,
${\hat{β}}^{*} = {argmin}_{u \in R^{m}} {\frac{1}{2 n} ‖ Y - X^{*} u ‖^{2} + \sum_{j = 1}^{m} P (| u_{j} |, λ)} .$ (4)

Result (a) illustrates the crucial difference between the penalized regression methods based on the proposed strictly concave penalty and other commonly used convex penalties including the Ridge penalty and the Elastic net penalty. The grouping effect of the Elastic net [30] is in contradiction to the linear dependence of ${X_{j} | {\hat{β}}_{j} \neq 0, j = 1, \dots, p}$ . Roughly speaking, if strictly concave penalty is used, there is no redundancy in the selected variables.

Result (b) suggests that the properties of the penalized likelihood estimator in the non-full rank X cases can be studied indirectly through $X^{*}$ . To see this, introduce the notations $β_{(1)} = (β_{1}, \dots, β_{m})^{T}$ , $β_{(2)} = (β_{m + 1}, \dots, β_{p})^{T}$ , and $X^{* *} = [X_{m + 1}, \dots, X_{p}]$ . Since the columns in $X^{*}$ are maximal linearly independent, there exists an $m \times (p - m)$ -dimensional matrix C such that

X^{* *} = X^{*} C .

The true model is equivalent to

Y = X β + ε = X^{*} β^{*} + ε,

where $β^{*} = β_{(1)} + C β_{(2)}$ . This means that if we are able to show the oracle properties under the design matrix $X^{*}$ , the model selected based on the proposed method selects covariates corresponding to non-zero $β^{*}$ in the equivalent model with probability going to one. Following the arguments of Fan and Lv in [11], we state without proof the following proposition on the oracle property of the penalized likelihood estimates based on strictly concave penalty.

Proposition 2.5

Let $X^{*}$ be defined in Proposition 2.4. If $X^{*}$ and the error terms ϵ satisfy all regular conditions of [11], then ${\hat{β}}^{*}$ defined by (4) fulfills the so-called oracle properties in [11]. That means,

With probability tending to 1 as $n \to \infty,$ the penalized likelihood estimator ${\hat{β}}^{*} = ({\hat{β}}_{(1)}^{* T}, {\hat{β}}_{(2)}^{* T})^{T}$ satisfies:
${\hat{β}}_{(2)}^{*} = 0 and ‖ {\hat{β}}^{*} - β^{*} ‖_{2} = O_{P} (\sqrt{s} n^{- 1 / 2}),$
where ${\hat{β}}_{(1)}^{*}$ is a subvector of ${\hat{β}}^{*}$ formed by components in $supp (β^{*})$ and s is the size of ${\hat{β}}_{(1)}^{*};$

$A_{n} (X_{(1)}^{* T} X_{(1)}^{*})^{1 / 2} ({\hat{β}}_{(1)}^{*} - β_{(1)}^{*}) \overset{d}{\to} N (0, σ^{2} G),$
where $σ^{2} = var (ε_{i}),$ $A_{n}$ is a $m \times s$ matrix such that $A_{n} A_{n}^{T} \to G$ and G is a $m \times m$ symmetric positive definite matrix, and $X_{(1)}^{*}$ is the submatrix of $X^{*}$ corresponding to $β_{(1)}^{*}$ .

To compare the strictly concave penalty to the Elastic net penalty, consider the cases with identical covariates that lead to the so-called grouping effects described in [30]. The following proposition follows immediately from Proposition 2.4 and is stated without proof.

Proposition 2.6

Let $\hat{β} = \hat{β} (λ)$ be the strictly concave penalized likelihood estimator (3). Then, the followings hold,

If $X_{i} = X_{j} + δ V,$ $i \neq j,$ and $Var (V) = 1,$ then for sufficiently small δ, ${\hat{β}}_{i} \cdot {\hat{β}}_{j} = 0$ .

Suppose that $X_{q + 1} = X_{q + 2} = \dots = X_{p},$ for some q<p. Without loss of generality assume that ${\hat{β}}_{q + 1} \neq 0$ . Then, ${\hat{β}}^{*} = ({\hat{β}}_{1}, \dots, {\hat{β}}_{q}, \sum_{j = q + 1}^{p} {\hat{β}}_{j})^{T}$ fulfills
${\hat{β}}^{*} = {argmin}_{u \in R^{q + 1}} {\frac{1}{2 n} ‖ Y - X^{*} u ‖^{2} + \sum_{j = 1}^{q + 1} P (| u_{j} |, λ)},$ (5)
where $X^{*} = [X_{1}, \dots, X_{q + 1}]$ .

2.3. Feasibility of the majorization minimization algorithm

Majorization -- minimization algorithm of [16] that is closely related to the local quadratic approximation method in [8] can be employed to minimize the penalized likelihood function (3) when the penalty function $P (\cdot, λ)$ is non-concave. Note that the majorization - minimization algorithm is applicable only when the matrix $X^{'} X + n Ω$ is invertible, where X is the design matrix, $Ω = diag (ω_{1}, \dots, ω_{p})$ , $ω_{j} = P^{'} (| β_{j} |, λ) / (δ + | β_{j} |) \geq 0$ , and δ is given small positive value, say $10^{- 8}$ . Let

J_{1} = {j | j = 1, \dots, p, ω_{j} = 0} and J_{2} = {j | j = 1, \dots, p, ω_{j} \neq 0} .

We have the following results.

Proposition 2.7

The matrix $X^{'} X + n Ω$ is invertible if and only if the system of columns $X_{j}, j \in J_{1}$ is linearly independent system. In particular, if $ω_{j} > 0, \forall j$ or the design matrix X has full column rank, the matrix $X^{'} X + n Ω$ is invertible.

The proof is given in Appendix A.4. Below, the invertibility of $X^{'} X + n Ω$ is discussed for different kinds of penalty function. Note that, for any strictly concave penalty $P (\cdot, λ)$ , $ω_{j} > 0, \forall j = 1, \dots, p$ . Consequently, $J_{1} = \emptyset$ and the conclusion of Proposition 2.7 holds trivially. However, this is not true for the SCAD, the MCP, and the HARD penalty because these penalties are constant beyond some critical point. As a result, $ω_{j}$ can be zero if $β_{j}$ is large. Then, the set $J_{1}$ can be nonempty. The linear independence assumption on $X_{j}$ , $j \in J_{1}$ in Proposition 2.7 is not guaranteed in the multicollinearity cases.

3. Modified log penalty

In this section, the modified log penalty (MLOG), a special case of strictly concave penalty, is introduced.

Definition 3.1 Modified log penalty function —

The modified log penalty (MLOG) is defined as

$P^{(M L O G)} (θ, λ) = λ \log (1 + \frac{| θ |}{\sqrt{λ}}),$ (6)

where $λ > 0$ is a tunning parameter.

Note that $lim_{λ \to 0 +} P^{(M L O G)} (θ, λ) = 0$ and $lim_{λ \to \infty} (P^{(M L O G)} (θ, λ) / \sqrt{λ} | θ |) = 1$ for all θ. Therefore, the modified log penalized likelihood estimate behaves like the ordinary least squares estimate when λ is close to 0 and behaves like the LASSO estimate when λ goes to infinity. When $| θ | \neq 0$ and λ goes to zero, $λ \log (1 + | θ | / \sqrt{λ}) ≅ λ \log (| θ | / \sqrt{λ}) = λ \log | θ | - \frac{1}{2} λ \log λ$ . Neglecting the constant term, it becomes the logarithmic function.

In the modified log penalty, one is added to the term $| θ | / \sqrt{λ}$ to avoid the singularity. One can consider a more general penalty function $λ \log (1 + μ | θ |)$ . where $μ > 0$ is given. In this paper, $μ = μ_{0} = 1 / \sqrt{λ}$ is chosen because it is the greatest possible value of μ that guarantees the uniqueness and existence of $\hat{θ} (t)$ , the solution to the minimization problem (2). To see this, consider the first order condition

0 = \frac{d}{d θ} (\frac{1}{2} (θ - t)^{2} + λ \log (1 + μ θ)) = θ - t + \frac{λ μ}{1 + μ θ} .

The existence and uniqueness of the solution can be established by noting that the derivative of the right-hand-side $1 - λ μ^{2} / (1 + μ θ)^{2} \geq 0$ for all $θ \in [0, \infty)$ only when $μ \leq 1 / \sqrt{λ}$ .

Following [1], thresholding rule of the MLOG penalty refers to the function $Φ^{(M L O G)} (t, λ) = \hat{θ} (t)$ and is given by

Φ^{(M L O G)} (t, λ) = {\begin{cases} \frac{t - \sqrt{λ}}{2} + \sqrt{\frac{1}{4} (t + \sqrt{λ})^{2} - λ}, & t \geq \sqrt{λ}, 0, & | t | < \sqrt{λ}, \frac{t + \sqrt{λ}}{2} - \sqrt{\frac{1}{4} (- t + \sqrt{λ})^{2} - λ}, & t \leq - \sqrt{λ} . \end{cases}

(7)

Note that the function $Φ^{(M L O G)} (t, λ) = \hat{θ} (t)$ is the unique solution to (2) for all $t \in (- \infty, \infty)$ . It is a continuous function of t. Moreover, $\hat{θ} (t) - t \to 0$ as $t \to \infty$ . To see this, note that $\forall | t | \geq \sqrt{λ},$

\begin{aligned} Φ^{(M L O G)} (| t |, λ) - | t | & = \sqrt{\frac{1}{4} (| t | + \sqrt{λ})^{2} - λ} - \frac{| t | + \sqrt{λ}}{2} \\ = - \frac{λ}{\sqrt{\frac{1}{4} (| t | + \sqrt{λ})^{2} - λ} + \frac{| t | + \sqrt{λ}}{2}} \\ = - \frac{{(\frac{λ}{| t | + \sqrt{λ}})}^{3}}{{(\frac{1}{2} + \sqrt{\frac{1}{4} - \frac{λ}{(| t | + \sqrt{λ})^{2}}})}^{2}} - \frac{λ}{| t | + \sqrt{λ}} \\ = - \frac{λ}{| t | + \sqrt{λ}} + O {(\frac{λ}{| t | + \sqrt{λ}})}^{3} . \end{aligned}

Since the thresholding rule is an odd function, we have

Φ^{(M L O G)} (t, λ) = t - \frac{λ}{| t | + \sqrt{λ}} + O {(\frac{λ}{| t | + \sqrt{λ}})}^{3}, as | t | \to \infty .

(8)

The plots of the modified log penalty function and its thresholding rule are shown in Figure 1.

4. Simulation studies

In this section, the finite-sample performances of the nonconcave penalty are compared to those of the Elastic net, the LASSO, and the SCAD penalties under some examples exhibiting multicollinearity.

To obtain the penalized likelihood estimates, LARS-EN algorithm of [30] is used for the Elastic net while the majorization-minimization algorithm of [16] that is closely related to the local quadratic approximation method of [8] is used for all other penalties. For $j = 1, 2, \dots, p,$ the covariate $X_{j}$ is deselected if the estimated coefficient is $| {\hat{β}}_{j} | < 10^{- 6}$ . The tuning parameter λ is chosen based on the Bayesian information criterion (BIC) of [13,26]. The optimal λ value is obtained using the grid-point search over 100 grid-points ${10^{(- 5 + 7 l / 99)}, l = 0, 1, \dots, 99}$ . For the SCAD penalty, the tuning parameter a=3.7 is used as suggested in [8].

In each example, the simulated dataset consists of a training set and a test set. The models are fitted using the training sets and the prediction errors are obtained from the test sets. N=500 replicates are used in the simulation. The sample sizes of the training sets are chosen as n=200, n=400, and n=800. The number of covariates (p) grows with n. The sample sizes of test sets are 100.

The following measures of estimation efficiency, prediction efficiency, and selection consistency are used to compare the performance of different penalties. Let $\hat{β}$ be an estimate of β.

Median of relative model errors (MRME) of [8].
Model error (ME): $ME (\hat{β}) = (\hat{β} - β)^{T} Cov (X) (\hat{β} - β)$ .
Relative model error (RME): $ME (\hat{β}) / ME ({\hat{β}}_{LS})$ , where ${\hat{β}}_{LS}$ is the least squares estimator (LS).
Mean squared error (MSE): $MSE (\hat{β}) = ‖ Y - X \hat{β} ‖^{2} / n$ .
Prediction error (PE): $PE (\hat{β}) = (\hat{β} - β)^{T} Cov (X_{t e s t}) (\hat{β} - β)$ .
Mean squared prediction error (MPME): $MPSE (\hat{β}) = ‖ Y_{t e s t} - X_{t e s t} \hat{β} ‖^{2} / n$ .
False positives (FP) and false negatives (FN). FP is the mean number of irrelevant covariates misclassified as relevant and FN is the mean number of relevant covariates misclassified as irrelevant. In the simulation examples, the average FP and FN values are reported.
NONZERO: the average number of selected covariates.

Example 4.1

In this example, the covariates are strongly correlated but not exactly identical. Suppose that $X = [X_{(1)}, X_{(2)}, X_{(3)}]$ . The number of covariates is $p = p_{n} = [4 n^{1 / 4}]$ . Let $q = [p / 3]$ . The rows of $X_{(1)}$ are sampled from $N (0, Σ_{0})$ , where $Σ_{0} = ({0.5}^{| i - j |})_{i, j = 1, 2, \dots, q}$ . The rows of $X_{(2)}$ and $X_{(3)}$ are generated by

$X_{(2)} = \sqrt{1 - τ^{2}} X_{(1)} + τ Z_{(1)} and X_{(3)} = Z_{(2)},$

where $τ = 0.1$ . The rows of $Z_{(1)}$ and $Z_{(2)}$ are sampled from $N (0, I_{q})$ and $N (0, I_{p - 2 q})$ respectively.

Consider the following two sets of true regression coefficients $β^{(t r u e)}$ ,

$\begin{aligned} S c e n a r i o (a) : β^{(t r u e)} & = (\underset{q}{\underset{⏟}{κ_{1}, κ_{2}, \dots, κ_{q}}}, \underset{p - q}{\underset{⏟}{0, 0, \dots, 0}})^{T}, \\ S c e n a r i o (b) : β^{(t r u e)} & = (\underset{p - q}{\underset{⏟}{0, 0, \dots, 0}}, \underset{q}{\underset{⏟}{κ_{1}, κ_{2}, \dots, κ_{q}}})^{T} . \end{aligned}$

Here, $κ = {3, - 1.9, 2.5, - 2.2, 1.5, 3, - 1.9, 2.5, - 2.2, 1.5, \dots}$ and the five numbers 3, -1.9, 2.5, -2.2, 1.5 are repeated in κ. In Scenario (a), $X_{(1)}$ that is strongly correlated to $X_{(2)}$ is relevant while in Scenario (b), both $X_{(1)}$ and $X_{(2)}$ are irrelevant.

Table 1 summarizes the simulation results. First, both MLOG and SCAD perform well in terms of FP in Scenario (b) where the strong pairwise correlations occur only in the irrelevant covariates. On the other hand, the MLOG outperforms the SCAD in general under Scenario (a) where strong pairwise correlations occur between both relevant and irrelevant covariates. The Elastic net and the LASSO perform well in terms of estimation efficiency, but FP is large in general. This means that the Elastic net and the LASSO tend to select more variables than other penalties.

Table 1. The Simulation results of Example 4.1. Mean squared error (MSE) of estimates, model error (ME), mean of the number of selected covariates (NONZERO), median of relative model errors (MRME), prediction model errors (PE), mean prediction squared errors (MPSE), mean number of the false positives (FP) and false negatives (FN).

Scenario	(n,p,q)	Method	ME	MRME	MSE	PE	MPSE	NONZERO	FP	FN
		LS	0.0741	100.00	0.9498	0.0795	0.5420	14.38	9.38	0.00
	n=200	MLOG	0.0302	40.18	0.9863	0.0314	0.5186	5.39	0.39	0.00
	p=15	ENET	0.0464	62.93	0.9768	0.0482	0.5265	13.69	8.69	0.00
	q=5	LASSO	0.0476	65.38	0.9816	0.0498	0.5271	12.80	7.80	0.00
		SCAD	1.8508	2972.39	2.8072	2.0629	1.5218	6.25	1.56	0.31
		LS	0.0427	100.00	0.9485	0.0462	0.2621	16.02	11.02	0.00
	n=400	MLOG	0.0148	34.75	0.9714	0.0155	0.2533	5.47	0.47	0.00
(a)	p=17	ENET	0.0271	63.71	0.9634	0.0286	0.2561	14.83	9.83	0.00
	q=5	LASSO	0.0269	64.20	0.9657	0.0284	0.2557	13.99	8.99	0.00
		SCAD	1.2158	3045.90	2.1819	1.3219	0.5807	6.24	1.36	0.12
		LS	0.0261	100.00	0.9720	0.0275	0.1279	19.48	12.48	0.00
	n=800	MLOG	0.0104	40.47	0.9852	0.0109	0.1258	7.28	0.28	0.00
	p=21	ENET	0.0156	60.72	0.9836	0.0165	0.1269	16.81	9.81	0.00
	q=7	LASSO	0.0158	61.43	0.9839	0.0167	0.1269	16.56	9.56	0.00
		SCAD	0.2829	900.17	1.2566	0.2848	0.1594	7.26	0.29	0.03
		LS	0.0732	100.00	0.9300	0.0798	0.5411	14.98	9.98	0.00
	n=200	MLOG	0.0273	36.50	0.9729	0.0285	0.5128	5.37	0.37	0.00
	p=15	ENET	0.0433	58.84	0.9580	0.0461	0.5229	10.49	5.49	0.00
	q=5	LASSO	0.0436	60.09	0.9662	0.0465	0.5228	8.20	3.20	0.00
		SCAD	0.0509	67.52	0.9651	0.0524	0.5238	5.55	0.55	0.00
		LS	0.0436	100.00	0.9572	0.0461	0.2589	16.70	11.70	0.00
	n=400	MLOG	0.0143	32.12	0.9840	0.0145	0.2519	5.52	0.52	0.00
(b)	p=17	ENET	0.0261	59.84	0.9751	0.0268	0.2547	9.97	4.97	0.00
	q=5	LASSO	0.0257	59.60	0.9789	0.0262	0.2549	8.27	3.27	0.00
		SCAD	0.0179	41.61	0.9905	0.0184	0.2528	5.52	0.52	0.00
		LS	0.0262	100.00	0.9711	0.0269	0.1293	20.98	13.98	0.00
	n=800	MLOG	0.0086	33.03	0.9880	0.0087	0.1271	7.24	0.24	0.00
	p=21	ENET	0.0141	54.63	0.9843	0.0142	0.1278	10.51	3.51	0.00
	q=7	LASSO	0.0142	54.97	0.9846	0.0143	0.1278	10.39	3.39	0.00
		SCAD	0.0115	44.20	0.9913	0.0117	0.1274	7.54	0.54	0.00

Open in a new tab

Example 4.2

In this example, some covariates are exactly identical. Consider three simulation settings. In each simulation setting, the number of covariates (p) is chosen the same as that in Example 4.1 and the number of identical covariates in the true model is chosen as p−q, where $q = [p / 2] + 1$ . Set $X_{q + 1} = \dots = X_{p}$ . Let $X^{*} = [X_{1}, X_{2}, \dots, X_{q + 1}]$ be the first $(q + 1)$ -columns of X and $X^{* *}$ be the remaining $(p - q - 1)$ columns of X. The rows $X^{*}$ are sampled from $N (0, Σ_{0}^{*})$ , where $Σ_{0}^{*} = (ρ_{i j})$ and $ρ_{i j} = {0.5}^{| i - j |}, i, j = 1, \dots, q + 1$ . The p−q nonzero coefficients are chosen from the first $(p - q)$ elements of the sequence κ defined in Example 4.1. In this example, both least squared estimate (LS) and the SCAD result in computational difficulties related to the inverse of singular matrix, so the performances of these two penalties are not reported. Consider the following true regression coefficient vectors $β^{(t r u e)}$ ,

$\begin{aligned} S c e n a r i o (a) : β^{(t r u e)} & = (\underset{p - q}{\underset{⏟}{κ_{1}, κ_{2}, \dots, κ_{p - q}}}, \underset{q}{\underset{⏟}{0, 0, \dots, 0}})^{T}, \\ S c e n a r i o (b) : β^{(t r u e)} & = (\underset{q}{\underset{⏟}{0, 0, \dots, 0}}, \underset{p - q}{\underset{⏟}{κ_{1}, κ_{2}, \dots, κ_{p - q}}})^{T}, \\ S c e n a r i o (c) : β^{(t r u e)} & = (\underset{r}{\underset{⏟}{κ_{1}, κ_{2}, \dots, κ_{r}}}, \underset{q - r}{\underset{⏟}{0, 0, \dots, 0}}, \underset{p - q - r}{\underset{⏟}{κ_{r + 1}, κ_{r + 2}, \dots, κ_{p - q}}}, \underset{r}{\underset{⏟}{0, 0, \dots, 0}})^{T}, \\ where r & = [(p - q) / 2] . \end{aligned}$

In Scenario (a), all identical covariates are irrelevant. In Scenario (b), all identical covariates are relevant. In Scenario (c), some of identical covariates are relevant while the other identical covariates are irrelevant.

Note that $X_{q + 1} = \dots = X_{p}$ . To avoid the difficulties related to model identification, the mean number of false positives, false negatives, and selected covariates are computed based on $β^{*} = (β_{1}^{(t r u e)}, \dots, β_{q}^{(t r u e)}, \sum_{j = q + 1}^{p} β_{j}^{(t r u e)})^{T}$ . Let k be the number of non-zeros in $β^{*}$ . NONZERO refers to the average number of non-zeros in ${\hat{β}}^{*}$ . To show the detailed information about the identical covariates, additional measures are used. Let $I_{1}$ be the number of selected covariates among p−q identical covariates $X_{q + 1}, \dots, X_{p}$ , $I_{2}$ be the number of identical covariates removed from the model, and ‘prob’ be the estimated probability of correctly eliminating all redundant identical covariates. The true values of k, $I_{1}$ , and $I_{2}$ in the three scenarios are shown in Tables 2 and 3.

The simulation results are summarized in Table 4. In Scenario (a) where all identical covariates are irrelevant, the performance of the MLOG penlalty is similar to those of the Elastic net and the LASSO in terms of both estimation and prediction efficiency. On the other hand, the MLOG outperforms both Elastic net and LASSO in Scenarios (b) and (c) where some identical covariates are relevant while the others are irrelevant. In terms of ‘prob’, the MLOG is more likely to eliminate redundant identical covariates while the Elastic net more likely exhibits the so-called grouping effect.

Table 2. The true values of the number of nonzeros (k), the number of selected covariates among identical covariates in the model ( $I_{1}$ ), and the number of identical covariates that are removed from in the model ( $I_{2}$ ).

	$X_{1}, X_{2}, \dots, X_{q}$ , $X_{q + 1} = \dots = X_{p}$ , $q = [\frac{p}{2}] + 1$ , $r = [\frac{p - q}{2}]$
Scenario (a)	$β^{*} = (\underset{p - q}{\underset{⏟}{κ_{1}, κ_{2}, \dots, κ_{p - q}}}, \underset{2 q - p + 1}{\underset{⏟}{0, 0, \dots, 0}})^{T}$	k=p−q	$I_{1} = 0$	$I_{2} = p - q$
Scenario (b)	$β^{*} = (\underset{q}{\underset{⏟}{0, 0, \dots, 0}}, \sum_{j = 1}^{p - q} κ_{j})^{T}$	k=1	$I_{1} = 1$	$I_{2} = p - q - 1$
Scenario (c)	$β^{*} = (\underset{r}{\underset{⏟}{κ_{1}, \dots, κ_{r}}}, \underset{q - r}{\underset{⏟}{0, \dots, 0}}, \sum_{j = r + 1}^{p - q} κ_{j})^{T}$	k=r+1	$I_{1} = 1$	$I_{2} = p - q - 1$

Open in a new tab

Table 3. The true values of k, $I_{1}$ , and $I_{2}$ with n=200, n=400 and n=800.

n	Scenario (a)			Scenario (b)			Scenario (c)
200	k=7,	$I_{1} = 0$ ,	$I_{2} = 7$	k=1,	$I_{1} = 1$ ,	$I_{2} = 6$	k=4,	$I_{1} = 1$ ,	$I_{2} = 6$
400	k=8,	$I_{1} = 0$ ,	$I_{2} = 8$	k=1,	$I_{1} = 1$ ,	$I_{2} = 7$	$k = 5$ ,	$I_{1} = 1$ ,	$I_{2} = 7$
800	k=10,	$I_{1} = 0$ ,	$I_{2} = 10$	k=1,	$I_{1} = 1$ ,	$I_{2} = 9$	k=6,	$I_{1} = 1$ ,	$I_{2} = 9$

Open in a new tab

Table 4. The Simulation results of Example 4.2. Mean squared error (MSE) of estimates, model error (ME), mean of the number of selected covariates (NONZERO), prediction model errors (PE), mean prediction squared errors (MPSE), mean number of the false positives (FP) and false negatives (FN), the number of selected covariates among identical covariates in the model ( $I_{1}$ ) and the probability of correctly eliminated identical covariates (prob).

Scenario	(n,p,k)	Method	ME	MSE	PE	MPSE	NONZERO	FP	FN	$I_{1}$	prob
	n=200	MLOG	0.0334	0.9496	0.0349	1.0301	7.16	0.16	0.00	0.06	0.94
	p=15	ENET	0.0426	0.9425	0.0451	1.0393	8.50	1.50	0.00	2.58	0.63
	k=7	LASSO	0.0422	0.9433	0.0453	1.0392	8.53	1.53	0.00	0.73	0.27
	n=400	MLOG	0.0222	0.9847	0.0224	1.0209	8.17	0.17	0.00	0.14	0.86
(a)	p=17	ENET	0.0273	0.9819	0.0276	1.0245	9.34	1.34	0.00	1.79	0.78
	k=8	LASSO	0.0271	0.9824	0.0276	1.0257	9.39	1.39	0.00	0.72	0.29
	n=800	MLOG	0.0130	0.9845	0.0132	0.9995	10.11	0.11	0.00	0.08	0.92
	p=21	ENET	0.0156	0.9830	0.0158	1.0025	11.38	1.38	0.00	0.16	0.98
	k=10	LASSO	0.0161	0.9839	0.0165	1.0041	11.28	1.28	0.00	0.63	0.37
	n=200	MLOG	0.0064	0.9787	0.0065	0.9988	1.51	0.51	0.00	1.00	1.00
	p=15	ENET	2.4730	3.4459	2.4401	3.4150	1.58	0.74	0.16	5.88	0.00
	k=1	LASSO	0.0188	0.9758	0.0191	1.0102	2.66	1.66	0.00	1.00	1.00
	n=400	MLOG	0.0046	0.9973	0.0045	0.9948	1.50	0.50	0.00	1.00	1.00
(b)	p=17	ENET	0.3049	1.2999	0.3276	1.3041	1.83	0.84	0.01	7.94	0.00
	k=1	LASSO	0.0114	0.9941	0.0113	1.0014	2.97	1.97	0.00	1.00	1.00
	n=800	MLOG	0.0021	0.9866	0.0020	0.9810	1.42	0.42	0.00	1.00	1.00
	p=21	ENET	0.0049	0.9893	0.0048	0.9831	1.51	0.51	0.00	10.00	0.00
	k=1	LASSO	0.0055	0.9858	0.0054	0.9846	2.50	1.50	0.00	1.00	1.00
	n=200	MLOG	0.0239	0.9589	0.0243	1.0003	4.35	0.35	0.00	1.00	1.00
	p=15	ENET	0.0378	0.9474	0.0387	1.0117	6.75	2.75	0.00	7.00	0.00
	k=4	LASSO	0.0388	0.9465	0.0399	1.0118	7.01	3.01	0.00	1.01	0.99
	n=400	MLOG	0.0129	0.9853	0.0131	1.0376	5.28	0.28	0.00	1.00	1.00
(c)	p=17	ENET	0.0228	0.9781	0.0236	1.0489	7.92	2.92	0.00	8.00	0.00
	k=5	LASSO	0.0227	0.9782	0.0235	1.0481	8.07	3.07	0.00	1.00	1.00
	n=800	MLOG	0.0079	0.9884	0.0080	1.0080	6.25	0.25	0.00	1.00	1.00
	p=21	ENET	0.0131	0.9848	0.0133	1.0134	8.78	2.78	0.00	10.00	0.00
	k=6	LASSO	0.0135	0.9843	0.0137	1.0138	9.18	3.18	0.00	1.00	1.00

Open in a new tab

Example 4.3

This example is the same as Example 4.2 excepting that the covariance matrix

$Σ_{0}^{*} = (\begin{matrix} Σ_{11}^{*} & Σ_{12}^{*} \\ Σ_{21}^{*} & Σ_{22}^{*} \end{matrix})$

is used instead of $Σ_{0}$ , where

$\begin{aligned} Σ_{11}^{*} & = (1 - ρ) * I_{r} + ρ * 1_{r} 1_{r}^{T}, \\ Σ_{12}^{*} & = O_{r \times (q + 1 - r)}, Σ_{21}^{*} = Σ_{12}^{* T}, \\ Σ_{22}^{*} & = ρ * I_{q + 1 - r} + (1 - ρ) * 1_{q + 1 - r} 1_{q + 1 - r}^{T} \end{aligned}$

with $ρ = 0.1$ . Unlike Example 4.2, there are both strongly correlated and weakly correlated pairs of covariates in $X^{*}$ .

The simulation results are shown in Table 5. Similar to Example 4.2, the MLOG outperforms the LASSO and Elastic net penalty in terms of the probability of eliminating redundant covariates and prediction error. The Elastic net in general exhibits grouping effect.

Table 5. The Simulation results of Example 4.3. Mean squared error (MSE) of estimates, model error (ME), mean of the number of selected covariates (NONZERO), prediction model errors (PE), mean prediction squared errors (MPSE), mean number of the false positives (FP) and false negatives (FN), the number of selected covariates among identical covariates in the model ( $I_{1}$ ) and the probability of correctly eliminated identical covariates (prob).

Scenario	(n,p,k)	Method	ME	MSE	PE	MPSE	NONZERO	FP	FN	$I_{1}$	prob
	n=200	MLOG	0.0383	0.9588	0.0403	1.0596	7.05	0.05	0.00	0.03	0.97
	p=15	ENET	0.0453	0.9531	0.0476	1.0648	8.27	1.27	0.00	2.25	0.54
	k=7	LASSO	0.0452	0.9530	0.0477	1.0650	8.36	1.36	0.00	0.67	0.33
	n=400	MLOG	0.0188	0.9725	0.0190	1.0202	8.02	0.02	0.00	0.01	0.99
(a)	p=17	ENET	0.0246	0.9700	0.0251	1.0268	9.43	1.43	0.00	3.06	0.49
	k=8	LASSO	0.0228	0.9702	0.0232	1.0235	9.31	1.31	0.00	0.70	0.30
	n=800	MLOG	0.0126	0.9899	0.0127	1.0341	10.02	0.02	0.00	0.02	0.98
	p=21	ENET	0.0153	0.9889	0.0156	1.0361	11.26	1.26	0.00	2.64	0.74
	k=10	LASSO	0.0152	0.9897	0.0155	1.0370	11.12	1.12	0.00	0.55	0.45
	n=200	MLOG	0.0061	0.9707	0.0059	1.0228	1.22	0.22	0.00	1.00	1.00
	p=15	ENET	4.6020	5.5639	4.5730	5.6197	1.24	0.54	0.30	4.93	0.00
	k=1	LASSO	0.0205	0.9613	0.0203	1.0379	3.23	2.23	0.00	1.01	0.99
	n=400	MLOG	0.0031	0.9865	0.0032	0.9974	1.31	0.31	0.00	1.00	1.00
(b)	p=17	ENET	5.1911	6.1341	5.3456	6.3323	1.91	1.04	0.13	6.98	0.00
	k=1	LASSO	0.0108	0.9822	0.0110	1.0071	3.13	2.13	0.00	1.01	0.99
	n=800	MLOG	0.0016	0.9978	0.0016	0.9941	1.24	0.24	0.00	1.00	1.00
	p=21	ENET	0.0159	1.0087	0.0143	1.0037	2.05	1.05	0.00	10.00	0.00
	k=1	LASSO	0.0059	0.9952	0.0061	0.9963	3.18	2.18	0.00	1.00	1.00
	n=200	MLOG	0.0299	0.9768	0.0301	1.0326	4.02	0.33	0.31	0.02	0.98
	p=15	ENET	0.0339	0.9648	0.0349	1.0405	6.04	2.14	0.10	6.05	0.00
	k=4	LASSO	0.0310	0.9634	0.0315	1.0347	5.81	1.90	0.10	0.92	0.89
	n=400	MLOG	0.0144	0.9871	0.0147	0.9927	5.04	0.04	0.00	1.00	1.00
(c)	p=17	ENET	0.0208	0.9823	0.0216	1.0012	6.81	1.81	0.00	8.00	0.00
	k=5	LASSO	0.0203	0.9846	0.0209	1.0011	6.39	1.39	0.00	1.00	1.00
	n=800	MLOG	0.0077	0.9938	0.0077	0.9937	6.04	0.04	0.00	1.00	1.00
	p=21	ENET	0.0110	0.9914	0.0112	0.9969	7.93	1.93	0.00	10.00	0.00
	k=6	LASSO	0.0114	0.9928	0.0116	0.9975	7.54	1.54	0.00	1.01	0.99

Open in a new tab

Example 4.4

In this example, the situation that the number of covariates (p) is much larger than the number of observations (n) is considered under three simulation settings. The MLOG penalty is compared with the LASSO and the Elastic net. Here, n=50, p=100, and the same $β^{(t r u e)}$ are used for all three simulation settings,

$β^{(t r u e)} = (\underset{24}{\underset{⏟}{κ_{1}, \dots, κ_{24}}}, \underset{26}{\underset{⏟}{0, 0, \dots, 0}}, \underset{25}{\underset{⏟}{κ_{25}, \dots, κ_{49}}}, \underset{25}{\underset{⏟}{0, 0, \dots, 0}})^{T} .$

Here, the 49 nonzero coefficients are chosen as the first 49 elements of the sequence κ introduced in Example 4.1. The simulation settings are as follows,

Model (a): The rows of X are sampled from $N (0, Σ_{0})$ , where $Σ_{0} = (ρ_{i j})$ and $ρ_{i j} = {0.5}^{| i - j |}, i, j = 1, \dots, 100$ ;

Model (b): Let $X_{(1)} = [X_{1}, \dots, X_{50}]$ , $X_{(2)} = [X_{51}, \dots, X_{75}]$ , and $X_{(3)} = [X_{76}, \dots, X_{100}]$ . The rows of $X^{*} = [X_{(1)}, X_{(2)}]$ are sampled from $N (0, Σ_{0}^{*})$ , where $Σ_{0}^{*} = (ρ_{i j})$ and $ρ_{i j} = {0.5}^{| i - j |}, i, j = 1, \dots, 75$ . $X_{(3)} = \sqrt{1 - τ^{2}} X_{(2)} + τ Z$ , where Z are sampled from $N (0, I_{25})$ and $τ = 0.1$ ;

Model (c): The rows of $X^{*} = [X_{1}, \dots, X_{51}]$ are sampled from $N (0, Σ_{0}^{*})$ , where $Σ_{0}^{*} = (ρ_{i j})$ and $ρ_{i j} = {0.5}^{| i - j |}, i, j = 1, \dots, 51$ . The remaining covariates are exactly equal to $X_{51}$ . That means $X_{51} = \dots = X_{100}$ .

Obviously, there is no strong correlation between the covariates in Model (a). In Model (b), the relevant covariates $X_{(2)}$ are strongly correlated with the irrelevant covariates $X_{(3)}$ . In the Model (c), some covariates are exactly identical.

The simulation results are shown in Table 6. The MLOG performs the best in all cases while the LASSO and the Elastic net cannot select the true model. In the situation where some covariates are identical, the MLOG outperforms both LASSO and Elastic net penalty in terms of the probability of eliminating redundant covariates and prediction error.

Table 6. The Simulation results of Example 4.4. Mean squared error (MSE) of estimates, model error (ME), mean of the number of selected covariates (NONZERO), prediction model errors (PE), mean prediction squared errors (MPSE), mean number of the false positives (FP) and false negatives (FN), the number of selected covariates among identical covariates in the model ( $I_{1}$ ) and the probability of correctly eliminated identical covariates (prob).

Scenario	Method	ME	MSE	PE	MPSE	NONZERO	FP	FN	$I_{1}$	prob
	MLOG	0.2448	0.0005	1.0077	0.4732	49.29	0.68	0.39
(a)	ENET	3.2564	5.1164	4.0410	6.0826	45.15	18.83	22.68
	LASSO	5.1726	6.9262	5.2391	7.2317	27.86	10.45	31.61
	MLOG	0.2491	0.0009	1.1881	0.0632	49.17	0.63	0.46
(b)	ENET	5.9036	7.5343	6.4448	7.8515	57.42	26.43	18.01
	LASSO	7.1262	7.7524	7.2141	8.4866	35.17	12.98	26.81
	MLOG	0.2455	0.0017	0.9168	0.9205	25.76	0.82	0.06	48.76	0.88
(c)	ENET	3.1427	3.6126	3.9263	3.9810	32.74	11.39	3.65	36.99	0.00
	LASSO	3.6124	4.1524	3.9787	4.0974	31.41	10.62	4.21	48.14	0.65

Open in a new tab

To summarize, the simulation results in Tables 1–6 suggest that in the absence of multicollinearity, both SCAD and MLOG perform well. However, when the covariates are strongly correlated or even identical, the MLOG performs better in terms of prediction error. Figures A1–A4 show more detailed simulation results of the above three examples.

5. Real data examples

In this section, the linear models are used and the proposed MLOG penalty is applied to the diabetes dataset in [6] and the prostate dataset in Tibshirani [25] and She [21,22]. To illustrate the performances in the case that the number of covariates (p) is much larger than the number of observations (n), the Gene expression dataset of Scheetz et al. [20] is used too. The performance of the MLOG penalty is compared with the Elastic net, the LASSO, and the SCAD based on the number of selected covariates (NONZERO) and mean squared error. For $j = 1, 2, \dots, p$ , the standard error of the estimated coefficient $(\hat{β_{j}})$ is computed from 500 bootstrap samples (see, [7]) as

s e_{b o o t} (\hat{β_{j}}) = \sqrt{\frac{1}{B - 1} \sum_{b = 1}^{B} ({\tilde{β}}_{b, j} - {\tilde{β}}_{j})^{2}},

where ${\tilde{β}}_{b, j}$ is the estimate of $β_{j}$ at b−th bootstrap sample and ${\tilde{β}}_{j} = (1 / B) \sum_{b = 1}^{B} {\tilde{β}}_{b, j}$ . The size of bootstrap is chosen as B=1000.

The following is used to evaluate the standard error for the estimator $(\hat{β})$ ,

s e_{b o o t} (\hat{β}) = \sqrt{\frac{1}{B - 1} \sum_{b = 1}^{B} ‖ {\tilde{β}}_{b} - \tilde{β} ‖^{2}},

where ${\tilde{β}}_{b}$ is the estimate of β at b−th bootstrap sample and $\tilde{β} = (1 / B) \sum_{b = 1}^{B} {\tilde{β}}_{b}$ . To validate the penalized estimators, training data and test data are considered. The model is fitted from a random subsample of size 300. The remaining 142 observations are used as the test data. This procedure is repeated for 500 times. The mean prediction squared error (MPSE) and the standard deviation of prediction squared errors (sd(PSE)) are reported.

Example 5.1 Diabetes data —

The diabetes dataset contains n=442 observations from diabetes patients. There are ten baseline covariates $X_{1}, \dots, X_{10}$ , namely age, sex, body mass index (bmi), average blood pressure (bp), and six blood serum measurements: $s_{1}, s_{2}, s_{3}, s_{4}, s_{5}, s_{6}$ . The response is a quantitative measure of disease progression one year after the baseline. Before the statistical analysis, the data is standardized so that the means of all variables are zero and the variances are one.

Some covariates are strongly correlated. For example, the pairwise correlation between $s_{1}$ and $s_{2}$ is 0.897, between $s_{2}$ and $s_{4}$ is 0.66, and between $s_{3}$ and $s_{4}$ is −0.738. The sample correlation matrix of the covariates is shown below:

$\begin{aligned} (\begin{matrix} age & sex & bmi & bp & s_{1} \\ age & 1.000 & 0.174 & 0.185 & 0.335 & 0.260 \\ sex & 0.174 & 1.000 & 0.088 & 0.241 & 0.035 \\ bmi & 0.185 & 0.088 & 1.000 & 0.395 & 0.250 \\ bp & 0.335 & 0.241 & 0.395 & 1.000 & 0.242 \\ s_{1} & 0.260 & 0.035 & 0.250 & 0.242 & 1.000 \\ s_{2} & 0.219 & 0.143 & 0.261 & 0.186 & 0.897 \\ s_{3} & - 0.075 & - 0.379 & - 0.367 & - 0.179 & 0.052 \\ s_{4} & 0.204 & 0.332 & 0.414 & 0.258 & 0.542 \\ s_{5} & 0.271 & 0.150 & 0.446 & 0.393 & 0.516 \\ s_{6} & 0.302 & 0.208 & 0.389 & 0.390 & 0.326 \end{matrix} \\ \begin{matrix} s_{2} & s_{3} & s_{4} & s_{5} & s_{6} \\ 0.219 & - 0.075 & 0.204 & 0.271 & 0.302 \\ 0.143 & - 0.379 & 0.332 & 0.150 & 0.208 \\ 0.261 & - 0.367 & 0.414 & 0.446 & 0.389 \\ 0.186 & - 0.179 & 0.258 & 0.393 & 0.390 \\ 0.897 & 0.052 & 0.542 & 0.516 & 0.326 \\ 1.000 & - 0.196 & 0.660 & 0.318 & 0.291 \\ - 0.196 & 1.000 & - 0.738 & - 0.399 & - 0.274 \\ 0.660 & - 0.738 & 1.000 & 0.618 & 0.417 \\ 0.318 & - 0.399 & 0.618 & 1.000 & 0.465 \\ 0.291 & - 0.274 & 0.417 & 0.465 & 1.000 \end{matrix}) \end{aligned}$

Tables 7 and 8 show the estimation results of estimation and prediction computed from bootstrap samples. The MLOG outperforms the LASSO, the SCAD, and the Elastic net in terms of both MSE and MPSE. Though the results of estimation and prediction are satisfactory for both LASSO and Elastic net, they tend to select more covariates than the MLOG and the SCAD. This is consistent with the results of [5].

Table 7. The bootstrap means and standard deviations of the estimate of the diabetes data.

Covariates	LS	MLOG	ENET	LASSO	SCAD
age	−0.006(0.035)	0.000(0.012)	0.000(0.034)	0.000(0.012)	0.000(0.013)
sex	−0.148(0.039)	−0.094(0.052)	−0.146(0.041)	−0.093(0.042)	−0.104(0.056)
bmi	0.321(0.041)	0.333(0.037)	0.323(0.042)	0.319(0.045)	0.336(0.050)
bp	0.200(0.038)	0.171(0.041)	0.198(0.039)	0.168(0.041)	0.191(0.057)
$s_{1}$	−0.489(0.223)	−0.011(0.046)	−0.347(0.258)	−0.028(0.041)	−0.021(0.072)
$s_{2}$	0.294(0.178)	0.000(0.043)	0.181(0.199)	0.000(0.018)	0.000(0.066)
$s_{3}$	0.062(0.112)	−0.136(0.059)	−0.000(0.126)	−0.129(0.045)	−0.136(0.082)
$s_{4}$	0.109(0.095)	0.000(0.055)	0.091(0.093)	0.000(0.036)	0.000(0.087)
$s_{5}$	0.464(0.090)	0.297(0.051)	0.411(0.105)	0.296(0.049)	0.313(0.063)
$s_{6}$	0.042(0.037)	0.000(0.019)	0.041(0.036)	0.019(0.027)	0.000(0.012)

Open in a new tab

Table 8. The estimation results for diabetes data. The mean of the number of selected covariates (NONZERO), mean squared errors (MSE), standard error of estimator, mean prediction squared errors (MPSE), and the standard deviation of prediction squared errors (sd(PSE)).

Method	Selected Covariates	NONZERO	MSE	se(estimator)	MPSE	sd(PSE)
LS	age, sex, bmi, bp, $s_{1}$ ,…, $s_{6}$	10	0.4811	0.034	0.5106	0.0455
MLOG	sex, bmi, bp, $s_{1}$ , $s_{3}$ , $s_{5}$	6	0.4832	0.016	0.5118	0.0445
ENET	sex, bmi, bp, $s_{1}$ , $s_{2}$ , $s_{4}$ , $s_{5}$ , $s_{6}$	8	0.4935	0.039	0.5148	0.0452
LASSO	sex, bmi, bp, $s_{1}$ , $s_{3}$ , $s_{5}$ , $s_{6}$	7	0.4909	0.012	0.5165	0.0442
SCAD	sex, bmi, bp, $s_{1}$ , $s_{3}$ , $s_{5}$	6	0.4898	0.019	0.5224	0.0449

Open in a new tab

Example 5.2 Prostate data —

The prostate dataset have n=97 observations and 9 clinical measures. Following She [21], take log(cancer volume) (lcavol) as the response variable and a full quadratic model was considered: the 43 covariates are 8 main effects, 7 squares, and 28 interactions of eight original variables – lweight, age, lbph, svi, lcp, gleason, pgg45, and lpsa, where svi is binary. To validate the estimation methods, 80 observations are randomly selected for model fitting and the remaining 17 observations are for testing.

The covariates in the full quadratic model exhibit even stronger correlations than those in Example 5.1. For example, the within–group correlations are very high, for example, >0.98 for group ${lcp, lweight * lcp, age * lcp, gleason * lcp}$ , and >0.93 for group ${lpsa, lweght * lpsa, age * lpsa, gleason * lpsa}$ . The results of She [21] suggested that the LASSO does not give stable and accurate solutions in the presence of many highly correlated covariates.

Tables 9 and 10 show the performances of estimation and prediction based on bootstrap samples. Since there are strongly correlated pairs of covariates, the Elastic net select many redundant covariates and the performances in terms of prediction, MSE, and MPSE are not as good as the MLOG. The MLOG tends to select less covariates than the LASSO and the Elastic net.

Table 9. The bootstrap means and standard deviations of the estimate of the prostate data.

Selected	MLOG	ENET	LASSO	Selected	MLOG	ENET	LASSO
Covariates	(3)	(12)	(4)	Covariates	(3)	(9)	(4)
age		0.0177(0.119)		age*gleason	0.0005(0.021)		0.0008(0.014)
lcp	0.0051(0.017)	0.3059(0.536)	0.0849(0.630)	age*pgg45	−0.0001(0.001)	−0.0001(0.001)
gleason		0.0816(1.148)		age*lpsa			0.0002(0.009)
lpsa	0.0079(0.018)	0.1967(0.773)	0.4809(0.773)	lbph*svi		0.0931(0.222)
lweight²		−0.0187(0.321)		lbph*lcp		−0.0051(0.069)
lbph²		0.0515(0.090)		lbph*pgg45		−0.0014(0.003)
lcp²		0.0526(0.074)	0.0019(0.056)	lbph*lpsa		−0.0189(0.086)	−0.0043(0.061)
pgg45²		−0.0001(0.010)		svi*pgg45		−0.0038(0.016)
lpsa²		0.0311(0.108)		lcp*pgg45		0.0032(0.005)	0.0001(0.004)
lweight*lcp		0.0886(0.140)	0.0494(0.147)	lcp*lpsa		−0.1362(0.233)
lweight*lpsa		0.0184(0.155)		gleason*pgg45	0.0001(0.021)
age*lbph	−0.0009(0.010)	−0.0007(0.007)		gleason*lpsa		0.0204(0.087)

Open in a new tab

Table 10. The estimation results for prostate data. The mean of the number of selected covariates (NONZERO), mean squared errors (MSE), standard error of estimator, mean prediction squared errors (MPSE), and the standard deviation of prediction squared errors (sd(PSE)).

Method	Selected Covariates	NONZERO	MSE	se(estimator)	MPSE	sd(PSE)
	lcp, lpsa,
MLOG	agelbph, agegleason,	6	0.4906	0.128	0.5355	0.1587
	agepgg45, gleasonpgg45
	age, lcp, gleason, lpsa, lweight², lbph²,
	pgg45², lpsa², lweightlcp, lweightlpsa,
ENET	agelbph, agepgg45, lbphsvi, lbphlcp,	21	1.8905	0.257	7.7054	22.8274
	lbphpgg45, lbphlpsa, svi*pgg45,
	lcppgg45, lcplpsa, gleason*lpsa, lcp²
	lcp, lpsa, lcp²,
LASSO	lweightlcp, agegleason, age*lpsa	8	0.5216	0.174	3.2233	13.3297
	lbphlpsa, lcppgg45

Open in a new tab

Example 5.3 Gene expression data —

The microarray data of Scheetz et al. [20] contains the gene expression levels of 200 TRIM32 genes collected from eye tissue samples of 120 rats. In this example, the number of covariates (p) is much greater than the number of observations (n). Moreover, some covariates are so strongly correlated. In such a situation, our previous discussions suggest that the LASSO and the SCAD penalties cannot select the true model. To validate the estimation methods, we randomly select 100 observations for model fitting and the remaining 20 observations for testing.

Table 11 summarizes the estimation and prediction results of the gene expression data. The MLOG outperforms all other penalties in the variable selection, estimation, and prediction. The Elastic net penalty leads to large biases. For the LASSO and the SCAD penalty, although the bias of the estimates are smaller than those of the Elastic penalty, they select too many irrelevant covariates and give larger biases in both estimation and prediction.

Table 11. The results for gene expression data. The mean of the number of selected covariates (NONZERO), mean squared errors (MSE), standard error of estimator, mean prediction squared errors (MPSE), and the standard deviation of prediction squared errors (sd(PSE)).

Method	NONZERO	MSE	se(estimator)	MPSE	sd(PSE)
MLOG	16	0.0047	0.0341	0.0139	0.0060
ENET	29	5.8597	0.1127	4.9840	1.3391
LASSO	120	0.7360	0.0414	0.0478	0.0143
SCAD	122	0.1098	0.0816	0.0901	0.0686

Open in a new tab

6. Conclusion

In this paper, we introduce a new class of strictly concave penalty functions, in particular, the modified log penalty to improve the performances of prediction under the multicollinearity cases. The proposed penalties exhibit certain nice properties as described in section 2 even under the multicollinearity cases. In the weakly correlated cases, these penalties perform as well as the SCAD penalty. In the multicollinearity or highly correlated cases, the proposed penalties tend to select less covariates. Real data analysis and simulation studies show that the modified log penalty outperforms the LASSO, the SCAD, and the Elastic net in terms of prediction error in general.

Appendix 1. Proofs

A.1. Technical lemmas

Proposition A.1

$\hat{β} = ({\hat{β}}_{1}, {\hat{β}}_{2}, \dots, {\hat{β}}_{p})^{T}$ is a solution to the minimization problem (3) only if the following conditions are satisfied,

$\frac{1}{n} X_{j}^{T} (Y - X \hat{β}) = P^{'} (| {\hat{β}}_{j} |, λ) sgn ({\hat{β}}_{j}), f o r a l l {\hat{β}}_{j} \neq 0$ (A1)

and

$\frac{1}{n} | X_{j}^{T} (Y - X \hat{β}) | \leq P^{'} (0 +, λ), f o r a l l j = 1, \dots, p .$ (A2)

Proof.

First, we have the following lemma.

Lemma A.2

Let $f (x_{1}, x_{2}, \dots, x_{d})$ be a function on $R^{d}$ . Suppose that $f (x_{1}, x_{2}, \dots, x_{d})$ attains minimum value at $(x_{1}^{0}, x_{2}^{0}, \dots, x_{d}^{0})$ . Then, the function

$g (x_{1}, \dots, x_{k}) = f (x_{1}, \dots, x_{k}, x_{k + 1}^{0}, \dots, x_{d}^{0})$

attains minimum value at $(x_{1}^{0}, x_{2}^{0}, \dots, x_{k}^{0}),$ $\forall k = 1, 2, \dots, d$ .

The proof of Lemma A.2 is trivial. Below, the proof of Proposition A.1 is given. Let

\begin{aligned} F (β) & = \frac{1}{2 n} ‖ Y - X β ‖_{2}^{2} + \sum_{j = 1}^{p} P (| β_{j} |, λ) \\ = \frac{1}{2 n} {‖ Y - \sum_{j = 1}^{p} X_{j} β_{j} ‖}^{2} + \sum_{j = 1}^{p} P (| β_{j} |, λ) . \end{aligned}

For all $j = 1, 2, \dots, p$ with $β_{j} \neq 0$ , we have

\frac{\partial F}{\partial β_{j}} = - X_{j}^{T} (Y - X β) + P^{'} (| β_{j} |, λ) sgn (β_{j}) .

Let $\hat{β} = ({\hat{β}}_{1}, {\hat{β}}_{2}, \dots, {\hat{β}}_{p})^{T}$ be a solution to the minimization problem $min_{β} F (β)$ . Define $J_{0} (\hat{β}) = {j = 1, \dots, p | {\hat{β}}_{j} \neq 0}$ and $m = # J_{0} (\hat{β})$ . Without loss of generality assume that

J_{0} (\hat{β}) = {1, 2, \dots, m} \to \hat{β} = ({\hat{β}}_{1}, \dots, {\hat{β}}_{m}, 0, \dots, 0)^{T} .

According to Lemma A.2, $({\hat{β}}_{1}, {\hat{β}}_{2}, \dots, {\hat{β}}_{m})$ is a solution to the minimization problem

min_{(β_{1}, β_{2}, \dots, β_{m})} G (β_{1}, β_{2}, \dots, β_{m}) = min_{(β_{1}, β_{2}, \dots, β_{m})} F (β_{1}, β_{2}, \dots, β_{m}, 0, \dots, 0) .

Therefore,

\frac{\partial G}{\partial β_{j}} ({\hat{β}}_{1}, {\hat{β}}_{2}, \dots, {\hat{β}}_{m}) = 0, \forall j = 1, \dots, m .

It is equivalent to

\frac{\partial F}{\partial β_{j}} ({\hat{β}}_{1}, \dots, {\hat{β}}_{m}, 0, \dots, 0) = 0, \forall j = 1, \dots, m .

That means Inline graphic and thus

\frac{1}{n} X_{j}^{T} (Y - X \hat{β}) = P^{'} (| β_{j} |, λ) sgn (β_{j}) \forall j = 1, \dots, m .

Now, consider j>m. For all $α \in R$ , let

β_{α}^{(j)} = ({\hat{β}}_{1}, \dots, {\hat{β}}_{m}, 0, \dots, 0, α, 0, \dots, 0)^{T},

where α is the jth element. Since $\hat{β}$ is the global minimizer of $F (β)$ , we have

F (\hat{β}) \leq F (β_{α}^{(j)}), \forall j > m, \forall α \in R .

On the other hand, simple algebraic manipulations show that

F (β_{α}^{(j)}) = \frac{1}{2 n} {‖ Y - \sum_{k = 1}^{m} X_{k} {\hat{β}}_{k} - X_{j} α ‖}^{2} + \sum_{k = 1}^{m} P (| {\hat{β}}_{k} |, λ) + P (| α |, λ) .

Therefore,

F (β_{α}^{(j)}) = F (\hat{β}) + \frac{α^{2}}{2 n} ‖ X_{j} ‖^{2} - \frac{1}{n} α X_{j}^{T} (Y - X \hat{β}) + P (| α |, λ) .

Since $F (\hat{β}) \leq F (β_{α}^{(j)})$ for all j>m and $α \in R$ , we have

\frac{α^{2}}{2 n} ‖ X_{j} ‖^{2} - \frac{α^{2}}{n} α X_{j}^{T} (Y - X \hat{β}) + P (| α |, λ) \geq 0.

Choose $α = γ X_{j}^{'} (Y - X \hat{β}), 0 < γ < 1$ , we have

{(X_{j}^{'} (Y - X \hat{β}))}^{2} [\frac{γ^{2}}{2 n} ‖ X_{j} ‖^{2} - \frac{γ}{n}] + P (| X_{j}^{T} (Y - X \hat{β}) | γ, λ) \geq 0.

Let $A = X_{j}^{'} (Y - X \hat{β})$ . Then,

A^{2} [\frac{γ^{2}}{2 n} ‖ X_{j} ‖^{2} - \frac{γ}{n}] + P (| A | γ, λ) \geq 0.

It is equivalent to

\frac{A^{2}}{n} [γ - \frac{γ^{2}}{2} ‖ X_{j} ‖^{2}] \leq P (| A | γ, λ) .

Choose $γ \in (0, 1)$ sufficiently small such that $γ - γ^{2} / 2 ‖ X_{j} ‖^{2} > 0$ . Then,

\frac{A^{2}}{n} \leq \frac{P (| A | γ, λ)}{| A | γ} \frac{| A |}{1 - \frac{γ}{2} ‖ X_{j} ‖^{2}} .

(A3)

The condition (A3) holds for any small γ. Taking $γ \to 0,$ we have

\frac{A^{2}}{n} \leq | A | P^{'} (0 +, λ) \to \frac{1}{n} | A | \leq P^{'} (0 +, λ) .

Therefore, $(1 / n) | X_{j}^{T} (Y - X \hat{β}) | \leq P^{'} (0 +, λ), \forall j > m$ . Since $P (\cdot, λ)$ is strictly concave penalty and the derivative $P^{'} (\cdot, λ)$ is non-increasing on $[0, \infty),$ $P^{'} (u, λ) \leq P^{'} (0 +, λ), \forall u \in [0, \infty)$ . This completes the proof.

A.2. Proof of Theorem 2.3

Let $J = {j \in \bar{1, p} | {\hat{β}}_{j} \neq 0}$ and $U^{(δ)} = X - δ Z$ . Denote the number of components of J by h. Obviously, the system of column vectors ${U_{j}^{(δ)} | j \in J}$ is linearly independent if h=1.

Consider h=q+1,q>0. By contradiction assume that the system of column vectors ${U_{j}^{(δ)} | j \in J}$ is linearly dependent. Without loss of generality, assume that

J = {1, 2, \dots, h} .

Since ${U_{j}^{(δ)} | j \in \bar{1, h}}$ is linearly dependent and ${\hat{β}}_{j} \neq 0, \forall j = 1, \dots, h$ , the system of vectors ${{\hat{β}}_{j} U_{j}^{(δ)} | j = 1, \dots, h}$ is also linearly dependent. Then, there exist real values $γ_{1}, \dots, γ_{h},$ not all zero such that

\sum_{j = 1}^{q + 1} γ_{j} {\hat{β}}_{j} U_{j}^{(δ)} = 0.

Without loss of generality assumed that

| γ_{h} | = max_{1 \leq j \leq h} | γ_{j} | .

Define $α_{j} = - γ_{j} / γ_{h}$ , $j = 1, \dots, h$ . We get $| α_{j} | \leq 1$ , $\forall j = 1, \dots, h$ and

{\hat{β}}_{h} U_{h}^{(δ)} = \sum_{j = 1}^{q} α_{j} {\hat{β}}_{j} U_{j}^{(δ)} .

(A4)

Since $\hat{β}$ is the solution and ${\hat{β}}_{j} \neq 0$ , $j = 1, \dots, h$ , Proposition A.1 suggests that

\frac{1}{n} X_{j}^{T} (Y - X \hat{β}) = P^{'} (| {\hat{β}}_{j} |, λ) sgn ({\hat{β}}_{j}), j \in \bar{1, h} .

From (A4), we have

{\hat{β}}_{h} \frac{1}{n} U_{h}^{(δ) T} (Y - X \hat{β}) = \sum_{j = 1}^{q} α_{j} \frac{1}{n} {\hat{β}}_{j} U_{j}^{(δ) T} (Y - X \hat{β}) .

Then,

{\hat{β}}_{h} \frac{1}{n} (X_{h} - δ Z_{h})^{T} (Y - X \hat{β}) = \sum_{j = 1}^{q} α_{j} \frac{1}{n} {\hat{β}}_{j} (X_{j} - δ Z_{j})^{T} (Y - X \hat{β}) .

Therefore,

| {\hat{β}}_{h} | P^{'} (| {\hat{β}}_{h} |, λ) = \sum_{j = 1}^{q} α_{j} | {\hat{β}}_{j} | P^{'} (| {\hat{β}}_{j} |), λ) + \frac{δ}{n} ({\hat{β}}_{h} Z_{h} - \sum_{j = 1}^{q} α_{j} {\hat{β}}_{j} Z_{j}) (Y - X \hat{β}) .

(A5)

From (A4), $U_{h}^{(δ)} = \sum_{j = 1}^{q} U_{j}^{(δ)} α_{j} ({\hat{β}}_{j} / {\hat{β}}_{h})$ . For any $τ > 1$ , define

{\tilde{β}}_{j}^{(τ)} = {\begin{cases} {\hat{β}}_{j} (1 - \frac{α_{j}}{τ}), & j = 1, \dots, q, {\hat{β}}_{h} (1 + \frac{1}{τ}), & j = h, 0, & j > h . \end{cases}

We have

\begin{aligned} U^{(δ)} \tilde{β} & = \sum_{j = 1}^{q} U_{j}^{(δ)} {\tilde{β}}_{j}^{(τ)} + U_{h}^{(δ)} {\tilde{β}}_{h}^{(τ)} \\ = \sum_{j = 1}^{q} U_{j}^{(δ)} {\hat{β}}_{j} (1 - \frac{α_{j}}{τ}) + \sum_{j = 1}^{q} U_{j}^{(δ)} α_{j} {\hat{β}}_{j} (1 + \frac{1}{τ}) \\ = \sum_{j = 1}^{q} U_{j}^{(δ)} {\hat{β}}_{j} (1 + α_{j}) = U^{(δ)} \hat{β} . \end{aligned}

Since $τ > 1$ by assumption, we have $1 - α_{j} / τ > 0$ , Inline graphic . Consider

F (\hat{β}) = \frac{1}{2 n} ‖ Y - X \hat{β} ‖^{2} + \sum_{j = 1}^{q} P (| {\hat{β}}_{j} |, λ) + P (| {\hat{β}}_{h} |, λ)

and

\begin{aligned} F (\tilde{β}) & = \frac{1}{2 n} ‖ Y - X \tilde{β} ‖^{2} + \sum_{j = 1}^{q} P (| {\hat{β}}_{j} | (1 - \frac{α_{j}}{τ}), λ) + P (| {\hat{β}}_{h} | (1 + \frac{1}{τ}), λ) \\ = \frac{1}{2 n} ‖ Y - X \hat{β} + δ Z (\hat{β} - \tilde{β}) ‖^{2} + \sum_{j = 1}^{q} P (| {\hat{β}}_{j} | (1 - \frac{α_{j}}{τ}), λ) + P (| {\hat{β}}_{h} | (1 + \frac{1}{τ}), λ) \\ = \frac{1}{2 n} ‖ Y - X \hat{β} ‖^{2} + F^{(δ)} (τ), \end{aligned}

where

\begin{aligned} F^{(δ)} (τ) & = \frac{δ}{n} (\hat{β} - \tilde{β})^{T} Z^{T} (Y - X \hat{β}) + \frac{1}{2 n} δ^{2} ‖ Z (\hat{β} - \tilde{β}) ‖^{2} \\ + \sum_{j = 1}^{q} P (| {\hat{β}}_{j} | (1 - \frac{α_{j}}{τ}), λ) + P (| {\hat{β}}_{h} | (1 + \frac{1}{τ}), λ) \\ = - \frac{δ}{n τ} {({({\hat{β}}_{h} Z_{h} - \sum_{j = 1}^{q} α_{j} {\hat{β}}_{j} Z_{j})}^{T} (Y - X \hat{β}) + \frac{δ^{2}}{2 n τ^{2}} ‖ {\hat{β}}_{h} Z_{-} \sum_{j = 1}^{q} α_{j} {\hat{β}}_{j} Z_{j}) ‖}^{2} \\ + \sum_{j = 1}^{q} P (| {\hat{β}}_{j} | (1 - \frac{α_{j}}{τ}), λ) + P (| {\hat{β}}_{h} | (1 + \frac{1}{τ}), λ) . \end{aligned}

To obtain a contradiction and complete the proof, we need to show that $F (\tilde{β}) < F ((\hat{β}))$ . We have

\begin{aligned} \frac{d}{d τ} F^{(δ)} (τ) & = \frac{δ}{τ^{2} n} {({\hat{β}}_{h} Z_{h} - \sum_{j = 1}^{q} α_{j} {\hat{β}}_{j} Z_{j})}^{T} (Y - X \hat{β}) - \frac{δ^{2}}{τ^{3} n} {‖ {\hat{β}}_{h} Z_{h} - \sum_{j = 1}^{q} α_{j} {\hat{β}}_{j} Z_{j}) ‖}^{2} \\ + \frac{1}{τ^{2}} \sum_{j = 1}^{q} α_{j} | {\hat{β}}_{j} | P^{'} (| {\hat{β}}_{j} | (1 - \frac{α_{j}}{τ}), λ) - \frac{| {\hat{β}}_{h} |}{τ^{2}} P^{'} (| {\hat{β}}_{h} | (1 + \frac{1}{τ}), λ) \\ = \frac{1}{τ^{2}} G^{(δ)} (τ) \end{aligned}

where

\begin{aligned} G^{(δ)} (τ) & = \frac{δ}{n} {({\hat{β}}_{h} Z_{h} - \sum_{j = 1}^{q} α_{j} {\hat{β}}_{j} Z_{j})}^{T} (Y - X \hat{β}) - \frac{δ^{2}}{τ n} {‖ {\hat{β}}_{h} Z_{h} - \sum_{j = 1}^{q} α_{j} {\hat{β}}_{j} Z_{j}) ‖}^{2} \\ + \sum_{j = 1}^{q} α_{j} | {\hat{β}}_{j} | P^{'} (| {\hat{β}}_{j} | (1 - \frac{α_{j}}{τ}), λ) - | {\hat{β}}_{h} | P^{'} (| {\hat{β}}_{h} | (1 + \frac{1}{τ}), λ) \end{aligned}

and

\begin{aligned} \frac{d}{d τ} G^{(δ)} (τ) & = \frac{δ^{2}}{τ^{2} n} {‖ {\hat{β}}_{h} Z_{h} - \sum_{j = 1}^{q} α_{j} {\hat{β}}_{j} Z_{j}) ‖}^{2} + \sum_{j = 1}^{q} \frac{α_{j}^{2}}{τ^{2}} {\hat{β}}_{j}^{2} P^{^{″}} (| {\hat{β}}_{j} | (1 - \frac{α_{j}}{τ}), λ) \\ + \frac{{\hat{β}}_{h}^{2}}{τ^{2}} P^{^{″}} (| {\hat{β}}_{h} | (1 + \frac{1}{τ}), λ) . \end{aligned}

Let $u = min_{1 \leq j \leq h} | {\hat{β}}_{j} |$ , $v = max_{1 \leq j \leq h} | {\hat{β}}_{j} |$ , and $M = max_{0 \leq θ \leq 2 v} P^{^{″}} (θ)$ . Since $P (\cdot, λ)$ is strictly concave penalty, we have M<0. Note that $| α_{j} | \leq 1, \forall j = 1, \dots, q$ , we have

\frac{d}{d τ} G^{(δ)} (τ) < \frac{1}{τ^{2}} (M u^{2} (1 + \sum_{j = 1}^{q} α_{j}^{2}) + δ^{2} {‖ {\hat{β}}_{h} Z_{h} - \sum_{j = 1}^{q} α_{j} {\hat{β}}_{j} Z_{j}) ‖}^{2}) .

If ${\hat{β}}_{h} Z_{h} - \sum_{j = 1}^{q} α_{j} {\hat{β}}_{j} Z_{j} = 0$ , choose $ϰ > 0$ arbitrarily. Otherwise, choose

ϰ = \frac{u \sqrt{- n M}}{‖ {\hat{β}}_{h} Z_{h} - \sum_{j = 1}^{q} α_{j} {\hat{β}}_{j} Z_{j}) ‖} .

For all $δ \in [0, ϰ],$ we have

\frac{d}{d τ} G^{(δ)} (τ) < \frac{1}{τ^{2}} M u^{2} (1 + \sum_{j = 1}^{q} α_{j}^{2}) \leq 0.

Then, the function $G^{(δ)} (τ)$ is strictly decreasing on $(1; \infty)$ . Therefore,

G^{(δ)} (τ) > lim_{τ \to \infty} G^{(δ)} (τ), \forall τ > 1.

However,

lim_{τ \to \infty} G^{(δ)} (τ) = \frac{δ}{n} {({\hat{β}}_{h} Z_{h} - \sum_{j = 1}^{q} α_{j} {\hat{β}}_{j} Z_{j})}^{T} (Y - X \hat{β}) + \sum_{j = 1}^{q} α_{j} | {\hat{β}}_{j} | . P^{'} (| {\hat{β}}_{j} |), λ) - | {\hat{β}}_{h} | . P^{'} (| {\hat{β}}_{h} |, λ) .

From (A5), we have $lim_{τ \to \infty} G^{(δ)} (τ) = 0$ and $G^{(δ)} (τ) > 0, \forall τ > 1$ . Therefore,

\frac{d}{d τ} F^{(δ)} (τ) = \frac{1}{τ^{2}} G^{(δ)} (τ) > 0, \forall τ > 1.

That means the function $F^{(δ)} (τ)$ is strictly increasing on $(1, \infty)$ . Therefore, $F^{(δ)} (τ) < lim_{τ \to \infty} F^{(δ)} (τ), \forall τ > 1$ . It is easy to so that

lim_{τ \to \infty} F^{(δ)} (τ) = \sum_{j = 1}^{q} P (| {\hat{β}}_{j} |, λ) + P (| {\hat{β}}_{h} |, λ) .

Therefore,

F^{(δ)} (τ) < \sum_{j = 1}^{q} P (| {\hat{β}}_{j} |, λ) + P (| {\hat{β}}_{h} |, λ) .

That means $F (\tilde{β}) < F ((\hat{β}))$ . This completes the proof.

A.3. Proof of Proposition 2.4

Result (a) is a direct consequence of Theorem 2.3. The proof of (b) is given in the following. Let

G (u) = \frac{1}{2 n} ‖ Y - X^{*} u ‖^{2} + \sum_{j = 1}^{m} P (| u_{j} |, λ), u = (u_{1}, \dots, u_{m})^{T} \in R^{m} .

We have

\begin{aligned} F (\hat{β}) & = \frac{1}{2 n} ‖ Y - X \hat{β} ‖^{2} + \sum_{j = 1}^{p} P (| {\hat{β}}_{j} |, λ) \\ = \frac{1}{2 n} ‖ Y - X^{*} {\hat{β}}^{*} ‖^{2} + \sum_{j = 1}^{m} P (| {\hat{β}}_{j} |, λ) \\ = G ({\hat{β}}^{*}) \geq min_{u \in R^{m}} G (u) . \end{aligned}

(A6)

On the other hand, for all $u = (u_{1}, \dots, u_{m})^{T} \in R^{m}$ , let $\tilde{u} = (u^{T}, 0, \dots, 0)^{T} \in R^{p}$ . We have

F (\tilde{u}) = \frac{1}{2 n} ‖ Y - X^{*} u ‖^{2} + \sum_{j = 1}^{m} P (| u_{j} |, λ) = G (u) .

Then, $G (u) = F (\tilde{u}) \geq min_{β \in R^{p}} F (β) = F (\hat{β}), \forall u \in R^{m}$ . Therefore

min_{u \in R^{m}} G (u) \geq F (\hat{β}) .

(A7)

From (A6), (A7), we get $min_{u \in R^{m}} G (u) = F (\hat{β}) = G ({\hat{β}}^{*})$ .

A.4. Proof of Proposition 2.7

Since $X^{'} X + n Ω$ is non-negative definite matrix, it is invertible if and only if it is positive definite matrix. That means

u^{'} (X^{'} X + n Ω) u > 0, \forall u = (u_{1}, \dots, u_{p})^{'} \neq 0.

(A8)

We have

\begin{aligned} u^{'} (X^{'} X + n Ω) u & = (X u)^{'} (X u) + n \sum_{j = 1}^{p} u_{j}^{2} ω_{j} \\ = ‖ X u ‖^{2} + \sum_{j \in J_{2}} u_{j}^{2} ω_{j} . \end{aligned}

Then,

\begin{aligned} u^{'} (X^{'} X + n Ω) u = 0 & \leftrightarrow {\begin{cases} X u = 0, \\ u_{j} = 0, \forall j \in J_{2} \end{cases} \\ \leftrightarrow {\begin{cases} \sum_{j \in J_{1}} u_{j} X_{j} = 0, \\ u_{j} = 0, \forall j \in J_{2} . \end{cases} \end{aligned}

This completes the proof.

Appendix 2. Figures.

Figure A.1. — Simulation results of Example 4.1 – The mean number of false positives (FP) and false negatives (FN). Panels (a), (c), (e) show the results of Scenario (a). Panels (b), (d), (f) show the results of Scenario (b).

Figure A.2. — Simulation results of Example 4.2 – The mean number of false positives (FP) and false negatives (FN): Panels (a), (b) and (c) show the results of Scenario (a). Panels (d), (e) and (f) show the results of Scenario (b). Panels (g), (h) and (i) show the results of Scenario (c).

Figure A.3. — Simulation results of Example 4.3 – The mean number of false positives (FP) and false negatives (FN): Panels (a), (b) and (c) show the results of Scenario (a). Panels (d), (e) and (f) show the results of Scenario (b). Panels (g), (h) and (i) show the results of Scenario (c).

Figure A.4. — Simulation results of Example 4.4 – The mean number of false positives (FP) and false negatives (FN): Panel (a) shows the results of Model (a). Panel (b) shows the results of Model (b). Panel (c) shows the results of Model (c).

Funding Statement

Chi Tim, Ng's work is supported by National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIP) (No. NRF-2017R1C1B2011652).

Disclosure statement

No potential conflict of interest was reported by the authors.

References

1.Antoniadis A. and Fan J., Regularization of wavelet approximations, J. Am. Stat. Assoc. 96 (2001), pp. 939–967. doi: 10.1198/016214501753208942 [DOI] [Google Scholar]
2.Breiman L., Heuristics of instability and stabilization in model selection, Ann. Statist. 24 (1996), pp. 2350–2383. doi: 10.1214/aos/1032181158 [DOI] [Google Scholar]
3.Chatterjee S. and Hadi A.S., Regression Analysis by Example, 5th ed., John Wiley & Sons, Inc., Hoboken, New Jersey, 2012, 424p. [Google Scholar]
4.Chong I.-G. and Jun C.-H., Performance of some variable selection methods when multicollinearity is present, Chemometr. Intell. Lab. Syst. 78 (2005), pp. 103–112. doi: 10.1016/j.chemolab.2004.12.011 [DOI] [Google Scholar]
5.Dalayan A., Hebiri M., and Lederer J., On the prediction performance of the LASSO, Bernoulli 23 (2017), pp. 552–581. doi: 10.3150/15-BEJ756 [DOI] [Google Scholar]
6.Efron B., Hastie T., Johnstone I., and Tibshirani R., Least angle regression, Ann. Statist. 32 (2004), pp. 407–499. doi: 10.1214/009053604000000067 [DOI] [Google Scholar]
7.Efron B. and Tibshirani R.J., An Introduction to the Bootstrap, 1st ed., Chapman & Hall, New York, 1993, 456p. [Google Scholar]
8.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. doi: 10.1198/016214501753382273 [DOI] [Google Scholar]
9.Fan J. and Lv J., Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc. Ser. B 70 (2008), pp. 849–911. doi: 10.1111/j.1467-9868.2008.00674.x [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Fan J. and Lv J., A selective overview of variable selection in high dimensional feature space, Stat. Sin. 20 (2010), pp. 101–148. [PMC free article] [PubMed] [Google Scholar]
11.Fan J. and Lv J., Nonconcave penalized likelihood with NP-dimensionality, IEEE Trans. Inform. Theory 57 (2011), pp. 5467–5484. doi: 10.1109/TIT.2011.2158486 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Fan J. and Peng H., Nonconcave penalized likelihood with a diverging number of parameters, Ann. Statist. 32 (2004), pp. 928–961. doi: 10.1214/009053604000000256 [DOI] [Google Scholar]
13.Fan Y. and Tang C.Y., Tuning parameter selection in high dimensional penalized likelihood, J. R. Stat. Soc. Ser. B 75 (2013), pp. 531–552. doi: 10.1111/rssb.12001 [DOI] [Google Scholar]
14.Fitrianto A. and Lee C.Y., Performance of Ridge regression estimator methods on small sample size by varying correlation coefficients: A simulation study, J. Math. Statist. 10 (2014), pp. 25–29. doi: 10.3844/jmssp.2014.25.29 [DOI] [Google Scholar]
15.Hoerl A.E. and Kennard R.W., Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970), pp. 55–67. doi: 10.1080/00401706.1970.10488634 [DOI] [Google Scholar]
16.Hunter D.R. and Li R., Variable selection using MM algorithms, Ann. Statist. 33 (2005), pp. 1617–1642. doi: 10.1214/009053605000000200 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Jolliffe I.T., A note on the use of principal components in regression, Appl. Stat. 31 (1982), pp. 300–303. doi: 10.2307/2348005 [DOI] [Google Scholar]
18.Konno H. and Takaya Y., Multi-step methods for choosing the best set of variables in regression analysis, Comput. Optim. Appl. 46 (2010), pp. 417–426. doi: 10.1007/s10589-008-9193-6 [DOI] [Google Scholar]
19.Ng C.T., Oh S., and Lee Y., Going beyond oracle property: Selection consistency and uniqueness of local solution of the generalized linear model, Stat. Methodol. 32 (2016), pp. 147–160. doi: 10.1016/j.stamet.2016.05.006 [DOI] [Google Scholar]
20.Scheetz T., Kim K., Swiderski R., Philp A., Braun T., Knudtson K., Dorrance A., DiBona G., Huang J., and Casavant T., Regulation of gene expression in the mammalian eye and its relevance to eye disease, Proc. Natl. Acad. Sci. 103 (2006), pp. 14429–14434. doi: 10.1073/pnas.0602562103 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.She Y., Thresholding-based iterative selection procedures for model selection and shrinkage, Electron. J. Stat. 3 (2009), pp. 384–415. doi: 10.1214/08-EJS348 [DOI] [Google Scholar]
22.She Y., An iterative algorithm for fitting nonconvex penalized generalized linear models with grouped predictors, Comput. Stat. Data. Anal. 56 (2012), pp. 2976–2990. doi: 10.1016/j.csda.2011.11.013 [DOI] [Google Scholar]
23.Tamura R., Kobayashi K., Takano Y., Miyashiro R., Nakata K., and Matsui T., Mixed integer quadratic optimization formulations for eliminating multicollinearity based on variance inflation factor, Optimization Online (2016). Available at http://www.optimization-online.org/DB_HTML/2016/09/5655.html
24.Tamura R., Kobayashi K., Takano Y., Miyashiro R., Nakata K., and Matsui T., Best subset selection for eliminating multicollinearity, J. Oper. Res. Soc. Japan 60 (2017), pp. 321–336. [Google Scholar]
25.Tibshirani R., Regression shrinkage and selection via the LASSO, J. R. Stat. Soc. Ser. B 58 (1996), pp. 267–288. [Google Scholar]
26.Wang H., Li R., and Tsai C.L., Tuning parameter selectors for the smoothly clipped absolute deviation method, Biometrika 94 (2007), pp. 553–568. doi: 10.1093/biomet/asm053 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Wold S., Ruhe A., Wold H., and Dunn III W.J., The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses, SIAM J. Sci. Stat. Comput. 5 (1984), pp. 735–743. doi: 10.1137/0905052 [DOI] [Google Scholar]
28.Zhang C.-H., Nearly unbiased variable selection under minimax concave penalty, Ann. Statist. 38 (2010), pp. 894–942. doi: 10.1214/09-AOS729 [DOI] [Google Scholar]
29.Zou H., The adaptive LASSO and its oracle properties, J. Am. Stat. Assoc. 101 (2006), pp. 1418–1429. doi: 10.1198/016214506000000735 [DOI] [Google Scholar]
30.Zou H. and Hastie T., Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B 67 (2005), pp. 301–320. doi: 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]

[CIT0001] 1.Antoniadis A. and Fan J., Regularization of wavelet approximations, J. Am. Stat. Assoc. 96 (2001), pp. 939–967. doi: 10.1198/016214501753208942 [DOI] [Google Scholar]

[CIT0002] 2.Breiman L., Heuristics of instability and stabilization in model selection, Ann. Statist. 24 (1996), pp. 2350–2383. doi: 10.1214/aos/1032181158 [DOI] [Google Scholar]

[CIT0003] 3.Chatterjee S. and Hadi A.S., Regression Analysis by Example, 5th ed., John Wiley & Sons, Inc., Hoboken, New Jersey, 2012, 424p. [Google Scholar]

[CIT0004] 4.Chong I.-G. and Jun C.-H., Performance of some variable selection methods when multicollinearity is present, Chemometr. Intell. Lab. Syst. 78 (2005), pp. 103–112. doi: 10.1016/j.chemolab.2004.12.011 [DOI] [Google Scholar]

[CIT0005] 5.Dalayan A., Hebiri M., and Lederer J., On the prediction performance of the LASSO, Bernoulli 23 (2017), pp. 552–581. doi: 10.3150/15-BEJ756 [DOI] [Google Scholar]

[CIT0006] 6.Efron B., Hastie T., Johnstone I., and Tibshirani R., Least angle regression, Ann. Statist. 32 (2004), pp. 407–499. doi: 10.1214/009053604000000067 [DOI] [Google Scholar]

[CIT0007] 7.Efron B. and Tibshirani R.J., An Introduction to the Bootstrap, 1st ed., Chapman & Hall, New York, 1993, 456p. [Google Scholar]

[CIT0008] 8.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. doi: 10.1198/016214501753382273 [DOI] [Google Scholar]

[CIT0009] 9.Fan J. and Lv J., Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc. Ser. B 70 (2008), pp. 849–911. doi: 10.1111/j.1467-9868.2008.00674.x [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0010] 10.Fan J. and Lv J., A selective overview of variable selection in high dimensional feature space, Stat. Sin. 20 (2010), pp. 101–148. [PMC free article] [PubMed] [Google Scholar]

[CIT0011] 11.Fan J. and Lv J., Nonconcave penalized likelihood with NP-dimensionality, IEEE Trans. Inform. Theory 57 (2011), pp. 5467–5484. doi: 10.1109/TIT.2011.2158486 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0012] 12.Fan J. and Peng H., Nonconcave penalized likelihood with a diverging number of parameters, Ann. Statist. 32 (2004), pp. 928–961. doi: 10.1214/009053604000000256 [DOI] [Google Scholar]

[CIT0013] 13.Fan Y. and Tang C.Y., Tuning parameter selection in high dimensional penalized likelihood, J. R. Stat. Soc. Ser. B 75 (2013), pp. 531–552. doi: 10.1111/rssb.12001 [DOI] [Google Scholar]

[CIT0014] 14.Fitrianto A. and Lee C.Y., Performance of Ridge regression estimator methods on small sample size by varying correlation coefficients: A simulation study, J. Math. Statist. 10 (2014), pp. 25–29. doi: 10.3844/jmssp.2014.25.29 [DOI] [Google Scholar]

[CIT0015] 15.Hoerl A.E. and Kennard R.W., Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970), pp. 55–67. doi: 10.1080/00401706.1970.10488634 [DOI] [Google Scholar]

[CIT0016] 16.Hunter D.R. and Li R., Variable selection using MM algorithms, Ann. Statist. 33 (2005), pp. 1617–1642. doi: 10.1214/009053605000000200 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0017] 17.Jolliffe I.T., A note on the use of principal components in regression, Appl. Stat. 31 (1982), pp. 300–303. doi: 10.2307/2348005 [DOI] [Google Scholar]

[CIT0018] 18.Konno H. and Takaya Y., Multi-step methods for choosing the best set of variables in regression analysis, Comput. Optim. Appl. 46 (2010), pp. 417–426. doi: 10.1007/s10589-008-9193-6 [DOI] [Google Scholar]

[CIT0019] 19.Ng C.T., Oh S., and Lee Y., Going beyond oracle property: Selection consistency and uniqueness of local solution of the generalized linear model, Stat. Methodol. 32 (2016), pp. 147–160. doi: 10.1016/j.stamet.2016.05.006 [DOI] [Google Scholar]

[CIT0020] 20.Scheetz T., Kim K., Swiderski R., Philp A., Braun T., Knudtson K., Dorrance A., DiBona G., Huang J., and Casavant T., Regulation of gene expression in the mammalian eye and its relevance to eye disease, Proc. Natl. Acad. Sci. 103 (2006), pp. 14429–14434. doi: 10.1073/pnas.0602562103 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0021] 21.She Y., Thresholding-based iterative selection procedures for model selection and shrinkage, Electron. J. Stat. 3 (2009), pp. 384–415. doi: 10.1214/08-EJS348 [DOI] [Google Scholar]

[CIT0022] 22.She Y., An iterative algorithm for fitting nonconvex penalized generalized linear models with grouped predictors, Comput. Stat. Data. Anal. 56 (2012), pp. 2976–2990. doi: 10.1016/j.csda.2011.11.013 [DOI] [Google Scholar]

[CIT0023] 23.Tamura R., Kobayashi K., Takano Y., Miyashiro R., Nakata K., and Matsui T., Mixed integer quadratic optimization formulations for eliminating multicollinearity based on variance inflation factor, Optimization Online (2016). Available at http://www.optimization-online.org/DB_HTML/2016/09/5655.html

[CIT0024] 24.Tamura R., Kobayashi K., Takano Y., Miyashiro R., Nakata K., and Matsui T., Best subset selection for eliminating multicollinearity, J. Oper. Res. Soc. Japan 60 (2017), pp. 321–336. [Google Scholar]

[CIT0025] 25.Tibshirani R., Regression shrinkage and selection via the LASSO, J. R. Stat. Soc. Ser. B 58 (1996), pp. 267–288. [Google Scholar]

[CIT0026] 26.Wang H., Li R., and Tsai C.L., Tuning parameter selectors for the smoothly clipped absolute deviation method, Biometrika 94 (2007), pp. 553–568. doi: 10.1093/biomet/asm053 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0027] 27.Wold S., Ruhe A., Wold H., and Dunn III W.J., The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses, SIAM J. Sci. Stat. Comput. 5 (1984), pp. 735–743. doi: 10.1137/0905052 [DOI] [Google Scholar]

[CIT0028] 28.Zhang C.-H., Nearly unbiased variable selection under minimax concave penalty, Ann. Statist. 38 (2010), pp. 894–942. doi: 10.1214/09-AOS729 [DOI] [Google Scholar]

[CIT0029] 29.Zou H., The adaptive LASSO and its oracle properties, J. Am. Stat. Assoc. 101 (2006), pp. 1418–1429. doi: 10.1198/016214506000000735 [DOI] [Google Scholar]

[CIT0030] 30.Zou H. and Hastie T., Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B 67 (2005), pp. 301–320. doi: 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]

PERMALINK

Variable selection under multicollinearity using modified log penalty

Van Cuong Nguyen

Chi Tim Ng

ABSTRACT

1. Introduction

2. Penalized linear regression with strictly concave penalty

2.1. The strictly concave penalized likelihood estimator

Definition 2.1 The strictly concave penalty function —

Definition 2.2 The strictly concave penalized likelihood estimator —

2.2. Parsimonious variable section in the multicollinearity case

Theorem 2.3

Proposition 2.4

Proposition 2.5

Proposition 2.6

2.3. Feasibility of the majorization minimization algorithm

Proposition 2.7

3. Modified log penalty

Definition 3.1 Modified log penalty function —

Figure 1.

4. Simulation studies

Example 4.1

Example 4.2

Table 2. The true values of the number of nonzeros (k), the number of selected covariates among identical covariates in the model (I1), and the number of identical covariates that are removed from in the model (I2).

Table 3. The true values of k, I1, and I2 with n=200, n=400 and n=800.

Example 4.3

Example 4.4

5. Real data examples

Example 5.1 Diabetes data —

Table 7. The bootstrap means and standard deviations of the estimate of the diabetes data.

Table 8. The estimation results for diabetes data. The mean of the number of selected covariates (NONZERO), mean squared errors (MSE), standard error of estimator, mean prediction squared errors (MPSE), and the standard deviation of prediction squared errors (sd(PSE)).

Example 5.2 Prostate data —

Table 9. The bootstrap means and standard deviations of the estimate of the prostate data.

Table 10. The estimation results for prostate data. The mean of the number of selected covariates (NONZERO), mean squared errors (MSE), standard error of estimator, mean prediction squared errors (MPSE), and the standard deviation of prediction squared errors (sd(PSE)).

Example 5.3 Gene expression data —

Table 11. The results for gene expression data. The mean of the number of selected covariates (NONZERO), mean squared errors (MSE), standard error of estimator, mean prediction squared errors (MPSE), and the standard deviation of prediction squared errors (sd(PSE)).

6. Conclusion

Appendix 1. Proofs

A.1. Technical lemmas

Proposition A.1

Proof.

Lemma A.2

A.2. Proof of Theorem 2.3

A.3. Proof of Proposition 2.4

A.4. Proof of Proposition 2.7

Appendix 2. Figures.

Figure A.1.

Figure A.2.

Figure A.3.

Figure A.4.

Funding Statement

Disclosure statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 2. The true values of the number of nonzeros (k), the number of selected covariates among identical covariates in the model ( $I_{1}$ ), and the number of identical covariates that are removed from in the model ( $I_{2}$ ).

Table 3. The true values of k, $I_{1}$ , and $I_{2}$ with n=200, n=400 and n=800.