Abstract
Grouping structures arise naturally in many high-dimensional problems. Incorporation of such information can improve model fitting and variable selection. Existing group selection methods, such as the group Lasso, require correct membership. However, in practice it can be difficult to correctly specify group membership of all variables. Thus, it is important to develop group selection methods that are robust against group mis-specification. Also, it is desirable to select groups as well as individual variables in many applications. We propose a class of concave
-norm group penalties that is robust to grouping structure and can perform bi-level selection. A coordinate descent algorithm is developed to calculate solutions of the proposed group selection method. Theoretical convergence of the algorithm is proved under certain regularity conditions. Comparison with other methods suggests the proposed method is the most robust approach under membership mis-specification. Simulation studies and real data application indicate that the
-norm concave group selection approach achieves better control of false discovery rates. An R package grppenalty implementing the proposed method is available at CRAN.
Keywords: Bi-level selection, Concave penalties, Coordinate descent, Sparse group Lasso, p > n problems
1. Introduction
Grouping structures exist in many high-dimensional problems. For example, genes in the same biological pathway naturally form a group. In genome-wide association studies, single-nucleotide polymorphisms (SNP) from the same exon can also be considered as a group. Typically, variables of the same membership share similar characteristics. Hence higher within-group correlations are often observed for group members. Incorporation of group information can improve model fitting and leads to better interpretation. Possible applications of grouped approaches includes but not limited to (i) genetic studies assessing association between biomarkers (such as gene expression level, SNP mutation, and copy number variation) and phenotypes of interest; (ii) studies using multiple questions (instruments) to measure a particular feature. For example, multiple cognitive tests are generally used in Alzheimer's studies to quantify the cognitive function. The membership of variables can be determined either by analytical methods or knowledge from field science.
Yuan and Lin (2006) proposed the group Lasso for group selection. Meier and others (2008) extended the method to logistic regression and applied it to detect splice sites in DNA sequences. Breheny and Huang (2009) proposed a class of bi-level selection methods using concave composite penalties. Huang and others (2009) proposed a group bridge approach for group and individual variable selection. Friedman and others (2010) and Simon and others (2012) proposed the sparse group Lasso (SGL). The SGL bridges the individual selection feature of the Lasso and the group selection nature of the group Lasso via a convex combination. Huang and others (2012) reviewed several group selection methods, including the
-norm concave group selection methods. Both the group Lasso and the concave
-norm group penalty select variables at group level, that is, the members of a group are either all selected or dropped. Therefore, grouping structure has a great impact on results. True grouping structure is, however, difficult to specify or not available in many applications. Hence, it is important to develop a robust group selection method with respect to possible mis-specified membership.
We propose a class of concave
-norm group selection methods for high-dimensional linear and generalized linear models when number of covariates can exceed sample size. These methods have two attractive features. First, they are capable of selecting variables at both group and individual levels, that is, they have the bi-level selection property. Second, they are robust against possible mis-specified grouping structure. These methods can be efficiently implemented via a coordinate descent type algorithm. Our convergence analysis shows that this algorithm is guaranteed to converge to a minimum of the objective function.
The rest of the article organizes as follows. Section 2 first provides a brief review of related penalties, then proposes the concave
-norm group penalty. Section 2.3 shows the robustness by two examples and establishes the bi-level selection feature by two propositions. Section 3 details the computation of the concave
-norm group penalized solution using the coordinate descent algorithm (CDA). Section 3.3 extends the concave
-norm group penalty to GLMs and develops algorithm for computing solutions in GLMs based on majorization minimization (MM) approach and CDA. Section 3.4 establishes the theoretical convergence of the CDA for linear and GLMs. Section 4 performs simulation studies to understand the robustness of the concave
-norm group penalty and compare it with the concave
-norm group penalty and the SGL. A comprehensive comparison with related methods is also conducted to study the empirical behavior of the proposed method. Section 5 applies the
-norm group penalty to a motivation example and compares the results with other methods. Section 6 concludes the article by discussion.
2. Methods
2.1. A brief review of group penalties
We briefly review the existing group selection methods, namely the group Lasso, the concave
-norm group penalty and the sparse group Lasso (SGL). Denote the coefficients of a group of variables as
and let
be its dimensionality. The group Lasso (Yuan and Lin, 2006) is defined as
![]() |
(2.1) |
with
being the Euclidean norm. When the group size
, the group Lasso reduces to the Lasso penalty. By imposing a concave penalty
on the Euclidean norm of
, Huang and others (2012) proposed the concave
-norm group penalty, which has the form as
![]() |
(2.2) |
The concave
-norm group penalty reduces to the standard concave penalty when
. The group Lasso can be viewed as a special case of the
-norm concave group penalty with
. Both the group Lasso and concave
-norm group penalty rely on non-differentiability of
to perform group only selection. Tuning parameter
controls model sparsity. As group selection procedures, both the group Lasso and
-norm concave group penalty are sensitive to specified membership. The SGL (Friedman and others, 2010; Simon and others, 2012) uses the penalty function
![]() |
(2.3) |
with
, i.e., the
norm of
and
coefficient of individual group member. Convex combination of the group Lasso and Lasso imposes both group and individual sparsity. Tuning parameter
controls degree of sparsity and
controls weight of the group Lasso and Lasso. The SGL becomes the Lasso when
and the group Lasso when
. According to our results, the SGL is also sensitive to mis-specified group information. This may due to the group Lasso component. The SGL method seems to prefer a larger model comparing to the
-norm group penalty, which could be related to rate consistency property of the Lasso under the sparse Riesz condition (Zhang and Huang, 2008).
2.2. Concave 1-norm group penalty
Consider a linear regression model,
, where
is a response vector,
is an
design matrix and
is error terms. Here
is the coefficient vector. We are interested in cases where
and
is sparse in the sense that many of its elements are zero. Denote the
th covariate vector by
. Without loss of generality, we assume the response is centered and the covariates are standardized so that
. We also assume that
covariates are divided into
groups and size of the
th group is
. Denote the coefficients of the
th group
and corresponding design matrix
.
The concave
-norm group penalized least squares criterion is defined as
![]() |
(2.4) |
The penalty level for the
th group is
, which is proportional to its group size. This avoids the situation where large groups overwhelm small groups.
Multiple penalties can be chosen for
. We use the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001) and the minimax concave penalty (MCP) (Zhang, 2010). Both the SCAD and MCP reduce degree of penalization gradually for large coefficients. Such (nearly) unbiased estimation of coefficients enables the SCAD and MCP to correctly select important variables and estimate their coefficients with high probabilities under certain sparsity conditions and other appropriate regularity conditions, a property known as the oracle property. The SCAD penalty function is
with
and
. Here
is the indicator function and
denotes the non-negative part of
. The MCP is defined as
for
and
. Regularization parameter
controls concavity of both penalties, with smaller
being more concave. When
, both SCAD and MCP reduce to the Lasso penalty. Throughout the article, we brief the
-norm group SCAD and MCP as the
gSCAD and
gMCP.
Figure 1 shows solution paths of the
-norm and
-norm gMCP, group Lasso, and SGL for a simple example. The example has four groups of variables, with the first group having coefficients
and the rest three all having zero coefficients. The bold solid line is the path of the zero coefficient (the 1st member) and the bold dash lines are the paths of the non-zero coefficients (the 2nd and 3rd member) in the 1st group. The dotted lines are the paths of the rest variables. The
-norm gMCP and the SGL have bi-level selection features with a proper
, while the
-norm gMCP and the group Lasso perform group selection only.
Fig. 1.
Solution paths of different group selection methods. The bold solid line is the path of the zero coefficient (the 1st member) and the bold dash lines are the paths of the non-zero coefficients (the 2nd and 3rd member) in the first group. The dotted lines are the paths of the rest variables. The
-norm gMCP and the SGL have bi-level selection features with a proper
, while the
-norm gMCP and group Lasso perform group selection only.
2.3. Properties of the concave 1-norm group penalty
2.3.1. Robustness to mis-specified grouping structure
An advantage of the concave 1-norm group penalty is its robustness to mis-specified group information. This property is closely related to the bi-level selection feature discussed later. Group selection method, such as the 2-norm group penalty, has only two possible estimates,
or
. This obviously puts the 2-norm group penalty in a disadvantage position when some null variables are mis-grouped with one or more non-zero variables. The method either misses non-zero variables or falsely identifies null variables as causal ones.
Table 1 illustrates limitation of the
gMCP in some settings. Example A shows that the
gMCP fails to identify two null variables, x6–x7, when they are mis-grouped with a causal variable x5. This leads to the false discovery of x6–x7 as causal variables by the
gMCP method, while the
gMCP identifies x6–x7 correctly as null variables. Example B shows when a null variable x4 and a causal one x5 are mis-grouped with a causal one x6, the
gMCP fails to identify x5–x6 as causal variables, while the
gMCP fails to identify x6 as causal variable.
Table 1.
Two examples
| True | Working |
gMCP |
gMCP |
|||
|---|---|---|---|---|---|---|
| Example | Variables | structure | structure | True value | estimate | estimate |
| A | x1 | 1 | 1 | 0.4 | 0.285 | 0.311 |
| x2 | 1 | 1 | 0.4 | 0.382 | 0.416 | |
| x3 | 1 | 1 | 0.4 | 0.390 | 0.424 | |
| x4 | 1 | 1 | 0.3 | 0.297 | 0.327 | |
| x5 | 1 | 2 | 0.2 | 0.214 | 0.156 | |
| x6 | 2 | 2 | 0 | -0.023 | 0 | |
| x7 | 2 | 2 | 0 | -0.041 | 0 | |
| B | x1 | 1 | 1 | 0.3 | 0.227 | 0.212 |
| x2 | 1 | 1 | 0.3 | 0.338 | 0.305 | |
| x3 | 1 | 1 | 0.3 | 0.345 | 0.309 | |
| x4 | 1 | 2 | 0 | 0 | 0 | |
| x5 | 1 | 2 | 0.1 | 0 | 0.093 | |
| x6 | 2 | 2 | 0.05 | 0 | 0 |
Example A shows the
gMCP falsely identifies the null variables
–
as causal ones due to mis-specification. Example B shows that the
gMCP misses the causal variables
–
.
2.3.2. Bi-level selection feature
The following proposition shows that the concave
-norm group penalty could have zero and non-zero solutions within a group under proper conditions. Thus the method has bi-level selection feature. Note that the right hand side of second expression could be zero. Under that scenario, the proposed penalty performs group selection only. Therefore, bi-level selection of the concave
-norm group penalty requires a proper
. We prefer a data-driven approach to select an optimal
.
Proposition 1 (Bi-level selection) —
Let
be the solution of the concave
-norm group penalized regression as defined in (2.4), then a necessary condition for
to be a minimizer is that
(2.5) where
is the first derivative of
w.r.t.
.
Proposition 2 (Invariance property of the
gMCP) —
Given a group of standardized variables with size of
, the
gMCP has the following invariance property,
(2.6)
Notice that the
gSCAD does not have the invariance property. Proof of both propositions is simple and thus skipped.
2.3.3. Model sparsity of the 1-norm group penalty
The following proposition shows that at the same penalty level
, the concave 1-norm group penalty has a higher group sparsity than the 2-norm group penalty. That means in order to achieve the same level of group sparsity, we need a larger
for the concave 2-norm group penalty.
Proposition 3 (Model sparsity of 1-norm group penalty) —
Let
be the coefficients of a group of variables with dimensionality
, then given the same penalty level
,
implies
.
This proposition holds because
by the Cauchy–Schwarz inequality.
3. Computation
3.1. Coordinate descent algorithm
Over the last few years, CDA has been shown to be an efficient approach for solving high-dimensional penalized regression problems such as the Lasso (Wu and Lange, 2008; Friedman and others, 2007, 2010). We apply the idea of CDA to compute the solutions of the concave 1-norm group selection problems.
Let
. We want to update
to
and update
to
within the
th group. That is we want to update
to
using the proceeding notation. CDA minimizes the criterion function (2.4)
![]() |
(3.1) |
as a function of
. The solution of
for the
gSCAD and gMCP are
![]() |
(3.2) |
![]() |
(3.3) |
where
with
,
. The notation
for
is the soft-thresholding operator (Donoho and Johnstone, 1994). The solution form of (3.2) and (3.3) resembles a simple soft-thresholding operator if we set
. The
reflects the grouping effect in the penalty.
CDA for concave
-norm group penalty: We summarize the CDA for computing the solution of the concave
-norm group penalized regression as follows:
Given an initial value of
, CDA computes the corresponding residual
.For
, CDA updates
to
by using (3.2) or (3.3) for the
th coordinate of the
th group. Then repeat the same process for the other groups until
is updated to
.CDA checks the convergence criterion. If the algorithm converges then CDA stops iterations, otherwise it repeats Step 2 until the algorithm converges.
3.2. Solution surface
It is a common practice to compute a solution path for a sequence of
with a chosen
when applying the standard concave penalties. For example, in linear models it has been suggested one uses
for the SCAD (Fan and Li, 2001) and
(Zhang, 2010) for the MCP. For the proposed group penalties, it is not clear which
is appropriate. Therefore, we treat
and
both as tuning parameters and compute solution surface over a rectangle of
.
Let
and the grid values of a rectangle in
to be
and
. The number of grid points
and
are pre-specified with
. It can be shown that
. We let
, with
if
and
otherwise. Denote the solution corresponding to
as
. We first compute
with
as the initial value. Then for a given
, we compute
by using
as the initial value. The solution surface calculated in this manner is referred as the solution surface along
. In general, it provides a smoother fit than other alternatives. For more details of the solution surface along
, we refer to Mazumder and others (2011) and Jiang and Huang (2014).
3.3. Extension to the generalized linear models
The concave
-norm group penalty can be easily extended to other models by using different loss. In this article, we extend it to the GLM family, with focus on logistic model. For a GLM model, the criterion is defined as
![]() |
(3.4) |
with
being the vector of the
th observation. The form
depends on the specified GLM. For a logistic model
. Direct application of CDA is possible but not stable for large
in GLMs. Hence, we apply MM approach together with CDA to compute solutions of (3.4). The main idea of MM approach is to optimize a majorization function of
such that each iteration forces
downward until numerical minimum is reached. For more details about MM method, we refer to Hunter and Lange (2004), Hunter and Li (2005), and Lange and others (2000).
We assume the following two conditions hold in order to apply MM approach.
The second partial derivative of loss
w.r.t.
is uniformly bounded for standardized
, i.e., there exists a real number
such that
for all
,
and
.
, with
being the second derivative of
w.r.t
.
For a logistic model, the condition (i) can be met by choosing
. The condition (ii) is met by choosing
for the
gSCAD and
for the
gMCP. Some calculation shows that the coordinate-wise solution forms in GLM are as follows:
![]() |
(3.5) |
![]() |
(3.6) |
where
,
and
being the first derivative of
.
3.4. Convergence analysis
Theorem 3.1 establishes that under certain regularity conditions, CDA converges to a minimum of the objective function (2.4) for a concave
-norm group penalized linear model. Theorem 3.2 states that the solution computed by the CDA and MM approach converges to a minimum of the objective functions for a concave
-norm group penalized GLM. The proof of both theorems are provided in Appendix of supplementary material available at Biostatistics online.
Theorem 3.1 (Convergence in linear model) —
Consider the objective function (2.4), where the given data
lies on a compact set and no two columns of
are identical. Suppose the penalty
satisfies
,
is non-negative, uniformly bounded, with
being the first derivative (assuming existence) of
w.r.t
.
Then the sequence
generated by the CDA converges to a minimum of the function
defined in (2.4).
Theorem 3.2 (Convergence in GLM) —
Consider the objective function (3.4), where the given data
lie on a compact set and no two columns of
are identical. Suppose the penalty
satisfies
,
is non-negative, uniformly bounded, with
being the first derivative (assuming existence) of
w.r.t.
. Also assume two conditions listed below hold.
The second partial derivative of loss
w.r.t.
is uniformly bounded for standardized
, i.e., there exists a real number
such that
for all
,
and
.
, with
being the second derivative of
w.r.t.
.
Then the sequence
generated by the aforementioned algorithm converges to a minimum of the function
defined in (3.4).
4. Simulation studies
We first compare the
-norm gMCP,
-norm gMCP, and SGL under group mis-specification. A comprehensive comparison between the
-norm gMCP and related penalties is then presented under correct group information. For both simulation studies, the
penalized covariates
. To avoid the complexity of tuning parameter selection, we use a validation approach to select final model for comparison. That is for each
, we compute a predictive measure based on a validation dataset with
. For a linear regression, we use the predictive mean square error (PMSE) defined as
. For a logistic regression, we first compute the predictive probability by
. Then based on
, we compute the predictive area under ROC curve (PAUC). For details of computing PAUC, we refer to Jiang and others (2013). The
corresponding to the smallest PMSE and the largest PAUC are selected for comparison across different methods.
4.1. Simulation with group mis-specification
Set
,
and
, with
,
,
,
,
and zero for the rest coefficients. Let
, with
being the covariance matrix for groups 1 and 2 and
the covariance matrix for groups 3 and 4, and
being the covariance matrix for group
. We set
such that within-group correlation is 0.5 and between-group correlation is
for groups 1 and 2. Similarly,
such that within-group correlation and between-group correlation are both 0.5 for groups 3 and 4. For
, we choose a compound symmetry structure with
. The working group information is mis-specified in the sense that the causal variables X9–X10 are grouped with the null variables X11–X20 and the causal variables X29–X30 are grouped with the null variables X31–X40. Hence,
,
,
, and
for the working group information.
Table 2 presents the results in linear and logistic models. We report the false discovery rate (FDR) as well as the percentage of X9–X20 and X29–X40 being selected over the
replications. The results show that the
gMCP avoids the false-positive discovery of X11–X20 and X31–X40; hence, it achieves the lowest FDR. We did not report the result of the Lasso due to space limit and its similar performance with the SGL (
).
Table 2.
Comparison of the
gMCP,
gMCP, and SGL with mis-specified group information
| Model | Results |
gMCP |
gMCP |
SGL
|
SGL
|
SGL
|
SGL
|
|---|---|---|---|---|---|---|---|
| Linear | FDR | 0.272 | 0.613 | 0.544 | 0.544 | 0.520 | 0.453 |
| Pct. X9 | 0.872 | 0.714 | 0.984 | 0.984 | 0.994 | 0.988 | |
| Pct. X10 | 0.856 | 0.714 | 0.984 | 0.982 | 0.992 | 0.984 | |
| Pct. range X11–X20 | 0.182–0.208 | 0.714 | 0.972–0.980 | 0.926–0.944 | 0.744–0.812 | 0.430–0.480 | |
| Pct. X29 | 0.866 | 0.684 | 0.968 | 0.976 | 0.996 | 0.982 | |
| Pct. X30 | 0.860 | 0.684 | 0.968 | 0.976 | 0.994 | 0.976 | |
| Pct. range X31–X40 | 0.172–0.218 | 0.684 | 0.956–0.964 | 0.910–0.940 | 0.716–0.804 | 0.390–0.472 | |
| Logistic | FDR | 0.167 | 0.369 | 0.452 | 0.438 | 0.429 | 0.477 |
| Pct. X9 | 0.274 | 0.372 | 0.652 | 0.654 | 0.686 | 0.730 | |
| Pct. X10 | 0.260 | 0.372 | 0.654 | 0.660 | 0.684 | 0.724 | |
| Pct. range X11–X20 | 0.054–0.100 | 0.372 | 0.634–0.652 | 0.594–0.640 | 0.534–0.596 | 0.364–0.422 | |
| Pct. X29 | 0.284 | 0.372 | 0.624 | 0.620 | 0.656 | 0.704 | |
| Pct. X30 | 0.288 | 0.372 | 0.624 | 0.620 | 0.648 | 0.692 | |
| Pct. range X31–X40 | 0.064–0.092 | 0.372 | 0.606–0.620 | 0.578–0.604 | 0.530–0.588 | 0.348–0.428 |
The FDR, the percentage of X
–X
and X
–X
being selected, is reported. Causal variables are X
–X
(mis-grouped with the null variables X
–X
) and X
–X
(mis-grouped with the null variables X
–X
). The
gMCP achieves the smallest FDR.
4.2. A comparison with related methods
We compare the proposed 1-norm group penalty with the SGL, group Lasso, the standard concave penalty and the 2-norm concave group penalty in this subsection.
4.2.1. Simulation models
Set
and
. The
is a
compound symmetric matrix with
, representing a background correlation among predictors. The
is a
compound symmetric covariance matrix of the
th group with
as a median level of within-group correlation. We consider two scenarios (1) equal group size, with
for
, and (2) unequal group size with
for
, and
for
. For coefficients, set
for
and
, with
being a vector of length
. The value of
is chosen such that signal-to-noise ratio (SNR) is approximately in the range of
. We consider five types of
as listed below to represent five settings,
, representing a situation where effects of group members are relative small but similar.
, representing a situation where effects of some group members are small but not zero.
, representing a situation where only one or two members have strong effect with other members have small effect.
, representing a situation where effects of group members are median with some null members having zero coefficients.
, representing a situation where only one or two members have strong effect with other members having small or zero coefficients.
Denote the causal variables set as
with dimension
, and the estimated version as
with dimension
. Define the set of false-positive variables as
and
with dimension
. Similar concepts are defined at group level. Let the causal group sets
with dimension
, and the estimated version as
with dimension
. Denote the set of false-positive groups as
and
with dimension
. We report our results in terms of model size (
), false discovery rate (
), group model size (
), and group false discovery rate (
) to evaluate selection performance together with PMSE/PAUC.
Tables 3 and 4 present the results from 500 replications in linear and logistic models under five different settings. For the sake of space, we only report the results with unequal group size with
. For the same reason, we only report methods based on MCP penalty due to the similarity between SCAD and MCP penalties. The computation of the
-norm and
-norm group penalties and the group Lasso is done by the R package grppenalty, and the SGL is done by the package SGL. Below we provided a summary of major findings.
Table 3.
Comparison of the concave
-norm group penalties with other methods in linear models
| PMSE | GMS | GFDR | MS | FDR | |||
|---|---|---|---|---|---|---|---|
| Setting | SNR | Method | ( ) |
( ) |
( ) |
( ) |
( ) |
| 1 | 2.88 |
gMCP |
1.24 (2.6) | 5.49 (0.6) | 0.06 (6.0) | 60.57 (0.7) | 0.01 (1.0) |
SGL
|
2.01 (6.0) | 19.58 (1.7) | 0.73 (2.8) | 78.84 (4.5) | 0.35 (3.4) | ||
SGL
|
1.65 (4.8) | 8.2 (1.0) | 0.35 (7.5) | 83.89 (8.4) | 0.28 (6.6) | ||
SGL
|
1.55 (4.6) | 6.6 (0.7) | 0.21 (7.1) | 79.62 (8.5) | 0.21 (7.2) | ||
| Lasso | 1.91 (6.5) | 36.82 (1.0) | 0.86 (0.4) | 133.76 (7.7) | 0.59 (2.0) | ||
| MCP | 1.91 (6.5) | 36.78 (1.1) | 0.86 (0.5) | 133.53 (7.8) | 0.59 (2.1) | ||
| Group Lasso | 1.56 (5.2) | 22.22 (0.6) | 0.77 (0.7) | 277.37 (7.4) | 0.78 (0.6) | ||
gMCP |
1.24 (2.6) | 5.62 (0.6) | 0.08 (6.1) | 67.33 (7.0) | 0.08 (6.1) | ||
| 2 | 2.61 |
gMCP |
1.24 (2.7) | 5.53 (0.6) | 0.06 (6.1) | 60.61 (0.7) | 0.01 (1.1) |
SGL
|
1.66 (5.6) | 20.16 (1.8) | 0.74 (2.8) | 69.03 (4.8) | 0.42 (3.8) | ||
SGL
|
1.52 (4.8) | 10.78 (1.4) | 0.50 (6.8) | 96.06 (11.0) | 0.44 (6.5) | ||
SGL
|
1.59 (5.2) | 9.06 (1.1) | 0.41 (7.3) | 108.31 (13.9) | 0.41 (7.5) | ||
| Lasso | 1.59 (5.2) | 35.60 (1.1) | 0.86 (0.5) | 109.39 (6.1) | 0.62 (2.0) | ||
| MCP | 1.51 (4.7) | 26.86 (3.2) | 0.79 (4.4) | 72.28 (10.5) | 0.47 (5.5) | ||
| Group Lasso | 1.54 (5.2) | 22.15 (0.7) | 0.77 (0.8) | 276.42 (8.0) | 0.78 (0.7) | ||
gMCP |
1.24 (2.6) | 5.74 (0.7) | 0.09 (6.7) | 68.89 (8.8) | 0.09 (6.8) | ||
| 3 | 2.76 |
gMCP |
1.24 (2.6) | 5.44 (0.9) | 0.04 (5.5) | 59.69 (1.7) | 0.01 (2.0) |
SGL
|
1.46 (4.2) | 16.21 (1.8) | 0.67 (4.2) | 51.27 (4.5) | 0.39 (4.6) | ||
SGL
|
1.48 (4.6) | 10.92 (1.4) | 0.50 (6.8) | 92.53 (11.4) | 0.47 (6.7) | ||
SGL
|
1.69 (6.3) | 9.95 (1.2) | 0.46 (6.8) | 118.77 (15.4) | 0.46 (6.9) | ||
| Lasso | 1.40 (3.6) | 32.44 (1.4) | 0.84 (0.7) | 83.70 (4.9) | 0.61 (2.2) | ||
| MCP | 1.28 (2.0) | 15.99 (2.7) | 0.62 (8.7) | 35.15 (6.2) | 0.36 (6.3) | ||
| Group Lasso | 1.55 (5.3) | 22.03 (0.7) | 0.77 (0.8) | 275.23 (8.5) | 0.78 (0.8) | ||
gMCP |
1.24 (2.6) | 5.74 (0.7) | 0.09 (7.0) | 68.95 (9.3) | 0.08 (7.1) | ||
| 4 | 2.55 |
gMCP |
1.24 (2.7) | 5.57 (0.6) | 0.07 (6.4) | 60.66 (0.8) | 0.41 (0.7) |
SGL
|
1.64 (5.4) | 21.04 (1.8) | 0.75 (2.4) | 69.83 (4.9) | 0.52 (3.3) | ||
SGL
|
1.50 (4.7) | 11.58 (1.5) | 0.53 (6.3) | 100.72 (11.8) | 0.62 (4.5) | ||
SGL
|
1.58 (5.1) | 9.64 (1.2) | 0.44 (7.1) | 115.08 (14.9) | 0.66 (4.4) | ||
| Lasso | 1.58 (5.1) | 35.62 (1.1) | 0.86 (0.5) | 107.89 (6.1) | 0.68 (1.6) | ||
| MCP | 1.50 (4.7) | 26.44 (3.4) | 0.78 (5.1) | 69.95 (10.3) | 0.50 (6.4) | ||
| 5 | 2.71 |
gMCP |
1.24 (2.6) | 5.55 (1.1) | 0.05 (6.1) | 59.62 (1.9) | 0.40 (1.4) |
SGL
|
1.44 (4.0) | 16.97 (1.8) | 0.68 (4.1) | 51.21 (4.6) | 0.51 (4.1) | ||
SGL
|
1.46 (4.5) | 11.72 (1.5) | 0.54 (6.3) | 97.23 (12.1) | 0.65 (4.5) | ||
SGL
|
1.67 (6.2) | 10.63 (1.4) | 0.49 (6.6) | 126.79 (16.9) | 0.69 (4.2) | ||
| Lasso | 1.39 (3.5) | 32.30 (1.4) | 0.84 (0.7) | 80.96 (5.0) | 0.68 (1.8) | ||
| MCP | 1.25 (1.9) | 13.89 (2.8) | 0.54 (11.1) | 29.79 (5.8) | 0.34 (7.9) |
PMSE is the predictive mean square error, MS is model size, FDR is false discovery rate, GMS is group model size, and GFDR is the group false discovery rate. SE is the standard error computed from
replications.
Table 4.
Comparison of the concave
-norm group penalties with other methods in logistic models
| PAUC | GMS | GFDR | MS | FDR | |||
|---|---|---|---|---|---|---|---|
| Setting | SNR | Method | ( ) |
( ) |
( ) |
( ) |
( ) |
| 1 | 2.88 |
gMCP |
0.851 (0.78) | 8.80 (2.1) | 0.34 (10.8) | 61.97 (8.7) | 0.16 (7.0) |
SGL
|
0.841 (0.59) | 16.94 (2.0) | 0.68 (4.1) | 53.5 (5.4) | 0.39 (4.7) | ||
SGL
|
0.872 (0.46) | 10.17 (1.4) | 0.46 (7.4) | 89.84 (11.0) | 0.41 (6.9) | ||
SGL
|
0.879 (0.43) | 10.48 (1.3) | 0.49 (6.3) | 125.35 (15.9) | 0.49 (6.4) | ||
| Lasso | 0.832 (0.63) | 21.42 (2.3) | 0.75 (3.0) | 52.10 (5.5) | 0.43 (4.1) | ||
| MCP | 0.832 (0.63) | 21.42 (2.3) | 0.75 (3.0) | 52.10 (5.5) | 0.43 (4.1) | ||
| Group Lasso | 0.838 (0.74) | 12.73 (1.5) | 0.58 (5.2) | 155.92 (19.6) | 0.59 (5.3) | ||
gMCP |
0.857 (0.70) | 6.78 (1.2) | 0.20 (9.6) | 81.55 (15.2) | 0.20 (9.7) | ||
| 2 | 2.61 |
gMCP |
0.832 (0.83) | 9.98 (2.7) | 0.37 (12.0) | 58.32 (8.7) | 0.19 (8.3) |
SGL
|
0.828 (0.67) | 16.91 (2.3) | 0.67 (4.8) | 48.2 (6.4) | 0.44 (5.5) | ||
SGL
|
0.849 (0.53) | 12.62 (1.6) | 0.57 (5.6) | 104.33 (12.3) | 0.53 (5.7) | ||
SGL
|
0.851 (0.52) | 13.13 (1.4) | 0.60 (4.6) | 156.79 (17.5) | 0.60 (4.8) | ||
| Lasso | 0.820 (0.72) | 20.44 (2.6) | 0.73 (3.6) | 45.05 (6.1) | 0.46 (4.7) | ||
| MCP | 0.820 (0.71) | 20.31 (2.6) | 0.73 (3.8) | 44.67 (6.2) | 0.46 (4.7) | ||
| Group Lasso | 0.820 (0.82) | 12.42 (1.6) | 0.57 (5.9) | 151.71 (20.8) | 0.57 (6.1) | ||
gMCP |
0.841 (0.77) | 6.56 (1.2) | 0.17 (9.5) | 78.48 (14.8) | 0.17 (9.6) | ||
| 3 | 2.76 |
gMCP |
0.847 (0.77) | 11.17 (3.4) | 0.38 (13.0) | 56.35 (6.8) | 0.22 (10.3) |
SGL
|
0.849 (0.69) | 17.82 (2.5) | 0.69 (4.9) | 46.38 (6.8) | 0.50 (5.4) | ||
SGL
|
0.853 (0.62) | 15.02 (1.6) | 0.65 (4.2) | 118.96 (13.1) | 0.63 (4.4) | ||
SGL
|
0.846 (0.60) | 15.61 (1.5) | 0.66 (3.5) | 187.17 (18.7) | 0.67 (3.6) | ||
| Lasso | 0.843 (0.73) | 20.57 (2.9) | 0.73 (4.7) | 41.63 (6.7) | 0.50 (5.2) | ||
| MCP | 0.850 (0.85) | 14.96 (3.3) | 0.56 (10.5) | 27.96 (6.8) | 0.38 (8.4) | ||
| Group Lasso | 0.830 (0.80) | 12.20 (1.4) | 0.56 (5.3) | 149.62 (18.2) | 0.57 (5.4) | ||
gMCP |
0.852 (0.68) | 6.35 (1.1) | 0.14 (8.8) | 76.44 (13.8) | 0.14 (8.9) | ||
| 4 | 2.55 |
gMCP |
0.827 (0.84) | 9.72 (2.7) | 0.36 (12.0) | 56.97 (8.2) | 0.48 (4.1) |
SGL
|
0.822 (0.68) | 17.07 (2.4) | 0.67 (4.7) | 47.91 (6.6) | 0.55 (4.8) | ||
SGL
|
0.843 (0.57) | 12.99 (1.6) | 0.58 (5.5) | 107.22 (12.6) | 0.69 (3.6) | ||
SGL
|
0.845 (0.55) | 13.46 (1.5) | 0.61 (4.5) | 160.91 (17.7) | 0.76 (2.7) | ||
| Lasso | 0.814 (0.73) | 20.48 (2.8) | 0.73 (3.9) | 44.57 (6.7) | 0.55 (4.7) | ||
| MCP | 0.814 (0.72) | 20.10 (2.8) | 0.72 (5.0) | 43.59 (6.9) | 0.54 (5.2) | ||
| 5 | 2.71 |
gMCP |
0.842 (0.82) | 11.38 (3.4) | 0.39 (13.0) | 56.64 (7.2) | 0.50 (5.1) |
SGL
|
0.845 (0.69) | 17.86 (2.4) | 0.69 (4.9) | 45.86 (6.9) | 0.60 (4.8) | ||
SGL
|
0.849 (0.64) | 15.18 (1.6) | 0.65 (4.1) | 119.78 (13.1) | 0.75 (2.9) | ||
SGL
|
0.840 (0.62) | 15.84 (1.6) | 0.68 (3.6) | 189.59 (19.6) | 0.80 (2.2) | ||
| Lasso | 0.839 (0.73) | 20.90 (2.8) | 0.73 (4.5) | 41.80 (6.6) | 0.59 (4.8) | ||
| MCP | 0.847 (0.87) | 15.23 (3.3) | 0.57 (10.6) | 28.15 (6.9) | 0.44 (9.2) |
PAUC is the predictive area under ROC curve, MS is model size, FDR is false discovery rate, GMS is group model size, and GFDR is the group false discovery rate. SE is the standard error computed from
replications.
4.2.2. Comparison with the SGL
In linear models, the
gMCP achieves smaller PMSE than the SGL, while in logistic models the PAUC of these methods are similar. The concave
-norm group penalties have smaller FDR and GFDR across all settings. The MS and GMS of the concave
-norm group penalties are smaller than the SGL with
and
.
4.2.3. Comparison with the standard concave penalty
The PMSE of the concave
-norm group penalties is smaller than the standard ones; while the PAUC of these methods is close. The
gMCP has smaller GMS and GFDR in all the settings. This is expected since the standard penalties do not make use of group information. The MS and FDR of the concave
-norm group penalties are smaller than the standard concave penalties under the setting 1–4. Under the setting 5 with one or two dominating members, the standard MCP penalty ends up with smaller MS.
4.2.4. Comparison with the group Lasso and the concave 2-norm group selection
We compare the concave
-norm, the group Lasso and the
-norm group penalties only under the setting 1–3 because of the group selection property of the group Lasso and the
-norm group penalties. The
-norm group penalty in general has a smaller GMS, while the
-norm group penalty has a smaller MS. The GFDR and FDR of the
-norm and
-norm group penalties are close to each other, both of which are smaller than the group Lasso.
5. Data example
Our illustrative example comes from a published study exploring the association between genes and prognosis of breast cancer (van’t Veer and others, 2002; Van de Vijver and others, 2002). Tumor samples from
women with breast cancer were selected for microarray expression profiling. The age at diagnosis was 52 years or younger for those women to be eligible. Fluorescence intensities of
25 000 human genes were quantified and normalized. Ratio of these values to the intensity of a reference pool was calculated for analysis purpose. Further details can be found in the references above.
For our purpose, a binary variable indicating whether patients developed metastasis within 5 years from surgery is modeled as the outcome. There are 78 patients developed metastasis within 5 years. A total of
genes with top Spearman correlation coefficients with the outcome were used for illustrative purpose. (Note: the method can handle problems with
and
.) The membership of the
genes were determined by the hierarchical cluster method using the Gap statistic. The idea of Gap statistic is (1) group genes into
clusters and calculate the total within block sum of squares
. (2) create new resampled datasets by separately permuting measurement of each gene. Repeat Step (1) to the new resampled datasets and calculate the average
,
. Then find an
maximizes
. For details about the Gap statistic, we refer to Tibshirani and others (2001) and Ma and Huang (2007). In our example, the optimal
. Hence, we have 33 groups with group size from 2 to 68. We use the cross-validated area under ROC curve (CV-AUC) approach to select tuning parameters
. This approach computes average predictive AUC of validation datasets created by cross validation to select tuning parameters. We refer to Jiang and others (2013) for more details of CV-AUC method.
Table 5 presents the results based on 20 replications of 5-fold CV-AUC of different penalties. The median and median absolute deviation (MAD) of CV-AUC, MS, and GMS are reported. The
gSCAD and gMCP have greater CV-AUC than other methods. The
gSCAD and gMCP prefer models with small GMS. The SGL methods have similar results with three different choices of
. The standard concave penalties (Lasso, SCAD, and MCP) have the smallest CV-AUC compared with group methods. The results suggest that incorporating group information in general improves model predictive performance. Among grouped approaches, the
-norm group penalty outperformed others.
Table 5.
Results of different penalties in breast cancer study
| Method | CV-AUC (MAD) | GMS (MAD) | MS (MAD) |
|---|---|---|---|
| Lasso | 0.776 (0.013) | 29 (2) | 52 (6) |
| SCAD | 0.776 (0.013) | 29 (2) | 55 (9) |
| MCP | 0.776 (0.013) | 29 (2) | 52 (7) |
SGL
|
0.782 (0.008) | 27 (2) | 74.5 (15) |
SGL
|
0.803 (0.012) | 28 (1) | 80 (8.5) |
SGL
|
0.810 (0.011) | 30 (1) | 94 (4) |
| Group Lasso | 0.794 (0.009) | 12 (1) | 71 (11) |
gSCAD |
0.802 (0.007) | 6 (1) | 27 (7) |
gMCP |
0.802 (0.008) | 5 (0) | 20 (0) |
gSCAD |
0.825 (0.011) | 11.5 (4) | 44 (13) |
gMCP |
0.824 (0.011) | 11.5 (4.5) | 44 (13) |
6. Discussion
The proposed concave
-norm group penalty has bi-level selection feature under proper conditions. The robustness to membership mis-specification is of particular interest in practice since true group information is usually unavailable. The recent SGL method also has bi-level selection feature. However, it is sensitive to mis-specification due to the group Lasso component. The robustness of the proposed method is related to the
penalty within group. Assuming the same probably of being identified at group level, the
norm still gives freedom to individual members; while the
norm does not. Individual level selection protects over-control at group level. Hence, under mis-specification, an causal variable is still likely to be picked even if the group it belongs to is not identified. Likewise, a null variable is less likely to be identified even the group is selected. More work is needed to better understand theoretical property of the method.
Compared with the standard concave penalty, the
-norm group penalty incorporates the group information and thus achieves a better control of false discovery rate at group level and individual level. When group information is correctly specified, the
-norm group penalty is still capable of achieving a better FDR and GFDR in most cases being explored. The R package “grppenalty” implemented the proposed algorithms with sufficient efficiency and stability. Hence, the
-norm group penalty is a valuable tool for variable selection in
problems.
Although the
-norm group penalty is robust to mis-specified group information, we still want to approximate true grouping structure as accurate as possible. How to achieve this goal is still an open question. We offer several possibilities as listed below. For studies using questionnaire to collect variable information. We suggest defining group structure based on design of questionnaire. Most of these studies have a block of questions to measure similar quantities of study subjects. For example, questions attempting to quantify intake of fat-rich foods are usually organized in one block and can be considered as a group. Such grouping structure is consistent to the perception of researchers and has an easy interpretation. However, no statistical procedure justifies membership information. Another approach is to perform a numerical exploration using index statistics such as the Gap statistic. Then based on the index statistic, a group structure can be established. Such structure considers correlation among predictors; therefore, it in general leads to an improved performance. A disadvantage of such grouping method is its difficulty in interpretation. A third way, which is more specific to the genomic data, is to use available biological information about genes. Gene Oncology (GO) and multiple databases on biological pathway would be a good start to collect such group information.
7. Funding
The work of Huang is supported in part by NIH Grant R01CA142774 and NSF Grant DMS-1208225.
Supplementary Material
Supplementary material is available at http://biostatistics.oxfordjournals.org.
Acknowledgement
We thank the editor, associate editor, and two referees for their helpful comments, which led to considerable improvements in the revision of the paper. Conflict of Interest: None declared.
References
- Breheny P., Huang J. (2009). Penalized methods for bi-level variable selection. Statistics and Its Interface 2, 369–380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Donoho D. L., Johnstone J. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 813, 425–455. [Google Scholar]
- Fan J., Li R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association 96456, 1348–13608. [Google Scholar]
- Friedman J., Hastie T., Höfling H., Tibshirani R. (2007). Pathwise coordinate optimization. Annals of Applied Statistics 12, 302–332. [Google Scholar]
- Friedman J., Hastie T., Tibshirani R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 331, 1–22. [PMC free article] [PubMed] [Google Scholar]
- Friedman J., Hastie T., Tibshirani R. (2010). A note on the group lasso and a sparse group lasso. Techinical report. http://arxiv.org/abs/1001.0736. [Google Scholar]
- Huang J., Breheny P., Ma S. (2012). A selective review of group selection in high dimensional model. Statistical Science 274, 481–499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang J., Ma S., Xie H. L., Zhang C.-H. (2009). A group bridge approach for variable selection. Biometrika 96, 339–355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hunter D. R., Lange K. (2004). A tutorial on MM algorithms. American Statistician 581, 30–37. [Google Scholar]
- Hunter D. R., Li R. (2005). Variable selection using MM algorithms. Annals of Statistics 334, 1617–1642. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang D., Huang J., Zhang Y. (2013). The cross-validated AUC for MCP-logistic regression with high-dimensional data. Statistical Methods in Medical Research 225, 505–518. [DOI] [PubMed] [Google Scholar]
- Jiang D., Huang J. (2014). Majorization minimization by coordinate descent for concave penalized generalized linear models. Statistics and Computing 245, 871–883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lange K., Hunter D., Yang I. (2000). Optimization transfer using surrogate objective functions (with discussion). Journal of Computational and Graphics Statistics 91, 1–59. [Google Scholar]
- Ma S., Huang J. (2007). Clustering threshold gradient descent regularization: with applications to microarray studies. Bioinformatics 234, 466–472. [DOI] [PubMed] [Google Scholar]
- Mazumder R., Friedman J., Hastie T. (2011). SparseNet coordinate descent with non-convex penalties. Journal of American Statistical Association 106495, 1125–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meier L., van de Geer S., Bühlmann P. (2008). The group lasso for logistic regression. Journal of Royal Statistical Society Series B 701, 53–71. [Google Scholar]
- Simon N., Friedman J., Hastie T., Tibshirani R. A sparse-group lasso. Technical report. Stanford University. [Google Scholar]
- Tibshirani R., Walther G., Hastie T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of Royal Statistical Society Series B 632, 411–423. [Google Scholar]
- van’t Veer L. J., Dai H., van de Vijver M. J. and others (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 41531, 530–536. [DOI] [PubMed] [Google Scholar]
- van de Vijver M. J., He Y. D., van’t Veer L. J. and others (2002). A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine 34725, 1999–2009. [DOI] [PubMed] [Google Scholar]
- Wu T. T., Lange K. (2008). Coordinate descent algorithms for Lasso penalized regression. Annals of Applied Statistics 21, 224–244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan M., Lin Y. (2006). Model selection and estimation in regression with grouped variables. Journal of Royal Statistical Society Series B 681, 49–67. [Google Scholar]
- Zhang C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics 382, 894–942. [Google Scholar]
- Zhang C. H., Huang J. (2008). The sparsity and bias of the Lasso selection in high-dimensional linear regression. Annals of Statistics 364, 1567–1594. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


































































































































