Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Jan 1.
Published in final edited form as: Ann Stat. 2011;39(4):2021–2046. doi: 10.1214/11-aos897

The Sparse Laplacian Shrinkage Estimator for High-Dimensional Regression

Jian Huang 1,*, Shuangge Ma 2, Hongzhe Li 3, Cun-Hui Zhang 4
PMCID: PMC3217586  NIHMSID: NIHMS323903  PMID: 22102764

Abstract

We propose a new penalized method for variable selection and estimation that explicitly incorporates the correlation patterns among predictors. This method is based on a combination of the minimax concave penalty and Laplacian quadratic associated with a graph as the penalty function. We call it the sparse Laplacian shrinkage (SLS) method. The SLS uses the minimax concave penalty for encouraging sparsity and Laplacian quadratic penalty for promoting smoothness among coefficients associated with the correlated predictors. The SLS has a generalized grouping property with respect to the graph represented by the Laplacian quadratic. We show that the SLS possesses an oracle property in the sense that it is selection consistent and equal to the oracle Laplacian shrinkage estimator with high probability. This result holds in sparse, high-dimensional settings with pn under reasonable conditions. We derive a coordinate descent algorithm for computing the SLS estimates. Simulation studies are conducted to evaluate the performance of the SLS method and a real data example is used to illustrate its application.

Keywords: Graphical structure, minimax concave penalty, penalized regression, high-dimensional data, variable selection, oracle property

1 Introduction

There has been much work on penalized methods for variable selection and estimation in high-dimensional regression models. Several important methods have been proposed. Examples include estimators based on the bridge penalty (Frank and Friedman 1993), the ℓ1 penalty or the least absolute shrinkage and selection operator (LASSO, Tibshirani 1996; Chen, Donoho and Saunders 1998), the smoothly clipped absolute deviation (SCAD) penalty (Fan 1997; Fan and Li 2001), and the minimum concave penalty (MCP, Zhang 2010). These methods are able to do estimation and automatic variable selection simultaneously and provide a computationally feasible way for variable selection in high-dimensional settings. Much progress has been made in understanding the theoretical properties of these methods. Efficient algorithms have also been developed for implementing these methods.

A common feature of the methods mentioned above is the independence between the penalty and the correlation among predictors. This can lead to unsatisfactory selection results, especially in p n settings. For example, as pointed out by Zou and Hastie (2005), the LASSO tends to only select one variable among a group of highly correlated variables; and its prediction performance may not be as good as the ridge regression if there exists high correlation among predictors. To overcome these limitations, Zou and Hastie (2005) proposed the elastic net (Enet) method, which uses a combination of the ℓ1 and ℓ2 penalties. Selection properties of the Enet and adaptive Enet have also been studied by Jia and Yu (2009) and Zou and Zhang (2009). Bondell and Reich (2008) proposed the OSCAR (octagonal shrinkage and clustering algorithm for regression) approach, which uses a combination of the ℓ1 norm and a pairwise ℓ norm for the coefficients. Huang et al. (2010) proposed the Mnet method, which uses a combination of the MCP and ℓ2 penalties. The Mnet estimator is equal to the oracle ridge estimator with high probability under certain conditions. These methods are effective in dealing with certain types of collinearity among predictors and has the useful grouping property of selecting and dropping highly correlated predictors together. Still, these combination penalties do not use any specific information on the correlation pattern among the predictors.

Li and Li (2008) proposed a network-constrained regularization procedure for variable selection and estimation in linear regression models, where the predictors are genomic data measured on genetic networks. Li and Li (2010) considered the general problem of regression analysis when predictors are measured on an undirected graph, which is assumed to be known a priori. They called their method a graph-constrained estimation procedure, or GRACE. The GRACE penalty is a combination of the ℓ1 penalty and a penalty that is the Laplacian quadratic associated with the graph. Because the GRACE uses the ℓ1 penalty for selection and sparsity, it has the same drawbacks as the Enet discussed above. In addition, the full knowledge of the graphical structure for the predictors is usually not available, especially in high-dimensional problems. Daye and Jeng (2009) proposed the weighted fusion method, which also uses a combination of the ℓ1 penalty and a quadratic form that can incorporate information among correlated variables for estimation and variable selection. Tutz and Ulbricht (2009) studied a form of correlation based penalty, which can be considered a special case of the general quadratic penalty. But this approach does not do variable selection. The authors proposed a blockwise boosting procedure in combination with the correlation based penalty for variable selection. Hebiri and van de Geer (2010) studied the theoretical properties of the smoothed-Lasso and other ℓ1 + ℓ2-penalized methods in pn models. Pan, Xie and Shen (2009) studied a grouped penalty based on the Lγ-norm for γ > 1 that smoothes the regression coefficients over a network. In particular, when γ = 2 and after appropriate rescaling of the regression coefficients, this group Lγ penalty simplifies to the group Lasso (Yuan and Lin 2005) with the nodes in the network as groups. This method is capable of group selection, but it does not do individual variable selection. Also, because the group Lγ penalty is convex for γ > 1, it does not lead to consistent variable selection, even at the group level.

We propose a new penalized method for variable selection and estimation in sparse, high-dimensional settings that takes into account certain correlation patterns among predictors. We consider a combination of the MCP and Laplacian quadratic as the penalty function. We call the proposed approach the sparse Laplacian shrinkage (SLS) method. The SLS uses the MCP to promote sparsity and Laplacian quadratic penalty to encourage smoothness among coefficients associated with the correlated predictors. An important advantage of the MCP over the ℓ1 penalty is that it leads to estimators that are nearly unbiased and achieve selection consistency under weaker conditions (Zhang, 2010).

The contributions of this paper are as follows.

  • First, unlike the existing methods that use an ℓ1 penalty for selection and a ridge penalty or a general ℓ2 penalty for dealing with correlated predictors, we use the MCP to achieve nearly unbiased selection and proposed a concrete class of quadratics, the Laplacians, for incorporating correlation patterns among predictors in a local fashion. In particular, we suggest to employ the approaches for network analysis for specifying the Laplacians. This provides an implementable strategy for incorporating correlation structures in high-dimensional data analysis.

  • Second, we prove that the SLS estimator is sign consistent and equal to the oracle Laplacian shrinkage estimator under reasonable conditions. This result holds for a large class of Laplacian quadratics. An important aspect of this result is that it allows the number of predictors to be larger than the sample size. In contrast, the works of Daye and Jeng (2009) and Tutz and Ulbricht (2009) do not contain such results in pn models. The selection consistency result of Hebiri and van de Geer (2010) requires certain strong assumptions on the magnitude of the smallest regression coefficient (their Assumption C) and on the correlation between important and unimportant predictors (their Assumption D), in addition to a variant of the restricted eigenvalue condition (their Assumption B). In comparison, our assumption involving the magnitude of the regression coefficients is weaker and we use a sparse Riese condition instead of imposing restriction on the correlations among predictors. In addition, our selection results are stronger in that the SLS estimator is not only sign consistent, but also equal to the oracle Laplacian shrinkage estimator with high probability. In general, similar results are not available with the use of the ℓ1 penalty.

  • Third, we show that the SLS method is potentially capable of incorporating correlation structure in the analysis without incurring extra bias. The Enet and the more general ℓ1 + ℓ2 methods in general introduces extra bias due to the quadratic penalty, in addition to the bias resulting from the ℓ1 penalty. To the best of our knowledge, this point has not been discussed in the existing literature. We also demonstrate that the SLS has certain local smoothing property with respect to the graphical structure of the predictors.

  • Fourth, unlike in the GRACE method, the SLS does not assume that the graphical structure for the predictors is known a priori. The SLS uses the existing data to construct the graph Laplacian or to augment partial knowledge of the graph structure.

  • Fifth, our simulation studies demonstrate that the SLS method outperforms the ℓ1 penalty plus a quadratic penalty approach as studied in Daye and Jeng (2009) and Hebri and van de Geer (2010). In our simulation examples, the SLS in general has smaller empirical false discovery rates with comparable false negative rates. It also has smaller prediction errors.

This paper is organized as follows. In Section 2 we define the SLS estimator. In Section 3 we discuss ways to construct graph Laplacian, or equivalently, its corresponding adjacency matrix. In Section 4 we study the selection properties of the SLS estimators. In Section 5 we investigate the properties of Laplacian shrinkage. In Section 6 we describe a coordinate descent algorithm for computing the SLS estimators, present simulation results and an application of the SLS method to a microarray gene expression dataset. Discussions of the proposed method and results are given in Section 7. Proofs for the oracle properties of the SLS and other technical details are provided in the Appendix.

2 The sparse Laplacian shrinkage estimator

Consider the linear regression model

y=j=1pxjβj+ε (2.1)

with n observations and p potential predictors, where y = (y1,, yn)′ is the vector of n response variables, xj = (x1j,, xnj)′ is the jth predictor, βj is the jth regression coefficient and ε = (ε1,, εn)′ is the vector of random errors. Let X = (x1,, xp) be the n × p design matrix. Throughout, we assume that the response and predictors are centered and the predictors are standardized so that i=1nxij2=n, j = 1,, p. For λ = (λ1, λ2) with λ1 ≥ 0 and λ2 ≥ 0, we propose the penalized least squares criterion

M(b;λ,γ)=12n||yXb||2+j=1pρ(bj;λ1,γ)+12λ21j<kpajk(bjsjkbk)2, (2.2)

where || · || denotes the ℓ2 norm, ρ is the MCP with penalty parameter λ1 and regularization parameter γ, |ajk| mesaures the strength of the connection between xj and xk, and sjk = sgn(ajk) is the sign of ajk, with sgn(t) = −1, 0, or 1 respectively for t < 0, = 0, or > 0. The two penalty terms in (2.2) play different roles. The first term promotes sparsity in the estimated model. The second term encourages smoothness of the estimated coefficients of the connected predictors. We can associate the quadratic form in this term with the Laplacian for a suitably defined undirected weighted graph for the predictors. See the description below. For any given (λ, γ), the SLS estimator is

β^(λ,γ)=argminbM(b;λ,γ). (2.3)

The SLS uses the MCP, defined as

ρ(t;λ1,γ)=λ10t(1x/(γλ1))+dx, (2.4)

where for any aR, a+ is the nonnegative part of a, i.e., a+ = a1{a≥0}. The MCP can be easily understood by considering its derivative,

ρ.(t;λ1,γ)=λ1(1t/(γλ1))+sgn(t). (2.5)

We observe that the MCP begins by applying the same level of penalization as the ℓ1 penalty, but continuously reduces that level to 0 for |t| > γλ. The regularization parameter γ controls the degree of concavity. Larger values of γ make ρ less concave. By sliding the value of γ from 1 to ∞, the MCP provides a continuum of penalties with the hard-threshold penalty as γ → 1+ and the convex ℓ1 penalty at γ = ∞. Detailed discussion of MCP can be found in Zhang (2010).

The SLS also allows the use of different penalties than the MCP for ρ, including the SCAD (Fan 1997; Fan and Li 2001) and other quadratic splines. Because the MCP minimizes the maximum concavity measure and has the simplest form among nearly unbiased penalties in this family, we choose it as the default penalty for the SLS. Further discussion of the MCP and its comparison with the LASSO and SCAD can be found in Zhang (2010) and Mazumder, Friedman and Hastie (2009).

We express the nonnegative quadratic form in the second penalty term in (2.2) using a positive semi-definite matrix L, which satisfies

bLb=1j<kpajk(bjsjkbk)2,bIRp.

For simplicity, we confine our discussion to the symmetric case where akj = ajk, 1 ≤ j < kp. Since the diagonal elements ajj do not appear in the quadratic form, we can define them any way we like for convenience. Let A = (ajk, 1 ≤ j, kp) and D = diag(d1,, dp), where dj=k=1pajk. We have Σ1≤j<kp |ajk|(bjsjkbk)2 = b′(DA)b. Therefore, L = DA. This matrix is associated with a labeled weighted graph Inline graphic = (V, Inline graphic) with vertex set V = {1,, p} and edge set Inline graphic = {(j, k): (j, k) ∈ V × V}. Here the |ajk| is the weight of edge (j, k) and dj is the degree of vertex j. The dj is also called the connectivity of vertex j. The matrix L is called the Laplacian of Inline graphic and A its signed adjacency matrix (Chung 1997). The edge (j, k) is labeled with the “+” or “−” sign, but its weight |ajk| is always nonnegative. We use a labeled graph to accommodate the case where two predictors can have a nonzero adjacent coefficient but are negatively correlated. Note that the usual adjacency matrix can be considered a special case of signed adjacency matrix when all ajk ≥ 0. For simplicity, we will use the term adjacency matrix below.

We usually require that the adjacency matrix to be sparse in the sense that many of its entries are zero or nearly zero. With a sparse adjacency matrix, the main characteristic of the shrinkage induced by the Laplacian penalty is that it occurs locally for the coefficients associated with the predictors connected in the graph. Intuitively, this can be seen by writing

λ21j<kpajk(bjsjkbk)2=12λ2(j,k):ajk0ajk(bjsjkbk)2.

Thus for λ2 > 0, the Laplacian penalty shrinks bjsjkbk towards zero for ajk ≠ 0. This can also be considered as a type of local smoothing on the graph G associated with the adjacency matrix A. In comparison, the shrinkage induced by the ridge penalty used in the Enet is global in that it shrinks all the coefficients towards zero, regardless of the correlation structure among the predictors. We will discuss the Laplacian shrinkage in more detail in Section 5.

Using the matrix notation, the SLS criterion (2.2) can be written as

M(b;λ,γ)=12n||yXb||2+j=1pρ(bj;λ1,γ)+12λ2b(DA)b. (2.6)

Here the Laplacian is not normalized, meaning that the weight dj is not standardized to 1. In problems where predictors should be treated without preference with respect to connectivity, we can first normalized the Laplacian L* = IpA* with A* = D−1/2AD−1/2 and use the criterion

M(b;λ,γ)=12n||yXb||2+j=1pρ(bj;λ1,γ)+12λ2b(IpA)b.

Technically, a normalized Laplacian L* can be considered a special case of a general L. We only consider the SLS estimator based on the criterion (2.6) when studying its properties. In network analysis of gene expression data, genes with large connectivity also tend to have important biological functions (Zhang and Horvath 2005). Therefore, it is prudent to provide more protection for such genes in the selection process.

3 Construction of adjacency matrix

In this section, we describe several simple forms of adjacency measures proposed by Zhang and Horvath (2005), which have have been successfully used in network analysis of gene expression data. The adjacency measure is often defined based on the notion of dissimilarity or similarity.

  1. A basic and widely used dissimilarity measure is the Euclidean distance. Based on this distance, we can define adjacency coefficient as ajk=φ(||xjxk||/n), where φ: [0, ∞) ↦ [0, ∞). A simple adjacency function is the threshold function φ(x) = 1{x ≤ 2r}. Then
    ajk={1if||xjxk||/n2r0if||xjxk||/n>2r. (3.1)

    It is convenient to express ajk in terms of the Pearson’s correlation coefficient rjk between xj and xk, where rjk=xjxk/(||xj||||xk||). For predictors that are standardized with ||xj||2 = n, 1 ≤ jp, we have ||xjxk||2/n = 2 − 2rjk. Thus in terms of correlation coefficients, we can write ajk = 1{rjk > r}. We determine the value of r based on the Fisher transformation zjk = 0.5 log((1 + rjk)/(1 − rjk)). If the correlation between xj and xk is zero, n3zjk is approximately distributed as N(0, 1). We can use this to determine a threshold c for n3zjk. The corresponding threshold for rjk is r=(exp(2c/n3)1)/(exp(2c/n3)+1).

    We note that here we use the Fisher transformation to change the scale of the correlation coefficients from [−1, 1] to the normal scale for determining the threshold value r, so that the adjacency matrix is relatively sparse. We are not trying to test the significance of correlation coefficients.

  2. The adjacency coefficient in (3.1) is defined based on a dissimilarity measure. Adjacency coefficient can also be defined based on similarity measures. An often used similarity measure is Pearson’s correlation coefficient rjk. Other correlation measures such as Spearman’s correlation can also be used. Let
    sjk=sgn(rjk)andajk=sjk1{rjk>r}.

    Here r can be determined using the Fisher transformation as above.

  3. With the power adjacency function considered in Zhang and Horvath (2005),
    ajk=max(0,rjk)αandsjk=1.

    Here α > 0 and can be determined by, for example, the scale-free topology criterion.

  4. A variation of the above power adjacency function is
    ajk=rjkαandsjk=sgn(rjk).

For the adjacency matrices given above, (i) and (ii) use dichotomized measures, whereas (iii) and (iv) use continuous measures. Under (i) and (iii), two covariates are either positively or not connected/correlated. In contrast, under (ii) and (iv), two covariates are allowed to be negatively connected/correlated.

There are many other ways for constructing an adjacency matrix. For example, a popular adjacency measure in cluster analysis is ajk = exp(− ||xjxk||2/nτ2) for τ > 0. The resulting adjacency matrix A = [ajk] is the Gram matrix associated with the Gaussian kernel. For discrete covariates, the Pearson correlation coefficient can still be used as a measure of correlation or association between two discrete predictors or between a discrete predictor and a continuous one. For example, for single nucleotide polymorphism data, Pearson’s correlation coefficient is often used as a measure of linkage disequilibrium (i.e. association) between two markers. Other measures, such as odds ratio or measure of association based on contingency table can also be used for rjk.

We note that how to construct the adjacency matrix is problem specific. Different applications may require different adjacency matrices. Since construction of adjacency matrix is not the focus of the present paper, we will only consider the use of the four adjacency matrices described above in our numerical studies in Section 6.

4 Oracle properties

In this section, we study the theoretical properties of the SLS estimator. Let the true value of the regression coefficient be βo=(β1o,,βp0). Denote O={j:βjo0}, which is the set of indices of nonzero coefficients. Let do = | Inline graphic| be the cardinality of Inline graphic. Define

β^o(λ2)=argminb{12n||yXb||2+12λ2bLb,bj=0,jO}. (4.1)

This is the oracle Laplacian shrinkage estimator on the set Inline graphic. Theorems 1 and 2 below provide sufficient conditions under which P(sgn(β̂) ≠ sgn(βo) or β̂β̂o) → 0. Thus under those conditions, the SLS estimator is sign consistent and equal to β̂o with high probability.

We need the following notation in stating our results. Let Σ = n−1XX. For any AB ⊆ {1,, p}, vectors v, the design matrix X and V = (vij)p×p, define

vB=(vj,jB),XB=(xj,jB),VA,B=(vij,iA,jB)A×B,VB=VB,B.

For example, B=XBXB/n and Inline graphic(λ2) = Inline graphic + λ2 Inline graphic. Let |B| denotes the cardinality of B. Let cmin(λ2) be the smallest eigenvalue of Σ + λ2L. We use the following constants to bound the bias of the Laplacian:

C1=||O1(λ2)LOβOo||,C2=||{Oc,O(λ2)O1(λ2)LOLOc,O}βO0||. (4.2)

We make the following sub-Gausian assumption on the error terms in (2.1).

Condition (A): For a certain constant ε ∈ (0, 1/3),

sup||u||=1P{uε>σt}et2/2,0<t2log(p/ε).

4.1 Convex penalized loss

We first consider the case where Σ(λ2) = Σ + λ2L is positive definite. Since (4.1) is the minimizer of the Laplacian restricted to the support Inline graphic, it can be explicitly written as

β^Oo=(O+λ2LO)1XOy/n,β^Oco=0, (4.3)

provided that Inline graphic(λ2) is invertible. Its expectation β* = E β̂o, considered as a target of the SLS estimator, must satisfy

βO=(O+λ2LO)1Oβo,βOc=0, (4.4)

Condition (B): (i) cmin2) > 1/γ with ρ(t; λ1, γ) in (2.2). (ii) The penalty levels satisfy

λ1λ2C2+σ2log((pdo)/ε)maxjp||xj||/n

with C2 in (4.2). (iii) With {vj, j ∈ Inline graphic} being the diagonal elements of O1(λ2)O{O1(λ2)},

minjO{βj(n/vj)1/2}σ2log(do/ε).

Define β=min{βjo,jO}. If Inline graphic is an empty set, that is, when all the regression coefficients are zero, we set β* = ∞.

Theorem 1

Suppose Conditions (A) and (B) hold. Then,

P({j:β^j0}Oorβ^β^o)3ε. (4.5)

If βλ2C1+maxj(2vj/n)log(do/ε) instead of Condition (B) (iii), then

P(sgn(β^)sgn(βo)orβ^β^o)3ε. (4.6)

Here note that p, do, γ and cmin2) are all allowed to depend on n.

The probability bound on the selection error in Theorem 1 is nonasymptotic. If the conditions of Theorem 1 hold with ε → 0, then (4.5) implies selection consistency of the SLS estimator and (4.6) implies sign consistency. The conditions are mild. Condition (A) concerns the tail probabilities of the error distribution and is satisfied if the errors are normally distributed. Condition (B) (i) ensures that the SLS criterion is strictly convex so that the solution is unique. The oracle estimator β̂o is biased due to the Laplacian shrinkage. Condition (B) (ii) requires a penalty level λ1 to prevent this bias and noise to cause false selection of variables in Inline graphic Condition (B) (iii) requires that the nonzero coefficients not be too small in order for the SLS estimator to be able to distinguish nonzero from zero coefficients.

In Theorem 1, we only require cmin(λ2) > 0, or equivalently, Σ + λ2L to be positive definite. The matrix Σ can be singular. This can be seen as follows. The adjacency matrix partitions the graph into disconnected cliques Vg, 1 ≤ gJ for some J ≥ 1. Let node jg be a (representative) member of Vg. A node k belongs to the same clique Vg iff (if and only if) ajgk1ak1k2 · · · akmk ≠ 0 through a certain chain jgk1k2 → ··· → kmk. Define g = ΣkVg ajgk1ak1k2 ··· akmkxk/|Vg|, where |Vg| is the cardinality of Vg. The matrix Σ + λ2L is positive definite iff b′Σb = bLb = 0 implies b = 0. Since bLb = 0 implies ΣkVg bkxk = bjg |Vg|g, Σ+λ2L is positive definite iff the vectors g are linearly independent. This does not require np. In other words, Theorem 1 is applicable to p > n problems as long as the vectors g are linearly independent.

4.2 The nonconvex case

When Σ(λ2) = Σ + λ2L is singular, Theorem 1 is not applicable. In this case, further conditions are required for the oracle property to hold. The key condition needed is the sparse Reisz condition, or SRC (Zhang and Huang 2008), in (4.9) below. It restricts the spectrum of diagonal subblocks of Σ(λ2) up to a certain dimension.

Let = (λ2) be a matrix satisfying X̃/n = Σ(λ2) = XX/n + λ2L and = (λ2) be a vector satisfying = Xy. Define

M(b;λ,γ)=12n||yXb||2+j=1pρ(bj;λ1,γ). (4.7)

Since M(b; λ, γ) − (b; λ, γ) = (||y||2||||2)/(2n), the two penalized loss functions have the same set of local minimizers. For the penalized loss (4.7) with the data (, ), let

β^(λ)=δ(X(λ2),y(λ2),λ1), (4.8)

where the map δ(X, y, λ1) ∈ IRp defines the MC+ estimator (Zhang, 2010) with data (X, y) and penalty level λ1. It was shown in Zhang (2010) that δ (X, y, λ1) depends on (X, y) only through Xy/n and XX/n, so that different choices of and are allowed. One way is to pick = (y′, 0)′ and = diag(X, (2L)1/2). Another way is to pick X̃/n = Σ(λ2) and = (′)Xy of smaller dimensions, where (′) is the Moore-Penrose inverse of ′.

Condition (C)

  1. For an integer d* and spectrum bounds 0 < c*2) ≤ c*2) < ∞,
    0<c(λ2)uBB(λ2)uBc(λ2)<,BwithBOd,||uB||=1, (4.9)

    with d*do(K* + 1), γc1(λ2)4+c(λ2)/c(λ2) in (2.2), and K* = c*2)/c*2) − 1/2.

  2. With C2=||{B,O(λ2)O1(λ2)LOLB,O}βOo||,
    max{1,c(λ2)K/(K+1)}λ1λ2C2+σ2log(p/ε)maxjp||xj||/n.
  3. With {vj, j ∈ Inline graphic} being the diagonal elements of O1(λ2)O{O1(λ2)},
    minjO{βjγ(2c(λ2)λ1)}(n/vj)1/2σ2log(do/ε).

Theorem 2

  1. Suppose Conditions (A) and (C) hold. Let β̂(λ) be as in (4.8). Then,
    P({j:β^j0}Oorβ^β^o)3ε. (4.10)
    If βλ2C1+γ(2c(λ2)λ1)+maxj(2vj/n)log(do/ε) instead of Condition (C) (iii), then
    P(sgn(β^)sgn(βo)orβ^β^o)3ε. (4.11)

    Here note that p, γ, do, d*, K*, ε, c*2) and c*2) are all allowed to depend on n, including the case c*2) → 0 as long as the conditions hold as stated.

  2. The statements in (i) also hold for all local minimizers β̂ of (2.6) or (4.7) satisfying #{j ∉ Inline graphic: β̂j ≠ 0} + do ≤ d*.

If the conditions of Theorem 2 hold with ε → 0, then (4.10) implies selection consistency of the SLS estimator and (4.11) implies sign consistency.

Condition (C), designed to handle the noncovexity of the penalized loss, is a weaker version of Condition (B) in the sense of allowing singular Σ(λ2). The SRC (4.9), depending on X or only through the regularized Gram matrix X̃/n = Σ(λ2) = Σ + λ2L, ensures that the model is identifiable in a lower d*-dimensional space. When p > n, the smallest singular value of X is always zero. However, the requirement c*(λ2) > 0 only concerns d*×d* diagonal submatrices of Σ(λ2), not the Gram matrix Σ of the design matrix X. We can have pn but still require d*/doK*+1 as in (4.9). Since p, d0, γ, d*, K*, c*(λ2) and c*(λ2) can depend on n, we allow the case c*(λ2) → 0 as long as Conditions (A) and (C) hold as stated. Thus, we allow pn but require that the model is sparse, in the sense that the number of nonzero coefficients do is smaller than d*/(1 + K*). For example, if c*(λ2) ≍ O(nα) for a small α > 0 and c*(λ2) ≍ O(1), then we require γO(n3α/2) or greater, K*O(nα) and d*/doO(nα) or greater. So all these quantities can depend on n, as long as the other requirements are met in Condition (C).

By examining the Conditions C(ii) and C(iii), for standardized predictors with ||xj||=n, we can have log(p/ε) = o(n) or p = ε exp(o(n)) as long as condition C(ii) is satisfied. As in Zhang (2010), under a somewhat stronger version of Condition C, Theorem 2 can be extended to quadratic spline concave penalties satisfying ρ(t;λ1,γ)=λ12ρ(t/λ;γ) with a penalty function satisfying (∂/t)ρ(t; γ) = 1 at t = 0+ and 0 for t > γ.

Also, comparing our results with the selection consistency results of Hebiri and Geer (2010) on the smoothed ℓ1+ℓ2-penalized methods, our conditions tend to be weaker. Notably, Hebiri and Geer (2010) require an condition on the Gram matrix which assumes that the correlations between the truly relevant variables and those which are not are small. No such assumption is required for our selection consistency results. In addition, our selection results are stronger in the sense that the SLS estimator is not only sign consistent, but also equal to the oracle Laplacian shrinkage estimator with high probability. In general, similar results are not available with the use of the ℓ1 penalty for sparsity.

Theorem 2 shows that the SLS estimator automatically adapts to the sparseness of the p-dimensional model and the denseness of a true submodel. From a sparse p-model, it correctly selects the true underlying model Inline graphic. This underlying model is a dense model in the sense that all its coefficients are nonzero. In this dense model, the SLS estimator behaves like the oracle Laplacian shrinkage estimator in (4.1). As in the convex penalized loss setting, here the results do not require a correct specification of a population correlation structure of the predictors.

4.3 Unbiased Laplacian and variance reduction

There are two natural questions concerning the SLS. First, what are the benefits from introducing the Laplacian penalty? Second, what kind of Laplacian L constitutes a reasonable choice? Since the SLS estimator is equal to the oracle Laplacian estimator with high probability by Theorem 1 or 2, these questions can be answered by examining the oracle Laplacian shrinkage estimator (4.1), whose nonzero part is

β^Oo(λ2)=O1(λ2)XOy/n.

Without the Laplacian, i.e., when λ2 = 0, it becomes the least squares (LS) estimator,

β^Oo(0)=O1XOy/n.

If some of the predictors in {xj, jInline graphic} are highly correlated or | Inline graphic|n, the LS estimator β^Oo(0) is not stable or unique. In comparison, as discussed below Theorem 1, Inline graphic(λ2) = Inline graphic + λ2 Inline graphic can be a full rank matrix under a reasonable condition, even if the predictors in {xj, jInline graphic} are highly correlated or | Inline graphic|n.

For the second question, we examine the bias of β^Oo(λ2). Since the bias of the target vector (4.4) is βOoβO(λ2)=λ2O1(λ2)LOβOo,β^Oo(λ2) is unbiased iff LOβOo=0. Therefore, in terms of bias reduction, a Laplacian L is most appropriate if the condition LOβOo=0 is satisfied. We shall say that a Laplacian L is unbiased if LOβOo=0. It follows from the discussion at the end of Subsection 4.1 that LOβOo=0 if βko=βjgoajgk1ak1k2akmk, where jg is a representative member of the clique VgInline graphic and {k1,, km, k}VgInline graphic.

With an unbiased Laplacian, the mean square error of β^Oo(λ2) is

E||β^Oo(λ2)βOo||2=σ2ntrace(O1(λ2)OO1(λ2)).

The mean square error of Inline graphic(0) is

E||β^Oo(0)βOo||2=σ2ntrace(O1).

We always have E||β^Oo(λ2)βOo||2<E||β^Oo(0)βOo||2forλ2>0. Therefore, an unbiased Laplacian reduces variance without incurring any bias on the estimator.

5 Laplacian shrinkage

The results in Section 4 show that the SLS estimator is equal to the oracle Laplacian shrinkage estimator with probability tending to one under certain conditions. In addition, an unbiased Laplacian reduces variance but does not increase bias. Therefore, to study the shrinkage effect of the Laplacian penalty on β̂, we can consider the oracle estimator β^Oo. To simplify the notation and without causing confusion, in this section, we study some other basic properties of the Laplacian shrinkage and compare it with the ridge shrinkage. The Laplacian shrinkage estimator is defined as

β(λ2)=argminb{G(b;λ2)12n||yXb||2+12λ2bLb,bq}. (5.1)

The following proposition shows that the Laplacian penalty shrinks a coefficient towards the center of all the coefficients connected to it.

Proposition 1

Let = y − Xβ̃.

  1. λ2max1jqdjβjajβ/dj||r||||y||.
  2. λ2djβjajβ(dkβkakβ)1n||xjxk||||y||.

Note that ajβ/dj=k=1qajkβk/dj=k=1qsgn(ajk)ajkβk/dj) is a signed weighted average of the β̃k’s connected to β̃j, since dj = Σk |ajk|. Part (i) of Proposition 1 provides an upper bound on the difference between β̃j and the center of all the coefficients connected to it. When ||||/(λ2dj) → 0, this difference converges to zero. For standardized dj = 1, part (ii) implies that the difference between the centered β̃j and β̃k converges to zero when ||xjxk|| ||y||/(λ2n) → 0.

When there are certain local structures in the adjacency matrix A, shrinkage occurs at the local level. As an example, we consider the adjacency matrix based on partition of the 16 predictors into 2r-balls defined in (3.1). Correspondingly, the index set {1,, q} is divided into disjoint neighborhoods/cliques V1,, VJ. We consider the normalized Laplacian L = IqA, where Iq is a q × q identity matrix and A = diag(A1,,AJ) with Ag=vg11g1. Here vg = |Vg|, 1 ≤ gJ. Let bg = (bj, jVg)′. We can write the objective function as

G(b;λ2)=12n||yXb||2+12λ2g=1Jbg(Igvg11g1g)bg. (5.2)

For the Laplacian shrinkage estimator based on this criterion, we have the following grouping properties.

Proposition 2

  1. For any j, k ∈ Vg, 1 ≤ g ≤ J,
    λ2βjβk1n||xjxk||·||y||,j,kVg.
  2. Let β̄g be the average of the estimates in Vg. For any j ∈ Vg and k ∈ Vh, g ≠ h,
    λ2βjβ¯g(βkβ¯h)1n||xjxk||·||y||,jVg,kVh.

This proposition characterizes the smoothing effect and grouping property of the Laplacian penalty in (5.2). Consider the case ||y||2/n = O(1). Part (i) implies that, for j and k in the same neighborhood and λ2 > 0, the difference β̃jβ̃k → 0 if ||xjxk||/(λ2n1/2) → 0. Part (ii) implies that, for j and k in different neighborhoods and λ2 > 0, the difference between the centered β̃j and β̃k converges to zero if ||xjxk||/(λ2n1/2) → 0.

We now compare the Laplacian shrinkage and ridge shrinkage. The discussion at the end of Section 4 about the requirement for the unbiasedness of Laplacian can be put in a wider context when a general positive definite or semidefinite matrix Q is used in the place of L. This wider context includes the Laplacian shrinkage and ridge shrinkage as special cases. Specifically, let

β^Q(λ,γ)=argminb12n||yXb||2+j=1pρ(bj;λ1,γ)+12λ2bQb.

For Q = Ip, β̂Q becomes the Mnet estimator (Huang et al. 2010). With some modifications on the conditions in Theorem 1 or Theorem 2, it can be shown that β̂Q is equal to the oracle estimator defined as

β^Qo(λ2)=argminb{12n||yXb||2+12bQb,bj=0,jO}.

Then in a way similar to the discussion in Section 4, β̂Q is nearly unbiased iff QOβOo=0. Therefore, for ||βOo||0, QO must be a rank deficient matrix, which in turn implies that Q must be rank deficient. Note that any Laplacian L is rank deficient. This rank deficiency requirement excludes the ridge penalty with Q = Ip. For the ridge penalty to yield an unbiased estimator, it must hold that ||βo|| = 0 in the underlying model.

We now give a simple example that illustrates the basic characteristics of Laplacian shrinkage and its differences from ridge shrinkage.

Example 5.1

Consider a linear regression model with two predictors satisfying ||xj||2 = n, j = 1, 2. The Laplacian shrinkage and ridge estimators are defined as

(b^L1(λ2),b^L2(λ2))=argminb1,b212ni=1n(yixi1b1xi2b2)2+12λ2(b1b2)2,

and

(b^R1(λ2),b^R2(λ2))=argminb1,b212ni=1n(yixi1b1xi2b2)2+12λ2(b12+b22).

Denote r1 = cor(x1, y), r2 = cor(x2, y) and r12 = cor(x1, x2). The Laplacian shrinkage estimates are

b^L1(λ2)=(1+λ2)r1(r12λ2)r2(1+λ2)2(r12λ2)2,b^L2(λ2)=(1+λ2)r2(r12λ2)r1(1+λ2)2(r12λ2)2.

Let

b^ols1=r1r12r21r122,b^ols2=r2r12r11r122,b^L()=r1+r22(1+r12),

where (ols1, b̂ols2) is the ordinary least squares (OLS) estimator for the bivariate regression, L(∞) is the OLS estimator that assumes the two coefficients are equal, that is, it minimizes i=1n(yi(xi1+xi2)b)2. Let wL = (2λ2)/(1−r12+2λ2). After some simple algebra, we have

b^L1(λ2)=(1wL)b^ols1+wLb^L()andb^L2(λ2)=(1wL)b^ols2+wLb^L().

Thus for any fixed λ2, L(λ2) is a weighted average of ols and L(∞) with the weights depending on λ2. When λ2 → ∞, L1L(∞) and L2L(∞). Therefore, the Laplacian penalty shrinks the OLS estimates towards a common value, which is the OLS estimate assuming equal regression coefficients.

Now consider the ridge regression estimator. We have

b^R1(λ2)=(1+λ2)r1r12r2(1+λ2)2r122andb^R2(λ2)=(1+λ2)r2r12r1(1+λ2)2r122.

The ridge estimator converges to zero as λ2 → ∞. For it to converge to a nontrivial solution, we need to rescale it by a factor of 1+λ2. Let wR=λ/(1+λr122). Let u1 = r1 and u2 = r2. Because n1i=1nxi12=1 and n1i=1nxi22=1, r1 and r2 are also the OLS estimators of univariate regressions of y on x1 and y on x2, respectively. We can write

(1+λ2)b^R1(λ2)=cλ2(1wR)b^ols1+cλwRb^u1,(1+λ2)b^R2(λ2)=cλ2(1wR)b^ols2+cλwRb^u2,

where cλ2={(1+λ2)2(1+λ)r122}/{(1+λ2)2r122}. Note that cλ2 ≈ 1. Thus (1+λ2) R is a weighted average of the OLS and the univariate regression estimators. The ridge penalty shrinks the (rescaled) ridge estimates towards individual univariate regression estimates.

6 Simulation studies

We use a coordinate descent algorithm to compute the SLS estimate. This algorithm optimizes a target function with respect to a single parameter at a time and iteratively cycles through all parameters until convergence. This algorithm was originally proposed for criterions with convex penalties such as LASSO (Fu 1998; Genkin et al. 2004; Friedman et al. 2007; Wu and Lange 2007). It has been proposed to calculate the MCP estimates (Breheny and Huang 2011). Detailed steps of this algorithm for computing the SLS estimates can be found in the technical report accompanying this paper (Huang et al. 2010).

In simulation studies, we consider the following ways of defining the adjacency measure. (N.1) ajk = I(rjk > r) and sjk = 1. Here the cutoff r is computed as 3.09 using the approach described in Section 3 with a p-value of 10−3; (N.2) ajk = I(|rjk| > r) and sjk = sgn(rjk). Here the cutoff r is computed as 3.29 using the approach described in Section 3 with a p-value of 10−3; (N.3) ajk = max(0, rjk)α and sjk = 1. We set α = 6, which satisfies the scale-free topology criteria (Zhang and Horvath 2005); (N.4) ajk=rjkα and sjk = sgn(rjk). We set α = 6.

The penalty levels λ1 and λ2 are selected using V-fold cross validation. In our numerical study, we set V = 5. To reduce computational cost, we search over the discrete grid of 2…−1, −0.5,0,0.5…. For comparison, we also consider the MCP estimate and the approach proposed in Daye and Jeng (2009; referred to as D-J hereafter). Both the SLS and MCP involve the regularization parameter γ. For MCP, Zhang (2010) suggested using γ=2/(1maxjkxjxk/n) for standardized covariates. The average γ value of this choice is 2.69 in his simulation studies. The simulation studies in Breheny and Huang (2009) suggest that γ = 3 is a reasonable choice. We have experimented with different γ values and reached the same conclusion. Therefore, we set γ = 3.

We set n = 100 and p = 500. Among the 500 covariates, there are 100 clusters, each with size 5. We consider two different correlation structures. (I) Covariates in different clusters are independent, whereas covariates i and j within the same cluster have correlation coefficients ρ|ij|; and (II) Covariates i and j have correlation coefficients ρ|ij|. Under structure I, zero and nonzero effects are independent, whereas under structure II, they are correlated. Covariates have marginal normal distributions with mean zero and variance one. We consider different levels of correlation with ρ = 0.1, 0.5, 0.9. Among the 500 covariates, the first 25 (5 clusters) have nonzero regression coefficients. We consider the following scenarios for nonzero coefficients: (a) all the nonzero coefficients are equal to 0.5; and (b) the nonzero coefficients are randomly generated from the uniform distribution on [0.25, 0.75]. In (a), the Laplacian matrices satisfy the unbiasedness property o = 0 discussed in Section 4. We have experienced with other levels of nonzero regression coefficients and reached similar conclusions.

We examine the accuracy of identifying nonzero covariate effects and the prediction performance. For this purpose, for each simulated dataset, we simulate an independent testing dataset with sample size 100. We conduct cross validation (for tuning parameter selection) and estimation using the training set only. We then make prediction for subjects in the testing set and compute the PMSE (prediction mean squared error).

We simulate 500 replicates and present the summary statistics in Table 1. We can see that the MCP performs satisfactorily when the correlation is small. However, when the correlation is high, it may miss a considerable number of true positives and have large prediction errors. The D-J approach, which can also accommodate the correlation structure, is able to identify all the true positives. However, it also identifies a large number of false positives, causing by the over-selection of the Lasso penalty. The proposed SLS approach outperforms the MCP and D-J methods in the sense that it has smaller empirical false discovery rates with comparable false negative rates. It also has significantly smaller prediction errors.

Table 1.

Simulation study: median based on 500 replicates. In each cell, the three numbers are positive findings, true positives, and PMSE ×100, respectively.

Coefficient ρ MCP D-J SLS
N.1 N.2 N.3 N.4 N.1 N.2 N.3 N.4
Correlation structure I
0.5 0.1 27 25 41.33 61 25 125.34 53 25 46.64 55 25 60.14 59 25 51.24 27 25 40.53 27 25 39.84 26 25 41.74 27 25 39.34
0.5 28 25 54.10 51 25 66.38 67 25 66.84 72 25 56.22 63 25 53.43 27 25 37.71 28 25 39.18 28 25 33.87 27 25 36.00
0.9 22 15 137.52 66 25 55.51 55 25 56.94 61 25 49.22 74 25 51.41 29 25 48.89 28 25 49.96 29 25 45.16 27 25 41.49
U[.25, .75] 0.1 37 25 52.24 72 25 54.28 61 25 88.00 59 25 70.00 78 25 60.51 33 25 51.80 36 25 52.19 30 25 53.03 30 25 52.22
0.5 29 24 65.12 66 25 78.76 54 25 72.34 63 25 63.55 57 25 66.33 28 25 42.24 28 25 43.96 27 24 54.72 28 24 58.77
0.9 17 13 152.42 67 25 63.43 62 25 57.30 50 25 53.88 74 25 57.98 29 25 47.73 29 25 49.14 27 25 48.49 28 25 50.83
Correlation structure II
0.5 0.1 26 25 38.22 62 25 121.69 58 25 117.10 63 25 127.34 72 25 122.34 27 25 40.33 27 25 40.65 27 25 41.49 27 25 37.40
0.5 29 25 53.01 52 25 55.99 49 25 62.04 66 25 62.70 65 25 64.41 27 25 36.97 28 25 39.47 28 25 38.53 27 25 39.53
0.9 15 13 140.69 48 25 55.75 34 25 56.71 32 25 60.27 38 25 59.78 29 25 66.79 29 25 60.52 29 25 57.91 30 25 60.19
U[.25, .75] 0.1 37 25 54.31 77 25 60.02 72 25 66.14 74 25 78.32 66 25 74.50 29 25 50.05 32 25 51.34 37 25 50.74 29 25 49.47
0.5 27 24 57.66 74 25 61.71 66 25 67.54 75 25 62.01 74 25 66.91 28 25 44.92 28 25 46.65 28 25 41.35 28 25 41.17
0.9 14 13 136.49 33 25 61.50 35 25 55.08 34 25 54.54 38 25 60.67 29 25 56.87 29 25 57.03 30 25 53.28 30 25 56.79

6.1 Application to a microarray study

In the study reported in Scheetz et al. (2006), F1 animals were intercrossed and 120 twelve-week-old male offspring were selected for tissue harvesting from the eyes and microarray analysis using the Affymetric GeneChip Rat Genome 230 2.0 Array. The intensity values were normalized using the RMA (robust multi-chip averaging, Bolstad 2003, Irizzary 2003) method to obtain summary expression values for each probe set. Gene expression levels were analyzed on a logarithmic scale. For the probe sets on the array, we first excluded those that were not expressed in the eye or that lacked sufficient variation. The definition of expressed was based on the empirical distribution of RMA normalized values. For a probe set to be considered expressed, the maximum expression value observed for that probe among the 120 F2 rats was required to be greater than the 25th percentile of the entire set of RMA expression values. For a probe to be considered “sufficiently variable,” it had to exhibit at least 2-fold variation in expression level among the 120 F2 animals.

We are interested in finding the genes whose expression are most variable and correlated with that of gene TRIM32. This gene was recently found to cause Bardet-Biedl syndrome (Chiang et al. 2006), which is a genetically heterogeneous disease of multiple organ systems including the retina. One approach to find the genes related to TRIM32 is to use regression analysis. Since it is expected that the number of genes associated with gene TRIM32 is small and since we are mainly interested in genes whose expression values across samples are most variable, we conduct the following initial screening. We compute the variances of gene expressions and select the top 1000. We then standardize gene expressions to have zero mean and unit variance.

We analyze data using the MCP, D-J, and proposed approach. In cross validation, we set V = 5. The numbers of genes identified are MCP: 23, D-J: 31 (N.1), 41 (N.2), 34 (N.3), 30 (N.4), SLS: 25 (N.1), 26 (N.2), 16 (N.3) and 17 (N.4), respectively. More detailed results are available from the authors. Different approaches and different ways of defining the adjacency measure lead to the identification of different genes. As expected, the SLS identifies shorter lists of genes than the D-J, which may lead to more parsimonious models and more focused hypothesis for confirmation. As the proposed approach pays special attention to the correlation among genes, we also compute the median of the absolute values of correlations among the identified genes, which are MCP: 0.171, D-J: 0.201 (N.1), 0.207 (N.2), 0.215 (N.3), 0.206 (N.4), SLS: 0.247 (N.1), 0.208 (N.2), 0.228 (N.3), 0.212 (N.4). The D-J and SLS, which incorporate correlation in the penalty, identify genes that are more strongly correlated than the MCP. The SLS identified genes have slightly higher correlations than those identified by D-J.

Unlike in simulation study, we are not able to evaluate true and false positives. This limitation is shared by most existing studies. We use the following V-fold (V=5) cross validation based approach to evaluate prediction. (a) Randomly split data into V-subsets with equal sizes; (b) Remove one subset from data; (c) Conduct cross validation and estimation using the rest V − 1 subsets; (d) Make prediction for the one removed subset; (e) Repeat Steps (b)–(d) over all subsets and compute the prediction error. The sums of squared prediction errors are MCP: 1.876; D-J: 1.951 (N.1), 1.694 (N.2), 1.534 (N.3) and 1.528 (N.4); SLS: 1.842 (N.1), 1.687 (N.2), 1.378 (N.3) and 1.441 (N.4), respectively. The SLS has smaller cross validated prediction errors, which may indirectly suggest better selection properties.

7 Discussion

In this article, we propose the SLS method for variable selection and estimation in high-dimensional data analysis. The most important feature of the SLS is that it explicitly incorporates the graph/network structure in predictors into the variable selection procedure through the Laplacian quadratic. It provides a systematic framework for connecting penalized methods for consistent variable selection and those for network and correlation analysis. As can be seen from the methodological development, the application of the SLS variable selection is relatively independent of the graph/network construction. Thus, although graph/network construction is of significant importance, it is not the focus of this study and not thoroughly pursued.

An important feature of the SLS method is that it incorporates the correlation patterns of the predictors into variable selection through the Laplacian quadratic. We have considered two simple approaches for determining the Laplacian based on dissimilarity and similarity measures. Our simulation studies demonstrate that incorporating correlation patterns improves selection results and prediction performance. Our theoretical results on the selection properties of the SLS are applicable to a general class of Laplacians and do not require the underlying graph for the predictors to be correctly specified.

We provide sufficient conditions under which the SLS estimator possesses an oracle property, meaning that it is sign consistent and equal to the oracle Laplacian shrinkage estimator with high probability. We also study the grouping properties of the SLS estimator. Our results show that the SLS is adaptive to the sparseness of the original p-dimensional model with pn and the denseness of the underlying do-dimensional model, where do < n is the number of nonzero coefficients. The asymptotic rates of the penalty parameters are derived. However, as in many recent studies, it is not clear whether the penalty parameters selected using cross validation or other procedures can match the asymptotic rate. This is an important and challenging problem that requires further investigation, but is beyond the scope of the current paper. Our numerical study shows a satisfactory finite-sample performance of the SLS. Particularly, we note that the cross validation selected tuning parameters seem sufficient for our simulated data. We are only able to experiment with four different adjacency measures. It is not our intention to draw conclusions on different ways of defining adjacency. More adjacency measures are hence not explored.

We have focused on the linear regression model in this article. However, the SLS method can be applied to general linear regression models. Specifically, for general linear models, the SLS criterion can be formulated as

1ni=1n(yi,b0+jxijbj)+j=1pρ(bj;λ1,γ)+12λ21j<kpajk(bjsjkbk)2,

where ℓ is a given loss function. For instance, for generalized linear models such as logistic regression, we can take ℓ to be the negative log-likelihood function. For Cox regression, we can use the negative partial likelihood as the loss function. Computationally, for loss functions other than least squares, the coordinate descent algorithm can be applied iteratively to quadratic approximations to the loss function. However, further work is needed to study theoretical properties of the SLS estimators for general linear models.

There is a large literature on the analysis of network data and much work has also been done on estimating sparse covariance matrices in high-dimensional settings. See for example, Zhang and Horvath (2005), Chung and Lu (2006), Meinshausen and Bühlmann (2006), Yuan and Lin (2007), Friedman, Hastie and Tibshirani (2008), Fan, Feng and Wu (2009), among others. It would be useful to study ways to incorporate these methods and results into the proposed SLS approach. In some problems such as genomic data analysis, partial external information may also be available on the graphical structure of some genes used as predictors in the model. It would be interesting to consider approaches for combining external information on the graphical structure with existing data in constructing the Laplacian quadratic penalty.

Acknowledgments

We wish to thank two anonymous referees, the associate editor and editor for their helpful comments which led to considerable improvements in the presentation of the paper. The research of Huang is partially supported by NIH grants R01CA120988, R01CA142774 and NSF grant DMS 0805670. The research of Ma is partially supported by NIH grants R01CA120988, R01CA142774, R03LM009754 and R03LM009828. The research of Li is partially supported by NIH grants R01ES009911 and R01CA127334. The research of Zhang is partially supported by NSF grants DMS 0604571, DMS 0804626 and NSA grant MDS 904-02-1-0063.

8 Appendix

In the appendix, we give proofs of Theorems 1 and 2 and Propositions 1 and 2.

Proof of Theorem 1

Since cmin(λ2) > 1, the criterion (2.2) is strictly convex and its minimizer is unique. Let X=X(λ2)=n(+λ2L)1/2, = (λ2) = −1Xy and

M(b;λ,γ)=(2n)1||yXb||2+j=1pρ(bj;λ1,γ).

Since ′(X̃/n, ) = (Σ+λ2L, Xy), M(b; λ, γ) − (b; λ, γ) = (||y||2 − || ||2)/(2n) does not depend on b. Thus, β̂ is the minimizer of (b; λ, γ).

Since β^joγλ1 gives ρ(β^jo;λ1)=0, the KKT conditions hold for (b; λ, γ) at β̂(λ) = β̂o(λ) in the intersection of the events

Ω1={||XOc(yXβ^o)/n||λ1},Ω2={minjOsgn(βj)β^joγλ1}. (8.1)

Let ε̃* = β* = ε̃ + Eε̃* with ε̃ = E. Since = Xy and both βo and β* are supported in Inline graphic,

XBEε/n=XBXβo/nXBXβ/n=B,OβOoB,O(λ2)O1(λ2)OβOo=λ2{B,O(λ2)O1(λ2)LOLB,O}βOo, (8.2)

which describes the effect of the bias of β̂o on the gradient in the linear model = β*+ε̃*. Since XOEε/n=0, we have || Eε̃*/n|| = λ2C2.

Since ε̃ = EX̃ = XyEXy = Xε, (8.2) gives

Ω1{||XOcε/n||<λ1λ2C2}. (8.3)

Since β^Oo=O1(λ2)XOy/n can be written as βO+((vj/n)1/2ujε,jO), where ||uj|| = 1 and {vj, jInline graphic} are the diagonal elements of O1(λ2)O{O1(λ2)}. Thus,

Ω2cjO{sgn(βj)ujε(n/vj)1/2β^jσ2log(O/ε)}. (8.4)

Since λ1λ2C2+σ2log(p/ε)maxjp||xj||/n, the sub-Gausian condition (A) yields

1P{Ω1Ω2}P{||XOcε/n||>σ2log((pO)/ε)maxjp||xj||/n}+jOP{sgn(βj)ujεσ2log(O/ε)}2Ocε/(pO)+Oε/O=3ε.

The proof of (4.5) is complete, since β^jo0 for all jInline graphic in Ω2.

For the proof of (4.6), we have ||βOβOo||=λ2C1 due to

βOβOo=O1(λ2)OβOoβOo=λ2O1(λ2)LOβOo. (8.5)

It follows that the condition on β* implies Conditon (B) (iii) with sgn(βO)=sgn(βOo)=sgn(β^Oo) in Ω2.

Proof of Theorem 2

For m ≥ 1 and vectors u in the range of , define

ζ(v;m,O,λ2)=max{||(PBPO)v||2(mn)1/2:OB{1,p},B=m+O}, (8.6)

where PB=XB(XBXB)1XB. Here ζ̃ depends on λ2 through . Since β̂(λ) is the MC+ estimator based on data (X̃, ) at penalty level λ1 and (4.9) holds for Σ(λ2) = X̃/n, the proof of Theorem 5 in Zhang (2010) gives β̂(λ) = β̂o (λ) in the event Ω=j=13Ωj, where Ω1={||XOc(yXβ^o)/n||λ1} is as in (8.1) and

Ω2={minjOsgn(βj)β^jo>γ(2cλ1)},Ω3={ζ(yXβ;dO,O,λ2)λ1}.

Note that (λ1,ε, λ2,ε, λ3,ε, α) in Zhang (2010) is identified with (λ1, 2cλ1, λ1, 1/2) here.

Let ε̃* = β* = ε̃ + Eε̃* with ε̃ = E. Since = Xy, (8.2) still holds with ||Eε̃*/n|| = λ2C2. Since ε̃ = XyEXy = Xε, (8.2) still gives (8.3). A slight modification of the argument for (8.4) yields

Ω2cjO{sgn(βj)ujε(n/vj)1/2(βjγ(2cλ1))σ2log(O/ε)}. (8.7)

For |B| ≤ d*, we have ||PBEε||/n=||B1/2(λ2)XBEε||/n||XBEε/n||B/c(λ2) and ||PBε||/n=||B1/2(λ2)XBε||/n||XBε/n||B/c(λ2). Thus, by (8.6)

ζ(yXβ;dO,O,λ2)=ζ(ε+Eε;dO,O,λ2)(||Xε/n||+λ2C2)d(dO)c(λ2).

Since | Inline graphic| ≤ d*/(K* + 1), this gives

Ω3{||Xε/n||<c(λ2)K/(K+1)λ1λ2C2}. (8.8)

Since max{1,c(λ2)K/(K+1)}λ1λ2C2+σ2log(p/ε)maxjp||xj||/n, (8.3), (8.7), (8.8) and Condition (A) imply

1P{Ω1Ω3}+P{Ω2c}P{||Xε/n||>σ2log(p/ε)maxjp||xj||/n}+jOP{sgn(βj)ujεσ2log(O/ε)}2p(ε/p)+Oε/O=3ε.

The proof of (4.10) is complete, since β^jo0 for all jInline graphic in Ω2. We omit the proof of (4.11) since it is identical to that of (4.6).

Proof of Proposition 1

The β̃ satisfies

1nxj(yXβ)+λ2(djβjajβ)=0,1jq. (8.9)

Therefore, by Cauchy-Schwarz and using ||xj||2 = n, we have

λ2max1jqdjβjajβ1nmax1jqxj(yXβ)1n||r||.

Now because G(β̃; λ2) ≤ G(0; λ2), we have |||| ≤ ||y||. This proves part (i).

For part (ii), note that we have

λ2(djβjajβ(dkβkakβ))=1n(xjxk)r.

Thus

λ2djβjajβ(dkβkakβ)1n||xjxk||||r||.

Part (ii) follows.

Proof of Proposition 2

The β̃ must satisfy

1nxj(yXβ)+λ2(βjvg11gβg)=0,jVg,1gJ. (8.10)

Taking the difference between the jth and kth equations in (8.10) for j, kVg, we get

λ2(βjβk)=1n(xjxk)(yXβ),j,kVg.

Therefore,

λ2βjβk1n||xjxk||·||yXβ||,j,kVg.

Part (i) follows from this inequality.

Define β¯g=vg11gβg. This is the average of the elements in β̃g. For any jVg and kVh, gh, we have

λ2(βjβg(βkβ¯h))=1n(xjxk)(yXβ),jVg,kVh.

Thus part (ii) follows. This completes the proof of Proposition 2.

Contributor Information

Jian Huang, Email: jian-huang@uiowa.edu.

Shuangge Ma, Email: shuangge.ma@yale.edu.

Hongzhe Li, Email: hongzhe@upenn.edu.

Cun-Hui Zhang, Email: cunhui@stat.rutgers.edu.

References

  • 1.Bondell HD, Reich BJ. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics. 2008;64:115–123. doi: 10.1111/j.1541-0420.2007.00843.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Breheny P, Huang J. Coordinate descent algorithms for nonconvex penalized regression methods. Ann Appl Statist. 2011;5:232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chen SS, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM J Sci Comput. 1998;20:3361. [Google Scholar]
  • 4.Chiang AP, Beck JS, Yen HJ, Tayeh MK, Scheetz TE, Swiderski R, Nishimura D, Braun TA, Kim KY, Huang J, Elbedour K, Carmi R, Slusarski DC, Casavant TL, Stone EM, Sheffield VC. Homozygosity mapping with SNP arrays identifies a novel Gene for Bardet-Biedl Syndrome (BBS10) Proceedings of the National Academy of Sciences. 2006;103:6287–6292. doi: 10.1073/pnas.0600158103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Chung FRK. Spectral Graph Theory. Amer Math Soc.CBMS Regional Conference Series in Mathematics; 1997. [Google Scholar]
  • 6.Chung FRK, Lu L. Complex Graphs and Networks. Amer Math Soc.CBMS Regional Conference Series in Mathematics; 2006. [Google Scholar]
  • 7.Daye JZ, Jeng JX. Shrinkage and model selection with correlated variables via weighted fusion. Computational Statistics and Data Analysis. 2009;53:1284–1298. [Google Scholar]
  • 8.Fan J. Comments on “Wavelets in statistics: a review” by A. Antoniadis. J Italian Statist Assoc. 1997;6:131–138. [Google Scholar]
  • 9.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
  • 10.Fan J, Feng Y, Wu Y. Network exploration via the adaptive LASSO and SCAD penalties. Ann Appl Statist. 2009;3:521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Frank IE, Friedman JH. A statistical view of some chemometrics regression tools (with discussion) Technometrics. 1993;35:109–148. [Google Scholar]
  • 12.Friedman J, Hastie, Hoefling H, Tibshirani R. Pathwise coordinate optimization. Ann Appl Statist. 2007;35:302–332. [Google Scholar]
  • 13.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatist. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Fu WJ. Penalized regressions: the bridge versus the LASSO. J Comp Graph Statist. 1998;7:397–416. [Google Scholar]
  • 15.Genkin A, Lewis DD, Madigan D. Technical Report. DIMACS, Rutgers University; 2004. Large-scale Bayesian logistic regression for text categorization. [Google Scholar]
  • 16.Hebiri M, van de Geer S. The smooth-Lasso and other ℓ1 + ℓ2-penalized methods. 2010 Preprint. Available at http://arxiv4.library.cornell.edu/PScache/arxiv/pdf/1003/1003.4885v1.pdf.
  • 17.Huang J, Breheny P, Ma S, Zhang C-H. Technical report # 402. Department of Statistics and Actuarial Science, University of Iowa; 2010. The Mnet method for variable selection. [Google Scholar]
  • 18.Huang J, Ma S, Li H, Zhang C-H. Technical report # 403. Department of Statistics and Actuarial Science, University of Iowa; 2010. The Sparse Laplacian Shrinkage Estimator for High-Dimensional Regression. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatist. 2003;4:249–264. doi: 10.1093/biostatistics/4.2.249. [DOI] [PubMed] [Google Scholar]
  • 20.Jia J, Yu B. On model selection consistency of elastic net when p ≫ n. Statistica Sinica. 2010;20:595–611. [Google Scholar]
  • 21.Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24:1175–1182. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]
  • 22.Li C, Li H. Variable selection and regression analysis for covariates with graphical structure. Ann Appl Statist. 2010;4:1498–1516. doi: 10.1214/10-AOAS332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Mazumder R, Friedman J, Hastie T. Tech Report. Department of Statistics, Stanford University; 2009. SparseNet: Coordinate descent with non-convex penalties. [Google Scholar]
  • 24.Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the Lasso. Ann Statist. 2006;34:1436–1462. [Google Scholar]
  • 25.Pan W, Xie B, Shen X. Incorporating predictor network in penalized regression with application to microarray data. Biometrics. 2009 doi: 10.1111/j.1541-0420.2009.01296.x. In press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Scheetz TE, Kim KYA, Swiderski RE, Philp1 AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheffield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Tibshirani R. Regression shrinkage and selection via the Lasso. J R Statist Soc B. 1996;58:267–288. [Google Scholar]
  • 28.Tutz G, Ulbricht J. Penalized regression with correlation-based penalty. Statist Comput. 2007;19:239–253. [Google Scholar]
  • 29.Wu T, Lange K. Coordinate descent procedures for lasso penalized regression. Ann Appl Statist. 2007;2:224–244. [Google Scholar]
  • 30.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Statist Soc B. 2006;68:49–67. [Google Scholar]
  • 31.Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]
  • 32.Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Statist Appl Genet Mol Bio. 2005;4:article 17. doi: 10.2202/1544-6115.1128. [DOI] [PubMed] [Google Scholar]
  • 33.Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. [Google Scholar]
  • 34.Zhang CH, Huang J. The Sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann Statist. 2008;36:1567–1594. [Google Scholar]
  • 35.Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Statist Soc B. 2005;67:301–320. [Google Scholar]
  • 36.Zou H, Zhang HH. On the adaptive elastic-net with a diverging number of parameters. Ann Statist. 2009;37:1733–1751. doi: 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES