The Sparse Laplacian Shrinkage Estimator for High-Dimensional Regression

Jian Huang; Shuangge Ma; Hongzhe Li; Cun-Hui Zhang

doi:10.1214/11-aos897

. Author manuscript; available in PMC: 2012 Jan 1.

Published in final edited form as: Ann Stat. 2011;39(4):2021–2046. doi: 10.1214/11-aos897

The Sparse Laplacian Shrinkage Estimator for High-Dimensional Regression

Jian Huang ^1,^*, Shuangge Ma ², Hongzhe Li ³, Cun-Hui Zhang ⁴

PMCID: PMC3217586 NIHMSID: NIHMS323903 PMID: 22102764

Abstract

We propose a new penalized method for variable selection and estimation that explicitly incorporates the correlation patterns among predictors. This method is based on a combination of the minimax concave penalty and Laplacian quadratic associated with a graph as the penalty function. We call it the sparse Laplacian shrinkage (SLS) method. The SLS uses the minimax concave penalty for encouraging sparsity and Laplacian quadratic penalty for promoting smoothness among coefficients associated with the correlated predictors. The SLS has a generalized grouping property with respect to the graph represented by the Laplacian quadratic. We show that the SLS possesses an oracle property in the sense that it is selection consistent and equal to the oracle Laplacian shrinkage estimator with high probability. This result holds in sparse, high-dimensional settings with p ≫ n under reasonable conditions. We derive a coordinate descent algorithm for computing the SLS estimates. Simulation studies are conducted to evaluate the performance of the SLS method and a real data example is used to illustrate its application.

Keywords: Graphical structure, minimax concave penalty, penalized regression, high-dimensional data, variable selection, oracle property

1 Introduction

There has been much work on penalized methods for variable selection and estimation in high-dimensional regression models. Several important methods have been proposed. Examples include estimators based on the bridge penalty (Frank and Friedman 1993), the ℓ₁ penalty or the least absolute shrinkage and selection operator (LASSO, Tibshirani 1996; Chen, Donoho and Saunders 1998), the smoothly clipped absolute deviation (SCAD) penalty (Fan 1997; Fan and Li 2001), and the minimum concave penalty (MCP, Zhang 2010). These methods are able to do estimation and automatic variable selection simultaneously and provide a computationally feasible way for variable selection in high-dimensional settings. Much progress has been made in understanding the theoretical properties of these methods. Efficient algorithms have also been developed for implementing these methods.

A common feature of the methods mentioned above is the independence between the penalty and the correlation among predictors. This can lead to unsatisfactory selection results, especially in p≫ n settings. For example, as pointed out by Zou and Hastie (2005), the LASSO tends to only select one variable among a group of highly correlated variables; and its prediction performance may not be as good as the ridge regression if there exists high correlation among predictors. To overcome these limitations, Zou and Hastie (2005) proposed the elastic net (Enet) method, which uses a combination of the ℓ₁ and ℓ₂ penalties. Selection properties of the Enet and adaptive Enet have also been studied by Jia and Yu (2009) and Zou and Zhang (2009). Bondell and Reich (2008) proposed the OSCAR (octagonal shrinkage and clustering algorithm for regression) approach, which uses a combination of the ℓ₁ norm and a pairwise ℓ_∞ norm for the coefficients. Huang et al. (2010) proposed the Mnet method, which uses a combination of the MCP and ℓ₂ penalties. The Mnet estimator is equal to the oracle ridge estimator with high probability under certain conditions. These methods are effective in dealing with certain types of collinearity among predictors and has the useful grouping property of selecting and dropping highly correlated predictors together. Still, these combination penalties do not use any specific information on the correlation pattern among the predictors.

Li and Li (2008) proposed a network-constrained regularization procedure for variable selection and estimation in linear regression models, where the predictors are genomic data measured on genetic networks. Li and Li (2010) considered the general problem of regression analysis when predictors are measured on an undirected graph, which is assumed to be known a priori. They called their method a graph-constrained estimation procedure, or GRACE. The GRACE penalty is a combination of the ℓ₁ penalty and a penalty that is the Laplacian quadratic associated with the graph. Because the GRACE uses the ℓ₁ penalty for selection and sparsity, it has the same drawbacks as the Enet discussed above. In addition, the full knowledge of the graphical structure for the predictors is usually not available, especially in high-dimensional problems. Daye and Jeng (2009) proposed the weighted fusion method, which also uses a combination of the ℓ₁ penalty and a quadratic form that can incorporate information among correlated variables for estimation and variable selection. Tutz and Ulbricht (2009) studied a form of correlation based penalty, which can be considered a special case of the general quadratic penalty. But this approach does not do variable selection. The authors proposed a blockwise boosting procedure in combination with the correlation based penalty for variable selection. Hebiri and van de Geer (2010) studied the theoretical properties of the smoothed-Lasso and other ℓ₁ + ℓ₂-penalized methods in p ≫ n models. Pan, Xie and Shen (2009) studied a grouped penalty based on the L_γ-norm for γ > 1 that smoothes the regression coefficients over a network. In particular, when γ = 2 and after appropriate rescaling of the regression coefficients, this group L_γ penalty simplifies to the group Lasso (Yuan and Lin 2005) with the nodes in the network as groups. This method is capable of group selection, but it does not do individual variable selection. Also, because the group L_γ penalty is convex for γ > 1, it does not lead to consistent variable selection, even at the group level.

We propose a new penalized method for variable selection and estimation in sparse, high-dimensional settings that takes into account certain correlation patterns among predictors. We consider a combination of the MCP and Laplacian quadratic as the penalty function. We call the proposed approach the sparse Laplacian shrinkage (SLS) method. The SLS uses the MCP to promote sparsity and Laplacian quadratic penalty to encourage smoothness among coefficients associated with the correlated predictors. An important advantage of the MCP over the ℓ₁ penalty is that it leads to estimators that are nearly unbiased and achieve selection consistency under weaker conditions (Zhang, 2010).

The contributions of this paper are as follows.

First, unlike the existing methods that use an ℓ₁ penalty for selection and a ridge penalty or a general ℓ₂ penalty for dealing with correlated predictors, we use the MCP to achieve nearly unbiased selection and proposed a concrete class of quadratics, the Laplacians, for incorporating correlation patterns among predictors in a local fashion. In particular, we suggest to employ the approaches for network analysis for specifying the Laplacians. This provides an implementable strategy for incorporating correlation structures in high-dimensional data analysis.
Second, we prove that the SLS estimator is sign consistent and equal to the oracle Laplacian shrinkage estimator under reasonable conditions. This result holds for a large class of Laplacian quadratics. An important aspect of this result is that it allows the number of predictors to be larger than the sample size. In contrast, the works of Daye and Jeng (2009) and Tutz and Ulbricht (2009) do not contain such results in p ≫ n models. The selection consistency result of Hebiri and van de Geer (2010) requires certain strong assumptions on the magnitude of the smallest regression coefficient (their Assumption C) and on the correlation between important and unimportant predictors (their Assumption D), in addition to a variant of the restricted eigenvalue condition (their Assumption B). In comparison, our assumption involving the magnitude of the regression coefficients is weaker and we use a sparse Riese condition instead of imposing restriction on the correlations among predictors. In addition, our selection results are stronger in that the SLS estimator is not only sign consistent, but also equal to the oracle Laplacian shrinkage estimator with high probability. In general, similar results are not available with the use of the ℓ₁ penalty.
Third, we show that the SLS method is potentially capable of incorporating correlation structure in the analysis without incurring extra bias. The Enet and the more general ℓ₁ + ℓ₂ methods in general introduces extra bias due to the quadratic penalty, in addition to the bias resulting from the ℓ₁ penalty. To the best of our knowledge, this point has not been discussed in the existing literature. We also demonstrate that the SLS has certain local smoothing property with respect to the graphical structure of the predictors.
Fourth, unlike in the GRACE method, the SLS does not assume that the graphical structure for the predictors is known a priori. The SLS uses the existing data to construct the graph Laplacian or to augment partial knowledge of the graph structure.
Fifth, our simulation studies demonstrate that the SLS method outperforms the ℓ₁ penalty plus a quadratic penalty approach as studied in Daye and Jeng (2009) and Hebri and van de Geer (2010). In our simulation examples, the SLS in general has smaller empirical false discovery rates with comparable false negative rates. It also has smaller prediction errors.

This paper is organized as follows. In Section 2 we define the SLS estimator. In Section 3 we discuss ways to construct graph Laplacian, or equivalently, its corresponding adjacency matrix. In Section 4 we study the selection properties of the SLS estimators. In Section 5 we investigate the properties of Laplacian shrinkage. In Section 6 we describe a coordinate descent algorithm for computing the SLS estimators, present simulation results and an application of the SLS method to a microarray gene expression dataset. Discussions of the proposed method and results are given in Section 7. Proofs for the oracle properties of the SLS and other technical details are provided in the Appendix.

2 The sparse Laplacian shrinkage estimator

Consider the linear regression model

y = \sum_{j = 1}^{p} x_{j} β_{j} + ε

(2.1)

with n observations and p potential predictors, where y = (y₁, …, y_n)′ is the vector of n response variables, x_j = (x₁_j, …, x_nj)′ is the jth predictor, β_j is the jth regression coefficient and ε = (ε₁, …, ε_n)′ is the vector of random errors. Let X = (x₁, …, x_p) be the n × p design matrix. Throughout, we assume that the response and predictors are centered and the predictors are standardized so that $\sum_{i = 1}^{n} x_{i j}^{2} = n$ , j = 1, …, p. For λ = (λ₁, λ₂) with λ₁ ≥ 0 and λ₂ ≥ 0, we propose the penalized least squares criterion

M (b; λ, γ) = \frac{1}{2 n} {| | y - X b | |}^{2} + \sum_{j = 1}^{p} ρ (∣ b_{j} ∣; λ_{1}, γ) + \frac{1}{2} λ_{2} \sum_{1 \leq j < k \leq p} ∣ a_{j k} ∣ {(b_{j} - s_{j k} b_{k})}^{2},

(2.2)

where || · || denotes the ℓ₂ norm, ρ is the MCP with penalty parameter λ₁ and regularization parameter γ, |a_jk| mesaures the strength of the connection between x_j and x_k, and s_jk = sgn(a_jk) is the sign of a_jk, with sgn(t) = −1, 0, or 1 respectively for t < 0, = 0, or > 0. The two penalty terms in (2.2) play different roles. The first term promotes sparsity in the estimated model. The second term encourages smoothness of the estimated coefficients of the connected predictors. We can associate the quadratic form in this term with the Laplacian for a suitably defined undirected weighted graph for the predictors. See the description below. For any given (λ, γ), the SLS estimator is

\hat{β} (λ, γ) = \underset{b}{argmin} M (b; λ, γ) .

(2.3)

The SLS uses the MCP, defined as

ρ (t; λ_{1}, γ) = λ_{1} \int_{0}^{∣ t ∣} {(1 - x / (γ λ_{1}))}_{+} d x,

(2.4)

where for any a ∈ R, a₊ is the nonnegative part of a, i.e., a₊ = a1_{_a≥_0}. The MCP can be easily understood by considering its derivative,

\dot{ρ} (t; λ_{1}, γ) = λ_{1} {(1 - ∣ t ∣ / (γ λ_{1}))}_{+} sgn (t) .

(2.5)

We observe that the MCP begins by applying the same level of penalization as the ℓ₁ penalty, but continuously reduces that level to 0 for |t| > γλ. The regularization parameter γ controls the degree of concavity. Larger values of γ make ρ less concave. By sliding the value of γ from 1 to ∞, the MCP provides a continuum of penalties with the hard-threshold penalty as γ → 1+ and the convex ℓ₁ penalty at γ = ∞. Detailed discussion of MCP can be found in Zhang (2010).

The SLS also allows the use of different penalties than the MCP for ρ, including the SCAD (Fan 1997; Fan and Li 2001) and other quadratic splines. Because the MCP minimizes the maximum concavity measure and has the simplest form among nearly unbiased penalties in this family, we choose it as the default penalty for the SLS. Further discussion of the MCP and its comparison with the LASSO and SCAD can be found in Zhang (2010) and Mazumder, Friedman and Hastie (2009).

We express the nonnegative quadratic form in the second penalty term in (2.2) using a positive semi-definite matrix L, which satisfies

b^{'} L b = \sum_{1 \leq j < k \leq p} ∣ a_{j k} ∣ {(b_{j} - s_{j k} b_{k})}^{2}, \forall b \in {IR}^{p} .

For simplicity, we confine our discussion to the symmetric case where a_kj = a_jk, 1 ≤ j < k ≤ p. Since the diagonal elements a_jj do not appear in the quadratic form, we can define them any way we like for convenience. Let A = (a_jk, 1 ≤ j, k ≤ p) and D = diag(d₁, …, d_p), where $d_{j} = \sum_{k = 1}^{p} ∣ a_{j k} ∣$ . We have Σ_1≤j<k≤p |a_jk|(b_j − s_jkb_k)² = b′(D − A)b. Therefore, L = D − A. This matrix is associated with a labeled weighted graph Inline graphic = (V, ) with vertex set V = {1, …, p} and edge set = {(j, k): (j, k) ∈ V × V}. Here the |a_jk| is the weight of edge (j, k) and d_j is the degree of vertex j. The d_j is also called the connectivity of vertex j. The matrix L is called the Laplacian of and A its signed adjacency matrix (Chung 1997). The edge (j, k) is labeled with the “+” or “−” sign, but its weight |a_jk| is always nonnegative. We use a labeled graph to accommodate the case where two predictors can have a nonzero adjacent coefficient but are negatively correlated. Note that the usual adjacency matrix can be considered a special case of signed adjacency matrix when all a_jk ≥ 0. For simplicity, we will use the term adjacency matrix below.

We usually require that the adjacency matrix to be sparse in the sense that many of its entries are zero or nearly zero. With a sparse adjacency matrix, the main characteristic of the shrinkage induced by the Laplacian penalty is that it occurs locally for the coefficients associated with the predictors connected in the graph. Intuitively, this can be seen by writing

λ_{2} \sum_{1 \leq j < k \leq p} ∣ a_{j k} ∣ {(b_{j} - s_{j k} b_{k})}^{2} = \frac{1}{2} λ_{2} \sum_{(j, k) : a_{j k} \neq 0} ∣ a_{j k} ∣ {(b_{j} - s_{j k} b_{k})}^{2} .

Thus for λ₂ > 0, the Laplacian penalty shrinks b_j − s_jkb_k towards zero for a_jk ≠ 0. This can also be considered as a type of local smoothing on the graph G associated with the adjacency matrix A. In comparison, the shrinkage induced by the ridge penalty used in the Enet is global in that it shrinks all the coefficients towards zero, regardless of the correlation structure among the predictors. We will discuss the Laplacian shrinkage in more detail in Section 5.

Using the matrix notation, the SLS criterion (2.2) can be written as

M (b; λ, γ) = \frac{1}{2 n} {| | y - X b | |}^{2} + \sum_{j = 1}^{p} ρ (∣ b_{j} ∣; λ_{1}, γ) + \frac{1}{2} λ_{2} b^{'} (D - A) b .

(2.6)

Here the Laplacian is not normalized, meaning that the weight d_j is not standardized to 1. In problems where predictors should be treated without preference with respect to connectivity, we can first normalized the Laplacian L^* = I_p − A^* with A^* = D⁻¹^/²AD⁻¹^/² and use the criterion

M^{*} (b; λ, γ) = \frac{1}{2 n} {| | y - X b | |}^{2} + \sum_{j = 1}^{p} ρ (∣ b_{j} ∣; λ_{1}, γ) + \frac{1}{2} λ_{2} b^{'} (I_{p} - A^{*}) b .

Technically, a normalized Laplacian L^* can be considered a special case of a general L. We only consider the SLS estimator based on the criterion (2.6) when studying its properties. In network analysis of gene expression data, genes with large connectivity also tend to have important biological functions (Zhang and Horvath 2005). Therefore, it is prudent to provide more protection for such genes in the selection process.

3 Construction of adjacency matrix

In this section, we describe several simple forms of adjacency measures proposed by Zhang and Horvath (2005), which have have been successfully used in network analysis of gene expression data. The adjacency measure is often defined based on the notion of dissimilarity or similarity.

A basic and widely used dissimilarity measure is the Euclidean distance. Based on this distance, we can define adjacency coefficient as $a_{j k} = φ (| | x_{j} - x_{k} | | / \sqrt{n})$ , where φ: [0, ∞) ↦ [0, ∞). A simple adjacency function is the threshold function φ(x) = 1{x ≤ 2r}. Then
$a_{j k} = {\begin{array}{l} 1 & if | | x_{j} - x_{k} | | / \sqrt{n} \leq 2 r \\ 0 & if | | x_{j} - x_{k} | | / \sqrt{n} > 2 r . \end{array}$ (3.1)

It is convenient to express a_jk in terms of the Pearson’s correlation coefficient r_jk between x_j and x_k, where $r_{j k} = x_{j}^{'} x_{k} / (| | x_{j} | | | | x_{k} | |)$ . For predictors that are standardized with ||x_j||² = n, 1 ≤ j ≤ p, we have ||x_j − x_k||²/n = 2 − 2r_jk. Thus in terms of correlation coefficients, we can write a_jk = 1{r_jk > r}. We determine the value of r based on the Fisher transformation z_jk = 0.5 log((1 + r_jk)/(1 − r_jk)). If the correlation between x_j and x_k is zero, $\sqrt{n - 3} z_{j k}$ is approximately distributed as N(0, 1). We can use this to determine a threshold c for $\sqrt{n - 3} z_{j k}$ . The corresponding threshold for r_jk is $r = (exp (2 c / \sqrt{n - 3}) - 1) / (exp (2 c / \sqrt{n - 3}) + 1)$ .

We note that here we use the Fisher transformation to change the scale of the correlation coefficients from [−1, 1] to the normal scale for determining the threshold value r, so that the adjacency matrix is relatively sparse. We are not trying to test the significance of correlation coefficients.
The adjacency coefficient in (3.1) is defined based on a dissimilarity measure. Adjacency coefficient can also be defined based on similarity measures. An often used similarity measure is Pearson’s correlation coefficient r_jk. Other correlation measures such as Spearman’s correlation can also be used. Let
$s_{j k} = sgn (r_{j k}) and a_{j k} = s_{j k} 1 {∣ r_{j k} ∣ > r} .$

Here r can be determined using the Fisher transformation as above.
With the power adjacency function considered in Zhang and Horvath (2005),
$a_{j k} = max {(0, r_{j k})}^{α} and s_{j k} = 1.$

Here α > 0 and can be determined by, for example, the scale-free topology criterion.
A variation of the above power adjacency function is
$a_{j k} = ∣ r_{j k} ∣^{α} and s_{j k} = sgn (r_{j k}) .$

For the adjacency matrices given above, (i) and (ii) use dichotomized measures, whereas (iii) and (iv) use continuous measures. Under (i) and (iii), two covariates are either positively or not connected/correlated. In contrast, under (ii) and (iv), two covariates are allowed to be negatively connected/correlated.

There are many other ways for constructing an adjacency matrix. For example, a popular adjacency measure in cluster analysis is a_jk = exp(− ||x_j − x_k||²/nτ²) for τ > 0. The resulting adjacency matrix A = [a_jk] is the Gram matrix associated with the Gaussian kernel. For discrete covariates, the Pearson correlation coefficient can still be used as a measure of correlation or association between two discrete predictors or between a discrete predictor and a continuous one. For example, for single nucleotide polymorphism data, Pearson’s correlation coefficient is often used as a measure of linkage disequilibrium (i.e. association) between two markers. Other measures, such as odds ratio or measure of association based on contingency table can also be used for r_jk.

We note that how to construct the adjacency matrix is problem specific. Different applications may require different adjacency matrices. Since construction of adjacency matrix is not the focus of the present paper, we will only consider the use of the four adjacency matrices described above in our numerical studies in Section 6.

4 Oracle properties

In this section, we study the theoretical properties of the SLS estimator. Let the true value of the regression coefficient be $β^{o} = {(β_{1}^{o}, \dots, β_{p}^{0})}^{'}$ . Denote $O = {j : β_{j}^{o} \neq 0}$ , which is the set of indices of nonzero coefficients. Let d^o = | Inline graphic | be the cardinality of . Define

{\hat{β}}^{o} (λ_{2}) = \underset{b}{argmin} {\frac{1}{2 n} {| | y - X b | |}^{2} + \frac{1}{2} λ_{2} b^{'} L b, b_{j} = 0, j \notin O} .

(4.1)

This is the oracle Laplacian shrinkage estimator on the set Inline graphic . Theorems 1 and 2 below provide sufficient conditions under which P(sgn(β̂) ≠ sgn(β^o) or β̂ ≠ β̂^o) → 0. Thus under those conditions, the SLS estimator is sign consistent and equal to β̂^o with high probability.

We need the following notation in stating our results. Let Σ = n⁻¹X′X. For any A ∪ B ⊆ {1, …, p}, vectors v, the design matrix X and V = (v_ij)_p×p, define

v_{B} = {(v_{j}, j \in B)}^{'}, X_{B} = (x_{j}, j \in B), V_{A, B} = {(v_{i j}, i \in A, j \in B)}_{∣ A ∣ \times ∣ B ∣}, V_{B} = V_{B, B} .

For example, $\sum_{B} = X_{B}^{'} X_{B} / n$ and Inline graphic (λ₂) = + λ₂ . Let |B| denotes the cardinality of B. Let c_min(λ₂) be the smallest eigenvalue of Σ + λ₂L. We use the following constants to bound the bias of the Laplacian:

C_{1} = {| | \sum_{O}^{- 1} (λ_{2}) L_{O} β_{O}^{o} | |}_{\infty}, C_{2} = {| | {\sum_{O^{c}, O} (λ_{2}) \sum_{O}^{- 1} (λ_{2}) L_{O} - L_{O^{c}, O}} β_{O}^{0} | |}_{\infty} .

(4.2)

We make the following sub-Gausian assumption on the error terms in (2.1).

Condition (A): For a certain constant ε ∈ (0, 1/3),

sup_{| | u | | = 1} P {u^{'} ε > σ t} \leq e^{- t^{2} / 2}, 0 < t \leq \sqrt{2 log (p / ε)} .

4.1 Convex penalized loss

We first consider the case where Σ(λ₂) = Σ + λ₂L is positive definite. Since (4.1) is the minimizer of the Laplacian restricted to the support Inline graphic , it can be explicitly written as

{\hat{β}}_{O}^{o} = {(\sum_{O} + λ_{2} L_{O})}^{- 1} X_{O}^{'} y / n, {\hat{β}}_{O^{c}}^{o} = 0,

(4.3)

provided that Inline graphic (λ₂) is invertible. Its expectation β* = E β̂^o, considered as a target of the SLS estimator, must satisfy

β_{O}^{*} = {(\sum_{O} + λ_{2} L_{O})}^{- 1} \sum_{O} β^{o}, β_{O^{c}}^{*} = 0,

(4.4)

Condition (B): (i) c_min(λ₂) > 1/γ with ρ(t; λ₁, γ) in (2.2). (ii) The penalty levels satisfy

λ_{1} \geq λ_{2} C_{2} + σ \sqrt{2 log ((p - d^{o}) / ε)} max_{j \leq p} | | x_{j} | | / n

with C₂ in (4.2). (iii) With {v_j, j ∈ Inline graphic } being the diagonal elements of $\sum_{O}^{- 1} (λ_{2}) \sum_{O} {\sum_{O}^{- 1} (λ_{2})}$ ,

min_{j \in O} {∣ β_{j}^{*} ∣ {(n / v_{j})}^{1 / 2}} \geq σ \sqrt{2 log (d^{o} / ε)} .

Define $β_{*} = min {∣ β_{j}^{o} ∣, j \in O}$ . If Inline graphic is an empty set, that is, when all the regression coefficients are zero, we set β_* = ∞.

Theorem 1

Suppose Conditions (A) and (B) hold. Then,

P ({j : {\hat{β}}_{j} \neq 0} \neq O o r \hat{β} \neq {\hat{β}}^{o}) \leq 3 ε .

(4.5)

If $β_{*} \geq λ_{2} C_{1} + {max}_{j} \sqrt{(2 v_{j} / n) log (d^{o} / ε)}$ instead of Condition (B) (iii), then

P (sgn (\hat{β}) \neq sgn (β^{o}) o r \hat{β} \neq {\hat{β}}^{o}) \leq 3 ε .

(4.6)

Here note that p, d^o, γ and c_min(λ₂) are all allowed to depend on n.

The probability bound on the selection error in Theorem 1 is nonasymptotic. If the conditions of Theorem 1 hold with ε → 0, then (4.5) implies selection consistency of the SLS estimator and (4.6) implies sign consistency. The conditions are mild. Condition (A) concerns the tail probabilities of the error distribution and is satisfied if the errors are normally distributed. Condition (B) (i) ensures that the SLS criterion is strictly convex so that the solution is unique. The oracle estimator β̂^o is biased due to the Laplacian shrinkage. Condition (B) (ii) requires a penalty level λ₁ to prevent this bias and noise to cause false selection of variables in Inline graphic Condition (B) (iii) requires that the nonzero coefficients not be too small in order for the SLS estimator to be able to distinguish nonzero from zero coefficients.

In Theorem 1, we only require c_min(λ₂) > 0, or equivalently, Σ + λ₂L to be positive definite. The matrix Σ can be singular. This can be seen as follows. The adjacency matrix partitions the graph into disconnected cliques V_g, 1 ≤ g ≤ J for some J ≥ 1. Let node j_g be a (representative) member of V_g. A node k belongs to the same clique V_g iff (if and only if) a_{j_gk₁}a_k₁k₂ · · · a_{k_mk} ≠ 0 through a certain chain j_g → k₁ → k₂ → ··· → k_m → k. Define x̄_g = Σ_{k∈V_g} a_{j_gk₁}a_k₁k₂ ··· a_{k_mk}x_k/|V_g|, where |V_g| is the cardinality of V_g. The matrix Σ + λ₂L is positive definite iff b′Σb = b′Lb = 0 implies b = 0. Since b′Lb = 0 implies Σ_{k∈V_g} b_kx_k = b_{j_g} |V_g|x̄_g, Σ+λ₂L is positive definite iff the vectors x̄_g are linearly independent. This does not require n ≥ p. In other words, Theorem 1 is applicable to p > n problems as long as the vectors x̄_g are linearly independent.

4.2 The nonconvex case

When Σ(λ₂) = Σ + λ₂L is singular, Theorem 1 is not applicable. In this case, further conditions are required for the oracle property to hold. The key condition needed is the sparse Reisz condition, or SRC (Zhang and Huang 2008), in (4.9) below. It restricts the spectrum of diagonal subblocks of Σ(λ₂) up to a certain dimension.

Let X̃ = X̃(λ₂) be a matrix satisfying X̃′X̃/n = Σ(λ₂) = X′X/n + λ₂L and ỹ = ỹ (λ₂) be a vector satisfying X̃′ỹ = X′y. Define

\tilde{M} (b; λ, γ) = \frac{1}{2 n} {| | \tilde{y} - \tilde{X} b | |}^{2} + \sum_{j = 1}^{p} ρ (∣ b_{j} ∣; λ_{1}, γ) .

(4.7)

Since M(b; λ, γ) − M̃ (b; λ, γ) = (||y||² − ||ỹ||²)/(2n), the two penalized loss functions have the same set of local minimizers. For the penalized loss (4.7) with the data (X̃, ỹ), let

\hat{β} (λ) = δ (\tilde{X} (λ_{2}), \tilde{y} (λ_{2}), λ_{1}),

(4.8)

where the map δ(X, y, λ₁) ∈ IR^p defines the MC+ estimator (Zhang, 2010) with data (X, y) and penalty level λ₁. It was shown in Zhang (2010) that δ (X, y, λ₁) depends on (X, y) only through X′y/n and X′X/n, so that different choices of X̃ and ỹ are allowed. One way is to pick ỹ = (y′, 0)′ and X̃ = diag(X, (nλ₂L)¹^/²). Another way is to pick X̃′X̃/n = Σ(λ₂) and ỹ = (X̃′)^†X′y of smaller dimensions, where (X̃′)^† is the Moore-Penrose inverse of X̃′.

Condition (C)

For an integer d^* and spectrum bounds 0 < c_*(λ₂) ≤ c^*(λ₂) < ∞,
$0 < c_{*} (λ_{2}) \leq u_{B}^{'} \sum_{B} (λ_{2}) u_{B} \leq c^{*} (λ_{2}) < \infty, \forall B with ∣ B \cup O ∣ \leq d^{*}, | | u_{B} | | = 1,$ (4.9)

with d^* ≥ d^o(K_* + 1), $γ \geq c_{*}^{- 1} (λ_{2}) \sqrt{4 + c_{*} (λ_{2}) / c^{*} (λ_{2})}$ in (2.2), and K_* = c^*(λ₂)/c_*(λ₂) − 1/2.
With $C_{2} = {| | {\sum_{B, O} (λ_{2}) \sum_{O}^{- 1} (λ_{2}) L_{O} - L_{B, O}} β_{O}^{o} | |}_{\infty}$ ,
$max {1, \sqrt{c_{*} (λ_{2}) K_{*} / (K_{*} + 1)}} λ_{1} \geq λ_{2} C_{2} + σ \sqrt{2 log (p / ε)} max_{j \leq p} | | x_{j} | | / n .$
With {v_j, j ∈ } being the diagonal elements of $\sum_{O}^{- 1} (λ_{2}) \sum_{O} {\sum_{O}^{- 1} (λ_{2})}$ ,
$min_{j \in O} {∣ β_{j}^{*} ∣ - γ (2 \sqrt{c^{*} (λ_{2})} λ_{1})} {(n / v_{j})}^{1 / 2} \geq σ \sqrt{2 log (d^{o} / ε)} .$

Theorem 2

Suppose Conditions (A) and (C) hold. Let β̂(λ) be as in (4.8). Then,
$P ({j : {\hat{β}}_{j} \neq 0} \neq O o r \hat{β} \neq {\hat{β}}^{o}) \leq 3 ε .$ (4.10)

If $β_{*} \geq λ_{2} C_{1} + γ (2 \sqrt{c^{*} (λ_{2})} λ_{1}) + {max}_{j} \sqrt{(2 v_{j} / n) log (d^{o} / ε)}$ instead of Condition (C) (iii), then
$P (sgn (\hat{β}) \neq sgn (β^{o}) o r \hat{β} \neq {\hat{β}}^{o}) \leq 3 ε .$ (4.11)

Here note that p, γ, d^o, d^*, K_*, ε, c_*(λ₂) and c^*(λ₂) are all allowed to depend on n, including the case c_*(λ₂) → 0 as long as the conditions hold as stated.
The statements in (i) also hold for all local minimizers β̂ of (2.6) or (4.7) satisfying #{j ∉ : β̂_j ≠ 0} + d^o ≤ d^*.

If the conditions of Theorem 2 hold with ε → 0, then (4.10) implies selection consistency of the SLS estimator and (4.11) implies sign consistency.

Condition (C), designed to handle the noncovexity of the penalized loss, is a weaker version of Condition (B) in the sense of allowing singular Σ(λ₂). The SRC (4.9), depending on X or X̃ only through the regularized Gram matrix X̃′X̃/n = Σ(λ₂) = Σ + λ₂L, ensures that the model is identifiable in a lower d^*-dimensional space. When p > n, the smallest singular value of X is always zero. However, the requirement c_*(λ₂) > 0 only concerns d^*×d^* diagonal submatrices of Σ(λ₂), not the Gram matrix Σ of the design matrix X. We can have p ≫ n but still require d^*/d^o ≥ K_*+1 as in (4.9). Since p, d⁰, γ, d^*, K_*, c_*(λ₂) and c^*(λ₂) can depend on n, we allow the case c_*(λ₂) → 0 as long as Conditions (A) and (C) hold as stated. Thus, we allow p ≫ n but require that the model is sparse, in the sense that the number of nonzero coefficients d^o is smaller than d^*/(1 + K_*). For example, if c_*(λ₂) ≍ O(n⁻^α) for a small α > 0 and c^*(λ₂) ≍ O(1), then we require γ ≍ O(n³^α/²) or greater, K^* ≍ O(n^α) and d^*/d^o ≍ O(n^α) or greater. So all these quantities can depend on n, as long as the other requirements are met in Condition (C).

By examining the Conditions C(ii) and C(iii), for standardized predictors with $| | x_{j} | | = \sqrt{n}$ , we can have log(p/ε) = o(n) or p = ε exp(o(n)) as long as condition C(ii) is satisfied. As in Zhang (2010), under a somewhat stronger version of Condition C, Theorem 2 can be extended to quadratic spline concave penalties satisfying $ρ (t; λ_{1}, γ) = λ_{1}^{2} ρ (t / λ; γ)$ with a penalty function satisfying (∂/∂t)ρ(t; γ) = 1 at t = 0+ and 0 for t > γ.

Also, comparing our results with the selection consistency results of Hebiri and Geer (2010) on the smoothed ℓ₁+ℓ₂-penalized methods, our conditions tend to be weaker. Notably, Hebiri and Geer (2010) require an condition on the Gram matrix which assumes that the correlations between the truly relevant variables and those which are not are small. No such assumption is required for our selection consistency results. In addition, our selection results are stronger in the sense that the SLS estimator is not only sign consistent, but also equal to the oracle Laplacian shrinkage estimator with high probability. In general, similar results are not available with the use of the ℓ₁ penalty for sparsity.

Theorem 2 shows that the SLS estimator automatically adapts to the sparseness of the p-dimensional model and the denseness of a true submodel. From a sparse p-model, it correctly selects the true underlying model Inline graphic . This underlying model is a dense model in the sense that all its coefficients are nonzero. In this dense model, the SLS estimator behaves like the oracle Laplacian shrinkage estimator in (4.1). As in the convex penalized loss setting, here the results do not require a correct specification of a population correlation structure of the predictors.

4.3 Unbiased Laplacian and variance reduction

There are two natural questions concerning the SLS. First, what are the benefits from introducing the Laplacian penalty? Second, what kind of Laplacian L constitutes a reasonable choice? Since the SLS estimator is equal to the oracle Laplacian estimator with high probability by Theorem 1 or 2, these questions can be answered by examining the oracle Laplacian shrinkage estimator (4.1), whose nonzero part is

{\hat{β}}_{O}^{o} (λ_{2}) = \sum_{O}^{- 1} (λ_{2}) X_{O}^{'} y / n .

Without the Laplacian, i.e., when λ₂ = 0, it becomes the least squares (LS) estimator,

{\hat{β}}_{O}^{o} (0) = \sum_{O}^{- 1} X_{O}^{'} y / n .

If some of the predictors in {x_j, j ∈ Inline graphic } are highly correlated or | | ≥ n, the LS estimator ${\hat{β}}_{O}^{o} (0)$ is not stable or unique. In comparison, as discussed below Theorem 1, (λ₂) = + λ₂ can be a full rank matrix under a reasonable condition, even if the predictors in {x_j, j ∈ } are highly correlated or | Inline graphic | ≥ n.

For the second question, we examine the bias of ${\hat{β}}_{O}^{o} (λ_{2})$ . Since the bias of the target vector (4.4) is $β_{O}^{o} - β_{O}^{*} (λ_{2}) = λ_{2} \sum_{O}^{- 1} (λ_{2}) L_{O} β_{O}^{o}, {\hat{β}}_{O}^{o} (λ_{2})$ is unbiased iff $L_{O} β_{O}^{o} = 0$ . Therefore, in terms of bias reduction, a Laplacian L is most appropriate if the condition $L_{O} β_{O}^{o} = 0$ is satisfied. We shall say that a Laplacian L is unbiased if $L_{O} β_{O}^{o} = 0$ . It follows from the discussion at the end of Subsection 4.1 that $L_{O} β_{O}^{o} = 0$ if $β_{k}^{o} = β_{j_{g}}^{o} a_{j_{g} k_{1}} a_{k_{1} k_{2}} \dots a_{k_{m} k}$ , where j_g is a representative member of the clique V_g ∩ Inline graphic and {k₁, …, k_m, k} ⊆ V_g ∩ .

With an unbiased Laplacian, the mean square error of ${\hat{β}}_{O}^{o} (λ_{2})$ is

E {| | {\hat{β}}_{O}^{o} (λ_{2}) - β_{O}^{o} | |}^{2} = \frac{σ^{2}}{n} trace (\sum_{O}^{- 1} (λ_{2}) \sum_{O} \sum_{O}^{- 1} (λ_{2})) .

The mean square error of Inline graphic (0) is

E {| | {\hat{β}}_{O}^{o} (0) - β_{O}^{o} | |}^{2} = \frac{σ^{2}}{n} trace (\sum_{O}^{- 1}) .

We always have $E {| | {\hat{β}}_{O}^{o} (λ_{2}) - β_{O}^{o} | |}^{2} < E {| | {\hat{β}}_{O}^{o} (0) - β_{O}^{o} | |}^{2} for λ_{2} > 0$ . Therefore, an unbiased Laplacian reduces variance without incurring any bias on the estimator.

5 Laplacian shrinkage

The results in Section 4 show that the SLS estimator is equal to the oracle Laplacian shrinkage estimator with probability tending to one under certain conditions. In addition, an unbiased Laplacian reduces variance but does not increase bias. Therefore, to study the shrinkage effect of the Laplacian penalty on β̂, we can consider the oracle estimator ${\hat{β}}_{O}^{o}$ . To simplify the notation and without causing confusion, in this section, we study some other basic properties of the Laplacian shrinkage and compare it with the ridge shrinkage. The Laplacian shrinkage estimator is defined as

\tilde{β} (λ_{2}) = \underset{b}{argmin} {G (b; λ_{2}) \equiv \frac{1}{2 n} {| | y - X b | |}^{2} + \frac{1}{2} λ_{2} b^{'} L b, b \in ℝ^{q}} .

(5.1)

The following proposition shows that the Laplacian penalty shrinks a coefficient towards the center of all the coefficients connected to it.

Proposition 1

Let r̃ = y − Xβ̃.

$λ_{2} max_{1 \leq j \leq q} d_{j} ∣ {\tilde{β}}_{j} - a_{j}^{'} \tilde{β} / d_{j} ∣ \leq | | \tilde{r} | | \leq | | y | | .$
$λ_{2} ∣ d_{j} {\tilde{β}}_{j} - a_{j}^{'} \tilde{β} - (d_{k} {\tilde{β}}_{k} - a_{k}^{'} \tilde{β}) ∣ \leq \frac{1}{n} | | x_{j} - x_{k} | | | | y | | .$

Note that $a_{j}^{'} \tilde{β} / d_{j} = \sum_{k = 1}^{q} a_{j k} {\tilde{β}}_{k} / d_{j} = \sum_{k = 1}^{q} sgn (a_{j k}) ∣ a_{j k} ∣ {\tilde{β}}_{k} / d_{j})$ is a signed weighted average of the β̃_k’s connected to β̃_j, since d_j = Σ_k |a_jk|. Part (i) of Proposition 1 provides an upper bound on the difference between β̃_j and the center of all the coefficients connected to it. When ||r̃||/(λ₂d_j) → 0, this difference converges to zero. For standardized d_j = 1, part (ii) implies that the difference between the centered β̃_j and β̃_k converges to zero when ||x_j − x_k|| ||y||/(λ₂n) → 0.

When there are certain local structures in the adjacency matrix A, shrinkage occurs at the local level. As an example, we consider the adjacency matrix based on partition of the 16 predictors into 2r-balls defined in (3.1). Correspondingly, the index set {1, …, q} is divided into disjoint neighborhoods/cliques V₁, …, V_J. We consider the normalized Laplacian L = I_q − A, where I_q is a q × q identity matrix and A = diag(A₁, …,A_J) with $A_{g} = v_{g}^{- 1} 1_{g}^{'} 1$ . Here v_g = |V_g|, 1 ≤ g ≤ J. Let b_g = (b_j, j ∈ V_g)′. We can write the objective function as

G (b; λ_{2}) = \frac{1}{2 n} {| | y - X b | |}^{2} + \frac{1}{2} λ_{2} \sum_{g = 1}^{J} b_{g}^{'} (I_{g} - v_{g}^{- 1} 1_{g}^{'} 1_{g}) b_{g} .

(5.2)

For the Laplacian shrinkage estimator based on this criterion, we have the following grouping properties.

Proposition 2

For any j, k ∈ V_g, 1 ≤ g ≤ J,
$λ_{2} ∣ {\tilde{β}}_{j} - {\tilde{β}}_{k} ∣ \leq \frac{1}{n} | | x_{j} - x_{k} | | \cdot | | y | |, j, k \in V_{g} .$
Let β̄_g be the average of the estimates in V_g. For any j ∈ V_g and k ∈ V_h, g ≠ h,
$λ_{2} ∣ {\tilde{β}}_{j} - {\bar{β}}_{g} - ({\tilde{β}}_{k} - {\bar{β}}_{h}) ∣ \leq \frac{1}{n} | | x_{j} - x_{k} | | \cdot | | y | |, j \in V_{g}, k \in V_{h} .$

This proposition characterizes the smoothing effect and grouping property of the Laplacian penalty in (5.2). Consider the case ||y||²/n = O(1). Part (i) implies that, for j and k in the same neighborhood and λ₂ > 0, the difference β̃_j − β̃_k → 0 if ||x_j − x_k||/(λ₂n¹^/²) → 0. Part (ii) implies that, for j and k in different neighborhoods and λ₂ > 0, the difference between the centered β̃_j and β̃_k converges to zero if ||x_j − x_k||/(λ₂n¹^/²) → 0.

We now compare the Laplacian shrinkage and ridge shrinkage. The discussion at the end of Section 4 about the requirement for the unbiasedness of Laplacian can be put in a wider context when a general positive definite or semidefinite matrix Q is used in the place of L. This wider context includes the Laplacian shrinkage and ridge shrinkage as special cases. Specifically, let

{\hat{β}}_{Q} (λ, γ) = \underset{b}{argmin} \frac{1}{2 n} {| | y - X b | |}^{2} + \sum_{j = 1}^{p} ρ (∣ b_{j} ∣; λ_{1}, γ) + \frac{1}{2} λ_{2} b^{'} Q b .

For Q = I_p, β̂_Q becomes the Mnet estimator (Huang et al. 2010). With some modifications on the conditions in Theorem 1 or Theorem 2, it can be shown that β̂_Q is equal to the oracle estimator defined as

{\hat{β}}_{Q}^{o} (λ_{2}) = \underset{b}{argmin} {\frac{1}{2 n} {| | y - X b | |}^{2} + \frac{1}{2} b^{'} Q b, b_{j} = 0, j \notin O} .

Then in a way similar to the discussion in Section 4, β̂_Q is nearly unbiased iff $Q_{O} β_{O}^{o} = 0$ . Therefore, for $| | β_{O}^{o} | | \neq 0$ , Q_O must be a rank deficient matrix, which in turn implies that Q must be rank deficient. Note that any Laplacian L is rank deficient. This rank deficiency requirement excludes the ridge penalty with Q = I_p. For the ridge penalty to yield an unbiased estimator, it must hold that ||β^o|| = 0 in the underlying model.

We now give a simple example that illustrates the basic characteristics of Laplacian shrinkage and its differences from ridge shrinkage.

Example 5.1

Consider a linear regression model with two predictors satisfying ||x_j||² = n, j = 1, 2. The Laplacian shrinkage and ridge estimators are defined as

({\hat{b}}_{L 1} (λ_{2}), {\hat{b}}_{L 2} (λ_{2})) = \underset{b_{1}, b_{2}}{argmin} \frac{1}{2 n} \sum_{i = 1}^{n} {(y_{i} - x_{i 1} b_{1} - x_{i 2} b_{2})}^{2} + \frac{1}{2} λ_{2} {(b_{1} - b_{2})}^{2},

and

({\hat{b}}_{R 1} (λ_{2}), {\hat{b}}_{R 2} (λ_{2})) = \underset{b_{1}, b_{2}}{argmin} \frac{1}{2 n} \sum_{i = 1}^{n} {(y_{i} - x_{i 1} b_{1} - x_{i 2} b_{2})}^{2} + \frac{1}{2} λ_{2} (b_{1}^{2} + b_{2}^{2}) .

Denote r₁ = cor(x₁, y), r₂ = cor(x₂, y) and r₁₂ = cor(x₁, x₂). The Laplacian shrinkage estimates are

{\hat{b}}_{L 1} (λ_{2}) = \frac{(1 + λ_{2}) r_{1} - (r_{12} - λ_{2}) r_{2}}{{(1 + λ_{2})}^{2} - {(r_{12} - λ_{2})}^{2}}, {\hat{b}}_{L 2} (λ_{2}) = \frac{(1 + λ_{2}) r_{2} - (r_{12} - λ_{2}) r_{1}}{{(1 + λ_{2})}^{2} - {(r_{12} - λ_{2})}^{2}} .

Let

{\hat{b}}_{ols 1} = \frac{r_{1} - r_{12} r_{2}}{1 - r_{12}^{2}}, {\hat{b}}_{ols 2} = \frac{r_{2} - r_{12} r_{1}}{1 - r_{12}^{2}}, {\hat{b}}_{L} (\infty) = \frac{r_{1} + r_{2}}{2 (1 + r_{12})},

where (b̂_ols₁, b̂_ols₂) is the ordinary least squares (OLS) estimator for the bivariate regression, b̂_L(∞) is the OLS estimator that assumes the two coefficients are equal, that is, it minimizes $\sum_{i = 1}^{n} {(y_{i} - (x_{i 1} + x_{i 2}) b)}^{2}$ . Let w_L = (2λ₂)/(1−r₁₂+2λ₂). After some simple algebra, we have

{\hat{b}}_{L 1} (λ_{2}) = (1 - w_{L}) {\hat{b}}_{ols 1} + w_{L} {\hat{b}}_{L} (\infty) and {\hat{b}}_{L 2} (λ_{2}) = (1 - w_{L}) {\hat{b}}_{ols 2} + w_{L} {\hat{b}}_{L} (\infty) .

Thus for any fixed λ₂, b̂_L(λ₂) is a weighted average of b̂_ols and b̂_L(∞) with the weights depending on λ₂. When λ₂ → ∞, b̂_L₁ → b̂_L(∞) and b̂_L₂ → b̂_L(∞). Therefore, the Laplacian penalty shrinks the OLS estimates towards a common value, which is the OLS estimate assuming equal regression coefficients.

Now consider the ridge regression estimator. We have

{\hat{b}}_{R 1} (λ_{2}) = \frac{(1 + λ_{2}) r_{1} - r_{12} r_{2}}{{(1 + λ_{2})}^{2} - r_{12}^{2}} and {\hat{b}}_{R 2} (λ_{2}) = \frac{(1 + λ_{2}) r_{2} - r_{12} r_{1}}{{(1 + λ_{2})}^{2} - r_{12}^{2}} .

The ridge estimator converges to zero as λ₂ → ∞. For it to converge to a nontrivial solution, we need to rescale it by a factor of 1+λ₂. Let $w_{R} = λ / (1 + λ - r_{12}^{2})$ . Let b̂_u₁ = r₁ and b̂_u₂ = r₂. Because $n^{- 1} \sum_{i = 1}^{n} x_{i 1}^{2} = 1$ and $n^{- 1} \sum_{i = 1}^{n} x_{i 2}^{2} = 1$ , r₁ and r₂ are also the OLS estimators of univariate regressions of y on x₁ and y on x₂, respectively. We can write

\begin{array}{l} (1 + λ_{2}) {\hat{b}}_{R 1} (λ_{2}) = c_{λ_{2}} (1 - w_{R}) {\hat{b}}_{ols 1} + c_{λ} w_{R} {\hat{b}}_{u 1}, \\ (1 + λ_{2}) {\hat{b}}_{R 2} (λ_{2}) = c_{λ_{2}} (1 - w_{R}) {\hat{b}}_{ols 2} + c_{λ} w_{R} {\hat{b}}_{u 2}, \end{array}

where $c_{λ_{2}} = {{(1 + λ_{2})}^{2} - (1 + λ) r_{12}^{2}} / {{(1 + λ_{2})}^{2} - r_{12}^{2}}$ . Note that c_λ₂ ≈ 1. Thus (1+λ₂) b̂_R is a weighted average of the OLS and the univariate regression estimators. The ridge penalty shrinks the (rescaled) ridge estimates towards individual univariate regression estimates.

6 Simulation studies

We use a coordinate descent algorithm to compute the SLS estimate. This algorithm optimizes a target function with respect to a single parameter at a time and iteratively cycles through all parameters until convergence. This algorithm was originally proposed for criterions with convex penalties such as LASSO (Fu 1998; Genkin et al. 2004; Friedman et al. 2007; Wu and Lange 2007). It has been proposed to calculate the MCP estimates (Breheny and Huang 2011). Detailed steps of this algorithm for computing the SLS estimates can be found in the technical report accompanying this paper (Huang et al. 2010).

In simulation studies, we consider the following ways of defining the adjacency measure. (N.1) a_jk = I(r_jk > r) and s_jk = 1. Here the cutoff r is computed as 3.09 using the approach described in Section 3 with a p-value of 10⁻³; (N.2) a_jk = I(|r_jk| > r) and s_jk = sgn(r_jk). Here the cutoff r is computed as 3.29 using the approach described in Section 3 with a p-value of 10⁻³; (N.3) a_jk = max(0, r_jk)^α and s_jk = 1. We set α = 6, which satisfies the scale-free topology criteria (Zhang and Horvath 2005); (N.4) $a_{j k} = r_{j k}^{α}$ and s_jk = sgn(r_jk). We set α = 6.

The penalty levels λ₁ and λ₂ are selected using V-fold cross validation. In our numerical study, we set V = 5. To reduce computational cost, we search over the discrete grid of 2^{…−1, −0.5,0,0.5…}. For comparison, we also consider the MCP estimate and the approach proposed in Daye and Jeng (2009; referred to as D-J hereafter). Both the SLS and MCP involve the regularization parameter γ. For MCP, Zhang (2010) suggested using $γ = 2 / (1 - {max}_{j \neq k} ∣ x_{j}^{'} x_{k} ∣ / n)$ for standardized covariates. The average γ value of this choice is 2.69 in his simulation studies. The simulation studies in Breheny and Huang (2009) suggest that γ = 3 is a reasonable choice. We have experimented with different γ values and reached the same conclusion. Therefore, we set γ = 3.

We set n = 100 and p = 500. Among the 500 covariates, there are 100 clusters, each with size 5. We consider two different correlation structures. (I) Covariates in different clusters are independent, whereas covariates i and j within the same cluster have correlation coefficients ρ^|ⁱ⁻^j^|; and (II) Covariates i and j have correlation coefficients ρ^|ⁱ⁻^j^|. Under structure I, zero and nonzero effects are independent, whereas under structure II, they are correlated. Covariates have marginal normal distributions with mean zero and variance one. We consider different levels of correlation with ρ = 0.1, 0.5, 0.9. Among the 500 covariates, the first 25 (5 clusters) have nonzero regression coefficients. We consider the following scenarios for nonzero coefficients: (a) all the nonzero coefficients are equal to 0.5; and (b) the nonzero coefficients are randomly generated from the uniform distribution on [0.25, 0.75]. In (a), the Laplacian matrices satisfy the unbiasedness property Lβ^o = 0 discussed in Section 4. We have experienced with other levels of nonzero regression coefficients and reached similar conclusions.

We examine the accuracy of identifying nonzero covariate effects and the prediction performance. For this purpose, for each simulated dataset, we simulate an independent testing dataset with sample size 100. We conduct cross validation (for tuning parameter selection) and estimation using the training set only. We then make prediction for subjects in the testing set and compute the PMSE (prediction mean squared error).

We simulate 500 replicates and present the summary statistics in Table 1. We can see that the MCP performs satisfactorily when the correlation is small. However, when the correlation is high, it may miss a considerable number of true positives and have large prediction errors. The D-J approach, which can also accommodate the correlation structure, is able to identify all the true positives. However, it also identifies a large number of false positives, causing by the over-selection of the Lasso penalty. The proposed SLS approach outperforms the MCP and D-J methods in the sense that it has smaller empirical false discovery rates with comparable false negative rates. It also has significantly smaller prediction errors.

Table 1.

Simulation study: median based on 500 replicates. In each cell, the three numbers are positive findings, true positives, and PMSE ×100, respectively.

Coefficient	ρ	MCP	D-J				SLS
Coefficient	ρ	MCP	N.1	N.2	N.3	N.4	N.1	N.2	N.3	N.4
		Correlation structure I
0.5	0.1	27 25 41.33	61 25 125.34	53 25 46.64	55 25 60.14	59 25 51.24	27 25 40.53	27 25 39.84	26 25 41.74	27 25 39.34
	0.5	28 25 54.10	51 25 66.38	67 25 66.84	72 25 56.22	63 25 53.43	27 25 37.71	28 25 39.18	28 25 33.87	27 25 36.00
	0.9	22 15 137.52	66 25 55.51	55 25 56.94	61 25 49.22	74 25 51.41	29 25 48.89	28 25 49.96	29 25 45.16	27 25 41.49
U[.25, .75]	0.1	37 25 52.24	72 25 54.28	61 25 88.00	59 25 70.00	78 25 60.51	33 25 51.80	36 25 52.19	30 25 53.03	30 25 52.22
	0.5	29 24 65.12	66 25 78.76	54 25 72.34	63 25 63.55	57 25 66.33	28 25 42.24	28 25 43.96	27 24 54.72	28 24 58.77
	0.9	17 13 152.42	67 25 63.43	62 25 57.30	50 25 53.88	74 25 57.98	29 25 47.73	29 25 49.14	27 25 48.49	28 25 50.83
		Correlation structure II
0.5	0.1	26 25 38.22	62 25 121.69	58 25 117.10	63 25 127.34	72 25 122.34	27 25 40.33	27 25 40.65	27 25 41.49	27 25 37.40
	0.5	29 25 53.01	52 25 55.99	49 25 62.04	66 25 62.70	65 25 64.41	27 25 36.97	28 25 39.47	28 25 38.53	27 25 39.53
	0.9	15 13 140.69	48 25 55.75	34 25 56.71	32 25 60.27	38 25 59.78	29 25 66.79	29 25 60.52	29 25 57.91	30 25 60.19
U[.25, .75]	0.1	37 25 54.31	77 25 60.02	72 25 66.14	74 25 78.32	66 25 74.50	29 25 50.05	32 25 51.34	37 25 50.74	29 25 49.47
	0.5	27 24 57.66	74 25 61.71	66 25 67.54	75 25 62.01	74 25 66.91	28 25 44.92	28 25 46.65	28 25 41.35	28 25 41.17
	0.9	14 13 136.49	33 25 61.50	35 25 55.08	34 25 54.54	38 25 60.67	29 25 56.87	29 25 57.03	30 25 53.28	30 25 56.79

Open in a new tab

6.1 Application to a microarray study

In the study reported in Scheetz et al. (2006), F1 animals were intercrossed and 120 twelve-week-old male offspring were selected for tissue harvesting from the eyes and microarray analysis using the Affymetric GeneChip Rat Genome 230 2.0 Array. The intensity values were normalized using the RMA (robust multi-chip averaging, Bolstad 2003, Irizzary 2003) method to obtain summary expression values for each probe set. Gene expression levels were analyzed on a logarithmic scale. For the probe sets on the array, we first excluded those that were not expressed in the eye or that lacked sufficient variation. The definition of expressed was based on the empirical distribution of RMA normalized values. For a probe set to be considered expressed, the maximum expression value observed for that probe among the 120 F2 rats was required to be greater than the 25th percentile of the entire set of RMA expression values. For a probe to be considered “sufficiently variable,” it had to exhibit at least 2-fold variation in expression level among the 120 F2 animals.

We are interested in finding the genes whose expression are most variable and correlated with that of gene TRIM32. This gene was recently found to cause Bardet-Biedl syndrome (Chiang et al. 2006), which is a genetically heterogeneous disease of multiple organ systems including the retina. One approach to find the genes related to TRIM32 is to use regression analysis. Since it is expected that the number of genes associated with gene TRIM32 is small and since we are mainly interested in genes whose expression values across samples are most variable, we conduct the following initial screening. We compute the variances of gene expressions and select the top 1000. We then standardize gene expressions to have zero mean and unit variance.

We analyze data using the MCP, D-J, and proposed approach. In cross validation, we set V = 5. The numbers of genes identified are MCP: 23, D-J: 31 (N.1), 41 (N.2), 34 (N.3), 30 (N.4), SLS: 25 (N.1), 26 (N.2), 16 (N.3) and 17 (N.4), respectively. More detailed results are available from the authors. Different approaches and different ways of defining the adjacency measure lead to the identification of different genes. As expected, the SLS identifies shorter lists of genes than the D-J, which may lead to more parsimonious models and more focused hypothesis for confirmation. As the proposed approach pays special attention to the correlation among genes, we also compute the median of the absolute values of correlations among the identified genes, which are MCP: 0.171, D-J: 0.201 (N.1), 0.207 (N.2), 0.215 (N.3), 0.206 (N.4), SLS: 0.247 (N.1), 0.208 (N.2), 0.228 (N.3), 0.212 (N.4). The D-J and SLS, which incorporate correlation in the penalty, identify genes that are more strongly correlated than the MCP. The SLS identified genes have slightly higher correlations than those identified by D-J.

Unlike in simulation study, we are not able to evaluate true and false positives. This limitation is shared by most existing studies. We use the following V-fold (V=5) cross validation based approach to evaluate prediction. (a) Randomly split data into V-subsets with equal sizes; (b) Remove one subset from data; (c) Conduct cross validation and estimation using the rest V − 1 subsets; (d) Make prediction for the one removed subset; (e) Repeat Steps (b)–(d) over all subsets and compute the prediction error. The sums of squared prediction errors are MCP: 1.876; D-J: 1.951 (N.1), 1.694 (N.2), 1.534 (N.3) and 1.528 (N.4); SLS: 1.842 (N.1), 1.687 (N.2), 1.378 (N.3) and 1.441 (N.4), respectively. The SLS has smaller cross validated prediction errors, which may indirectly suggest better selection properties.

7 Discussion

In this article, we propose the SLS method for variable selection and estimation in high-dimensional data analysis. The most important feature of the SLS is that it explicitly incorporates the graph/network structure in predictors into the variable selection procedure through the Laplacian quadratic. It provides a systematic framework for connecting penalized methods for consistent variable selection and those for network and correlation analysis. As can be seen from the methodological development, the application of the SLS variable selection is relatively independent of the graph/network construction. Thus, although graph/network construction is of significant importance, it is not the focus of this study and not thoroughly pursued.

An important feature of the SLS method is that it incorporates the correlation patterns of the predictors into variable selection through the Laplacian quadratic. We have considered two simple approaches for determining the Laplacian based on dissimilarity and similarity measures. Our simulation studies demonstrate that incorporating correlation patterns improves selection results and prediction performance. Our theoretical results on the selection properties of the SLS are applicable to a general class of Laplacians and do not require the underlying graph for the predictors to be correctly specified.

We provide sufficient conditions under which the SLS estimator possesses an oracle property, meaning that it is sign consistent and equal to the oracle Laplacian shrinkage estimator with high probability. We also study the grouping properties of the SLS estimator. Our results show that the SLS is adaptive to the sparseness of the original p-dimensional model with p ≫ n and the denseness of the underlying d^o-dimensional model, where d^o < n is the number of nonzero coefficients. The asymptotic rates of the penalty parameters are derived. However, as in many recent studies, it is not clear whether the penalty parameters selected using cross validation or other procedures can match the asymptotic rate. This is an important and challenging problem that requires further investigation, but is beyond the scope of the current paper. Our numerical study shows a satisfactory finite-sample performance of the SLS. Particularly, we note that the cross validation selected tuning parameters seem sufficient for our simulated data. We are only able to experiment with four different adjacency measures. It is not our intention to draw conclusions on different ways of defining adjacency. More adjacency measures are hence not explored.

We have focused on the linear regression model in this article. However, the SLS method can be applied to general linear regression models. Specifically, for general linear models, the SLS criterion can be formulated as

\frac{1}{n} \sum_{i = 1}^{n} ℓ (y_{i}, b_{0} + \sum_{j} x_{i j} b_{j}) + \sum_{j = 1}^{p} ρ (∣ b_{j} ∣; λ_{1}, γ) + \frac{1}{2} λ_{2} \sum_{1 \leq j < k \leq p} ∣ a_{j k} ∣ {(b_{j} - s_{j k} b_{k})}^{2},

where ℓ is a given loss function. For instance, for generalized linear models such as logistic regression, we can take ℓ to be the negative log-likelihood function. For Cox regression, we can use the negative partial likelihood as the loss function. Computationally, for loss functions other than least squares, the coordinate descent algorithm can be applied iteratively to quadratic approximations to the loss function. However, further work is needed to study theoretical properties of the SLS estimators for general linear models.

There is a large literature on the analysis of network data and much work has also been done on estimating sparse covariance matrices in high-dimensional settings. See for example, Zhang and Horvath (2005), Chung and Lu (2006), Meinshausen and Bühlmann (2006), Yuan and Lin (2007), Friedman, Hastie and Tibshirani (2008), Fan, Feng and Wu (2009), among others. It would be useful to study ways to incorporate these methods and results into the proposed SLS approach. In some problems such as genomic data analysis, partial external information may also be available on the graphical structure of some genes used as predictors in the model. It would be interesting to consider approaches for combining external information on the graphical structure with existing data in constructing the Laplacian quadratic penalty.

Acknowledgments

We wish to thank two anonymous referees, the associate editor and editor for their helpful comments which led to considerable improvements in the presentation of the paper. The research of Huang is partially supported by NIH grants R01CA120988, R01CA142774 and NSF grant DMS 0805670. The research of Ma is partially supported by NIH grants R01CA120988, R01CA142774, R03LM009754 and R03LM009828. The research of Li is partially supported by NIH grants R01ES009911 and R01CA127334. The research of Zhang is partially supported by NSF grants DMS 0604571, DMS 0804626 and NSA grant MDS 904-02-1-0063.

8 Appendix

In the appendix, we give proofs of Theorems 1 and 2 and Propositions 1 and 2.

Proof of Theorem 1

Since c_min(λ₂) > 1/γ, the criterion (2.2) is strictly convex and its minimizer is unique. Let $\tilde{X} = \tilde{X} (λ_{2}) = \sqrt{n} {(\sum + λ_{2} L)}^{1 / 2}$ ,ỹ = ỹ(λ₂) = X̃⁻¹X′y and

\tilde{M} (b; λ, γ) = {(2 n)}^{- 1} {| | \tilde{y} - \tilde{X} b | |}^{2} + \sum_{j = 1}^{p} ρ (∣ b_{j} ∣; λ_{1}, γ) .

Since X̃′(X̃/n, ỹ) = (Σ+λ₂L, X′y), M(b; λ, γ) − M̃(b; λ, γ) = (||y||² − ||ỹ ||²)/(2n) does not depend on b. Thus, β̂ is the minimizer of M̃ (b; λ, γ).

Since $∣ {\hat{β}}_{j}^{o} ∣ \geq γ λ_{1}$ gives $ρ^{'} (∣ {\hat{β}}_{j}^{o} ∣; λ_{1}) = 0$ , the KKT conditions hold for M̃ (b; λ, γ) at β̂(λ) = β̂^o(λ) in the intersection of the events

Ω_{1} = {{| | {\tilde{X}}_{O^{c}}^{'} (\tilde{y} - \tilde{X} {\hat{β}}^{o}) / n | |}_{\infty} \leq λ_{1}}, Ω_{2} = {min_{j \in O} sgn (β_{j}^{*}) {\hat{β}}_{j}^{o} \geq γ λ_{1}} .

(8.1)

Let ε̃^* = ỹ − X̃β^* = ε̃ + Eε̃^* with ε̃ = ỹ − Eỹ. Since X̃′ỹ = X′y and both β^o and β^* are supported in Inline graphic ,

\begin{array}{l} {\tilde{X}}_{B}^{'} {E \tilde{ε}}^{*} / n = X_{B}^{'} X β^{o} / n - {\tilde{X}}_{B}^{'} \tilde{X} β^{*} / n \\ = \sum_{B, O} β_{O}^{o} - \sum_{B, O} (λ_{2}) \sum_{O}^{- 1} (λ_{2}) \sum_{O} β_{O}^{o} \\ = λ_{2} {\sum_{B, O} (λ_{2}) \sum_{O}^{- 1} (λ_{2}) L_{O} - L_{B, O}} β_{O}^{o}, \end{array}

(8.2)

which describes the effect of the bias of β̂^o on the gradient in the linear model ỹ = X̃β^*+ε̃^*. Since ${\tilde{X}}_{O}^{'} {E \tilde{ε}}^{*} / n = 0$ , we have || X̃′Eε̃^*/n||_∞ = λ₂C₂.

Since X̃′ε̃ = X̃′ỹ − EX̃′ỹ = X′y − EX′y = X′ε, (8.2) gives

Ω_{1} \subseteq {{| | X_{O^{c}}^{'} ε / n | |}_{\infty} < λ_{1} - λ_{2} C_{2}} .

(8.3)

Since ${\hat{β}}_{O}^{o} = \sum_{O}^{- 1} (λ_{2}) X_{O}^{'} y / n$ can be written as $β_{O}^{*} + {({(v_{j} / n)}^{1 / 2} u_{j}^{'} ε, j \in O)}^{'}$ , where ||u_j|| = 1 and {v_j, j ∈ Inline graphic } are the diagonal elements of $\sum_{O}^{- 1} (λ_{2}) \sum_{O} {\sum_{O}^{- 1} (λ_{2})}$ . Thus,

Ω_{2}^{c} \subseteq \cup_{j \in O} {sgn (β_{j}^{*}) u_{j}^{'} ε \geq {(n / v_{j})}^{1 / 2} ∣ {\hat{β}}_{j}^{*} ∣ \geq σ \sqrt{2 log (∣ O ∣ / ε)}} .

(8.4)

Since $λ_{1} \geq λ_{2} C_{2} + σ \sqrt{2 log (p / ε)} {max}_{j \leq p} | | x_{j} | | / n$ , the sub-Gausian condition (A) yields

\begin{array}{l} 1 - P {Ω_{1} \cap Ω_{2}} \leq P {{| | X_{O^{c}}^{'} ε / n | |}_{\infty} > σ \sqrt{2 log ((p - ∣ O ∣) / ε)} max_{j \leq p} | | x_{j} | | / n} + \sum_{j \in O} P {sgn (β_{j}^{*}) u_{j}^{'} ε \geq σ \sqrt{2 log (∣ O ∣ / ε)}} \\ \leq 2 ∣ O^{c} ∣ ε / (p - ∣ O ∣) + ∣ O ∣ ε / ∣ O ∣ = 3 ε . \end{array}

The proof of (4.5) is complete, since ${\hat{β}}_{j}^{o} \neq 0$ for all j ∈ Inline graphic in Ω₂.

For the proof of (4.6), we have ${| | β_{O}^{*} - β_{O}^{o} | |}_{\infty} = λ_{2} C_{1}$ due to

β_{O}^{*} - β_{O}^{o} = \sum_{O}^{- 1} (λ_{2}) \sum_{O} β_{O}^{o} - β_{O}^{o} = - λ_{2} \sum_{O}^{- 1} (λ_{2}) L_{O} β_{O}^{o} .

(8.5)

It follows that the condition on β_* implies Conditon (B) (iii) with $sgn (β_{O}^{*}) = sgn (β_{O}^{o}) = sgn ({\hat{β}}_{O}^{o})$ in Ω₂.

Proof of Theorem 2

For m ≥ 1 and vectors u in the range of X̃, define

\tilde{ζ} (v; m, O, λ_{2}) = max {\frac{{| | ({\tilde{P}}_{B} - {\tilde{P}}_{O}) v | |}_{2}}{{(m n)}^{1 / 2}} : O \subseteq B \subseteq {1, \dots p}, ∣ B ∣ = m + ∣ O ∣},

(8.6)

where ${\tilde{P}}_{B} = {\tilde{X}}_{B} {({\tilde{X}}_{B}^{'} {\tilde{X}}_{B})}^{- 1} {\tilde{X}}_{B}^{'}$ . Here ζ̃ depends on λ₂ through P̃. Since β̂(λ) is the MC+ estimator based on data (X̃, ỹ) at penalty level λ₁ and (4.9) holds for Σ(λ₂) = X̃′X̃/n, the proof of Theorem 5 in Zhang (2010) gives β̂(λ) = β̂^o (λ) in the event $Ω = \cap_{j = 1}^{3} Ω_{j}$ , where $Ω_{1} = {{| | {\tilde{X}}_{O^{c}}^{'} (\tilde{y} - \tilde{X} {\hat{β}}^{o}) / n | |}_{\infty} \leq λ_{1}}$ is as in (8.1) and

Ω_{2} = {min_{j \in O} sgn (β_{j}^{*}) {\hat{β}}_{j}^{o} > γ (2 \sqrt{c^{*}} λ_{1})}, Ω_{3} = {ζ (\tilde{y} - \tilde{X} β^{*}; d^{*} - ∣ O ∣, O, λ_{2}) \leq λ_{1}} .

Note that (λ_1,_ε, λ_2,_ε, λ_3,_ε, α) in Zhang (2010) is identified with (λ₁, $2 \sqrt{c^{*}} λ_{1}$ , λ₁, 1/2) here.

Let ε̃^* = ỹ − X̃β^* = ε̃ + Eε̃^* with ε̃ = ỹ − Eỹ. Since X̃′ỹ = X′y, (8.2) still holds with ||X̃′Eε̃^*/n||_∞ = λ₂C₂. Since X̃′ε̃ = X′y − EX′y = X′ε, (8.2) still gives (8.3). A slight modification of the argument for (8.4) yields

Ω_{2}^{c} \subseteq \cup_{j \in O} {sgn (β_{j}^{*}) u_{j}^{'} ε \geq {(n / v_{j})}^{1 / 2} (∣ β_{j}^{*} ∣ - γ (2 \sqrt{c^{*}} λ_{1})) \geq σ \sqrt{2 log (∣ O ∣ / ε)}} .

(8.7)

For |B| ≤ d^*, we have $| | {\tilde{P}}_{B} {E \tilde{ε}}^{*} | | / \sqrt{n} = | | \sum_{B}^{- 1 / 2} (λ_{2}) {\tilde{X}}_{B}^{'} {E \tilde{ε}}^{*} | | / n \leq {| | {\tilde{X}}_{B}^{'} {E \tilde{ε}}^{*} / n | |}_{\infty} \sqrt{∣ B ∣ / c_{*} (λ_{2})}$ and $| | {\tilde{P}}_{B} \tilde{ε} | | / \sqrt{n} = | | \sum_{B}^{- 1 / 2} (λ_{2}) {\tilde{X}}_{B}^{'} \tilde{ε} | | / n \leq {| | {\tilde{X}}_{B}^{'} ε / n | |}_{\infty} \sqrt{∣ B ∣ / c_{*} (λ_{2})}$ . Thus, by (8.6)

ζ (\tilde{y} - \tilde{X} β^{*}; d^{*} - ∣ O ∣, O, λ_{2}) = ζ (\tilde{ε} + {E \tilde{ε}}^{*}; d^{*} - ∣ O ∣, O, λ_{2}) \leq \frac{({| | X^{'} ε / n | |}_{\infty} + λ_{2} C_{2}) \sqrt{d^{*}}}{\sqrt{(d^{*} - ∣ O ∣) c_{*} (λ_{2})}} .

Since | Inline graphic | ≤ d^*/(K_* + 1), this gives

Ω_{3} \subseteq {{| | X^{'} ε / n | |}_{\infty} < \sqrt{c_{*} (λ_{2}) K_{*} / (K_{*} + 1)} λ_{1} - λ_{2} C_{2}} .

(8.8)

Since $max {1, \sqrt{c_{*} (λ_{2}) K_{*} / (K_{*} + 1)}} λ_{1} \geq λ_{2} C_{2} + σ \sqrt{2 log (p / ε)} {max}_{j \leq p} | | x_{j} | | / n$ , (8.3), (8.7), (8.8) and Condition (A) imply

\begin{array}{l} 1 - P {Ω_{1} \cap Ω_{3}} + P {Ω_{2}^{c}} \leq P {{| | X^{'} ε / n | |}_{\infty} > σ \sqrt{2 log (p / ε)} max_{j \leq p} | | x_{j} | | / n} + \sum_{j \in O} P {sgn (β_{j}^{*}) u_{j}^{'} ε \geq σ \sqrt{2 log (∣ O ∣ / ε)}} \\ \leq 2 p (ε / p) + {∣ O ∣}_{ε} / ∣ O ∣ = 3 ε . \end{array}

The proof of (4.10) is complete, since ${\hat{β}}_{j}^{o} \neq 0$ for all j ∈ Inline graphic in Ω₂. We omit the proof of (4.11) since it is identical to that of (4.6).

Proof of Proposition 1

The β̃ satisfies

- \frac{1}{n} x_{j}^{'} (y - X \tilde{β}) + λ_{2} (d_{j} {\tilde{β}}_{j} - a_{j}^{'} \tilde{β}) = 0, 1 \leq j \leq q .

(8.9)

Therefore, by Cauchy-Schwarz and using ||x_j||² = n, we have

λ_{2} max_{1 \leq j \leq q} ∣ d_{j} {\tilde{β}}_{j} - a_{j}^{'} \tilde{β} ∣ \leq \frac{1}{n} max_{1 \leq j \leq q} ∣ x_{j}^{'} (y - X \tilde{β}) ∣ \leq \frac{1}{\sqrt{n}} | | \tilde{r} | | .

Now because G(β̃; λ₂) ≤ G(0; λ₂), we have ||r̃|| ≤ ||y||. This proves part (i).

For part (ii), note that we have

λ_{2} (d_{j} {\tilde{β}}_{j} - a_{j}^{'} \tilde{β} - (d_{k} {\tilde{β}}_{k} - a_{k}^{'} \tilde{β})) = \frac{1}{n} {(x_{j} - x_{k})}^{'} \tilde{r} .

Thus

λ_{2} ∣ d_{j} {\tilde{β}}_{j} - a_{j}^{'} \tilde{β} - (d_{k} {\tilde{β}}_{k} - a_{k}^{'} \tilde{β}) ∣ \leq \frac{1}{n} | | x_{j} - x_{k} | | | | \tilde{r} | | .

Part (ii) follows.

Proof of Proposition 2

The β̃ must satisfy

- \frac{1}{n} x_{j}^{'} (y - X \tilde{β}) + λ_{2} ({\tilde{β}}_{j} - v_{g}^{- 1} 1_{g}^{'} {\tilde{β}}_{g}) = 0, j \in V_{g}, 1 \leq g \leq J .

(8.10)

Taking the difference between the jth and kth equations in (8.10) for j, k ∈ V_g, we get

λ_{2} ({\tilde{β}}_{j} - {\tilde{β}}_{k}) = \frac{1}{n} {(x_{j} - x_{k})}^{'} (y - X \tilde{β}), j, k \in V_{g} .

Therefore,

λ_{2} ∣ {\tilde{β}}_{j} - {\tilde{β}}_{k} ∣ \leq \frac{1}{n} | | x_{j} - x_{k} | | \cdot | | y - X \tilde{β} | |, j, k \in V_{g} .

Part (i) follows from this inequality.

Define ${\bar{β}}_{g} = v_{g}^{- 1} 1_{g}^{'} {\tilde{β}}_{g}$ . This is the average of the elements in β̃_g. For any j ∈ V_g and k ∈ V_h, g ≠ h, we have

λ_{2} ({\tilde{β}}_{j} - {\tilde{β}}_{g} - ({\tilde{β}}_{k} - {\bar{β}}_{h})) = \frac{1}{n} {(x_{j} - x_{k})}^{'} (y - X \tilde{β}), j \in V_{g}, k \in V_{h} .

Thus part (ii) follows. This completes the proof of Proposition 2.

Contributor Information

Jian Huang, Email: jian-huang@uiowa.edu.

Shuangge Ma, Email: shuangge.ma@yale.edu.

Hongzhe Li, Email: hongzhe@upenn.edu.

Cun-Hui Zhang, Email: cunhui@stat.rutgers.edu.

References

1.Bondell HD, Reich BJ. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics. 2008;64:115–123. doi: 10.1111/j.1541-0420.2007.00843.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Breheny P, Huang J. Coordinate descent algorithms for nonconvex penalized regression methods. Ann Appl Statist. 2011;5:232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Chen SS, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM J Sci Comput. 1998;20:3361. [Google Scholar]
4.Chiang AP, Beck JS, Yen HJ, Tayeh MK, Scheetz TE, Swiderski R, Nishimura D, Braun TA, Kim KY, Huang J, Elbedour K, Carmi R, Slusarski DC, Casavant TL, Stone EM, Sheffield VC. Homozygosity mapping with SNP arrays identifies a novel Gene for Bardet-Biedl Syndrome (BBS10) Proceedings of the National Academy of Sciences. 2006;103:6287–6292. doi: 10.1073/pnas.0600158103. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Chung FRK. Spectral Graph Theory. Amer Math Soc.CBMS Regional Conference Series in Mathematics; 1997. [Google Scholar]
6.Chung FRK, Lu L. Complex Graphs and Networks. Amer Math Soc.CBMS Regional Conference Series in Mathematics; 2006. [Google Scholar]
7.Daye JZ, Jeng JX. Shrinkage and model selection with correlated variables via weighted fusion. Computational Statistics and Data Analysis. 2009;53:1284–1298. [Google Scholar]
8.Fan J. Comments on “Wavelets in statistics: a review” by A. Antoniadis. J Italian Statist Assoc. 1997;6:131–138. [Google Scholar]
9.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
10.Fan J, Feng Y, Wu Y. Network exploration via the adaptive LASSO and SCAD penalties. Ann Appl Statist. 2009;3:521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Frank IE, Friedman JH. A statistical view of some chemometrics regression tools (with discussion) Technometrics. 1993;35:109–148. [Google Scholar]
12.Friedman J, Hastie, Hoefling H, Tibshirani R. Pathwise coordinate optimization. Ann Appl Statist. 2007;35:302–332. [Google Scholar]
13.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatist. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Fu WJ. Penalized regressions: the bridge versus the LASSO. J Comp Graph Statist. 1998;7:397–416. [Google Scholar]
15.Genkin A, Lewis DD, Madigan D. Technical Report. DIMACS, Rutgers University; 2004. Large-scale Bayesian logistic regression for text categorization. [Google Scholar]
16.Hebiri M, van de Geer S. The smooth-Lasso and other ℓ1 + ℓ2-penalized methods. 2010 Preprint. Available at http://arxiv4.library.cornell.edu/PScache/arxiv/pdf/1003/1003.4885v1.pdf.
17.Huang J, Breheny P, Ma S, Zhang C-H. Technical report # 402. Department of Statistics and Actuarial Science, University of Iowa; 2010. The Mnet method for variable selection. [Google Scholar]
18.Huang J, Ma S, Li H, Zhang C-H. Technical report # 403. Department of Statistics and Actuarial Science, University of Iowa; 2010. The Sparse Laplacian Shrinkage Estimator for High-Dimensional Regression. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatist. 2003;4:249–264. doi: 10.1093/biostatistics/4.2.249. [DOI] [PubMed] [Google Scholar]
20.Jia J, Yu B. On model selection consistency of elastic net when p ≫ n. Statistica Sinica. 2010;20:595–611. [Google Scholar]
21.Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24:1175–1182. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]
22.Li C, Li H. Variable selection and regression analysis for covariates with graphical structure. Ann Appl Statist. 2010;4:1498–1516. doi: 10.1214/10-AOAS332. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Mazumder R, Friedman J, Hastie T. Tech Report. Department of Statistics, Stanford University; 2009. SparseNet: Coordinate descent with non-convex penalties. [Google Scholar]
24.Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the Lasso. Ann Statist. 2006;34:1436–1462. [Google Scholar]
25.Pan W, Xie B, Shen X. Incorporating predictor network in penalized regression with application to microarray data. Biometrics. 2009 doi: 10.1111/j.1541-0420.2009.01296.x. In press. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Scheetz TE, Kim KYA, Swiderski RE, Philp1 AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheffield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Tibshirani R. Regression shrinkage and selection via the Lasso. J R Statist Soc B. 1996;58:267–288. [Google Scholar]
28.Tutz G, Ulbricht J. Penalized regression with correlation-based penalty. Statist Comput. 2007;19:239–253. [Google Scholar]
29.Wu T, Lange K. Coordinate descent procedures for lasso penalized regression. Ann Appl Statist. 2007;2:224–244. [Google Scholar]
30.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Statist Soc B. 2006;68:49–67. [Google Scholar]
31.Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]
32.Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Statist Appl Genet Mol Bio. 2005;4:article 17. doi: 10.2202/1544-6115.1128. [DOI] [PubMed] [Google Scholar]
33.Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. [Google Scholar]
34.Zhang CH, Huang J. The Sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann Statist. 2008;36:1567–1594. [Google Scholar]
35.Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Statist Soc B. 2005;67:301–320. [Google Scholar]
36.Zou H, Zhang HH. On the adaptive elastic-net with a diverging number of parameters. Ann Statist. 2009;37:1733–1751. doi: 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Bondell HD, Reich BJ. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics. 2008;64:115–123. doi: 10.1111/j.1541-0420.2007.00843.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Breheny P, Huang J. Coordinate descent algorithms for nonconvex penalized regression methods. Ann Appl Statist. 2011;5:232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Chen SS, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM J Sci Comput. 1998;20:3361. [Google Scholar]

[R4] 4.Chiang AP, Beck JS, Yen HJ, Tayeh MK, Scheetz TE, Swiderski R, Nishimura D, Braun TA, Kim KY, Huang J, Elbedour K, Carmi R, Slusarski DC, Casavant TL, Stone EM, Sheffield VC. Homozygosity mapping with SNP arrays identifies a novel Gene for Bardet-Biedl Syndrome (BBS10) Proceedings of the National Academy of Sciences. 2006;103:6287–6292. doi: 10.1073/pnas.0600158103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Chung FRK. Spectral Graph Theory. Amer Math Soc.CBMS Regional Conference Series in Mathematics; 1997. [Google Scholar]

[R6] 6.Chung FRK, Lu L. Complex Graphs and Networks. Amer Math Soc.CBMS Regional Conference Series in Mathematics; 2006. [Google Scholar]

[R7] 7.Daye JZ, Jeng JX. Shrinkage and model selection with correlated variables via weighted fusion. Computational Statistics and Data Analysis. 2009;53:1284–1298. [Google Scholar]

[R8] 8.Fan J. Comments on “Wavelets in statistics: a review” by A. Antoniadis. J Italian Statist Assoc. 1997;6:131–138. [Google Scholar]

[R9] 9.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]

[R10] 10.Fan J, Feng Y, Wu Y. Network exploration via the adaptive LASSO and SCAD penalties. Ann Appl Statist. 2009;3:521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Frank IE, Friedman JH. A statistical view of some chemometrics regression tools (with discussion) Technometrics. 1993;35:109–148. [Google Scholar]

[R12] 12.Friedman J, Hastie, Hoefling H, Tibshirani R. Pathwise coordinate optimization. Ann Appl Statist. 2007;35:302–332. [Google Scholar]

[R13] 13.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatist. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Fu WJ. Penalized regressions: the bridge versus the LASSO. J Comp Graph Statist. 1998;7:397–416. [Google Scholar]

[R15] 15.Genkin A, Lewis DD, Madigan D. Technical Report. DIMACS, Rutgers University; 2004. Large-scale Bayesian logistic regression for text categorization. [Google Scholar]

[R16] 16.Hebiri M, van de Geer S. The smooth-Lasso and other ℓ1 + ℓ2-penalized methods. 2010 Preprint. Available at http://arxiv4.library.cornell.edu/PScache/arxiv/pdf/1003/1003.4885v1.pdf.

[R17] 17.Huang J, Breheny P, Ma S, Zhang C-H. Technical report # 402. Department of Statistics and Actuarial Science, University of Iowa; 2010. The Mnet method for variable selection. [Google Scholar]

[R18] 18.Huang J, Ma S, Li H, Zhang C-H. Technical report # 403. Department of Statistics and Actuarial Science, University of Iowa; 2010. The Sparse Laplacian Shrinkage Estimator for High-Dimensional Regression. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatist. 2003;4:249–264. doi: 10.1093/biostatistics/4.2.249. [DOI] [PubMed] [Google Scholar]

[R20] 20.Jia J, Yu B. On model selection consistency of elastic net when p ≫ n. Statistica Sinica. 2010;20:595–611. [Google Scholar]

[R21] 21.Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24:1175–1182. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]

[R22] 22.Li C, Li H. Variable selection and regression analysis for covariates with graphical structure. Ann Appl Statist. 2010;4:1498–1516. doi: 10.1214/10-AOAS332. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Mazumder R, Friedman J, Hastie T. Tech Report. Department of Statistics, Stanford University; 2009. SparseNet: Coordinate descent with non-convex penalties. [Google Scholar]

[R24] 24.Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the Lasso. Ann Statist. 2006;34:1436–1462. [Google Scholar]

[R25] 25.Pan W, Xie B, Shen X. Incorporating predictor network in penalized regression with application to microarray data. Biometrics. 2009 doi: 10.1111/j.1541-0420.2009.01296.x. In press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Scheetz TE, Kim KYA, Swiderski RE, Philp1 AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheffield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Tibshirani R. Regression shrinkage and selection via the Lasso. J R Statist Soc B. 1996;58:267–288. [Google Scholar]

[R28] 28.Tutz G, Ulbricht J. Penalized regression with correlation-based penalty. Statist Comput. 2007;19:239–253. [Google Scholar]

[R29] 29.Wu T, Lange K. Coordinate descent procedures for lasso penalized regression. Ann Appl Statist. 2007;2:224–244. [Google Scholar]

[R30] 30.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Statist Soc B. 2006;68:49–67. [Google Scholar]

[R31] 31.Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]

[R32] 32.Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Statist Appl Genet Mol Bio. 2005;4:article 17. doi: 10.2202/1544-6115.1128. [DOI] [PubMed] [Google Scholar]

[R33] 33.Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. [Google Scholar]

[R34] 34.Zhang CH, Huang J. The Sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann Statist. 2008;36:1567–1594. [Google Scholar]

[R35] 35.Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Statist Soc B. 2005;67:301–320. [Google Scholar]

[R36] 36.Zou H, Zhang HH. On the adaptive elastic-net with a diverging number of parameters. Ann Statist. 2009;37:1733–1751. doi: 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

The Sparse Laplacian Shrinkage Estimator for High-Dimensional Regression

Jian Huang

Shuangge Ma

Hongzhe Li

Cun-Hui Zhang

Abstract

1 Introduction

2 The sparse Laplacian shrinkage estimator

3 Construction of adjacency matrix

4 Oracle properties

4.1 Convex penalized loss

Theorem 1

4.2 The nonconvex case

Theorem 2

4.3 Unbiased Laplacian and variance reduction

5 Laplacian shrinkage

Proposition 1

Proposition 2

Example 5.1

6 Simulation studies

Table 1.

6.1 Application to a microarray study

7 Discussion

Acknowledgments

8 Appendix

Proof of Theorem 1

Proof of Theorem 2

Proof of Proposition 1

Proof of Proposition 2

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

The Sparse Laplacian Shrinkage Estimator for High-Dimensional Regression

Jian Huang

Shuangge Ma

Hongzhe Li

Cun-Hui Zhang

Abstract

1 Introduction

2 The sparse Laplacian shrinkage estimator

3 Construction of adjacency matrix

4 Oracle properties

4.1 Convex penalized loss

Theorem 1

4.2 The nonconvex case

Theorem 2

4.3 Unbiased Laplacian and variance reduction

5 Laplacian shrinkage

Proposition 1

Proposition 2

Example 5.1

6 Simulation studies

Table 1.

6.1 Application to a microarray study

7 Discussion

Acknowledgments

8 Appendix

Proof of Theorem 1

Proof of Theorem 2

Proof of Proposition 1

Proof of Proposition 2

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases