HIGH DIMENSIONAL VARIABLE SELECTION

Larry Wasserman; Kathryn Roeder

doi:10.1214/08-aos646

. Author manuscript; available in PMC: 2009 Sep 25.

Published in final edited form as: Ann Stat. 2009 Jan 1;37(5A):2178–2201. doi: 10.1214/08-aos646

HIGH DIMENSIONAL VARIABLE SELECTION

Larry Wasserman ¹, Kathryn Roeder ^1,^*

PMCID: PMC2752029 NIHMSID: NIHMS109249 PMID: 19784398

Abstract

This paper explores the following question: what kind of statistical guarantees can be given when doing variable selection in high dimensional models? In particular, we look at the error rates and power of some multi-stage regression methods. In the first stage we fit a set of candidate models. In the second stage we select one model by cross-validation. In the third stage we use hypothesis testing to eliminate some variables. We refer to the first two stages as “screening” and the last stage as “cleaning.” We consider three screening methods: the lasso, marginal regression, and forward stepwise regression. Our method gives consistent variable selection under certain conditions.

Keywords: Lasso, Stepwise Regression, Sparsity

1. Introduction

Several methods have been developed lately for high dimensional linear regression such as the lasso (Tibshirani 1996), Lars (Efron et al. 2004) and boosting (Bühlmann 2006). There are at least two different goals when using these methods. The first is to find models with good prediction error. The second is to estimate the true “sparsity pattern,” that is, the set of covariates with nonzero regression coefficients. These goals are quite different and this paper will deal with the second goal. (Some discussion of prediction is in the appendix.) Other papers on this topic include Meinshausen and Bühlmann (2006), Candes and Tao (2007), Wainwright (2006), Zhao and Yu (2006), Zou (2006), Fan and Lv (2008), Meinshausen and Yu (2008), Tropp (2004, 2006), Donoho (2006) and Zhang and Huang (2006). In particular, the current paper builds on ideas in Meinshausen and Yu (2008) and Meinshausen (2007).

Let (X₁, Y₁),…,(X_n, Y_n) be iid observations from the regression model

Y_{i} = X_{i}^{T} β + ε_{i}

(1)

where ε ~ N(0, σ²), X_i = (X_i₁,…, X_ip)^T ∈ ℝ^p and p = p_n > n. Let X be the n × p design matrix with j^th column X_•_j = (X₁_j,…, X_nj)^T and let Y = (Y₁,…, Y_n)^T. Let

D = {j : β_{j} \neq 0}

be the set of covariates with nonzero regression coefficients. Without loss of generality, assume that D = {1,…, s}for some s. A variable selection procedure D̂ _n maps the data into subsets of S = {1,…, p}.

The main goal of this paper is to derive a procedure D̂ _n such that

\underset{n \to \infty}{lim sup} ℙ ({\hat{D}}_{n} \subset D) \geq 1 - α,

(2)

that is, the asymptotic type I error is no more than α. Note that throughout the paper we use ⊂ to denote non-strict set-inclusion. Moreover, we want D̂ _n to have nontrivial power. Meinshausen and Bühlmann (2006) control a different error measure. Their method guarantees lim sup_n_→∞ ℙ(D̂ _n ∩V ≠∅) ≤ α where V is the set of variables not connected to Y by any path in an undirected graph.

Our procedure involves three stages. In stage I we fit a suite of candidate models, each model depending on a tuning parameter λ,

S = {{\hat{S}}_{n} (λ) : λ \in Λ} .

In stage II we select one of those models Ŝ _n using cross-validation to select λ̂ . In stage III we eliminate some variables by hypothesis testing. Schematically:

\underset{screen}{\underset{︸}{\begin{array}{l} data & \overset{stage I}{\to} & S & \overset{stage II}{\to} \end{array}}} {\hat{S}}_{n} \underset{clean}{\underset{︸}{\begin{array}{l} \overset{stage III}{\to} & {\hat{D}}_{n} \end{array}}}

Genetic epidemiology provides a natural setting for applying screen and clean. Typically the number of subjects, n, is in the thousands, while p ranges from tens of thousands to hundereds of thousands of genetic features. The number of genes exhibiting a detectable association with a trait is extremely small. Indeed, for Type I diabetes only ten genes have exhibited a reproducible signal (Wellcome Trust 2007). Hence it is natural to assume that the true model is sparse. A common experimental design involves a 2-stage sampling of data, with stages 1 and 2 corresponding to the screening and cleaning processes, respectively.

In stage 1 of a genetic association study, n₁ subjects are sampled and one or more traits such as bone mineral density are recorded. Each subject is also measured at p locations on the chromosomes. These genetic covariates usually have two forms in the population due to variability at a single nucleotide and hence are called single nucleotide polymorphisms (SNPs). The distinct forms are called alleles. Each covariate takes on a value (0, 1 or 2) indicating the number of copies of the less common allele observed. For a well designed genetic study, individual SNPs are nearly uncorrelated unless they are physically located in very close proximity. This feature makes it much easier to draw causal inferences about the relationship between SNPs and quantitative traits. It is standard in the field to infer that an association discovered between a SNP and a quantitative trait implies a causal genetic variant is physically located near the one exhibiting association. In stage 2, n₂ subjects are sampled at a subset of the SNPs assessed in stage 1. SNPs measured in stage 2 are often those that achieved a test statistic that exceeded a predetermined threshold of significance in stage 1. In essence, the two stage design pairs naturally with a screen and clean procedure.

For the screen and clean procedure it is essential that Ŝ _n has two properties as n → ∞

P (D \subset {\hat{S}}_{n}) \to 1

(3)

and

∣ {\hat{S}}_{n} ∣ = o_{P} (n)

(4)

where |M| denotes the number of elements in a set M. Condition (3) ensures the validity of the test in stage III while condition (4) ensures that the power of the test is not too small. Without condition (3), the hypothesis test in stage III would be biased. We will see that the power goes to 1, so taking α= α_n → 0 implies consistency: ℙ(D̂ _n = D) → 1. For fixed α, the method also produces a confidence sandwich for D, namely,

\underset{n \to \infty}{lim inf} P ({\hat{D}}_{n} \subset D \subset {\hat{S}}_{n}) \geq 1 - α .

To fit the suite of candidate models, we consider three methods. In Method 1,

{\hat{S}}_{n} (λ) = {j : {\tilde{β}}_{j} (λ) \neq 0}

where β̃_j(λ) is the lasso estimator, the value of β that minimizes

\sum_{i = 1}^{n} {(Y_{i} - X_{i}^{T} β)}^{2} + λ \sum_{j = 1}^{p} ∣ β_{j} ∣ .

In Method 2, take Ŝ _n(λ) to be the set of variables chosen by forward stepwise regression after λ steps. In Method 3, marginal regression, we take

{\hat{S}}_{n} = {j : ∣ {\hat{μ}}_{j} ∣ > λ}

where μ̂ _j is the marginal regression coefficient from regressing Y on X_j. (This is equivalent to ordering by the absolute t-statistics since we will assume that the covariates are standardized.) These three methods are very similar to basis pursuit, orthogonal matching pursuit and thresholding; see, for example, Tropp (2004, 2006) and Donoho (2006).

Notation

Let ψ = min_j_∈_D|β_j|. Define the loss of any estimator β̂ by

L (\hat{β}) = \frac{1}{n} {(\hat{β} - β)}^{T} X^{T} X (\hat{β} - β) = {(\hat{β} - β)}^{T} {\sum^{^}}_{n} (\hat{β} - β)

(5)

where Σ̂ _n = n⁻¹X^T X. For convenience, when β̂ ≡ β̂ (λ) depends on λ we write L(λ) instead of L(β̂ (λ)). If M ⊂ S, let X_M be the design matrix with columns (X_•_j: j ∈ M) and let ${\hat{β}}_{M} = {(X_{M}^{T} X_{M})}^{- 1} X_{M}^{T} Y$ denote the least squares estimator, assuming it is well-defined. Note that our use of X_•_j differs from standard ANOVA notation. Write X_λ instead of X_M when M = Ŝ _n(λ). When convenient, we extend β̂ _M to length p by setting β̂ _M (j) = 0 for j ∉ M. We use the norms:

∥ v ∥ = \sqrt{\sum_{j} v_{j}^{2}}, ∥ v ∣ ∣_{1} = \sum_{j} ∣ v_{j} ∣, ∥ v ∣ ∣_{\infty} = max_{j} ∣ v_{j} ∣ .

If C is any square matrix, let φ(C) and Φ(C) denote the smallest and largest eigenvalues of C. Also, if k is an integer define

φ_{n} (k) = min_{M : ∣ M ∣ = k} φ (\frac{1}{n} X_{M}^{T} X_{M}), and Φ_{n} (k) = max_{M : ∣ M ∣ = k} Φ (\frac{1}{n} X_{M}^{T} X_{M}) .

We will write z_u for the upper quantile of a standard Normal, so that ℙ(Z > z_u) = u where Z ~ N (0, 1).

Our method will involve splitting the data randomly into three groups Inline graphic , and . For ease of notation, assume the total sample size is 3n and that the sample size of each group is n.

Summary of Assumptions

We will use the following assumptions throughout except in Section 8.

(A1) $Y_{i} = X_{i}^{T} β + ε_{i}$ where ε_i ~ N (0, σ²), for i = 1, …, n.

(A2) The dimension p_n of X satisfies p_n → ∞ and p_n ≤ c₁eⁿ^{^c₂} for some c₁ > 0 and 0 ≤ c₂ < 1.

(A3) s ≡ |{j: β_j ≠ 0}| = O(1) and ψ = min{|β_j|: β_j ≠ 0} > 0.

(A4) There exist positive constants C₀, C₁ and κ such that ℙ (lim sup_n_{→ ∞} Φ_n(n) ≤ C₀) = 1 and ℙ(lim inf_n_{→ ∞} φ_n(C₁ log n) ≥ κ) = 1. Also, ℙ(φ_n(n) > 0) = 1 for all n.
(A5) The covariates are standardized: (X_ij) = 0 and $E (X_{i j}^{2}) = 1$ . Also, there exists 0 < B < ∞ such that ℙ(|X_jk| ≤ B) = 1.

For simplicity, we include no intercepts in the regressions. The assumptions can be weakened at the expense of more complicated proofs. In particular, we can let s increase with n and ψ decrease with n. Similarly, the Normality and constant variance assumptions can be relaxed.

2. Error Control

Define the type I error rate q(D̂ _n) = ℙ(D̂ _n ∩ D^c ≠ ∅) and the asymptotic error rate lim sup_n_{→ ∞} q(D̂ _n). We define the power π(D̂ _n) = ℙ (D ⊂ D̂ _n) and the average power

π_{av} = \frac{1}{s} \sum_{j \in D} P (j \in {\hat{D}}_{n}) .

It is well known that controlling the error rate is difficult for at least three reasons: correlation of covariates, high dimensionality of the covariate and unfaithfulness (cancellations of correlations due to confounding). Let us briefly review these issues.

It is easy to construct examples where, q(D̂ _n) ≤ α implies that π(D̂ _n) ≈ α. Consider two models for random variables Z = (Y, X₁, X₂):

Model 1	Model 2
X₁ ~ N (0, 1)	X₂ ~ N (0, 1)
Y = ψX₁ + N (0, 1)	Y= ψX₂ + N (0, 1)
X₂= ρX₁ + N (0, τ²)	X₁= ρX₂ + N (0, τ²).

Open in a new tab

Under models 1 and 2, the marginal distribution of Z is P₁ = N (0, Σ₁) and P₂ = N (0, Σ₂) where

\sum_{1} = (\begin{matrix} ψ^{2} + 1 & ψ & ρ ψ \\ ψ & 1 & ρ \\ ρ ψ & ρ & ρ^{2} + τ^{2} \end{matrix}), \sum_{2} = (\begin{matrix} ψ^{2} + 1 & ρ ψ & ψ \\ ρ ψ & ρ^{2} + τ^{2} & ρ \\ ψ & ρ & 1 \end{matrix}) .

Given any ε > 0 we can choose ρ sufficiently close to 1 and τ sufficiently close to 0 such that Σ₁ and Σ₂ are as close as we like and hence $d (P_{1}^{n}, P_{2}^{n}) < ε$ where d is total variation distance. It follows that

P_{2} (2 \notin \hat{D}) \geq P_{1} (2 \notin \hat{D}) - ε \geq 1 - α - ε .

Thus, if q ≤ α then the power is less than α + ε.

Dimensionality is less of an issue thanks to recent methods. Most methods, including those in this paper, allow p_n to grow exponentially. But all the methods require some restrictions on the number s of nonzero β_j’s. In other words, some sparsity assumption is required. In this paper we take s fixed and allow p_n to grow.

False negatives can occur during screening due to cancellations of correlations. For example, the correlation between Y and X₁ can be 0 even when β₁ is huge. This problem is called unfaithfulness in the causality literature; see Spirtes, Glymour and Scheines (2001) and Robins, Spirtes, Scheines and Wasserman (2003). False negatives during screening can lead to false positives during the second stage.

Let μ̂ _j denote the regression coefficient from regressing Y on X_j. Fix j ≤ s and note that

μ_{j} \equiv E ({\hat{μ}}_{j}) = β_{j} + \sum_{\overset{k \neq j}{1 \leq k \leq s}} β_{k} ρ_{k j}

where ρ_kj = corr(X_k, X_j). If

\sum_{\overset{k \neq j}{1 \leq k \leq s}} β_{k} ρ_{k j} \approx - β_{j}

then μ_j ≈ 0 no matter how large β_j is. This problem can occur even when n is large and p is small.

For example, suppose that β = (10, −10, 0, 0) and that ρ(X_i, X_j) = 0 except that ρ(X₁, X₂) =ρ(X₁, X₃) = ρ(X₂, X₄) = 1 − ε where ε > 0 is small. Then

β = (10, - 10, 0, 0) but μ \approx (0, 0, 10, - 10) .

Marginal regression is extremely susceptible to unfaithfulness. The lasso and forward stepwise, less so. However, unobserved covariates can induce unfaithfulness in all the methods.

3. Loss and Cross-validation

Let X_λ = (X_•_j: j ∈ Ŝ _n(λ)) denote the design matrix corresponding to the covariates in Ŝ _n(λ) and let β̂ (λ) be the least squares estimator for the regression restricted to Ŝ _n(λ), assuming the estimator is well defined. Hence, $\hat{β} (λ) = {(X_{λ}^{T} X_{λ})}^{- 1} X_{λ}^{T} Y$ . More generally, β̂ _M is the least squares estimator for any subset of variables M. When convenient, we extend β̂ (λ) to length p by setting β̂ _j(λ) = 0 for j ∉ Ŝ _n(λ).

3.1. Loss

Now we record some properties of the loss function. The first part of the following lemma is essentially Lemma 3 of Meinshausen and Yu (2008).

Lemma 3.1

Let $M_{m}^{+} = {M \subset S : ∣ M ∣ \leq m, D \subset M}$ . Then,

P (sup_{M \in M_{m}^{+}} L ({\hat{β}}_{M}) \leq \frac{4 m log p}{n φ_{n} (m)}) \to 1.

(6)

Let $M_{m}^{-} = {M \subset S : ∣ M ∣ \leq m, D ⊄ M}$ . Then,

P (inf_{M \in M_{m}^{-}} L ({\hat{β}}_{M}) \geq ψ^{2} φ_{n} (m + s)) \to 1.

(7)

3.2. Cross-validation

Recall that the data have been split into groups Inline graphic , , and each of size n. Construct β̂ (λ) from and let

\hat{L} (λ) = \frac{1}{n} \sum_{X_{i} \in D_{2}} {(Y_{i} - X_{i}^{T} \hat{β} (λ))}^{2} .

(8)

We would like L̂ (λ) to order the models the same way as the true loss L(λ) (defined after equation (5)). This requires that, asymptotically, L̂ (λ) − L(λ) ≈ δ_n where δ_n does not involve λ. The following bounds will be useful. Note that L(λ) and L̂ (λ) are both step functions that only change value when a variable enters or leaves the model.

Theorem 3.2

Suppose that max_λ∈Λn |Ŝ _n(λ)| ≤ k_n. Then there exists a sequence of random variables δ_n = O_P (1) that do not depend on λ or X, such that, with probability tending to 1,

sup_{λ \in Λ_{n}} ∣ L (λ) - \hat{L} (λ) - δ_{n} ∣ = O_{P} (\frac{k_{n}}{n^{1 - c_{2}}}) + O_{P} (\frac{k_{n}}{\sqrt{n}}) .

(9)

4. Multi-Stage Methods

The multi-stage methods use the following steps. As mentioned earlier, we randomly split the data into three parts Inline graphic , and which we take to be of equal size.

Stage I. Use to find Ŝ _n (λ) for each λ.
Stage II. Use to find λ̂ by cross-validation and let Ŝ _n = Ŝ _n (λ̂ )
Stage III. Use to find the least squares estimate β̂ for the model Ŝ _n. Let

{\hat{D}}_{n} = {j \in {\hat{S}}_{n} : ∣ T_{j} ∣ > c_{n}}

where T_j is the usual t-statistic, c_n = z_α_/2_m and m = |Ŝ _n|

4.1. The Lasso

The lasso estimator (Tibshirani 1996) β̃(λ) minimizes

M_{λ} (λ) = \sum_{i = 1}^{n} {(Y_{i} - X_{i}^{T} β)}^{2} + λ \sum_{j = 1}^{p} ∣ β_{j} ∣

and let Ŝ _n(λ) = {j: β̃_j (λ) ≠ 0}. Recall that β̂ (λ) is the least squares estimator using the covariates in Ŝ _n (λ).

Let k_n = A log n where A > 0 is a positive constant.

Theorem 4.1

Assume that (A1)–(A5) hold. Let Λ_n = {λ: |Ŝ _n(λ)| ≤ k_n}. Then:

The true loss overfits: ℙ(D ⊂ Ŝ _n(λ_*)) → 1 where λ_* = argmin_λ∈Λn L(λ).
Cross-validation also overfits: ℙ(D ⊂ Ŝ _n(λ̂ )) →1 where λ̂ = argmin_λ∈ΛnL̂ (λ).
Type I error is controlled: lim sup_{n→ ∞} ℙ(D^c ∩ D̂ _n ≠ ∅) ≤ α

If we let α = α_n → 0 then D̂ _n is consistent for variable selection.

Theorem 4.2

Assume that (A1)–(A5) hold. Let α_n → 0 and $\sqrt{n} α_{n} \to \infty$ . Then, the multi-stage lasso is consistent,

P ({\hat{D}}_{n} = D) \to 1.

(10)

The next result follows directly. The proof is thus omitted.

Theorem 4.3

Assume that (A1)–(A5) hold. Let α be fixed. Then (D̂ _n; Ŝ _n) forms a confidence sandwich:

\underset{n \to \infty}{lim inf} P ({\hat{D}}_{n} \subset D \subset {\hat{S}}_{n}) \geq 1 - α .

(11)

Remark 4.4

This confidence sandwich is expected to be conservative in the sense that the coverage can be much larger than 1 − α.

4.2. Stepwise Regression

The version of stepwise regression we consider is as follows. Let k_n = A log n for some A > 0.

Initialize: Res = Y, λ = 0, Ŷ = 0, and Ŝ _n (λ) = ∅.
Let λ ← λ+ 1. Compute μ̂ _j = n⁻¹〈X_j, Res〉 for j = 1, …, p.
Let J= argmax_j |μ̂ _j|. Set Ŝ _n(λ) = {Ŝ _n(λ −1), J}. Set Ŷ = X_λβ̂ (λ) where ${\hat{β}}_{λ} = {(X_{λ}^{T} X_{λ})}^{- 1} X_{λ}^{T} Y$ and let Res = Y − Ŷ .
If λ = k_n stop. Otherwise, go to step 2.

For technical reasons, we assume that the final estimator x^Tβ̂ is truncated to be no larger than B. Note that λ is discrete and Λ_n = {0, 1, …, k_n}.

Theorem 4.5

With Ŝ _n(λ) defined as above, the statements of Theorems 4.1, 4.2 and 4.3 hold.

4.3. Marginal Regression

This is probably the oldest, simplest and most common method. It is quite popular in gene expression analysis. It is used to be regarded with some derision but has enjoyed a revival. A version appears in a recent paper by Fan and Lv (2008). Let Ŝ _n(λ) = {j: |μ̂ _j| ≥ λ} where μ̂ _j =n⁻¹ 〈Y, X_•_j〉.

Let μ_j = Inline graphic (μ̂ _j) and let μ₍_j₎ denote the value of μ ordered by their absolute values:

∣ μ_{(1)} ∣ \geq ∣ μ_{(2)} ∣ \geq \dots

Theorem 4.6

Let k_n → ∞ with $k_{n} = o (\sqrt{n})$ . Let Λ_n = {λ: |Ŝ _n(λ)| ≤ k_n}. Assume that

min_{j \in D} ∣ μ_{j} ∣ > ∣ μ_{(k_{n})} ∣ .

(12)

Then, the statements of Theorems 4.1, 4.2 and 4.3 hold.

The assumption (12) limits the degree of unfaithfulness (small partial correlations induced by cancellation of parameters). Large values of k_n weaken assumption (12) thus making the method more robust to unfaithfulness, but at the expense of lower power. Fan and Lv (2008) make similar assumptions. They assume that there is a C > 0 such that |μ_j| ≥ C|β_j| for all j which rules out unfaithfulness. However, they do not explicitly related the values of μ_j for j ∈ D to the values outside D as we have done. On the other hand, they assume that Z = Σ^−1/2 X has a spherically symmetric distribution. Under this assumption and their faithfulness assumption, they deduce that the μ_j’s outside D cannot strongly dominate the μ_j’s within D. We prefer to simply make this an explicit assumption without placing distributional assumptions on X. At any rate, any method that uses marginal regressions as a starting point must make some sort of faithfulness assumptions to succeed.

4.4. Modifications

Let us now discuss a few modifications of the basic method. First, consider splitting the data only into two groups Inline graphic and . Then do these steps:

Stage I. Find Ŝ _n(λ) for λ ∈ Λ_n where |Ŝ _n(λ)| ≤ k_n for each λ ∈ Λ_n using .
Stage II. Find λ̂ by cross-validation and let Ŝ _n = Ŝ _n(λ̂ ) using .
Stage III. Find the least squares estimate β̂ _Ŝ_{_n} using . Let D̂ _n = {j ∈ Ŝ _n: |T_j| > c_n} where T_j is the usual t-statistic.

Theorem 4.7

Choosing

c_{n} = \frac{log log n \sqrt{2 k_{n} log (2 p_{n})}}{α}

(13)

controls asymptotic type I error.

The critical value in (13) is hopelessly large and it does not appear it can be substantially reduced. We present this mainly to show the value of the extra data-splitting step. It is tempting to use the same critical value as in the tri-split case, namely, c_n = z_α/₂_m where m = |Ŝ _n| but we suspect this will not work in general. However, it may work under extra conditions.

5. Application

As an example we illustrate an analysis based on part of the Osteoporotic Fractures in Men Study (MrOS, Orwoll et al. 2005). A sample of 860 men were measured at a large number of genes and outcome measures. We consider only 296 SNPs which span 30 candidate genes for bone mineral density. An aim of the study was to identify genes associated with bone mineral density that could help in understanding the genetic basis of osteoporosis in men. Initial analyses of this subset of the data revealed no SNPs with a clear pattern of association with the phenotype; however, three SNPs, numbered (67, 277, 289) exhibited some association in the screening of the data. To further explore the effacacy of the lasso screen and clean procedure we modified the phenotype to enhance this weak signal and then reanalyzed the data to see if we could detect this planted signal.

We were interested in testing for main effects and pairwise interactions in these data; however, including all interactions results in a model with 43,660 additional terms, which is not practical for this sample size. As a compromise we selected 2 SNPs per gene to model potential interaction effects. This resulted in a model with a total of 2066 potential coefficients, including 296 main effects and 1770 interaction terms. With this model our initial screen detected 10 terms, including the three enhanced signals, 2 other main effects and 5 interactions. After cleaning, the final model detected the 3 enhanced signals, and no other terms.

6. Simulations

To further explore the screen and clean procedures, we conducted simulation experiments with four models. For each model $Y_{i} = X_{i}^{T} β + ε_{i}$ where the measurement errors, ε_i and $ε_{i j}^{*}$ , are iid Normal(0, 1) and the covariates X_ij’s are Normal(0, 1) (except for model D). Models differ in how Y_i is linked to X_i and the dependence structure of the X_i’s. Models A, B and C explore scenarios with moderate and large p, while Model D focuses on confounding and unfaithfullness.

Null model: β = (0,…,0) and the X_ij’s are iid.
Triangle model: β_j = δ(10 − j), j = 1,…, 10, β_j = 0, j > 10 and X_ij’s are iid.
Correlated Triangle model: as B, but with $X_{i j (+ 1)} = ρ X_{i j} + {(1 - ρ^{2})}^{1 / 2} ε_{i j}^{*}$ for j > 1, and ρ = 0.5.
Unfaithful model: Y_i = β₁X_i₁ + β₂X_i₂ + ε_i, for β₁ = − β₂ = 10, where the X_ij’s are iid for j = {1, 5, 6, 7, 8, 9, 10}, but $X_{i 2} = ρ X_{i 1} + τ ε_{i 2}^{*}, X_{i 3} = ρ X_{i 1} + τ ε_{i 10}^{*}$ , and $X_{i 4} = ρ X_{i 2} + τ ε_{i 11}^{*}$ , for τ = 0.01 and ρ = 0.95.

We used a maximum model size of k_n = n^1/2 which technically goes beyond the theory but works well in practice. Prior to analysis the covariates are scaled so that each has mean 0 and variance 1. The tests were initially performed using a third of the data for each of the three stages of the procedure (Table 1, top half, 3 splits). For models A, B and C each approach has Type I error less than ρ, except the stepwise procedure which has trouble with model C when n = p = 100. We also calculated the false positive rate and found it to be very low (about 10⁻⁴ when p = 100 and 10⁻⁵ when p = 1000) indicating that even when a Type I error occurs, only a very small number of terms are included erroneously. The lasso screening procedure exhibited a slight power advantage over the stepwise procedure. Both methods dominated the marginal approach. The Markov dependence structure in model C clearly challenged the marginal approach. For Model D none of the approaches controlled the Type I error.

Table 1.

Size and Power of Screen and Clean Procedures using Lasso, Stepwise and Marginal regression for the screening step. For all procedures α = 0.05. For p = 100, δ = 0.5 and for p = 1000, δ = 1.5. Reported power is π_av. The top 8 rows of simulations were conducted using three stages as described in section 4, with a third of the data used for each stage. The bottom 8 rows of simulations were conducted splitting the data in half, using the first portion with leave-one-out cross validation for stages 1 and 2 and the second portion for cleaning.

Splits	n	p	model	Lasso	Size Step	Marg	Lasso	Power Step	Marg
2	100	100	A	0.005	0.001	0.004	0.00	0.00	0.00
2	100	100	B	0.01	0.02	0.03	0.62	0.62	0.31
2	100	100	C	0.001	0.01	0.01	0.77	0.57	0.21
2	100	10	D	0.291	0.283	0.143	0.08	0.08	0.04

2	100	1000	A	0.001	0.002	0.010	0.00	0.00	0.00
2	100	1000	B	0.002	0.020	0.010	0.17	0.09	0.11
2	100	1000	C	0.02	0.14	0.01	0.27	0.15	0.11
2	1000	10	D	0.291	0.283	0.143	0.08	0.08	0.04

3	100	100	A	0.040	0.050	0.030	0.00	0.00	0.00
3	100	100	B	0.02	0.01	0.02	0.91	0.90	0.56
3	100	100	C	0.03	0.04	0.03	0.91	0.88	0.41
3	100	10	D	0.382	0.343	0.183	0.16	0.18	0.09

3	100	1000	A	0.035	0.045	0.040	0.00	0.00	0.00
3	100	1000	B	0.045	0.020	0.035	0.57	0.66	0.29
3	100	1000	C	0.06	0.070	0.020	0.74	0.65	0.19
3	1000	10	D	0.481	0.486	0.187	0.17	0.17	0.13

Open in a new tab

To determine the sensitivity of the approach to using distinct data for each stage of the analysis, simulations were conducted screening on the first half of the data and cleaning on the second half (2 splits). The tuning parameter was selected using leave-one-out cross validation (Table 1, bottom half). As expected this approach lead to a dramatic increase in the power of all the procedures. More surprising is the fact that the Type I error was near α or below for models A, B and C. Clearly this approach has advantages over data splitting and merits further investigation.

A natural competitor to screen and clean procedure is a two-stage adaptive lasso (Zou, 2006). In our implementation we split the data and used one half for each stage of the analysis. At stage one, leave-one-out cross validation lasso screens the data. In stage two, the adaptive lasso, with weights w_j = |β̂ _j|⁻¹, cleans the data. The tuning parameter for the lasso was again chosen using leave-one-out cross validation. Table 2 provides the size, power and false positive rate (FPR) for this procedure. Naturally, the adaptive lasso does not control the size of the test, but the FPR is small. The power of the test is greater than we found for our lasso screen and clean procedure, but this extra power comes at the cost of a much higher Type I error rate.

Table 2.

Size, Power and False Positive Rate (FPR) of Two-stage Adaptive Lasso Procedure

n	p	model	Size	Power	FPR
100	100	A	0.93	0	0.032
100	100	B	0.84	0.97	0.034
100	100	C	0.81	0.96	0.031
100	10	D	0.67	0.21	0.114
100	1000	A	0.96	0	0.004
100	1000	B	0.89	0.65	0.004
100	1000	C	0.76	0.77	0.002
1000	10	D	0.73	0.24	0.013

Open in a new tab

7. Proofs

Recall that if A is a square matrix then φ(A) and Φ(A) denote the smallest and largest eigenvalues of A. Throughout the proofs we make use of the following fact. If v is a vector and A is a square matrix then

φ (A) ∥ v ∣ ∣^{2} \leq v^{T} A v \leq Φ (A) ∥ v ∣ ∣^{2} .

(14)

We use the following standard tail bound: if Z ~ N(0, 1) then ℙ(|Z| > t) ≤ t⁻¹e^−t²/2. We will also use the following results about the lasso from Meinshausen and Yu (2008). Their results are stated and proved for fixed X but, under the conditions (A1)–(A5), it is easy to see that their conditions hold with probability tending to one and so their results hold for random X as well.

Theorem 7.1 (Meinshausen and Yu, 2008)

Let β̃(λ) be the lasso estimator.

The squared error satisfies:
$P (∥ \tilde{β} (λ) - β ∣ ∣_{2}^{2} \leq \frac{2 λ^{2} s}{n^{2} κ^{2}} + \frac{cm log p_{n}}{n φ_{n}^{2} (m)}) \to 1$ (15)

where m = |Ŝ _n(λ)| and c > 0 is a constant.
The size of Ŝ _n(λ) satisfies
$P (∣ {\hat{S}}_{n} (λ) ∣ \leq \frac{τ^{2} C n^{2}}{λ^{2}}) \to 1$ (16)

where $τ^{2} = E (Y_{i}^{2})$ .

Proof of Lemma 3.1

Let D ⊂ M and $φ = φ (n^{- 1} X_{M}^{T} X_{M})$ . Then

L ({\hat{β}}_{M}) = \frac{1}{n} ε^{T} X_{M} {(X_{M}^{T} X_{M})}^{- 1} X_{M}^{T} ε \leq \frac{1}{n^{2} φ} ∥ X_{M}^{T} ε ∣ ∣^{2} = \frac{1}{n φ} \sum_{j \in M} Z_{j}^{2}

where $Z_{j} = n^{- 1 / 2} X_{• j}^{T} ε$ . Conditional on X, $Z_{i} \sim N (0, a_{j}^{2})$ where $a_{j}^{2} = n^{- 1} \sum_{i = 1}^{n} X_{i j}^{2}$ . Let $A_{n}^{2} = {max}_{1 \leq j \leq p_{n}} a_{j}^{2}$ . By Hoeffding’s inequality, (A2) and (A5), ℙ(E_n) → 1 where $E_{n} = {A_{n} \leq \sqrt{2}}$ . So

\begin{array}{l} P (max_{1 \leq j \leq p_{n}} ∣ Z_{j} ∣ > \sqrt{4 log p_{n}}) = P (max_{1 \leq j \leq p_{n}} ∣ Z_{j} ∣ > \sqrt{4 log p_{n}}, E_{n}) + P (max_{1 \leq j \leq p_{n}} ∣ Z_{j} ∣ > \sqrt{4 log p_{n}}, E_{n}^{c}) \\ \leq P (max_{1 \leq j \leq p_{n}} ∣ Z_{j} ∣ > \sqrt{4 log p_{n}}, E_{n}) + P (E_{n}^{c}) \\ \leq P (A_{n} max_{1 \leq j \leq p_{n}} \frac{∣ Z_{j} ∣}{a_{j}} > \sqrt{4 log p_{n}}, E_{n}) + o (1) \\ \leq P (max_{1 \leq j \leq p_{n}} \frac{∣ Z_{j} ∣}{a_{j}} > \sqrt{2 log p_{n}}) + o (1) \\ = E (P (max_{1 \leq j \leq p_{n}} \frac{∣ Z_{j} ∣}{a_{j}} > \sqrt{2 log p_{n}}) ∣ X) + o (1) \\ \leq O (\frac{1}{\sqrt{2 log p_{n}}}) + o (1) = o (1) . \end{array}

But $\sum_{j \in M} Z_{j}^{2} \leq m {max}_{1 \leq j \leq p_{n}} Z_{j}^{2}$ and (6) follows.

Now we lower bound L(β̂ _M). Let M be such that D ⊄ M. Let A = {j: β̂ (j) ≠ 0} ∪ D. Then |A| ≤ m + s. Therefore, with probability tending to 1,

\begin{array}{l} L ({\hat{β}}_{M}) = \frac{1}{n} {({\hat{β}}_{M} - β)}^{T} X^{T} X ({\hat{β}}_{M} - β) = \frac{1}{n} {({\hat{β}}_{M} - β)}^{T} X_{A}^{T} X_{A} ({\hat{β}}_{M} - β) \\ \geq φ_{n} (m + s) ∥ {\hat{β}}_{M} - β ∣ ∣^{2} = φ_{n} (m + s) \sum_{j \in A} {({\hat{β}}_{M} (j) - β (j))}^{2} \\ \geq φ_{n} (m + s) \sum_{j \in D \cap M^{c}} {(0 - β (j))}^{2} \geq φ_{n} (m + s) ψ^{2} . \end{array}

Proof of Theorem 3.2

Let Ỹ denote the responses, and X̃ the design matrix, for the second half of the data. Then Ỹ = X̃β + ε̃. Now

L (λ) = \frac{1}{n} {(\hat{β} (λ) - β)}^{T} X^{T} X (\hat{β} (λ) - β) = {(\hat{β} (λ) - β)}^{T} {\sum^{^}}_{n} (\hat{β} (λ) - β)

and

\hat{L} (λ) = n^{- 1} ∥ \tilde{Y} - \tilde{X} \hat{β} (λ) ∣ ∣^{2} = {(\hat{β} (λ) - β)}^{T} {\sum^{\sim}}_{n} (\hat{β} (λ) - β) + δ_{n} + \frac{2}{n} 〈 \tilde{ε}, \tilde{X} (\hat{β} (λ) - β) 〉

where δ_n = ||ε̃||²/n, and ${\sum^{^}}_{n} = n_{1}^{- 1} X^{T} X$ and Σ̃_n = n⁻¹ X̃^T X̃. By Hoeffding’s inequality

P (∣ {\sum^{^}}_{n} (j, k) - {\sum^{\sim}}_{n} (j, k) ∣ > ε) \leq e^{- n c ε^{2}}

for some c > 0 and so

P (max_{j k} ∣ {\sum^{^}}_{n} (j, k) - {\sum^{\sim}}_{n} (j, k) ∣ > ε) \leq p_{n}^{2} e^{- n c ε^{2}} .

Choose ε_n = 4/(cn¹⁻^c^₂). It follows that

P (max_{j k} ∣ {\sum^{^}}_{n} (j, k) - {\sum^{\sim}}_{n} (j, k) ∣ > \frac{4}{c n^{1 - c_{2}}}) \leq e^{- 2 n^{c_{2}}} \to 0.

Note that

∣ {j : {\hat{β}}_{j} (λ) \neq 0} \cup {j : β_{j} \neq 0} ∣ \leq k_{n} + s .

Hence, with probability tending to 1,

∣ L (λ) - \hat{L} (λ) - δ_{n} ∣ \leq \frac{4}{c n^{1 - c_{2}}} ∥ \hat{β} (λ) - β ∣ ∣_{1}^{2} + 2 ξ_{n} (λ)

for all λ ∈ Λ_n, where

ξ_{n} (λ) = \frac{1}{n} \sum_{i \in I_{2}} {\tilde{ε}}_{i} μ_{i} (λ)

and $μ_{i} (λ) = {\tilde{X}}_{i}^{T} (\hat{β} (λ) - β)$ . Now $∥ \hat{β} (λ) - β) ∣ ∣_{1}^{2} = O_{P} ({(k_{n} + s)}^{2})$ since ||β̂ (λ)||² = O_P (k_n/φ(k_n)). Thus, ||β̂ (λ) − β||₁ ≤ C(k_n + s) with probability tending to 1, for some C > 0. Also, |μ_i(λ)| ≤ B||β̂ (λ) − β||₁ ≤ BC(k_n + s) with probability tending to 1. Let W ~ N (0, 1). Conditional on Inline graphic ,

∣ ξ_{n} (λ) ∣ \overset{d}{=} \frac{σ}{\sqrt{n}} \sqrt{\sum_{i = 1}^{n} μ_{i}^{2} (λ)} ∣ W ∣ \leq \frac{σ}{\sqrt{n}} B C (k_{n} + s) ∣ W ∣

so ${sup}_{λ \in Λ_{n}} ∣ ξ_{n} (λ) ∣ = O_{P} (k_{n} / \sqrt{n})$ .

Proof of Theorem 4.1

(1) Let $λ_{n} = τ n \sqrt{C / k_{n}}$ , M = Ŝ _n(λ_n) and m = |M |. Then, ℙ(m ≤ k_n) → 1 due to (16). Hence, ℙ(λ_n ∈ Λ_n) → 1. From (15),

∥ \tilde{β} (λ_{n}) - β ∣ ∣_{2}^{2} \leq O (\frac{1}{k_{n}}) + O_{P} (\frac{k_{n} log p_{n}}{n}) = o_{P} (1) .

Hence, $∥ \tilde{β} (λ_{n}) - β ∣ ∣_{\infty}^{2} = o_{P} (1)$ . So, for each j ∈ D,

∣ {\tilde{β}}_{j} (λ_{n}) ∣ \geq ∣ β_{j} ∣ - ∣ {\tilde{β}}_{j} (λ_{n}) - β_{j} ∣ \geq ψ + o_{P} (1)

and hence ℙ(min_j_∈_D|β̃_j(λ_n)| > 0) → 1. Therefore, Γ_n = {λ ∈ Λ_n: D ⊂ Ŝ _n(λ)} is nonempty. By Lemma 3.1,

L (λ_{n}) \leq cm log p_{n} / (n φ (m)) = O_{P} (k_{n} log p_{n} / n) .

(17)

On the other hand, from Lemma 3.1,

P (inf_{λ \in Λ_{n} \cap Γ_{n}^{c}} L ({\hat{β}}_{λ}) > ψ^{2} φ (k_{n})) \to 1.

(18)

Now, nφ_n(k_n)/(k_n log p_n) → ∞ and so, (17) and (18) imply that

P (inf_{λ \in Λ_{n} \cap Γ_{n}^{c}} L ({\hat{β}}_{λ}) > L (λ_{n})) \to 1.

Thus, if λ_* denotes the minimizer of L(λ) over Λ_n, we conclude that ℙ(λ_* ∈ Γ_n) → 1 and hence, ℙ(D ⊂Ŝ _n(λ_*)) → 1.

(2) This follows from part (1) and Theorem 3.2.

(3) Let A = Ŝ _n ∩ D^c. We want to show that

P (max_{j \in A} ∣ T_{j} ∣ > c_{n}) \leq α + o (1) .

Now,

\begin{array}{l} P (max_{j \in A} ∣ T_{j} ∣ > c_{n}) = P (max_{j \in A} ∣ T_{j} ∣ > c_{n}, D \subset {\hat{S}}_{n}) + P (max_{j \in A} ∣ T_{j} ∣ > c_{n}, D ⊄ {\hat{S}}_{n}) \\ \leq P (max_{j \in A} ∣ T_{j} ∣ > c_{n}, D \subset {\hat{S}}_{n}) + P (D ⊄ {\hat{S}}_{n}) \\ \leq P (max_{j \in A} ∣ T_{j} ∣ > c_{n}, D \subset {\hat{S}}_{n}) + o (1) . \end{array}

Conditional on ( Inline graphic , ), β̂ _A is Normally distributed with mean 0 and variance matrix $σ^{2} {(X_{A}^{T} X_{A})}^{- 1}$ when D ⊂ Ŝ _n. Recall that

T_{j} (M) = \frac{e_{j}^{T} {(X_{M}^{T} X_{M})}^{- 1} X_{M}^{T} Y}{\hat{σ} \sqrt{e_{j}^{T} {(X_{M}^{T} X_{M})}^{- 1} e_{j}}} = \frac{{\hat{β}}_{M, j}}{s_{j}}

where M = Ŝ _n, $s_{j}^{2} = {\hat{σ}}^{2} e_{j}^{T} {(X_{M}^{T} X_{M})}^{- 1} e_{j}$ and e_j= (0, …, 0, 1, 0, …, 0)^T where the 1 is in the j^th coordinate. When D ⊂ Ŝ _n, each T_j, for j ∈ A, has a t-distribution with n − m degrees of freedom where m = |Ŝ _n|. Also, c_n/t_α_/2_m → 1 where t_u denotes the upper tail critical value for the t-distribution. Hence,

P (max_{j \in A} ∣ T_{j} ∣ > c_{n}, D \subset {\hat{S}}_{n} ∣ D_{1}, D_{2}) = P (max_{j \in A} ∣ T_{j} ∣ > t_{α ∣ 2 m}, D \subset {\hat{S}}_{n} ∣ D_{1}, D_{2}) + a_{n} \leq α + a_{n}

where a_n = o(1), since |A| ≤ m. It follows that

P (max_{j \in A} ∣ T_{j} ∣ > c_{n}, D \subset {\hat{S}}_{n}) \leq α + o (1) .

Proof of Theorem 4.2

From Theorem 4.1, ℙ(D̂ _n ∩ D^c ≠ ∅) ≤ α_n and so ℙ(D̂ _n ∩ D^c ≠ ∅) → 0. Hence, ℙ(D̂ _n ⊂ D) → 1. It remains to be shown that

P (D \subset {\hat{D}}_{n}) \to 1.

(19)

The test statistic for testing β_j = 0 when Ŝ _n = M is

T_{j} (M) = \frac{e_{j}^{T} {(X_{M}^{T} X_{M})}^{- 1} X_{M}^{T} Y}{\hat{σ} \sqrt{e_{j}^{T} {(X_{M}^{T} X_{M})}^{- 1} e_{j}}} .

For simplicity in the proof, let us take σ̂ = σ, the extension to unknown σ being straightforward. Let j ∈ D, ℳ = {M: |M| ≤ k_n, D ⊂ M}. Then,

\begin{array}{l} P (j \notin {\hat{D}}_{n}) = P (j \notin {\hat{D}}_{n}, D \subset {\hat{S}}_{n}) + P (j \notin {\hat{D}}_{n}, D ⊄ {\hat{S}}_{n}) \\ \leq P (j \notin {\hat{D}}_{n}, D \subset {\hat{S}}_{n}) + P (D ⊄ {\hat{S}}_{n}) \\ = P (j \notin {\hat{D}}_{n}, D \subset {\hat{S}}_{n}) + o (1) \\ = \sum_{M \in M} P (j \notin {\hat{D}}_{n}, {\hat{S}}_{n} = M) + o (1) \\ \leq \sum_{M \in M} P (∣ T_{j} (M) ∣ < c_{n}, {\hat{S}}_{n} = M) + o (1) \\ \leq \sum_{M \in M} P (∣ T_{j} (M) ∣ < c_{n}) + o (1) . \end{array}

Conditional on Inline graphic ∪ , for each M ∈ ℳ, T_j(M) = (β_j/s_j) + Z where Z ~ N (0, 1). Without loss of generality assume that β_j > 0. Hence,

P (∣ T_{j} (M) ∣ < c_{n} ∣ D_{1} \cup D_{2}) = P (- c_{n} - \frac{β_{j}}{s_{j}} < Z < c_{n} - \frac{β_{j}}{s_{j}}) .

Fix a small ε > 0. Note that $s_{j}^{2} \leq σ^{2} / (n κ)$ . It follows that, for all large n, $c_{n} - β_{j} / s_{j} < - ε \sqrt{n}$ . So,

P (∣ T_{j} (M) ∣ < c_{n} ∣ D_{1} \cup D_{2}) \leq P (Z < - ε \sqrt{n}) \leq e^{- n ε^{2} / 2} .

The number of models in ℳ is

\sum_{j = 0}^{k_{n}} (\begin{matrix} p_{n} - s \\ j - s \end{matrix}) \leq k_{n} (\begin{matrix} p_{n} - s \\ k_{n} - s \end{matrix}) \leq k_{n} {(\frac{(p_{n} - s) e}{k_{n} - s})}^{k_{n} - s} \leq k_{n} p_{n}^{k_{n}}

where we used the inequality

(\begin{matrix} n \\ k \end{matrix}) \leq {(\frac{n e}{k})}^{k} .

So,

\sum_{M \in M} P (∣ T_{j} (M) ∣ < c_{n} ∣ D_{1} \cup D_{2}) \leq k_{n} p_{n}^{k_{n}} e^{- n ε^{2}} \to 0

by (A2). We have thus shown that ℙ(j ∉ D̂ _n) → 0 for each j ∈ D. Since |D| is finite, it follows that ℙ(j ∉ D̂ _n for some j ∈ D) → 0 and hence (19).

Proof of Theorem 4.5

A simple modification of Theorem 3.1 of Barron, Cohen, Dahmen and DeVore (2008) shows that

L (k_{n}) = \frac{1}{n} ∥ {\hat{Y}}_{k_{n}} - X β ∣ ∣^{2} = o_{P} (1) .

(The modification is needed because Barron, Cohen, Dahmen and DeVore (2008) require Y to be bounded while we have assumed that Y is Normal. By a truncation argument, we can still derive the bound on L(k_n).) So

∥ {\hat{β}}_{k_{n}} - β ∣ ∣^{2} \leq \frac{L (k_{n})}{φ_{n} (k_{n} + s)} \leq \frac{L (k_{n})}{κ} = o_{P} (1) .

Hence, for any ε > 0, with probability tending to 1, ||β̂ (k_n) − β||² < ε so that |β̂ _j| > ψ/2 > 0 for all j ∈ D. Thus, ℙ(D ⊂ Ŝ _n(k_n)) → 1. The remainder of the proof of part 1 is the same as in Theorem 4.1. Part 2 follows from the previous result together with Theorem 3.2. The proof of Part 3 is the same as for Theorem 4.1.

Proof of Theorem 4.6

Note that $\hat{μ_{j}} - μ_{j} = n^{- 1} \sum_{i = 1}^{n} X_{i j} ε_{i}$ . Hence, $\hat{μ_{j}} - μ_{j} \sim N (0, 1 / n)$ . So, for any δ > 0,

\begin{array}{l} P (max_{j} ∣ {\hat{μ}}_{j} - μ_{j} ∣ > δ) \leq \sum_{j = 1}^{p_{n}} P (∣ {\hat{μ}}_{j} - μ_{j} ∣ > δ) \\ \leq \frac{p_{n}}{δ \sqrt{n}} e^{- n δ^{2} / 2} \leq \frac{c_{1} e^{n^{c_{2}}}}{δ \sqrt{n}} e^{- n δ^{2} / 2} \to 0. \end{array}

By (12), conclude that D ⊂ Ŝ _n(λ) when λ = μ̂ ₍_k_{_n)}. The remainder of the proof is the same as the proof of Theorem 4.5.

Proof of Theorem 4.7

Let A = Ŝ _n ∩ D^c. We want to show that

P (max_{j \in A} ∣ T_{j} ∣ > c_{n}) \leq α + o (1) .

For fixed A, β̂ _A is Normal with mean 0 but this is not true for random A. Instead we need to bound T_j. Recall that

T_{j} (M) = \frac{e_{j}^{T} {(X_{M}^{T} X_{M})}^{- 1} X_{M}^{T} Y}{\hat{σ} \sqrt{e_{j}^{T} {(X_{M}^{T} X_{M})}^{- 1} e_{j}}} = \frac{{\hat{β}}_{M, j}}{s_{j}}

where M = Ŝ_n, $s_{j}^{2} = {\hat{σ}}^{2} e_{j}^{T} {(X_{M}^{T} X_{M})}^{- 1} e_{j}$ and e_j = (0, …, 0, 1, 0, …, 0)^T where the 1 is in the j^th coordinate. The probabilities that follow are conditional on Inline graphic but this is supressed for notational convenience. First, write

\begin{array}{l} P (max_{j \in A} ∣ T_{j} ∣ > c_{n}) = P (max_{j \in A} ∣ T_{j} ∣ > c_{n}, D \subset {\hat{S}}_{n}) + P (max_{j \in A} ∣ T_{j} ∣ > c_{n}, D ⊄ {\hat{S}}_{n}) \\ \leq P (max_{j \in A} ∣ T_{j} ∣ > c_{n}, D \subset {\hat{S}}_{n}) + P (D ⊄ {\hat{S}}_{n}) \\ \leq P (max_{j \in A} ∣ T_{j} ∣ > c_{n}, D \subset {\hat{S}}_{n}) + o (1) . \end{array}

When D ⊂ Ŝ _n,

{\hat{β}}_{{\hat{S}}_{n}} = β_{{\hat{S}}_{n}} + {(\frac{1}{n} X_{{\hat{S}}_{n}}^{T} X_{{\hat{S}}_{n}})}^{- 1} \frac{1}{n} X_{{\hat{S}}_{n}}^{T} ε = β_{{\hat{S}}_{n}} + Q_{{\hat{S}}_{n}} γ_{{\hat{S}}_{n}}

where $Q_{{\hat{S}}_{n}} = {((1 / n) X_{{\hat{S}}_{n}}^{T} X_{{\hat{S}}_{n}})}^{- 1}, γ_{{\hat{S}}_{n}} = n^{- 1} X_{{\hat{S}}_{n}}^{T} ε$ , and β_Ŝ_{_n} (j) = 0 for j ∈A. Now, $s_{j}^{2} \geq {\hat{σ}}^{2} / (n C)$ so that

∣ T_{j} ∣ \leq \frac{\sqrt{n C} ∣ {\hat{β}}_{{\hat{S}}_{n}, j} ∣}{\hat{σ}} \leq \frac{\sqrt{n log log n} ∣ {\hat{β}}_{{\hat{S}}_{n}, j} ∣}{\hat{σ}}

for j ∈ Ŝ _n. Therefore,

P (max_{j \in A} ∣ T_{j} ∣ > c_{n}, D \subset {\hat{S}}_{n}) \leq P (max_{j \in A} ∣ {\hat{β}}_{{\hat{S}}_{n}, j} ∣ > \frac{\hat{σ} c_{n}}{\sqrt{n C}}, D \subset {\hat{S}}_{n}) .

Let γ = n⁻¹X^Tε. Then,

∥ {\hat{β}}_{A} ∣ ∣^{2} \leq γ_{{\hat{S}}_{n}}^{T} Q_{{\hat{S}}_{n}}^{2} γ_{{\hat{S}}_{n}} \leq \frac{∥ γ_{{\hat{S}}_{n}} ∣ ∣^{2}}{κ^{2}} \leq \frac{k_{n} {max}_{1 \leq j \leq p_{n}} γ_{j}^{2}}{κ^{2}} .

It follows that

max_{j \in A} ∣ {\hat{β}}_{{\hat{S}}_{n}, j} ∣ \leq \frac{\sqrt{k_{n}} {max}_{1 \leq j \leq p_{n}} ∣ γ_{j} ∣}{κ} \leq \sqrt{k_{n} log log n} max_{1 \leq j \leq p_{n}} ∣ γ_{j} ∣

since κ > 0. So,

P (max_{j \in A} ∣ {\hat{β}}_{{\hat{S}}_{n}, j} ∣ > \frac{\hat{σ} c_{n}}{\sqrt{n log log n}}, D \subset {\hat{S}}_{n}) \leq P (max_{1 \leq j \leq p_{n}} ∣ γ_{j} ∣ > \frac{\hat{σ} c_{n}}{log log n \sqrt{n k_{n}}}) .

Note that γ_j ~ N (0, σ²/n) and hence

E (max_{j} ∣ γ_{j} ∣) \leq \sqrt{\frac{2 σ^{2} log (2 p_{n})}{n}} .

There exists ε_n → 0 such that ℙ(B_n) → 1 where B_n = {(1 − ε_n) ≤ σ̂ /σ ≤ (1 + ε)}. So,

\begin{array}{l} P (max_{1 \leq j \leq p_{n}} ∣ γ_{j} ∣ > \frac{\hat{σ} c_{n}}{log log n \sqrt{n k_{n}}}) \leq P (max_{1 \leq j \leq p_{n}} ∣ γ_{j} ∣ > \frac{\hat{σ} c_{n} (1 - ε_{n})}{log log n \sqrt{n k_{n}}}, B_{n}) \\ \leq \frac{\sqrt{n k_{n}}}{σ (1 - ε_{n}) c_{n} \sqrt{log log n}} E (max_{j} ∣ γ_{j} ∣) \leq α + o (1) . \end{array}

8. Discussion

The multi-stage method presented in this paper successfully controls type I error while giving reasonable power. The lasso and stepwise have similar performance. Although theoretical results assume independent data for each of the three stages, simulations suggest that leave-one-out cross-validation leads to valid Type I error rates and greater power. Screening the data in one phase of the experiment and cleaning in a followup phase leads to an efficient experimental design. Certainly this approach deserves further theoretical investigation. In particular, the question of optimality is an open question.

The literature on high dimensional variable selection is growing quickly. The most important deficiency in much of this work, including this paper, is the assumption that the model Y = X^Tβ + ε is correct. In reality, the model is at best an approximation. It is possible to study linear procedures when the linear model is not assumed to hold as in Greenshtein and Ritov (2004). We discuss this point in the appendix. Nevertheless, it seems useful to study the problem under the assumption of linearity to gain insight into these methods. Future work should be directed at exploring the robustness of the results when the model is wrong.

Other possible extensions include: dropping the Normality of the errors, permitting non-constant variance, investigating the optimal sample sizes for each stage, and considering other screening methods besides cross-validation.

Finally let us note that the example involving unfaithfulness, that is, cancellations of parameters to make the marginal correlation much different than the regression coefficient, pose a challenge for all the methods and deserve more attention even in cases of small p.

Acknowledgments

The authors are grateful for the use of a portion of the sample from the Osteo-porotic Fractures in Men (MrOS) Study to illustrate their methodology. MrOs is supported by the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS), the National Institute on Aging (NIA), and the National Cancer Institute (NCI) through grants U01 AR45580, U01 AR45614, U01 AR45632, U01 AR45647, U01 AR45654, U01 AR45583, U01 AG18197, and M01 RR000334. Genetic analyses in MrOS were supported by R01-AR051124. This work was supported by NIH grant MH057881. We also thank two referees and an AE for helpful suggestions.

Appendix

Prediction

Realistically, there is little reason to believe that the linear model is correct. Even if we drop the assumption that the linear model is correct, sparse methods like the lasso can still have good properties as shown in Greenshtein and Ritov (2004). In particular, they showed that the lasso satisfies a risk consistency property. In this appendix we show that this property continues to hold if λ is chosen by cross-validation.

The lasso estimator is the minimizer of $\sum_{i = 1}^{n} {(Y_{i} - X_{i}^{T} β)}^{2} + λ ∥ β ∣ ∣_{1}$ . This is equivalent to minimizing $\sum_{i = 1}^{n} {(Y_{i} - X_{i}^{T} β)}^{2}$ subject to ||β||₁ ≤ Ω, for some Ω. (More precisely, the set of estimators as λ varies is the same as the set of estimators as Ω varies.) We use this second version throughout this section.

The predictive risk of a linear predictor ℓ(x) = x^Tβ is R(β) = Inline graphic (Y − ℓ(x))² where (X, Y) denotes a new observation. Let γ = γ(β) = (−1, β₁, …, β_p)^T and let Γ = (ZZ^T) where Z = (Y, X₁, …, X_p). Then we can write R(β) = γ^TΓγ. The lasso estimator can now be written as β̂ (Ω_n) = argmin_β_∈_B_{(Ω_n)} R̂ (β) where R̂ (β) = γ^TΓ̂ γ and $\hat{Γ} = n^{- 1} \sum_{i = 1}^{n} Z_{i} Z_{i}^{T}$ .

Define

β_{*} = {argmin}_{β \in B (Ω_{n})} R (β)

where

B (Ω_{n}) = {β : ∥ β ∣ ∣_{1} \leq Ω_{n}} .

Thus, ℓ_*(x) = x^T β_* is the best linear predictor in the set B(Ω_n). The best linear predictor is well defined even though Inline graphic (Y | X) is no longer assumed to be linear. Greenshtein and Ritov (2004) call an estimator β̂ _n persistent, or predictive risk consistent, if

R ({\hat{β}}_{n}) - R (β_{*}) \overset{P}{\to} 0

as n → ∞.

The assumptions we make in this section are:

(B1) p_n ≤ eⁿ^{^ξ} for some 0 ≤ ξ < 1 and
(B2) The elements of Γ̂ satisfy an exponential inequality:
$P (∣ {\hat{Γ}}_{j k} - Γ_{j k} ∣ > ε) \leq c_{3} e^{- n c_{4} ε^{2}}$

for some c₃, c₄ > 0 and
(B3) There exists B₀ < ∞ such that, for all n, max_j;k (|Z_jZ_k|) ≤ B₀.

Condition (A2) can easily be deduced from more primitive assumptions as in Greenshtein and Ritov (2004) but for simplicity we take (A2) as an assumption. Let us review one of the results in Greenshtein and Ritov (2004). For the moment, replace (A1) with the assumption that p_n ≤ n^b for some b. Under these conditions, it follows that

Δ_{n} \equiv max_{j, k} ∣ {\hat{Γ}}_{j k} - Γ_{j k} ∣ = O_{P} (\sqrt{\frac{log n}{n}}) .

Hence,

\begin{array}{l} sup_{β \in B (Ω_{n})} ∣ R (β) - \hat{R} (β) ∣ = sup_{β \in B (Ω_{n})} ∣ γ^{T} (Γ - \hat{Γ}) γ ∣ \\ \leq Δ_{n} sup_{β \in B (Ω_{n})} ∥ γ ∣ ∣_{1}^{2} = Ω_{n}^{2} O_{P} (\sqrt{\frac{log n}{n}}) . \end{array}

The latter term is o_P (1) as long as Ω_n = o((n/log n)^1/4). Thus we have:

Theorem 8.1 (Greenshtein and Ritov 2004)

If Ω_n = o((n/log n)^1/4) then the lasso estimator is persistent.

For future reference, let us state a slightly different version of their result that we will need. We omit the proof.

Theorem 8.2

Let γ > 0 be such that ξ + γ < 1. Let Ω_n = O(n^{(1−ξ−γ)/4}). Then, under (B1) and (B2),

P (sup_{β \in B (Ω_{n})} ∣ \hat{R} (β) - R (β) ∣ > \frac{1}{n^{γ / 4}}) = O (e^{- c n^{γ / 2}})

(20)

for some c > 0.

The estimator β̂ (Ω_n) lies on the boundary of the ball B(Ω_n) and is very sensitive to the exact choice of Ω_n. A potential improvement—and something that reflects actual practice—is to compute the set of lasso estimators β̂ (ℓ) for 0 ≤ ℓ ≤ Ω_n and then select from that set based on cross validation. We now confirm that the resulting estimator preserves persistence. As before we split the data into Inline graphic and . Construct the lasso estimators {β̂ (ℓ): 0 ≤ ℓ ≤ Ω_n}. Choose ℓ̂ by cross validation using . Let β̂ = β̂ (ℓ̂ ).

Theorem 8.3

Let γ > 0 be such that ξ + γ < 1. Under (A1), (A2) and (A3), if Ω_n = O(n^{(1−ξ−γ)/4}). then the cross validated lasso estimator β̂ is persistent. Moreover,

R (\hat{β}) - inf_{0 \leq ℓ \leq Ω_{n}} R (\hat{β} (ℓ)) \overset{P}{\to} 0.

(21)

Proof

Let β_*(ℓ) = argmin_β_∈_B_(ℓ)R(β). Define h(ℓ) = R(β_*(ℓ)), g(ℓ) = R(β̂ (ℓ)) and c(ℓ) = L̂ (β̂ (ℓ)). Note that, for any vector b, we can write R(b) = τ² + b^TΣb −2b^T ρ where ρ = ( Inline graphic (Y X₁), …, (Y X_p))^T.

Clearly, h is monotone nonincreasing on [0, Ω_n]. We claim that |h(ℓ + δ) − h(ℓ)| ≤ cΩ_nδ where c depends only on Γ. To see this, let u = β_*(ℓ), v = β_*(ℓ + δ) and a = ℓ β_*(ℓ + δ)/(ℓ + δ) so that a ∈ B(ℓ). Then,

\begin{array}{l} h (ℓ + δ) \leq h (ℓ) = R (u) \leq R (a) \\ = R (v) + R (a) - R (v) = h (ℓ + δ) + \frac{2 δ}{ℓ + δ} ρ^{T} v - \frac{δ (2 ℓ + δ)}{{(ℓ + δ)}^{2}} v^{T} \sum v \\ \leq h (ℓ + δ) + 2 δ C + δ (2 Ω_{n} + δ) C \end{array}

where C = max_j,k |Γ_j,k| = O(1).

Next we claim that g(ℓ) is Lipschitz on [0, Ω_n] with probability tending to 1. Let β̂ (ℓ) = argmin_β_∈_B̂_(ℓ)R̂ (β) denote the lasso estimator and set û = β̂ (ℓ) and v̂ = β̂ (ℓ + δ). Let ε_n = n⁻^γ^/4. From (20), the following chain of equations hold except on a set of exponentially small probability:

\begin{array}{l} g (ℓ + δ) = R (\hat{v}) \leq \hat{R} (\hat{v}) + ε_{n} \leq \hat{R} (v) + ε_{n} \\ \leq R (v) + 2 ε_{n} = h (ℓ + δ) + 2 ε_{n} \\ \leq h (ℓ) + c Ω_{n} δ + 2 ε_{n} = R (u) + c Ω_{n} δ + 2 ε_{n} \\ \leq R (\hat{u}) + c Ω_{n} δ + 2 ε_{n} = g (ℓ) + c Ω_{n} δ + 2 ε_{n} . \end{array}

A similar argument can be applied in the other direction. Conclude that

∣ g (ℓ + δ) - g (ℓ) ∣ \leq c Ω_{n} δ + 2 ε_{n}

except on a set of small probability.

Now let A = {0, δ, 2δ, …, mδ} where m is the smallest integer such that mδ ≥ Ω_n. Thus, m ~ Ω_n/δ_n. Choose δ = δ_n = n⁻³⁽¹⁻^ξ⁻^γ^)/8. Then Ω_nδ_n → 0 and Ω_n/δ_n ≤ n³⁽¹⁻^ξ⁻^γ^)/4. Using the same argument as in the proof of Theorem 3.2,

max_{ℓ \in A} ∣ \hat{L} (\hat{β} (ℓ)) - R (\hat{β} (ℓ)) ∣ = σ_{n}

where σ_n = o_P (1). Then,

\begin{array}{l} R (β_{*} (Ω_{n})) \leq R (\hat{β}) \leq \hat{L} (\hat{β} (\hat{ℓ})) + σ_{n} \\ \leq \hat{L} (m δ_{n}) + σ_{n} \leq g (m δ_{n}) + 2 σ_{n} \leq g (Ω_{n}) + 2 σ_{n} + c Ω_{n} δ_{n} \\ \leq h (Ω_{n}) + 2 σ_{n} + ε_{n} + c Ω_{n} δ_{n} = R (β_{*} (Ω_{n})) + 2 σ_{n} + ε_{n} + c Ω_{n} δ_{n} \end{array}

and persistence follows. To show the second result, let β̃ = argmin_{0≤ℓ≤Ω_n} g(ℓ) and β̄ = argmin_ℓ∈_A g(ℓ). Then,

\begin{array}{l} R (\tilde{β}) \leq \hat{L} (\tilde{β}) + σ_{n} \leq \hat{L} (\bar{β}) + σ_{n} \\ \leq R (\bar{β}) + 2 σ_{n} \leq R (\tilde{β}) + 2 σ_{n} + c δ_{n} Ω_{n} \end{array}

and the claim follows.

References

Barron A, Cohen A, Dahmen W, DeVore R. Approximation and learning by greedy algorithms. The Annals of Statistics. 2008;36:64–94. [Google Scholar]
Bühlmann P. Boosting for high-dimensional linear models. The Annals of Statistics. 2006;34:559–583. [Google Scholar]
Candes E, Tao T. The Dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics. 2007;35:2313–2351. [Google Scholar]
Donoho D. For Most Large Underdetermined Systems of Linear Equations, the minimal l1-norm near-solution approximates the sparsest near-solution. Communications on Pure and Applied Mathematics. 2006;59:797–829. [Google Scholar]
Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics. 2004;32:407–499. [Google Scholar]
Fan J, Lv J. Sure independence screening for ultra-high dimensional feature space. To appear: Journal of the Royal Statistical Association, Series B. 2008 doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Greenshtein E, Ritov Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988. [Google Scholar]
Hoeffding W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association. 1963;58:13–30. [Google Scholar]
Meinshausen N. Relaxed Lasso. Computational Statistics and Data Analysis. 2007;52:374–393. [Google Scholar]
Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
Meinshausen N, Yu B. Lasso-type recovery of sparse representations of high-dimensional data. To appear: The Annals of Statistics 2008 [Google Scholar]
Orwoll E, Blank JB, Barrett-Connor E, Cauley J, Cummings S, Ensrud K, Lewis C, Cawthon PM, Marcus R, Marshall LM, McGowan J, Phipps K, Sherman S, Stefanick ML, Stone K. Design and baseline characteristics of the osteoporotic fractures in men (MrOS) study–a large observational study of the determinants of fracture in older men. Contemp Clin Trials. 2005;26:569–585. doi: 10.1016/j.cct.2005.05.006. [DOI] [PubMed] [Google Scholar]
Robins J, Scheines R, Spirtes P, Wasserman L. Uniform consistency in causal inference. Biometrika. 2003;90:491–515. [Google Scholar]
Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search. MIT Press; 2001. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B. 1996;58:267–288. [Google Scholar]
Tropp JA. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information Theory. 2004;50:2231–2242. [Google Scholar]
Tropp JA. Just relax: convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory. 2006;52:1030–1051. [Google Scholar]
Wainwright M. Sharp thresholds for high-dimensional and noisy recovery of sparsity.arxiv.org/math.ST/0605740 2006 [Google Scholar]
Wellcome Trust. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao P, Yu B. On model selection consistency of lasso. Journal of Machine learning research. 2006;7:2541–2563. [Google Scholar]
Zhang CH, Huang J. Model selection consistency of the lasso in high-dimensional linear regression. To appear: The Annals of Statisstics 2006 [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R1] Barron A, Cohen A, Dahmen W, DeVore R. Approximation and learning by greedy algorithms. The Annals of Statistics. 2008;36:64–94. [Google Scholar]

[R2] Bühlmann P. Boosting for high-dimensional linear models. The Annals of Statistics. 2006;34:559–583. [Google Scholar]

[R3] Candes E, Tao T. The Dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics. 2007;35:2313–2351. [Google Scholar]

[R4] Donoho D. For Most Large Underdetermined Systems of Linear Equations, the minimal l1-norm near-solution approximates the sparsest near-solution. Communications on Pure and Applied Mathematics. 2006;59:797–829. [Google Scholar]

[R5] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics. 2004;32:407–499. [Google Scholar]

[R6] Fan J, Lv J. Sure independence screening for ultra-high dimensional feature space. To appear: Journal of the Royal Statistical Association, Series B. 2008 doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Greenshtein E, Ritov Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988. [Google Scholar]

[R8] Hoeffding W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association. 1963;58:13–30. [Google Scholar]

[R9] Meinshausen N. Relaxed Lasso. Computational Statistics and Data Analysis. 2007;52:374–393. [Google Scholar]

[R10] Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]

[R11] Meinshausen N, Yu B. Lasso-type recovery of sparse representations of high-dimensional data. To appear: The Annals of Statistics 2008 [Google Scholar]

[R12] Orwoll E, Blank JB, Barrett-Connor E, Cauley J, Cummings S, Ensrud K, Lewis C, Cawthon PM, Marcus R, Marshall LM, McGowan J, Phipps K, Sherman S, Stefanick ML, Stone K. Design and baseline characteristics of the osteoporotic fractures in men (MrOS) study–a large observational study of the determinants of fracture in older men. Contemp Clin Trials. 2005;26:569–585. doi: 10.1016/j.cct.2005.05.006. [DOI] [PubMed] [Google Scholar]

[R13] Robins J, Scheines R, Spirtes P, Wasserman L. Uniform consistency in causal inference. Biometrika. 2003;90:491–515. [Google Scholar]

[R14] Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search. MIT Press; 2001. [Google Scholar]

[R15] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B. 1996;58:267–288. [Google Scholar]

[R16] Tropp JA. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information Theory. 2004;50:2231–2242. [Google Scholar]

[R17] Tropp JA. Just relax: convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory. 2006;52:1030–1051. [Google Scholar]

[R18] Wainwright M. Sharp thresholds for high-dimensional and noisy recovery of sparsity.arxiv.org/math.ST/0605740 2006 [Google Scholar]

[R19] Wellcome Trust. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Zhao P, Yu B. On model selection consistency of lasso. Journal of Machine learning research. 2006;7:2541–2563. [Google Scholar]

[R21] Zhang CH, Huang J. Model selection consistency of the lasso in high-dimensional linear regression. To appear: The Annals of Statisstics 2006 [Google Scholar]

[R22] Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

PERMALINK

HIGH DIMENSIONAL VARIABLE SELECTION

Larry Wasserman

Kathryn Roeder

Abstract

1. Introduction

Notation

Summary of Assumptions

2. Error Control

3. Loss and Cross-validation

3.1. Loss

Lemma 3.1

3.2. Cross-validation

Theorem 3.2

4. Multi-Stage Methods

4.1. The Lasso

Theorem 4.1

Theorem 4.2

Theorem 4.3

Remark 4.4

4.2. Stepwise Regression

Theorem 4.5

4.3. Marginal Regression

Theorem 4.6

4.4. Modifications

Theorem 4.7

5. Application

6. Simulations

Table 1.

Table 2.

7. Proofs

Theorem 7.1 (Meinshausen and Yu, 2008)

Proof of Lemma 3.1

Proof of Theorem 3.2

Proof of Theorem 4.1

Proof of Theorem 4.2

Proof of Theorem 4.5

Proof of Theorem 4.6

Proof of Theorem 4.7

8. Discussion

Acknowledgments

Appendix

Prediction

Theorem 8.1 (Greenshtein and Ritov 2004)

Theorem 8.2

Theorem 8.3

Proof

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases