Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Oct 19.
Published in final edited form as: Stat Sci. 2021 Oct 11;36(4):562–577. doi: 10.1214/20-sts815

In Defense of the Indefensible: A Very Naïve Approach to High-Dimensional Inference

Sen Zhao 1, Daniela Witten 2, Ali Shojaie 3
PMCID: PMC10586523  NIHMSID: NIHMS1885635  PMID: 37860618

Abstract

A great deal of interest has recently focused on conducting inference on the parameters in a high-dimensional linear model. In this paper, we consider a simple and very naïve two-step procedure for this task, in which we (i) fit a lasso model in order to obtain a subset of the variables, and (ii) fit a least squares model on the lasso-selected set. Conventional statistical wisdom tells us that we cannot make use of the standard statistical inference tools for the resulting least squares model (such as confidence intervals and p-values), since we peeked at the data twice: once in running the lasso, and again in fitting the least squares model. However, in this paper, we show that under a certain set of assumptions, with high probability, the set of variables selected by the lasso is identical to the one selected by the noiseless lasso and is hence deterministic. Consequently, the naïve two-step approach can yield asymptotically valid inference. We utilize this finding to develop the naïve confidence interval, which can be used to draw inference on the regression coefficients of the model selected by the lasso, as well as the naïve score test, which can be used to test the hypotheses regarding the full-model regression coefficients.

Keywords: confidence interval, lasso, p-value, post-selection inference, significance testing

1. INTRODUCTION

In this paper, we consider the linear model

y=Xβ*+ϵ, (1.1)

where X=x1,,xp is an n×p deterministic design matrix, ϵ is a vector of independent and identically distributed errors with Eϵi=0 and Varϵi=σϵ2, and β* is a. p-vector of coefficients. Without loss of generality, we assume that the columns of X are centered and standardized, such that i=1nX(i,k)=0 and xk22=n for k=1,,p.

When the number of variables p is much smaller than the sample size n, estimation and inference for the vector β* are straightforward. For instance, estimation can be performed using ordinary least squares, and inference can be conducted using classical approaches (see, e.g., Gelman and Hill, 2006; Weisberg, 2013).

As the scope and scale of data collection have increased across virtually all fields, there is an increase in data sets that are high dimensional, in the sense that the number of variables, p, is larger than the number of observations, n. In this setting, classical approaches for estimation and inference of β* cannot be directly applied. In the past 20 years, a vast statistical literature has focused on estimating β* in high dimensions. In particular, penalized regression methods, such as the lasso (Tibshirani, 1996),

βˆλ=argminbRp12nyXb22+λb1, (1.2)

can be used to estimate β*. However, the topic of inference in the high-dimensional setting remains relatively less explored, despite promising recent work in this area. Roughly speaking, recent work on inference in the high-dimensional setting falls into two classes: (i) methods that examine the null hypothesis H0,j*:βj*=0; and (ii) methods that make inference based on a sub-model. We will review these two classes of methods in turn.

First, we review methods that examine the null hypothesis H0,j*:βj*=0, i.e. that the variable xj is unassociated with the outcome y, conditional on all other variables. It might be tempting to estimate β* using the lasso (1.2), and then (for instance) to construct a confidence interval around βˆλ,j. Unfortunately, such an approach is problematic, because βˆλ is a biased estimate of β*. To remedy this problem, we can apply a one-step adjustment to βˆλ, such that under appropriate assumptions, the resulting debiased estimator is asymptotically unbiased for β*. This is similar to the idea of two-step estimation in nonparametric and semiparametric inference (see, e.g., Hahn, 1998; Hirano et al., 2003). With the one-step adjustment, p-values and confidence intervals can be constructed around this debiased estimator. Such an approach is taken by the low dimensional projection estimator (LDPE; Zhang and Zhang, 2014; van de Geer et al., 2014), the debiased lasso test with unknown population covariance (SSLasso; Javanmard and Montanari, 2013, 2014a), the debiased lasso test with known population covariance (SDL; Javanmard and Montanari, 2014b), and the decorrelated score test (dScore; Ning and Liu, 2017). See Dezeure et al. (2015) for a review of such procedures. In what follows, we will refer to these and related approaches for testing H0,j*:βj*=0 as debiased lasso tests.

Next, we review recent work that makes statistical inference based on a sub-model. Recall that the challenge in high dimensions stems from the fact that when p>n, classical statistical methods cannot be applied; for instance, we cannot even perform ordinary least squares (OLS). This suggests a simple approach: given an index set {1,,p}, let X denote the columns of X indexed by . Then, we can consider performing inference based on the sub-model composed only of the features in the index set . That is, rather than considering the model (1.1), we consider the sub-model

y=Xβ()+ϵ(). (1.3)

In (1.3), the notation β() and ϵ() emphasizes that the true regression coefficients and corresponding noise are functions of the set .

Now, provided that ||<n, we can perform estimation and inference on the vector β() using classical statistical approaches. For instance, we can consider building confidence intervals CIj() such that for any j,

PrβjCIj1α. (1.4)

At first blush, the problems associated with high dimensionality have been solved!

Of course, there are some problems with the aforementioned approach. The first problem is that the coefficients in the sub-model (1.3) typically are not the same as the coefficients in the original model (1.1) (Berk et al., 2013). Roughly speaking, the problem is that the coefficients in the model (1.1) quantify the linear association between a given variable and the response, conditional on the other p1 variables, whereas the coefficients in the sub-model (1.3) quantify the linear association between a variable and the response, conditional on the other ||1 variables in the sub-model. The true regression coefficients in the sub-model are of the form

β()XX1XXβ*. (1.5)

Thus, β()β* unless XXcβc*=0. To see this more concretely, consider the following example with p=4 deterministic variables. Let

1nXX=100.60010.600.60.6100001,

and set β*=(1,1,0,0). The above design matrix does not satisfy the strong irrepresentable condition needed for selection consistency of lasso (Zhao and Yu, 2006). Thus, if we take to equal the support of the lasso estimate, i.e.,

=𝒜ˆλsuppβˆλj:βˆλ,j0, (1.6)

then it is easy to verify that for some λ,={2,3}, in which case

β()=1nXX({2,3},{2,3})11nXX({2,3},{1,2,3,4})β*=10.60.611010.600.60.6101100=0.43750.937510=β*.

The second problem that arises in restricting our attention to the sub-model (1.3) is that in practice, the index set is not pre-specified. Instead, it is typically chosen based on the data. The problem is that if we construct the index set based on the data, and then apply classical inference approaches on the vector β(), the resulting p-values and confidence intervals will not be valid (see, e.g., Pötscher, 1991; Kabaila, 1998; Leeb and Pötscher, 2003, 2005, 2006a,b, 2008; Kabaila, 2009; Berk et al., 2013). This is because we peeked at the data twice: once to determine which variables to include in , and then again to test hypotheses associated with those variables. Consequently, an extensive recent body of literature has focused on the task of performing inference on β() in (1.3) given that was chosen based on the data. Cox (1975) proposed the idea of sample-splitting to break up the dependence of variable selection and hypothesis testing. Wasserman and Roeder (2009) studied sample-splitting in application to the lasso, marginal regression and forward step-wise regression. Meinshausen et al. (2009) extended the single-splitting proposal of Wasserman and Roeder (2009) to multi-splitting, which improved statistical power and reduced the number of falsely selected variables. Berk et al. (2013) instead considered simultaneous inference, which is universally valid under all possible model selection procedures without sample-splitting. More recently, Lee et al. (2016) and Tibshirani et al. (2016) studied the geometry of the lasso and sequential regression, respectively, and proposed exact post-selection inference methods conditional on the random set of selected variables. See Taylor and Tibshirani (2015) for a review of post-selection inference procedures.

The procedures outlined above successfully address the second problem arising from restricting the attention to a sub-model, namely the randomness of the set . However, they do not address the first problem regarding the difference in the target of inference (1.5) unless XXcβc*=0. In the case of lasso, valid inference for β𝒜** can be obtained if =𝒜ˆλ=𝒜*. However, among other conditions, this requires the strong irrepresentable condition, which is known to be a restrictive condition that is not likely to hold in high dimensions (Zhao and Yu, 2006).

In a recent Statistical Science paper, Leeb et al. (2015) performed a simulation study, in which they obtained a set using variable selection, and then calculated “naïve” confidence intervals for β() using ordinary least squares, without accounting for the fact that the set was chosen based on the data. Of course, conventional wisdom dictates that the resulting confidence intervals will be much too narrow. In fact, this is what Leeb et al. (2015) found, when they used best subset selection to construct the set . However, surprisingly, when the lasso was used to construct the set , the confidence intervals induced by (1.4) had approximately correct coverage. This is in stark contrast to the existing literature!

In this paper, we present a theoretical justification for the empirical finding in Leeb et al. (2015). The main idea of our paper is to establish selection consistency of the lasso estimate with respect to its noiseless counterpart. This result allows us to perform valid inference for the support of the noiseless lasso without needing post-selection or sample splitting strategies. Furthermore, we use our theoretical findings to also develop the naïve score test, a simple procedure for testing the null hypothesis H0,j*:βj*=0 for all j=1,,p.

The rest of this paper is organized as follows. In Sections 2 and 3, we focus on post-selection inference: we seek to perform inference on β() in (1.3), where is selected based on the lasso, i.e., =𝒜ˆλ (1.6). In Section 2, we point out a previously overlooked scenario in selection consistency theory: although 𝒜ˆλ (1.6) is random, with high probability it is equal to the support of the noiseless lasso under relatively mild regularity conditions. This result implies that we can use classical methods for inference on β(), when =𝒜ˆλ. In Section 3, we provide empirical evidence in support of these theoretical findings. In Sections 4 and Section 5, we instead focus on the task of performing inference on β* in (1.1). We propose the naïve score test in Section 4, and study its empirical performance in Section 5. We end with a discussion of future research directions in Section 6. Technical proofs are relegated to the online Supplementary Materials.

We now introduce some notation that will be used throughout the paper. We use“”to denote equalities by definition, and “” for the asymptotic order. We use 1{} for the indicator function; “” and “” denote the maximum and minimum of two real numbers, respectively. For any real number aR,a+a0. Given a set 𝒮, |𝒮| denotes its cardinality and 𝒮𝒮c denotes its complement. We use bold upper case fonts to denote matrices, bold lower case fonts for vectors, and normal fonts for scalars. We use symbols with a superscript “*”, e.g., β* and 𝒜*supp(β*), to denote the true population parameters associated with the full linear model (1.1); we use symbols superscripted by a set in the parentheses, e.g., β(), to denote quantities related to the sub-model (1.3). Symbols subscripted by “λ” and with a hat, e.g., βˆλ and 𝒜ˆλ, denote parameter estimates from the lasso estimator (1.2) with tuning parameter λ>0; symbols subscripted by “λ” and without a hat, e.g., βλ and 𝒜λsuppβλ, are associated with the noiseless lasso estimator (van de Geer and Bühlmann, 2009; van de Geer, 2017),

βλargminbRp12nEyXb22+λb1. (1.7)

For any vector b, matrix Σ, and index sets 𝒮1 and 𝒮2, we use b𝒮1 to denote the subvector of b comprised of elements of 𝒮1, and Σ𝒮1,𝒮2 to denote the sub-matrix of Σ with rows in 𝒮1 and columns in 𝒮2.

2. THEORETICAL JUSTIFICATION FOR NAÏVE CONFIDENCE INTERVALS

Recall that β(𝒜ˆλ) was defined in (1.5). The simulation results of Leeb et al. (2015) suggest that if we perform ordinary least squares using the variables contained in the support set of the lasso, 𝒜ˆλ, then the classical confidence intervals associated with the least squares estimator,

β˜(𝒜ˆλ)X𝒜ˆλX𝒜ˆλ1X𝒜ˆλy, (2.1)

have approximately correct coverage, where correct coverage means that for all j𝒜ˆλ,

Prβj(𝒜ˆλ)CIj(𝒜ˆλ)1α. (2.2)

We reiterate that in (2.2), CIj(𝒜ˆλ) is the confidence interval output by standard least squares software applied to the data (y,X𝒜ˆλ). This goes against our statistical intuition: it seems that by fitting a lasso model and then performing least squares on the selected set, we are peeking at the data twice, and thus we would expect the confidence interval CIj(𝒜ˆλ) to be much too narrow.

In this section, we present a theoretical result that suggests that, in fact, this “double-peeking” might not be so bad. Our key insight is as follows: under certain assumptions, the set of variables selected by the lasso is deterministic and non-data-dependent with high probability. Thus, fitting a least squares model on the variables selected by the lasso does not really constitute peeking at the data twice: effectively, with high probability, we are only peeking at it once. That means that the naïve confidence intervals obtained from ordinary least squares will have approximately correct coverage, in the sense of (2.2).

We first introduce the required conditions for our theoretical result.

(M1) The design matrix X is deterministic, with columns in general position (Rosset et al., 2004; Zhang, 2010; Dossal, 2012; Tibshirani, 2013). Columns of X are standardized, i.e., for any j=1,,p,xjxj=n. The error ϵ in (1.1) has independent entries and sub-Gaussian tails. The sample size n, dimension p and tuning parameter λ satisfy

log(p)n1λ0.

(E) Let Σˆ=XX/n. For any index set with ||=𝒪q*, let be any index set such that , |\|||. Then for all aRp that satisfy ac1a1, and acminj\aj,

ϕ*2liminfnaΣˆaa22>0,

In addition, the restricted sparse eigenvalue,

ϕ2qsupb𝒜*c0qbΣˆbb22,
ϕ2q*=𝒪log(p)/q*

(M2) Recall that 𝒜*supp(β*). Let 𝒜λsuppβλ,bλminminj𝒜λβλ,j with βλ defined in (1.7). The signal strength in 𝒜λ satisfies

limnbλminλ=ξ>0, (2.3)

and the signal strength outside of 𝒜λ satisfies

X𝒜λcβ𝒜λc*2=𝒪logp. (2.4)

(T) The strong irrepresentable condition with respect to 𝒜λ and βλ holds, i.e., there exists δ>0 such that

limnX𝒜λcX𝒜λX𝒜λX𝒜λ1signβλ,𝒜λ<1δ. (2.5)

Condition (M1) is mild and standard in literature. Note that in (M1), we require the lasso tuning parameter λ to approach zero at a slightly slower rate than log(p)/n to control the randomness of the error ϵ. Most standard literature requires λ>Clog(p)/n for some constant C>0, which our condition also satisfies.

In (M2), the requirement that limnbλ,min/λ>0 indicates that the noiseless lasso regression coefficients are either 0, or asymptotically no smaller than λ, which is larger than log(p)/n. Note that this assumption concerns variables chosen by the noiseless lasso and not the true model. In the second part of (M2), X𝒜λcβ𝒜λc*2=𝒪(log(p)) implies that the total signal strength of weak signal variables that are not selected by the noiseless lasso cannot be too large to be distinguishable from the sub-Gaussian noise. As shown in the proof of Proposition 2.1, the condition bλ,min=𝒪(λ) is important in showing that limnPr𝒜λ𝒜ˆλ=1, whereas X𝒜λcβ𝒜λc*2=𝒪(log(p)) is instrumental in showing limnPr𝒜λ𝒜ˆλ=1.

Condition (E) manifests the behavior of the eigenvalues of Σˆ. Specifically, the first part of (E) is the restricted eigenvalue condition (Bickel et al., 2009; van de Geer and Bühlmann, 2009), except that instead of requiring ||=q* as in Bickel et al. (2009) and van de Geer and Bühlmann (2009), we here require ||=𝒪q*. The second part of (E) is the sparse Reisz, or sparse eigenvalue condition (Zhang and Huang, 2008; Belloni and Chernozhukov, 2013). Both the restricted eigenvalue and sparse Reisz conditions are standard and mild conditions in the literature. (E) implies that log(p)/q*ψ>1.

Condition (T) is the strong irrepresentable condition with respect to 𝒜λ and βλ. Condition (T) is likely weaker than the classical irrepresentable condition with respect to β* and 𝒜*, proposed in Zhao and Yu (2006), because the classical irrepresentable condition implies that 𝒜λ=𝒜* with large n, in which case (T) holds as it becomes identical to the classical irrepresentable condition. Lemma 2.1 shows a sufficient condition for (T), which is proven in Section S2 in the Supplementary Materials.

Lemma 2.1.

Suppose conditions (M1), (M2) and (E) hold. Condition (T) holds if

limn22q*nϕ*2X𝒜λcX𝒜λ<1δ. (2.6)

Condition (2.6) allows X𝒜λcX𝒜λ to diverge to infinity at a slower rate than n/q*. A more stringent version of condition (2.6) is presented in Voorman et al. (2014).

We use (T) to prove limnτλ,𝒜λc1δ, where λnτλ=X(Xβ*Xβλ) is the stationary condition of (1.7). Note that limnτλ,𝒜λc1δ is the required condition for Proposition 2.1, and (T) might be sufficient but unncessary. However, (T) is more standard and understandable than the condition limnτλ,𝒜λc1δ.

We now present Proposition 2.1, which is proven in Section S1 of the online Supplementary Materials.

PROPOSITION 2.1.

Suppose conditions (M1), (M2), (E) and (T) hold. Then, we have limnPr[𝒜ˆλ=𝒜λ]=1, where 𝒜λsuppβλ, with βλ defined in (1.7).

The proof of Proposition 2.1 is in the same flavor as Meinshausen and Bühlmann (2006); Zhao and Yu (2006); Tropp (2006) and Wainwright (2009), and is based on absorbing the contribution of the weak signals X𝒜λcβ𝒜λc* into the noise vector. However, variable selection consistency asserts that Pr[𝒜ˆλ=𝒜ˆ*]1, whereas Proposition 2.1 states that Pr[𝒜ˆλ=𝒜λ]1. Consequently, variable selection consistency requires the irrepresentable and signal strength conditions with respect to β*, whereas Proposition 2.1 requires the irrepresentable and signal strength conditions with respect to βλ. These conditions are likely much milder than those for the variable selection consistency of the lasso, as confirmed in our simulations in Section 3. In these simulations, we estimate Pr[𝒜ˆλ=𝒜λ] in 36 settings with two choices of λ. As shown in Table 4, in most settings, 𝒜ˆλ=𝒜λ with high probability, especially in cases with large n. We also estimated Pr[𝒜ˆλ=𝒜*] in the same 36 settings, with the same two choices of λ, as well as an additional choice of λ that minimizes the cross-validated MSE; we found that Pr[𝒜ˆλ=𝒜*]=0 in all 36 × 3 settings.

TABLE 4.

The proportion of 𝒜ˆλb that equals 𝒜λb,b=1,,1000, under the scale-free graph and stochastic block model settings with tuning parameters λsup and λ1SE.. In the simulation, ρ{0.2,0.6}, sample size n{100,300,500}, dimension p=100 and signal-to-noise ratio SNR{0.1,0.3,0.5}.

ρ
n
SNR
0.1 100
0.3
0.5 0.1 0.2
300
0.3
0.5 0.1 500
0.3
0.5

Scale-free λsup 0.958 0.725 0.928 0.649 1.000 1.000 0.971 0.999 0.998
Scale-free λ1SE 0.417 0.665 0.829 0.567 0.955 0.940 0.804 0.976 0.964

Stochastic block λsup 0.964 0.736 0.931 0.678 1.000 1.000 0.964 1.000 1.000
Stochastic block λ1SE 0.394 0.661 0.815 0.582 0.964 0.957 0.582 0.559 0.964

ρ
n
SNR
0.1 100
0.3
0.5 0.1 0.6
300
0.3
0.5 0.1 500
0.3
0.5

Scale-free λsup 0.952 0.713 0.924 0.687 1.000 1.000 0.968 0.999 0.998
Scale-free λ1SE 0.396 0.657 0.799 0.572 0.950 0.941 0.779 0.976 0.963

Stochastic block λsup 0.950 0.750 0.935 0.687 1.000 1.000 0.967 1.000 0.999
Stochastic block λ1SE 0.445 0.661 0.806 0.559 0.959 0.948 0.805 0.987 0.974

Based on Proposition 2.1, we could build asymptotically valid confidence intervals, as shown in Theorem 2.1.

Proposition 2.1 suggests that asymptotically, we “pay no price” for peeking at our data by performing the lasso: we should be able to perform downstream analyses on the subset of variables in 𝒜ˆλ as though we had obtained that subset without looking at the data. This intuition will be formalized in Theorem 2.1.

Theorem 2.1, which is proven in Section S3 in the online Supplementary Materials, shows that β˜(𝒜ˆλ) in (2.1) is asymptotically normal, with mean and variance suggested by classical least squares theory: that is, the fact that 𝒜ˆλ was selected by peeking at the data has no effect on the asymptotic distribution of β˜(𝒜ˆλ). This result requires that λ be chosen in a non-data-adaptive way. Otherwise, 𝒜λ will be affected by the random error ϵ through λ, which complicates the distribution of β˜(𝒜ˆλ). Theorem 2.1 also requires Condition (W), which is used to apply the Lindeberg-Feller Central Limit Theorem. This condition can be relaxed if the noise ϵ is normally distributed.

(W) λ,β* and X are such that limnrw/rw20, where

rwejX𝒜λX𝒜λ1X𝒜λ,

and ej is the row vector of length 𝒜λ with the entry corresponding to βj* equal to one, and zero otherwise.

Theorem 2.1.

Suppose limnPr𝒜ˆλ=𝒜λ=1 and (W) holds. Then, for any j𝒜ˆλ,

β˜j(𝒜ˆλ)βj(𝒜ˆλ)σϵ(X𝒜ˆλX𝒜ˆλ)1(j,j)d𝒩(0,1), (2.7)

where β˜(𝒜ˆλ) is defined in (2.1) and β(𝒜ˆλ) in (1.5), and σϵ is the variance of ϵ in (1.1).

The error standard deviation σϵ in (2.7) is usually unknown. It can be estimated using various high-dimensional estimation methods, e.g., the scaled lasso (Sun and Zhang, 2012), cross-validation (CV) based methods (Fan et al., 2012) or method-of-moments based methods (Dicker, 2014); see a comparison study of high dimensional error variance estimation methods in Reid et al. (2016). Alternatively, Theorem 2.2 shows that we could also consistently estimate the error variance using the post-selection OLS residual sum of square (RSS).

Theorem 2.2.

Suppose limnPr𝒜ˆλ=𝒜λ=1 and log(p)/nqλ0, where qλ𝒜λ. Then

1nqˆλyX𝒜ˆλβ˜(𝒜ˆλ)22pσϵ2, (2.8)

where qˆλ𝒜ˆλ.

Theorem 2.2 is proven in Section S4 in the online Supplementary Materials. In (2.8), yX𝒜ˆλβ˜𝒜ˆλ is the fitted OLS residual on the sub-model (1.3). Also, log(p)/(nqλ0 is a weak condition: since log(p)/n0,log(p)/nqλ0 is satisfied if limnqλ/n<1.

To summarize, in this section, we have provided a theoretical justification for a procedure that seems, intuitively, to be statistically unjustifiable:

  1. Perform the lasso in order to obtain the support set 𝒜ˆλ;

  2. Use least squares to fit the sub-model containing just the features in 𝒜ˆλ;

  3. Use the classical confidence intervals from that least squares model, without accounting for the fact that 𝒜ˆλ was obtained by peeking at the data.

Theorem 2.1 guarantees that the naïve confidence intervals in Step 3 will indeed have approximately correct coverage, in the sense of (2.2).

3. NUMERICAL EXAMINATION OF NAÏVE CONFIDENCE INTERVALS

In this section, we perform simulation studies to examine the coverage probability (2.2) of the naïve confidence intervals obtained by applying standard least squares software to the data y,X𝒜ˆλ.

Recall from Section 1 that (2.2) involves the probability that the confidence interval contains the quantity β(𝒜ˆλ), which in general does not equal the population regression coefficient vector β𝒜ˆλ*. Inference for β* is discussed in Sections 4 and 5.

The results in this section complement simulation findings in Leeb et al. (2015).

3.1. Methods for Comparison

Following Theorem 2.1, for β˜(𝒜ˆλ) defined in (2.1), and for each j𝒜ˆλ, the 95% naïve confidence interval takes the form

CIj(𝒜ˆλ)β˜j(𝒜ˆλ)1.96×σˆϵX𝒜ˆλX𝒜ˆλ(j,j),β˜j(𝒜ˆλ)+1.96×σˆϵX𝒜ˆλX𝒜ˆλ(j,j). (3.1)

In order to obtain the set 𝒜ˆλ, we must apply the lasso using some value of λ. By (M1) we need λlog(p)/n, which is slightly larger than the prediction optimal rate, λlog(p)/n (Bickel et al., 2009; van de Geer and Bühlmann, 2009). A data-driven way to obtain a larger tuning parameter it to use λ1SE, which is the largest value of λ for which the 10-fold CV prediction mean squared error (PMSE) is within one standard error of the minimum CV PMSE (see Section 7.10.1 in Hastie et al., 2009). However, the λ selected by cross validation depends on y and may induce additional randomness in the set of selected coefficients, 𝒜ˆλ. This randomness can also impact the exact post-selection procedure of Lee et al. (2016). To address this issue, the authors proposed λsup2E[Xe]/n, where e𝒩n0,σˆϵ2I. As an alternative to λ1SE, we also evaluate λsup, where we simulate e and approximate the expectation based on the average of 1000 replicates. We compare the confidence intervals for β(𝒜ˆλ) with those based on the exact lasso post-selection inference procedure of Lee et al. (2016), which is implemented in the R package selectiveInference.

In both approaches, the standard deviation of errors, σϵ in (1.1), is estimated either using the scaled lasso (Sun and Zhang, 2012) or by applying Theorem 2.2. However, we do not examine the combination of λsup with error variance estimated based on Theorem 2.2, because λsup requires an estimate of the error standard deviation based on Theorem 2.2, whereas Theorem 2.2 requires a suitable choice of λ to be valid.

3.2. Simulation Set-Up

For the simulations, we consider two partial correlation settings for X, generated based on (i) a scale-free graph and (ii) a stochastic block model (see, e.g., Kolaczyk, 2009), each containing p=100 nodes. These settings are relaxations of the simple orthogonal and block-diagonal settings, and are displayed in Figure 1.

FIG 1.

FIG 1.

The scale-free graph and stochastic block model settings. The size of a given node indicates the magnitude of the corresponding element of β*.

In the scale-free graph setting, we used the igraph package in R to simulate an undirected, scale-free network 𝒢=(𝒱,) with power-law exponent parameter γ=5, and edge density 0.05. Here, 𝒱={1,,p} is the set of nodes in the graph, and is the set of edges. This resulted in a total of ||=247 edges in the graph. We then order the indices of the nodes in the graph so that the first, second, third, fourth, and fifth nodes correspond to the 10th, 20th, 30th, 40th, and 50th least-connected nodes in the graph.

In the stochastic block model setting, we first generate two dense Erdős-Rényi graphs (Erdős and Rényi, 1959; Gilbert, 1959) with five and 95 nodes, respectively. In each graph, the edge density is 0.3. We then add edges randomly between these two graphs to achieve an inter-graph edge density of 0.05. The indices of the nodes are ordered so that the nodes in the five-node graph precede the remaining nodes.

Next, for both graph settings, we define the weighted adjacency matrix, A, as follows:

A(j,k)=1forj=kρfor(j,k),0otherwise (3.2)

where ρ{0.2,0.6}. We then set Σ=A1, and standardize Σ so that Σ(j,j)=1, for all j=1,,p. We simulate observations x1,,xni.i.d.𝒩p(0,Σ), and generate the outcome y𝒩n(Xβ*,σϵ2In),n{100,300,500}, where

βj*=1forj=10.1for2j50otherwise.

A range of error variances σϵ2 are used to produce signal-to-noise ratios, SNR(β*Σβ*)/σϵ2{0.1,0.3,0.5}.

Throughout the simulations, Σ and β* are held fixed over B=1000 repetitions of the simulation study, while X and y vary.

3.3. Simulation Results

We calculate the average length and coverage proportion of the 95% naïve confidence intervals, where the coverage proportion is defined as

CoverageProportionb=1Bj𝒜ˆλb1{βj(𝒜ˆλb)CIj(𝒜ˆλb),b}/|𝒜ˆλb|, (3.3)

where 𝒜ˆλb and CIj(𝒜ˆλb),b are the set of variables selected by the lasso in the bth repetition, and the 95% naïve confidence interval (3.1) for the jth variable in the bth repetition, respectively. Recall that βj(𝒜ˆλb) was defined in (1.5). In order to calculate the average length and coverage proportion associated with the exact lasso post selection procedure of Lee et al. (2016), we replace CIj(𝒜ˆλb),b in (3.3) with the confidence interval output by the selectiveInference R package.

Tables 1 and 2 show the coverage proportion and average length of 95% naïve confidence intervals and 95% exact lasso post-selection confidence intervals under the scale-free graph and stochastic block model settings, respectively. The result shows that the coverage probability of the exact post-selection confidence interval is more correct than that of the naïve confidence interval when the data are small and signal is weak. But when the data are large and/or signal is relatively strong, both confidence intervals have approximately correct coverage. This corroborates the findings in Leeb et al. (2015), in which the authors consider settings with n=30 and p=10. The coverage probability of the naïve confidence intervals with tuning parameter λ1SE is a bit smaller than the desired level, especially when the signal is weak. This may be due to the randomness in λ1SE. The naïve approach of error variance estimation as in Theorem 2.2 works similarly compared to the scaled lasso across all settings. In addition, Tables 1 and 2 also show that naïve confidence intervals are substantially narrower than exact lasso post-selection confidence intervals, especially when the signal is weak.

TABLE 1.

Coverage proportions (Cov) and average lengths (Len) of 95% naïve confidence intervals with tuning parameters λsup and λ1SE, and 95% exact post-selection confidence intervals under the scale-free graph setting with partial correlation ρ{0.2,0.6}, sample size n{100,300,500}, dimension p=100 and signal-to-noise ratio SNR{0.1,0.3,0.5}. The error variance is estimated either through the scaled lasso (Sun and Zhang, 2012) (SL) or Theorem 2.2 (NL).

ρ
n
SNR
0.1 100
0.3
0.5 0.1 0.2
300
0.3
0.5 0.1 500
0.3
0.5

exact λsup SL 0.905 0.942 0.936 0.949 0.953 0.953 0.946 0.949 0.949
Cov naïve λsup SL 0.683 0.961 0.944 0.951 0.952 0.951 0.971 0.948 0.947
naïve λ1SE SL 0.596 0.876 0.887 0.905 0.917 0.910 0.944 0.935 0.932
naïve λ1SE NL 0.587 0.870 0.883 0.905 0.919 0.911 0.945 0.939 0.933

exact λsup SL 4.260 1.634 0.823 1.907 0.437 0.330 0.814 0.327 0.255
Len naïve λsup SL 1.122 0.693 0.552 0.717 0.421 0.326 0.562 0.325 0.253
naïve λ1SE SL 1.199 0.705 0.555 0.718 0.420 0.326 0.560 0.326 0.253
naïve λ1SE NL 1.201 0.711 0.559 0.723 0.424 0.329 0.563 0.328 0.255

ρ
n
SNR
0.1 100
0.3
0.5 0.1 0.6
300
0.3
0.5 0.1 500
0.3
0.5

exact λsup SL 0.956 0.938 0.945 0.922 0.944 0.945 0.948 0.945 0.947
Cov naïve λsup SL 0.625 0.943 0.953 0.942 0.942 0.942 0.964 0.944 0.944
naïve λ1SE SL 0.643 0.873 0.876 0.918 0.934 0.929 0.954 0.941 0.935
naïve λ1SE NL 0.639 0.870 0.873 0.919 0.934 0.929 0.957 0.944 0.938

exact λsup SL 6.433 1.616 0.770 1.892 0.429 0.328 0.787 0.326 0.253
Len naïve λsup SL 1.108 0.692 0.553 0.713 0.420 0.326 0.560 0.325 0.252
naïve λ1SE SL 1.200 0.705 0.554 0.720 0.420 0.326 0.561 0.326 0.253
naïve λ1SE NL 1.203 0.709 0.557 0.725 0.423 0.329 0.564 0.327 0.254

TABLE 2.

Coverage proportions (Cov) and average lengths (Len) of 95% naïve confidence intervals with tuning parameters λsup and λ1SE, and 95% exact post-selection confidence intervals under the stochastic block model setting with partial correlation ρ{0.2,0.6}, sample size n{100,300,500}, dimension p=100 and signal-to-noise ratio SNR{0.1,0.3,0.5}. The error variance is estimated either through the scaled lasso (Sun and Zhang, 2012) (SL) or Theorem 2.2 (NL).

ρ
n
SNR
0.1 100
0.3
0.5 0.1 0.2
300
0.3
0.5 0.1 500
0.3
0.5

exact λsup SL 0.921 0.934 0.948 0.951 0.934 0.933 0.953 0.958 0.958
Cov naïve λsup SL 0.540 0.955 0.960 0.953 0.931 0.930 0.966 0.957 0.957
naïve λ1SE SL 0.585 0.850 0.858 0.911 0.922 0.917 0.961 0.944 0.940
naïve λ1SE NL 0.586 0.854 0.865 0.913 0.924 0.918 0.961 0.945 0.942

exact λsup SL 4.721 1.692 0.857 1.873 0.425 0.325 0.758 0.325 0.252
Len naïve λsup SL 1.111 0.682 0.545 0.709 0.416 0.323 0.559 0.323 0.251
naïve λ1SE SL 1.188 0.703 0.552 0.715 0.416 0.323 0.557 0.323 0.251
naïve λ1SE NL 1.195 0.710 0.558 0.721 0.420 0.326 0.560 0.325 0.253

ρ
n
SNR
0.1 100
0.3
0.5 0.1 0.6
300
0.3
0.5 0.1 500
0.3
0.5

exact λsup SL 0.923 0.945 0.949 0.959 0.956 0.958 0.943 0.948 0.948
Cov naïve λsup SL 0.569 0.959 0.964 0.964 0.955 0.956 0.963 0.946 0.946
naïve λ1SE SL 0.568 0.841 0.852 0.922 0.933 0.927 0.947 0.946 0.941
naïve λ1SE NL 0.561 0.841 0.850 0.922 0.935 0.929 0.950 0.947 0.943

exact λsup SL 4.744 1.662 0.832 1.987 0.424 0.325 0.807 0.324 0.251
Len naïve λsup SL 1.119 0.690 0.545 0.712 0.416 0.323 0.557 0.322 0.250
naïve λ1SE SL 1.194 0.704 0.554 0.713 0.416 0.323 0.555 0.322 0.250
naïve λ1SE NL 1.194 0.707 0.557 0.718 0.420 0.326 0.558 0.324 0.252

In addition, to evaluate whether 𝒜ˆλ is deterministic, among repetitions that 𝒜ˆλb (there is no confidence interval if 𝒜ˆλb=), we also calculate the proportion of 𝒜ˆλb=𝒟, where 𝒟 is the most common 𝒜ˆλb,b=1,,1000. The result is summarized in Table 3, which shows that 𝒜ˆλ is almost deterministic with tuning parameter λsup. With λSSE, due to the randomness in the tuning parameter, 𝒜ˆλ is less deterministic, which may explain the smaller coverage probability than the desired level in this case. We also examined the proportion of 𝒜ˆλb=𝒜λb, which is shown in Table 4. The observation is similar to that in Table 3. As mentioned in Section 1, Pr[𝒜ˆλ=𝒜*]=0 in all the settings considered.

TABLE 3.

Among repetitions that 𝒜ˆλb, the proportion of 𝒜ˆλb that equals the most common 𝒜ˆλb,b=1,,1000, under the scale-free graph and stochastic block model settings with tuning parameters λsup and λ1SE. In the simulation, ρ{0.2,0.6}, sample size n{100,300,500}, dimension p=100 and signal-to-noise ratio SNR{0.1,0.3,0.5}..

ρ
n
SNR
0.1 100
0.3
0.5 0.1 0.2
300
0.3
0.5 0.1 500
0.3
0.5

Scale-free; λsup 0.000 0.999 0.993 1.000 0.999 0.998 1.000 0.999 0.996
Scale-free; λ1SE 0.832 0.958 0.851 0.961 0.948 0.934 0.988 0.980 0.966

Stochastic block; λsup 1.000 0.997 0.992 0.998 1.000 0.999 1.000 1.000 1.000
Stochastic block; λ1SE 0.838 0.839 0.831 0.964 0.957 0.946 0.992 0.979 0.968

ρ
n
SNR
0.1 100
0.3
0.5 0.1 0.6
300
0.3
0.5 0.1 500
0.3
0.5

Scale-free; λsup 0.995 0.996 0.996 0.999 0.998 0.998 0.999 1.000 1.000
Scale-free; λ1SE 0.854 0.869 0.855 0.964 0.947 0.935 0.994 0.981 0.962

Stochastic block; λsup 0.998 0.998 0.998 1.000 1.000 1.000 1.000 0.999 0.999
Stochastic block; λ1SE 0.834 0.843 0.826 0.969 0.976 0.968 0.988 0.979 0.971

4. INFERENCE FOR β* WITH THE NAÏVE SCORE TEST

Sections 2 and 3 focused on the task of developing confidence intervals for β() in (1.3), where =𝒜ˆλ, the set of variables selected by the lasso. However, recall from (1.5) that typically β()β*, where β* was introduced in (1.1).

In this section, we shift our focus to performing inference on β*. We will exploit Proposition 2.1 to develop a simple approach for testing H0,j*:βj*=0, for j=1,,p.

Recall that in the low-dimensional setting, the classical score statistic for the hypothesis H0,j*:βj*=0 is proportional to xjTyyˆ0, where yˆ0 is the vector of fitted values that results from least squares regression of y onto the p1 features x1,,xj1,xj+1,,xp. In order to adapt the classical score test statistic to the high-dimensional setting, we define the naïve score test statistic for testing H0,j*:βj*=0 as

Sjxjyy˜𝒜ˆλ\jxjInPjy, (4.1)

where

y˜(𝒜ˆλ\{j})X𝒜ˆλ\jβ˜(𝒜ˆλ\j),

and

PjX𝒜ˆλ\{j}X𝒜ˆλ\{j}X𝒜ˆλ\{j}1X𝒜ˆλ\{j}

is the orthogonal projection matrix onto the set of variables in 𝒜ˆλ\{j}.β˜(𝒜ˆλ\{j}) is defined in (2.1). In (4.1), the notation 𝒜ˆλ\{j} represents the set 𝒜ˆλ in (1.6) with j removed, if j𝒜ˆλ. If j𝒜ˆλ, then 𝒜ˆλ\{j}=𝒜ˆλ.

In Theorem 4.1, we will derive the asymptotic distribution of Sj under H0,j*:βj*=0. We first introduce two new conditions.

First, we require that the total signal strength of variables not selected by the noiseless lasso, (1.7), is small.

(M2* ) Recall that 𝒜*supp(β*). Let 𝒜λsuppβλ,bλminminj𝒜λβλ,j with βλ defined in (1.7). The signal strength in 𝒜λ satisfies

limnbλminλ=ξ>0, (4.2)

and the signal strength outside of 𝒜λ satisfies

X𝒜λcβ𝒜λc*2=𝒪(1). (4.3)

Condition (M2*) closely resembles (M2), which was required for Theorem 2.1 in Section 2. The only difference between the two is that (M2*) requires X𝒜λcβ𝒜λc*2=𝒪(1), whereas (M2) requires only that X𝒜λcβ𝒜λc*2=𝒪(log(p)). Recall that in Section 2, we consider inference for the parameters in the sub-model (1.3). In other words, testing the population regression parameter β* in (1.1) requires more stringent assumptions than constructing confidence intervals for the parameters in the sub-model (1.3).

The following condition, required to apply the Lindeberg-Feller Central Limit Theorem, can be relaxed if the noise ϵ in (1.1) is normally distributed.

(S) λ,β* and X satisfy limnrs/rs2=0, where rs(InP𝒜λ\{j})xj.

We now present Theorem 4.1, which is proven in Section S5 of the online Supplementary Materials.

Theorem 4.1.

Suppose (M2*) and (S) hold and limnPr𝒜ˆλ=𝒜λ=1. For any j=1,,p, under the null hypothesis H0,j*:βj*=0,

TSjσϵxjInPjxjd𝒩0,1, (4.4)

where Sj was defined in (4.1), and where σϵ is the variance of ϵ in (1.1).

Theorem 4.1 states that the distribution of the naïve score test statistic Sj is asymptotically the same as if 𝒜ˆλ were a fixed set, as opposed to being selected by fitting a lasso model on the data. Based on (4.4), we reject the null hypothesis H0,j*:βj*=0 at level α>0 if |T|>Φ𝒩1(1α/2), where Φ𝒩1() is the quantile function of the standard normal distribution function.

We emphasize that Theorem 4.1 holds for any variable j=1,,p, and thus can be used to test H0,j*:βj*=0, for all j=1,,p. (This is in contrast to Theorem 2.1, which concerns confidence intervals for the parameters in the sub-model (1.3) consisting of the variables in 𝒜ˆλ, and hence holds only for j𝒜ˆλ)

5. NUMERICAL EXAMINATION OF THE NAÏVE SCORE TEST

In this section, we compare the performance of the naïve score test (4.1) to three recent proposals from the literature for testing H0,j*:βj*=0 : namely, LDPE (Zhang and Zhang, 2014; van de Geer et al., 2014), SSLasso (Javanmard and Montanari, 2014a), and the decorrelated score test (dScore; Ning and Liu, 2017). (Since with high probability 𝒜ˆλ𝒜*, we do not include the exact post-selection procedure in this comparison.) R code for SSLasso, and dScore was provided by the authors; LDPE is implemented in the R package hdi. For the naïve score test, we estimate σϵ, the standard deviation of the errors in (1.1), using the scaled lasso (Sun and Zhang, 2012) or Theorem 2.2.

All four of these methods require us to select the value of the lasso tuning parameter. For LDPE, SSLasso, and dScore, we use 10-fold cross-validation to select the tuning parameter value that produces the smallest cross-validated mean square error, λmin. As in the numerical study of the naïve confidence intervals in Section 3, we implement the naïve score test using the tuning parameter value λ1SE and λsup. Unless otherwise noted, all tests are performed at a significance level of 0.05.

In Section 5.1, we investigate the powers and type-I errors of the above tests in simulation experiments. Section 5.2 contains an analysis of a glioblastoma gene expression dataset.

5.1. Power and Type-I Error

5.1.1. Simulation Set-Up

In this section, we adapt the scale-free graph and the stochastic block model presented in Section 3.2 to have p=500.

In the scale-free graph setting, we generate a scale-free graph with γ=5, edge density 0.05, and p=500 nodes. The resulting graph has ||=6237 edges. We order the nodes in the graph so that j th node is the (30×j) th least-connected node in the graph, for 1j10. For example, the 4th node is the 120th least-connected node in the graph.

In the stochastic block model setting, we generate two dense Erdős-Rényi graphs with ten nodes and 490 nodes, respectively; each has an intra-graph edge density of 0.3. The node indices are ordered so that the nodes in the smaller graph precede those in the larger graph. We then randomly connect nodes between the two graphs in order to obtain an inter-graph edge density of 0.05.

Next, for both graph settings, we generate A as in (3.2), where ρ{0.2,0.6}. We then set Σ=A1, and standardize Σ so that Σ(j,j)=1, for all j=1,,p. We simulate observations x1,,xni.i.d.𝒩p(0,Σ), and generate the outcome y𝒩n(Xβ*,σϵ2In),n{100,200,400}, where

βj*=1for1j30.1for4j100otherwise

A range of error variances σϵ2 are used to produce signal-to-noise ratios, SNR(β*Σβ*)/σϵ2{0.1,0.3,0.5}

We hold Σ and β* fixed over B=100 repetitions of the simulation, while X and y vary.

5.1.2. Simulation Results

For each test, the average power on the strong signal variables, the average power on the weak signal variables, and the average type-I error rate are defined as

Powerstrong1B13b=1Bj:βj*=11pjb<0.05, (5.1)
Powerweak1B17b=1Bj:βj*=0.11pjb<0.05, (5.2)
Type-1Error1B1490b=1Bj:βj*=01pjb<0.05, (5.3)

respectively. In (5.1)–(5.3), pjb is the p-value associated with null hypothesis H0,j*:βj*=0 in the bth simulated data set. In the simulations, the graphs and β* are held fixed over B=100 repetitions of the simulation study, while X and y vary.

Tables 5 and 6 summarize the results in the two simulation settings. Naïve score test with λsup has slightly worse control of type-1 error rate and better power than the other four methods, which have approximate control over the type-I error rate and comparable power. The performance with the scaled lasso is similar to the performance of Theorem 2.2.

TABLE 5.

Average power and type-I error rate for the hypotheses H0,j*:βj*=0 for j=1,,p, as defined in (5.1)—(5.3), under the scale-free graph setting with p=500. Results are shown for various values of ρ,n, SNR. Methods for comparison include LDPE, SSLasso, dScore, and the naïve score test with tuning parameter λmin, λ1SE and λsup.. The error variance is estimated either through the scaled lasso (Sun and Zhang, 2012) (S-Z) or Theorem 2.2 (T2.2). Note that as mentioned in Section 3, we do not combine λsup with T2.2

ρ
n
SNR
0.1 100
0.3
0.5 0.1 0.2
200
0.3
0.5 0.1 400
0.3
0.5

LDPE λmin S-Z 0.400 0.773 0.910 0.627 0.973 1.000 0.923 1.000 1.000
SSLasso λmin S-Z 0.410 0.770 0.950 0.650 0.970 1.000 0.910 1.000 1.000
Powerstrong dScore λmin S-Z 0.330 0.643 0.857 0.547 0.957 1.000 0.887 1.000 1.000
nScore λsup S-Z 0.403 0.847 0.960 0.727 0.990 0.997 0.940 1.000 1.000
nScore λ1SE S-Z 0.427 0.763 0.893 0.677 0.977 1.000 0.957 1.000 1.000
nScore λ1SE T2.2 0.357 0.763 0.890 0.680 0.977 1.000 0.910 1.000 1.000

LDPE λmin S-Z 0.064 0.083 0.056 0.054 0.059 0.079 0.070 0.079 0.113
SSLasso λmin S-Z 0.081 0.087 0.060 0.066 0.061 0.086 0.069 0.086 0.113
Powerweak dScore λmin S-Z 0.044 0.056 0.036 0.039 0.039 0.060 0.046 0.056 0.093
nScore λsup S-Z 0.061 0.091 0.109 0.070 0.109 0.107 0.097 0.103 0.114
nScore λ1SE S-Z 0.080 0.077 0.059 0.060 0.061 0.061 0.083 0.076 0.101
nScore λ1SE T2.2 0.067 0.070 0.070 0.054 0.071 0.069 0.079 0.094 0.123

LDPE λmin S-Z 0.051 0.052 0.051 0.049 0.051 0.047 0.050 0.051 0.049
SSLasso λmin S-Z 0.056 0.056 0.056 0.054 0.055 0.053 0.053 0.054 0.054
T1 Error dScore λmin S-Z 0.035 0.040 0.040 0.033 0.036 0.034 0.035 0.037 0.034
nScore λsup S-Z 0.069 0.082 0.095 0.064 0.083 0.079 0.065 0.068 0.050
nScore λ1SE S-Z 0.061 0.057 0.048 0.056 0.055 0.040 0.060 0.046 0.046
nScore λ1SE T2.2 0.049 0.052 0.049 0.054 0.053 0.050 0.056 0.049 0.050

ρ
n
SNR
0.1 100
0.3
0.5 0.1 0.6
200
0.3
0.5 0.1 400
0.3
0.5

LDPE λmin S-Z 0.330 0.783 0.947 0.627 0.980 1.000 0.887 1.000 1.000
SSLasso λmin S-Z 0.347 0.790 0.957 0.623 0.987 1.000 0.867 1.000 1.000
Powerstrong dScore λmin S-Z 0.270 0.673 0.883 0.533 0.960 0.993 0.863 1.000 1.000
nScore λsup S-Z 0.430 0.790 0.933 0.707 0.977 1.000 0.923 1.000 1.000
nScore λ1SE S-Z 0.357 0.767 0.887 0.677 0.980 0.997 0.937 1.000 1.000
nScore λ1SE T2.2 0.340 0.697 0.907 0.637 0.973 1.000 0.950 1.000 1.000

LDPE λmin S-Z 0.031 0.046 0.063 0.064 0.074 0.076 0.054 0.077 0.119
SSLasso λmin S-Z 0.047 0.063 0.076 0.063 0.090 0.099 0.053 0.076 0.121
Powerweak dScore λmin S-Z 0.021 0.037 0.047 0.036 0.060 0.044 0.034 0.050 0.083
nScore λsup S-Z 0.071 0.089 0.136 0.081 0.121 0.104 0.114 0.113 0.123
nScore λ1SE S-Z 0.039 0.060 0.050 0.076 0.074 0.066 0.070 0.067 0.104
nScore λ1SE T2.2 0.056 0.070 0.073 0.093 0.087 0.080 0.107 0.096 0.111

LDPE λmin S-Z 0.050 0.051 0.051 0.050 0.049 0.051 0.051 0.050 0.047
SSLasso λmin S-Z 0.056 0.056 0.056 0.054 0.055 0.053 0.053 0.054 0.054
T1 Error dScore λmin S-Z 0.033 0.036 0.034 0.031 0.031 0.035 0.036 0.035 0.033
nScore λsup S-Z 0.065 0.080 0.093 0.064 0.084 0.088 0.070 0.071 0.054
nScore λ1SE S-Z 0.056 0.060 0.045 0.061 0.051 0.040 0.058 0.048 0.047
nScore λ1SE T2.2 0.049 0.053 0.051 0.060 0.061 0.049 0.066 0.049 0.048
TABLE 6.

Average power and type-I error rate for the hypotheses H0,j*:βj*=0 for j=1,,p, as defined in (5.1)—(5.3), under the stochastic block model setting with p=500. Details are as in Table 5.

ρ
n
SNR
0.1 100
0.3
0.5 0.1 0.2
200
0.3
0.5 0.1 400
0.3
0.5

LDPE λmin S-Z 0.370 0.793 0.937 0.687 0.990 1.000 0.914 1.000 1.000
SSLasso λmin S-Z 0.393 0.803 0.933 0.687 0.990 1.000 0.892 1.000 1.000
Powerstrong dScore λmin S-Z 0.333 0.783 0.917 0.693 0.993 1.000 0.905 1.000 1.000
nScore λsup S-Z 0.473 0.857 0.953 0.697 0.993 0.993 0.943 1.000 1.000
nScore λ1SE S-Z 0.400 0.797 0.903 0.713 0.997 1.000 0.910 1.000 1.000
nScore λ1SE T2.2 0.437 0.787 0.923 0.720 0.987 1.000 0.940 1.000 1.000

LDPE λmin S-Z 0.041 0.044 0.051 0.057 0.050 0.071 0.050 0.093 0.071
SSLasso λmin S-Z 0.054 0.056 0.074 0.071 0.056 0.089 0.071 0.101 0.101
Powerweak dScore λmin S-Z 0.037 0.044 0.057 0.060 0.046 0.077 0.056 0.101 0.094
nScore λsup S-Z 0.059 0.071 0.107 0.043 0.083 0.070 0.069 0.094 0.106
nScore λ1SE S-Z 0.047 0.059 0.060 0.059 0.047 0.059 0.062 0.106 0.105
nScore λ1SE T2.2 0.057 0.064 0.067 0.047 0.069 0.086 0.043 0.079 0.113

LDPE λmin S-Z 0.051 0.049 0.048 0.050 0.050 0.050 0.051 0.050 0.049
SSLasso λmin S-Z 0.057 0.056 0.058 0.054 0.054 0.054 0.054 0.053 0.054
T1ER dScore λmin S-Z 0.043 0.040 0.041 0.041 0.044 0.042 0.042 0.042 0.041
nScore λsup S-Z 0.064 0.074 0.090 0.059 0.076 0.076 0.060 0.060 0.049
nScore λ1SE S-Z 0.062 0.058 0.048 0.056 0.052 0.040 0.054 0.047 0.046
nScore λ1SE T2.2 0.052 0.050 0.050 0.050 0.050 0.049 0.049 0.047 0.047

ρ
n
SNR
0.1 100
0.3
0.5 0.1 0.6
200
0.3
0.5 0.1 400
0.3
0.5

LDPE λmin S-Z 0.327 0.827 0.960 0.700 0.983 0.997 0.968 1.000 1.000
SSLasso λmin S-Z 0.350 0.853 0.957 0.687 0.990 0.997 0.945 0.996 1.000
Powerstrong dScore λmin S-Z 0.297 0.787 0.937 0.697 0.987 0.993 0.968 0.996 1.000
nScore λsup S-Z 0.420 0.870 0.957 0.720 0.987 1.000 0.947 1.000 1.000
nScore λ1SE S-Z 0.350 0.800 0.927 0.717 0.980 1.000 0.968 1.000 1.000
nScore λ1SE T2.2 0.373 0.797 0.890 0.653 0.980 1.000 0.917 1.000 1.000

LDPE λmin S-Z 0.043 0.049 0.046 0.041 0.077 0.063 0.053 0.066 0.083
SSLasso λmin S-Z 0.061 0.054 0.070 0.053 0.086 0.083 0.067 0.099 0.105
Powerweak dScore λmin S-Z 0.044 0.047 0.046 0.040 0.081 0.069 0.063 0.077 0.098
nScore λsup S-Z 0.054 0.073 0.093 0.063 0.093 0.094 0.061 0.094 0.096
nScore λ1SE S-Z 0.059 0.056 0.044 0.054 0.087 0.074 0.067 0.086 0.103
nScore λ1SE T2.2 0.046 0.054 0.060 0.059 0.064 0.084 0.051 0.087 0.116

LDPE λmin S-Z 0.049 0.049 0.049 0.049 0.050 0.049 0.049 0.047 0.048
SSLasso λmin S-Z 0.057 0.056 0.056 0.053 0.054 0.054 0.053 0.053 0.053
T1 Error dScore λmin S-Z 0.033 0.039 0.036 0.031 0.033 0.033 0.032 0.030 0.031
nScore λsup S-Z 0.063 0.079 0.089 0.062 0.077 0.075 0.060 0.062 0.048
nScore λ1SE S-Z 0.057 0.051 0.047 0.056 0.049 0.039 0.055 0.045 0.046
nScore λ1SE T2.2 0.051 0.051 0.051 0.049 0.048 0.046 0.047 0.045 0.045

5.2. Application to Glioblastoma Data

We investigate a glioblastoma gene expression data set previously studied in Horvath et al. (2006). For each of 130 patients, a survival outcome is available; we removed the twenty patients who were still alive at the end of the study. This resulted in a data set with n=110 observations. The gene expression measurements were normalized using the method of Gautier et al. (2004). We limited our analysis to p=3600 highly-connected genes (Zhang and Horvath, 2005; Horvath and Dong, 2008). The normalized data can be found at the website of Dr. Steve Horvath of UCLA Biostatistics. We log-transformed the survival response and centered it to have mean zero. Furthermore, we log-transformed the expression data, and then standardized each gene to have mean zero and standard deviation one across the n=110 observations.

Our goal is to identify individual genes whose expression levels are associated with survival time, after adjusting for the other 3599 genes in the data set. With family-wise error rate (FWER) controlled at level 0.1 using the Holm procedure (Holm, 1979), the naïve score test identifies three such genes: CKS2, H2AFZ, and RPA3. You et al. (2015) observed that CKS2 is highly expressed in glioma. Vardabasso et al. (2014) found that histone genes, of which H2AFZ is one, are related to cancer progression. Jin et al. (2015) found that RPA3 is associated with glioma development. As a comparison, SSLasso finds two genes associated with patient survival: PPAP2C and RGS3. LDPE and dScore identify no genes at FWER of 0.1.

6. DISCUSSION

In this paper, we examined a very naïve two-step approach to high-dimensional inference:

  1. Perform the lasso in order to select a small set of variables, 𝒜ˆλ.

  2. Fit a least squares regression model using just the variables in 𝒜ˆλ, and make use of standard regression inference tools. Make no adjustment for the fact that 𝒜ˆλ was selected based on the data.

It seems clear that this naïve approach is problematic, since we have peeked at the data twice, but are not accounting for this double-peeking in our analysis.

In this paper, we have shown that under certain assumptions, 𝒜ˆλ converges with high probability to a deterministic set, 𝒜λ. A similar result for random design matrix is presented in Zhao et al. (2019). This key insight allows us to establish that the confidence intervals resulting from the aforementioned naïve two-step approach have asymptotically correct coverage, in the sense of (1.4). This constitutes a theoretical justification for the recent simulation findings of Leeb et al. (2015). Furthermore, we used this key insight in order to establish that the score test that results from the naïve two-step approach has asymptotically the same distribution as though the selected set of variables had been fixed in advance; thus, it can be used to test the null hypothesis H0,j*:βj*=0,j=1,,p.

Our simulation results corroborate our theoretical findings. In fact, we find essentially no difference between the empirical performance of these naïve proposals, and a host of other recent proposals in the literature for high-dimensional inference (Javanmard and Montanari, 2014a; Zhang and Zhang, 2014; van de Geer et al., 2014; Lee et al., 2016; Ning and Liu, 2017).

From a bird’s-eye view, the recent literature on high-dimensional inference falls into two camps. The work of Wasserman and Roeder (2009); Meinshausen et al. (2009); Berk et al. (2013); Lee et al. (2016); Tibshirani et al. (2016) focuses on performing inference on the sub-model (1.3), whereas the work of Javanmard and Montanari (2013, 2014a,b); Zhang and Zhang (2014); van de Geer et al. (2014); Zhao and Shojaie (2016); Ning and Liu (2017) focuses on testing hypotheses associated with (1.1). In this paper, we have shown that the confidence intervals that result from the naïve approach can be used to perform inference on the sub-model (1.3), whereas the score test that results from the naïve approach can be used to test hypotheses associated with (1.1).

In the era of big data, simple analyses that are easy to apply and easy to understand are especially attractive to scientific investigators. Therefore, a careful investigation of such simple approaches is worthwhile, in order to determine which ones have the potential to yield accurate results, and which do not. We do not advocate applying the naïve two-step approach described above in most practical data analysis settings: we are confident that in practice, our intuition is correct, and this approach will perform poorly when the sample size is small or moderate, and/or the assumptions, which are unfortunately unverifiable, are not met. However, in very large data settings, our results suggest that this naïve approach may indeed be viable for high-dimensional inference, or at least warrants further investigation.

When choosing among existing inference procedures based on lasso, the target of inference should be taken into consideration. The target of inference can either be the population parameters, β* in (1.1), or the parameters induced by the sub-model chosen by lasso, β() in (1.5). Sample-splitting (Wasserman and Roeder, 2009; Meinshausen et al., 2009) and exact post selection (Lee et al., 2016; Tibshirani et al., 2016) methods provide valid inferences for β(). The latter is a particularly appealing choice for inference on β(), as it provides non-asymptotic confidence intervals under minimal assumptions. However, as we discussed in Section 1, β() is, in general, different from β*. A set of sufficient conditions for β()=β* is the irrepresentable condition together with a beta-min condition. Unfortunately, these assumptions are unverifiable and may not hold in practice. Our theoretical analysis and empirical studies suggests that the naïve two-step approach described above facilitates inference for β* under less stringent assumptions and without any conditioning or sample splitting. However, this method is also asymptotic and relies on unverifiable assumptions. Debiased lasso tests (Zhang and Zhang, 2014; van de Geer et al., 2014; Javanmard and Montanari, 2013, 2014a; Ning and Liu, 2017) provide asymptotically valid inference for entries of β*, without requiring a beta-min or irrepresentable condition. However, they require more restrictive sparsity of β*, as well as sparsity of the inverse covariance matrix of covariates, Σ1, which are also unverifiable. These limitations underscore the importance of recent efforts to relax these sparsity assumptions (e.g. Zhu et al., 2018; Wang et al., 2020).

We close with some suggestions for future research. One reviewer brought up an interesting point: methods with folded-concave penalties (e.g., Fan and Li, 2001; Zhang, 2010) require milder conditions to achieve variable selection consistency, i.e., [Pr𝒜ˆλ=𝒜*]1, than the lasso. Inspired by this observation, we wonder whether the proposal of Fan and Li (2001) and Zhang (2010) also require milder conditions to achieve Pr[𝒜ˆλ=𝒜λ]1. If so, then we could replace the lasso with a folded-concave penalty in the variable selection step, and improve the robustness of the naïve approaches. We believe this could be a fruitful area of future research. In addition, extending the proposed theory and methods to generalized linear models and M-estimators may also be promising areas for future research.

Supplementary Material

SupplementaryMaterial

ACKNOWLEDGEMENTS

We thank the Editor, Associate Editor, and four anonymous reviewers for their incredibly insightful comments, which led to substantial improvements of the manuscript. We thank the authors of Javanmard and Montanari (2014a) and Ning and Liu (2017) for providing code for their proposals. We are grateful to Joshua Loftus, Jonathan Taylor, Robert Tibshirani and Ryan Tibshirani for helpful responses to our inquiries.

Contributor Information

Sen Zhao, 1600Amphitheatre Parkway, Mountain View, California 94043, USA.

Daniela Witten, University of Washington, Health Sciences Building, Box 357232, Seattle, Washington 98195, USA.

Ali Shojaie, University of Washington, Health Sciences Building, Box 357232, Seattle, Washington 98195, USA.

REFERENCES

  1. Belloni A and Chernozhukov V (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli, 19(2):521–547. [Google Scholar]
  2. Berk R, Brown L, Buja A, Zhang K, and Zhao L (2013). Valid post-selection inference. The Annals of Statistics, 41(2):802–837 [Google Scholar]
  3. Bickel P, Ritov Y, and Tsybakov A (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732. [Google Scholar]
  4. Cox D (1975). A note on data-splitting for the evaluation of significance levels. Biometrika, 62(2):441–444. [Google Scholar]
  5. Dezeure R, Bühlmann P, Meier L, and Meinshausen N (2015). High-dimensional inference: Confidence intervals, p-values and R-software hdi. Statistical Science, 30(4):533–558. [Google Scholar]
  6. Dicker LH (2014). Variance estimation in high-dimensional linear models. Biometrika, 101(2):269–284. [Google Scholar]
  7. Dossal C (2012). A necessary and sufficient condition for exact sparse recovery by 1 minimization. Comptes Rendus Mathematique, 350(1–2):117–120. [Google Scholar]
  8. Erdős P and Rényi A (1959). On random graphs I. Publicationes Mathematicae, 6:290–297. [Google Scholar]
  9. Fan J, Guo S, and Hao N (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B, 74(1):37–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360. [Google Scholar]
  11. Gautier L, Cope L, Bolstad BM, and Irizarry RA (2004). affy – analysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20(3):307–315. [DOI] [PubMed] [Google Scholar]
  12. Gelman A and Hill J (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research Cambridge University Press, New York, NY, USA. [Google Scholar]
  13. Gilbert EN (1959). Random graphs. Annals of Mathematical Statistics, 30(4):1141–1144. [Google Scholar]
  14. Hahn J (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66(2):315–331. [Google Scholar]
  15. Hastie T, Tibshirani R, and Friedman J (2009). The Elements of Statistical Learning. Springer Series in Statistics. Springer-Verlag New York, New York, NY, USA. [Google Scholar]
  16. Hirano K, Imbens GW, and Ridder G (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189. [Google Scholar]
  17. Holm S (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65–70. [Google Scholar]
  18. Horvath S and Dong J (2008). Geometric interpretation of gene co-expression network analysis. PLOS Computational Biology, 4(8):e1000117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, M, F. R., Laurance MF., Zhao W., Qi S., Chen Z., Lee Y., Scheck AC., Liau LM., Wu H., Geschwind DH., Febbo PG., Kornblum HI., Cloughesy TF., Nelson SF., and Mischel PS. (2006). Analysis of oncogenic signaling networks in glioblastoma identifies aspm as a molecular target. Proceedings of the National Academy of Sciences, 103(46):17402–17407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Javanmard A and Montanari A (2013). Confidence intervals and hypothesis testing for high-dimensional statistical models. In Burges CJC., Bottou L., Welling M, Ghahramani Z, and Weinberger KQ, editors, Advances in Neural Information Processing Systems 26, pages 1187–1195. Curran Associates, Inc. [Google Scholar]
  21. Javanmard A and Montanari A (2014a). Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15(Oct):2869–2909. [Google Scholar]
  22. Javanmard A and Montanari A (2014b). Hypothesis testing in high-dimensional regression under the gaussian random design model: Asymptotic theory. IEEE Transaction on Information Theory, 60(10):6522–6554. [Google Scholar]
  23. Jin T, Wang Y, Li G, Du S, Yang H, Geng T, Hou P, and Gong Y (2015). Analysis of difference of association between polymorphisms in the XRCC5, RPA3 and RTEL1 genes and glioma, astrocytoma and glioblastoma. American Journal of Cancer Research, 5(7):2294–2300. [PMC free article] [PubMed] [Google Scholar]
  24. Kabaila P (1998). Valid confidence intervals in regression after variable selection. Econometric Theory, 14(4):463–482. [Google Scholar]
  25. Kabaila P (2009). The coverage properties of confidence regions after model selection. International Statistical Review, 77(3):405–414. [Google Scholar]
  26. Kolaczyk ED (2009). Statistical Analysis of Network Data: Methods and Models. Springer Series in Statistics. Springer Science+Business Media, New York, NY, USA. [Google Scholar]
  27. Lee JD, Sun DL, Sun Y, and Taylor JE (2016). Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907–927. [Google Scholar]
  28. Leeb H and Pötscher BM (2003). The finite-sample distribution of post-model-selection estimators and uniform versus nonuniform approximations. Econometric Theory, 19(1):100–142. [Google Scholar]
  29. Leeb H and Pötscher BM (2005). Model selection and inference: Facts and fiction. Econometric Theory, 21(1):21–59. [Google Scholar]
  30. Leeb H and Pötscher BM (2006a). Can one estimate the conditional distribution of post-model-selection estimators? The Annals of Statistics, 34(5):2554–2591. [Google Scholar]
  31. Leeb H and Pötscher BM (2006b). Performance limits for estimators of the risk or distribution of shrinkage-type estimators, and some general lower risk-bound results. Econometric Theory, 22(1):69–97. [Google Scholar]
  32. Leeb H and Pötscher BM (2008). Can one estimate the unconditional distribution of post-model-selection estimators? Econometric Theory, 24(2):338–376. [Google Scholar]
  33. Leeb H, Pötscher BM, and Ewald K (2015). On various confidence intervals post-model-selection. Statistical Science, 30(2):216–227. [Google Scholar]
  34. Meinshausen N and Bühlmann P (2006). High dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462. [Google Scholar]
  35. Meinshausen N, Meier L, and Bühlmann P (2009). p-values for high-dimensional regression. Journal of the American Statistical Association, 104(488):1671–1681. [Google Scholar]
  36. Ning Y and Liu H (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1):158–195. [Google Scholar]
  37. Pötscher BM (1991). Effects of model selection on inference. Econometric Theory, 7(2):163–185. [Google Scholar]
  38. Reid S, Tibshirani R, and Friedman J (2016). A study of error variance estimation in Lasso regression. Statistica Sinica, 26(1):35–67. [Google Scholar]
  39. Rosset S, Zhu J, and Hastie T (2004). Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, 5:941–973. [Google Scholar]
  40. Sun T and Zhang C-H (2012). Scaled sparse linear regression. Biometrika, 99(4):879–898. [Google Scholar]
  41. Taylor J and Tibshirani RJ (2015). Statistical learning and selective inference. Proceedings of National Academy of Sciences, 112(25):7629–7634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288. [Google Scholar]
  43. Tibshirani RJ (2013). The lasso problem and uniqueness. Electronic Journal of Statistics, 7:1456–1490. [Google Scholar]
  44. Tibshirani RJ, Taylor J, Lockhart R, and Tibshirani R (2016). Exact post-selection inference for sequential regression procedures. Journal of the American Statistical Association, 111(514):600–620. [Google Scholar]
  45. Tropp J (2006). Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Transaction on Information Theory, 52(3):1030–1051. [Google Scholar]
  46. van de Geer S (2017). Some exercises with the lasso and its compatibility constant. arXiv preprint arXiv:1701.03326. [Google Scholar]
  47. van de Geer S and Bühlmann P (2009). On the conditions used to prove oracle results for the Lasso. Electronic Journals of Statistics, 3:1360–1392. [Google Scholar]
  48. van de Geer S, Bühlmann P, Ritov Y, and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202. [Google Scholar]
  49. Vardabasso C, Hasson D, Ratnakumar K, Chung C-Y, Duarte LF, and Bernstein E (2014). Histone variants: emerging players in cancer biology. Cellular and Molecular Life Sciences, 71(3):379–404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Voorman A, Shojaie A, and Witten DM (2014). Inference in high dimensions with the penalized score test. arXiv preprint arXiv:1401.2678. [Google Scholar]
  51. Wainwright MJ (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using 1-constrained quadratic programmming (lasso). IEEE Transaction on Information Theory, 55(5):2183–2202. [Google Scholar]
  52. Wang J, He X, and Xu G (2020). Debiased inference on treatment effect in a high-dimensional model. Journal of the American Statistical Association, 115(529):442–454. [Google Scholar]
  53. Wasserman L and Roeder K (2009). High-dimensional variable selection. The Annals of Statistics, 37(5A):2178–2201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Weisberg S (2013). Applied Lienar Regression. Wiley Series in Probability and Statistics. John Wiley & Sons, Hoboken, NJ, USA. [Google Scholar]
  55. You H, Lin H, and Zhang Z (2015). CKS2 in human cancers: Clinical roles and current perspectives (review). Molecular and Clinical Oncology, 3(3):459–463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Zhang B and Horvath S (2005). A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology, 4(Article 17). [DOI] [PubMed] [Google Scholar]
  57. Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942. [Google Scholar]
  58. Zhang C-H and Huang J (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics, 36(4):1567–1594. [Google Scholar]
  59. Zhang C-H and Zhang SS (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B, 76(1):217–242. [Google Scholar]
  60. Zhao P and Yu B (2006). On model selection consistency of lasso. Journal of Machine Learning Research, 7(Nov):2541–2563. [Google Scholar]
  61. Zhao S, Ottinger S, Peck S, Mac Donald C, and Shojaie A (2019). Network differential connectivity analysis. arXiv preprint arXiv:1909.13464. [Google Scholar]
  62. Zhao S and Shojaie A (2016). A significance test for graph-constrained estimation. Biometrics, 72(2):484–493. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Zhu Y, Bradic J, et al. (2018). Significance testing in non-sparse high-dimensional linear models. Electronic Journal of Statistics, 12(2):3312–3364. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SupplementaryMaterial

RESOURCES