Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Sep 18.
Published in final edited form as: Technometrics. 2018 Oct 31;61(2):154–164. doi: 10.1080/00401706.2018.1513380

Assessing Tuning Parameter Selection Variability in Penalized Regression

Wenhao Hu a, Eric B Laber a, Clay Barker b, Leonard A Stefanski a
PMCID: PMC6750234  NIHMSID: NIHMS987505  PMID: 31534281

Abstract

Penalized regression methods that perform simultaneous model selection and estimation are ubiquitous in statistical modeling. The use of such methods is often unavoidable as manual inspection of all possible models quickly becomes intractable when there are more than a handful of predictors. However, automated methods usually fail to incorporate domain-knowledge, exploratory analyses, or other factors that might guide a more interactive model-building approach. A hybrid approach is to use penalized regression to identify a set of candidate models and then to use interactive model-building to examine this candidate set more closely. To identify a set of candidate models, we derive point and interval estimators of the probability that each model along a solution path will minimize a given model selection criterion, for example, Akaike information criterion, Bayesian information criterion (AIC, BIC), etc., conditional on the observed solution path. Then models with a high probability of selection are considered for further examination. Thus, the proposed methodology attempts to strike a balance between algorithmic modeling approaches that are computationally efficient but fail to incorporate expert knowledge, and interactive modeling approaches that are labor intensive but informed by experience, intuition, and domain knowledge. Supplementary materials for this article are available online.

Keywords: Conditional distribution, Lasso, Prediction sets

1. Introduction

Penalized estimation is a popular means of regression model fitting that is quickly becoming a standard tool among quantitative researchers working across nearly all areas of science. Examples include the Lasso (Tibshirani 1996), SCAD (Fan and Li 2001), Elastic Net (Zou and Hastie 2005), and the adaptive Lasso (Zou 2006). One appealing feature of these methods is that they perform simultaneous model selection and estimation, thereby automating model-building at least partially. This is especially beneficial in settings where the number of predictors is large, precluding manual inspection of all possible models. However, a consequence is that the analyst becomes increasingly dependent on an estimation algorithm that has neither the subject-matter knowledge nor the intuition that might guide a less automated and more interactive model-building process (Henderson and Velleman 1981; Cox 2001). A hybrid approach is to use penalized estimation to construct a small subset of models, for example, the sequence of models occurring on a solution path, and then to apply interactive model-building techniques to choose a model from among these. We develop and advocate such a hybrid approach wherein a set of candidate models are identified using a solution path, and then models along this path are prioritized using their conditional probability of selection according to one or more tuning parameter selection methods. We envision this approach as being useful in at least two ways: (i) it facilitates interactive, expert-knowledge-driven exploration of high-quality candidate models even when the initial pool of models is large; and (ii) it provides valid conditional prediction sets for a data-driven tuning parameter given the observed design matrix and solution path, which is applicable for a large class of tuning parameter selection methods.

There is a vast literature on tuning parameter selection methods. Classical methods include Mallow’s Cp (Mallows 1973), Akaike information criterion (AIC; Akaike 1974), Bayesian information criterion (BIC; Schwarz 1978), cross-validation, and generalized cross-validation (Golub, Heath, and Wahba 1979). More recent work on tuning parameter selection, driven by interest in high-dimensional data, includes new information-theoretic selection methods (Chen and Chen 2008; Wang, Li, and Leng 2009; Zhang, Li, and Tsai 2010; Wang and Zhu 2011; Kim, Kwon, and Choi 2012; Fan and Tang 2013; Hui, Warton, and Foster 2015) as well as resampling-based approaches (Hall, Lee, and Park 2009; Meinshausen and Bühlmann 2010; Feng and Yu 2013; Sun, Wang, and Fang 2013; Shah and Samworth 2013). The foregoing methods select a single tuning parameter and hence a single fitted model. Our goal is to quantify the stability of these methods by constructing conditional prediction sets for data-driven tuning parameters and to use these prediction sets to prioritize models for further, expert-guided exploration. Given one or more tuning parameter selection methods, we identify all models with sufficiently large conditional probability of being selected given the design matrix and observed solution path.

In Section 2, we review penalized linear regression. In Section 3, we derive exact and asymptotic estimators of the sampling distribution of a data-driven tuning parameter. We examine the performance of the proposed methods through simulation studies in Section 4. In Section 5, we illustrate the proposed methods using two data examples. A concluding discussion is given in Section 6. Technical details are relegated to the supplement materials.

2. Penalized Linear Regression

We assume that the data are generated according to the linear model Yi=Xiβ0+ϵi, for i = 1, … , n, where ϵ1, … , ϵn are independent, identically distributed errors with expectation zero, β0 = (β01, … , β0p), and X1, … ,Xn are predictors that can be regarded as either fixed or random. Let Y = (Y1, Y2, … , Yn) be the vector of responses and X = (X1, X2, … , Xn) the design matrix with the first column equal to 1n×1. Let n denote the empirical distribution. We consider penalized least-square estimators

β^(λ)=argminβp{12YXβ2+λj=2pfj(βj;n)},

where fj(·), j = 2, … , p are penalty functions. For example,fj(βj;n)=βj corresponds to the Lasso, and fj(βj;n)=βjβ^ols,jγ corresponds to the adaptive Lasso, where β^ols,j is the ordinary least-square estimator and γ > 0 is a constant.

For any Λ ⊆ [0, ∞) define the solution path along Λ as S^(Λ)={β^(λ):λΛ}; we write S^ to denote S^{[0,)}. While the solution path along Λ may contain a continuum of coefficient vectors, it is commonly viewed as containing a finite set of unique models corresponding to each unique combination of nonzero elements of coefficients in S^(Λ), that is, the set of models M{S^(Λ)}={M{0,1}p:M=𝟙β^(λ)0,for someλΛ}. The number of models in M{S^} is typically much smaller, for example, Op{min(n, p)}, than the set of all of 2p possible models. Thus, the set of models along the solution path are a natural and computationally manageable subset ofmodels for further investigation. Standard practice is to choose a single value of the tuning parameter, say λ^, that optimizes some prespecified criterion and subsequently a single model M[S^{(λ^)}]]. However, the selected tuning parameter is a random variable and there may be multiple models along the solution path where the support of the selected tuning parameter is large, for example, M{S^(Lτ)} where Lτ is a τ upper-level set of the conditional distribution of λ^ given S^ and X. If these models can be identified from the observed data, then they can be reported as potential candidate models or a single model can be chosen from among them using expert judgment and other factors not captured in the estimation algorithm. Also, unlikely models can be ruled out. To formalize this procedure, we consider selection methods within the framework of generalized information criterion.

Define the generalized information criterion as

GICλ=log(σ^λ2)+wndf^λ, (1)

where σ^λ2=n1i=1n{YiXiβ^(λ)}2, df^λ=j=1p𝟙β^j(λ)>0, and wn is a sequence of positive constants, with wn = log(n)/n and wn = 2/n yielding BIC and AIC, respectively. We consider data-driven tuning parameters of the form λ^GIC=argminλ{log(σ^λ2)+wndf^λ} We focus primarily on the setting where n > p as the GIC is not well-defined if pn However, we provide an illustrative example in Section 5 where p > n wherein our method is applied after an initial screening step; this two-stage procedure is in line with our vision for using automated methods to identify a small set of candidate models for further consideration. We also present extensions of key distributional approximations to the setting where p diverges with n in the Appendix.

3. Estimating the Conditional Distribution of λ^GIC

In this section, we characterize and derive estimators of the conditional distribution of λ^GIC given S^ and X. We first show that conditioning on S^ and X is equivalent to conditioning on XY and X. We then show that λ^GIC is a nondecreasing function of the sum of squares error of the full model σ^02=n1i=1n(YiXiβ^ols)2. Therefore, the conditional distribution of λ^GIC is completely determined by the conditional distribution of σ^02.

3.1. Conditioning on the Solution Path

We assume that fj(βj, n), j = 2, … , p depends on the observed data only through XY and XX; this assumption is natural as XX and XY are sufficient statistics for the conditional mean of Y given X under the assumed linear model. Under this assumption, β^(λ)=argminβ{12βXXβYXβ+λj=2pfj(βj;n)}, from which it can be seen that the solution path is completely determined by XX and XY. On the other hand, given S^ and X, we can recover XY using XXβ^(0)=XX{(XX)1XY}=XY. Therefore, conditioning on solution path and design matrix is equivalent to conditioning on XY and X (see Lemma C.1 in the Appendix).

In the case of adaptive Lasso, we assume that X is full column rank so that fj(βj; ℙn), which depends on β^ols,j, is well-defined. It can be seen that if X is full column rank, then the entire solution path is determined by XX and β^ols Conditioning on the solution path is also practically relevant because it is consistent with the common practice wherein an analyst is presented with a full solution path and then proceeds to identify a model as a point along this path.

3.2. Exact Distribution of λ^GIC(S^,X)

We assume that the models along the solution path are determined by the sequence of tuning parameters λ^(1)<λ^(2)<<λ^(m^), so that m^ is the total number of tuning parameters to be considered. The following lemma characterizes the conditional distribution of λ^GIC.

Lemma 1.

The selected tuning parameter, λ^GIC, is completely determined by (S^,X,σ^02). Furthermore, assume YXβ^(λ)2 is a nondecreasing function of λ, write λ^GIC=λ(S^,X,σ^02), then for each fixed S^=s = s and X = x, the map σ2 ↦ λ(s, x, σ2) is nondecreasing.

The assumption that YXβ^(λ)2 is nondecreasing function of λ holds under mild conditions, for example, if j=2pfj{β^(λ);n} is a decreasing function of λ and the original penalized problem can be recast as constrained minimization problem of the form minimize ||YX β||2 subject to the constraint j=2pfj(βj;n)j=2pfj{β^(λ);n}. It is well known that Lasso satisfies this property. If the error is normally distributed, then (nσ^02)σ02 is independent of (S^, X) and follows a chi-square distribution with np degrees of freedom. Therefore, the preceding lemma shows that, under normal errors, the conditional distribution of λ^GIC given (S^,X) is a nondecreasing function of a chi-square random variable. The remaining results stated in this section do not require the assumption that YXβ^(λ)2 is nondecreasing in λ; rather, the results are stated in terms of a finite but arbitrary sequence of tuning parameter values.

Define D^λ={β^olsβ^(λ)}XX{β^olsβ^(λ)}. For k = 1, … , m^, define A^k={i:df^λ^(i)<df^λ^(k)}, B^k={i:df^λ^(i)>df^λ^(k)}, C^k={i:ik,anddf^λ^(i)=df^λ^(k)}, and

^i,k=D^λ^(k)exp{wn(df^λ^(k)df^λ^(i))}D^λ^(i)1exp{wn(df^λ^(k)df^λ^(i))},for1i,km^,

where wn is from Equation (1). The quantities in the foregoing definitions are all measurable with respect to X and S^ and thus, for probability statements conditional on X and S^, they are regarded as constants.

The following proposition gives the exact conditional distribution of λ^GIC given S^ and X.

Proposition 1.

Define I^k=𝟙(D^λ^(k)<D^λ^(i),for alliC^k) with the convention that I^k=1 if C^k is empty, and pk=P(maxiB^k^i,knσ^02miniA^k^i,kS^,X). Then,

P(λ^GIC=λ^(k)S^,X)=min(pk,I^k).

Provided that the conditional distribution of σ^02 given ((S^,X) is known or can be consistently estimated, the preceding proposition can be used to construct conditional prediction sets for λ^GIC. A (1 − α) × 100% conditional prediction set is {λ^(i):iΓ}, where iΓP(λ^GIC=λ^(i)S^,X)1α. Alternatively, as discussed previously, one can construct the τ upper-level set Lt={λ^(i):P(λ^GIC=λ^(i)S^,X)>τ}, for any τ ∈ (0, 1).

Define a^k=miniA^k^i,k and b^k=maxiB^k^i,k. If the errors are normally distributed then

pk=Fχnp2(a^kσ02)Fχnp2(b^kσ02),fora^kb^k. (2)

Plugging σ^02 into this expression yields an estimator p^k for pk.

Define gk(t)=Fχnp2(a^kt)Fχnp2(b^kt). Then a (1 − α) × 100% projection confidence interval (Berger and Boos 1994) for pk (Equation (2)) is

(inftCgk(t),suptCgk(t)), (3)

Where C=(nσ^02/χα/2,np2,nσ^02/χ1α/2,np2) is a (1−α) × 100% confidence interval for σ02. Thus, an estimator of Lτ is

L^τ={λ^(k):suptCgk(t)>τ}. (4)

Remark 1. The assumption that X is full rank is not necessary for Proposition 1. Note that the conclusions depend only on the quantities Xβ^ols, Xβ^(λ), and σ^02, which are computable even when X is not full rank.

3.3. Limiting Conditional Distribution of σ^02

As discussed above, if the errors are assumed tobe normally distributed then exact distribution theory for λ^GIC is possible using a transformed chi-square random variable. Here, we consider asymptotic approximations that apply more generally.

Denote the third and fourth moment of ϵ as μ3,ϵ and μ4,ϵ respectively. Define

Σ=(σ02C1μxμ3,ϵμxμ3,ϵμ4,ϵσ04),

where C=limnn1i=1nXiXi. And write Φp+1 (t) to denote the cumulative distribution function of a standard (p + 1 )-dimensional multivariate normal distribution evaluated at t. For u, v ∈ ℝp+1 write uv to mean component-wise inequality. The following are standard results from ordinary linear regression under common regularity conditions summarized in Section C in the Appendix (see the proof of Proposition 2).

Proposition 2.

The asymptotic joint distribution of β^olsβ0 and σ^02σ02 is multivariate normal with mean zero and covariance Σ, that is,

suptp+1P{nΣ12(β^olsβ0σ^02σ02)t}Φp+1(t)0.

Because we assume that X is full column rank, conditioning on (S^, X) is equivalent to conditioning on (β^ols, X) (in the sense that they generate the same σ-algebra). Therefore to approximate the conditional distribution of σ^02 given (S^, X), we construct an estimator of Σ, say Σ^, and then use the above proposition to form a plug-in estimator of the distribution of σ^02 given (β^ols,X). Define

e^i=YiXiβ^ols,i=1,2,,n, (5)

and subsequently σ^02=n1i=1ne^i2, μ^3,ϵ=n1i=1ne^i3, μ^4,ϵ=n1i=1ne^i4, μ^x=n1i=1nXi, and C^=n1i=1nXiXi. The estimated conditional distribution of σ^02 is

N[σ^02,1n{(μ^4,ϵσ^04)μ^3,ϵ2σ^02μ^xC^μ^x}]. (6)

This approximation, coupled with Proposition 1, can be used to approximate the conditional distribution of λ^GIC when a chi-squared approximation is not feasible.

Henceforth, we assume that the errors are symmetric about zero, in which case the third moment of ϵi, μ3ϵ, is zero, which implies σ^02 is asymptotically independent of β^ols. Therefore,

pk=P(b^knσ^02a^kS^,X)=Φ(n(a^knσ02)μ4,ϵσ04)Φ(n(b^knσ02)μ4,ϵσ04)+op(1),fora^kb^k, (7)

where μ4,ϵ is the fourth moment of ϵi. Define

hk(t1,t2)=Φ(n(a^knt1)t2)Φ(n(b^knt1)t2).

Suppose that Ey is a (1 − α) x 100% asymptotic confidence region for μ4,ϵσ04 and σ02, then

[inf(t1,t2)Eyhk(t1,t2),sup(t1,t2)Eyhk(t1,t2)], (8)

is an approximate (1 − α) × 100% projection confidence interval for pk (see Proposition C.1 in Section C of the Appendix.).

We construct the confidence set Ey using Wald confidence region:

Ey={(t1,t2):(t1σ^02t2μ^4,ϵ+σ^04)}×V^1{(t1σ^02t2μ^4,ϵ+σ^04)χ1α,22},

where V^ is the estimated covariance matrix of (σ^02,μ^4,ϵσ^04). Then the optimization problem in Equation (9) is solved using an augmented Lagrangian method (Bertsekas 2014). An estimator of Lτ is

L^τ={λ^(k):sup(t1,t2)Eyhk(t1,t2)>τ}. (9)

Proposition 2 is stated in terms of fixed p and diverging n. We show in the Appendix that the approximation in Equation (7) remains valid in the setting p=o(n) as well as the setting where p=O{exp(nc)} provided that an appropriate screening step is applied as a first step. We illustrate this screening approach with a high-dimensional example in our simulation experiments.

3.4. Bootstrap Approximation to the Distribution of λ^GIC(S^,X)

In small samples, it may be preferable to estimate the conditional distribution of σ^02 using the bootstrap. Let γ(b) = (γ1(b), … , γn(b)) be a sample drawn with replacement from {e^1,,e^n}. Define Y(b)=Xβ^ols+(IPx)γ(b), where Px = X(XX)X. This bootstrap method differs from the usual residual bootstrap in ordinary linear regression because our goal is to estimate the conditional distribution of σ^02. We accomplish this by multiplying the error vector by (IPx), which ensures that β^ols(b)=(XX)1XY(b)=β^ols so that Y(b) produces the same solution path as the original sample Y. The conditional distribution of the tuning parameter is estimated by generating b = 1,…, B bootstrap samples and calculating the corresponding tuning parameter for each bootstrap sample. See Proposition C.2 in Section C of the Appendix for a statement of the asymptotic equivalence between the proposed bootstrap method and the normal approximation given in Equation (6).

4. Simulation Studies

In this section, we investigate the finite-sample performance of the proposed methods using a series of simulation experiments. We focus on the Lasso tuned using BIC. Simulated datasets are generated from the model Yi=Xiβ0+ϵi, where ϵi, i = 1, … , n are generated independently from a standard normal distribution and Xi, i = 1, … , n are generated independently from a multivariate normal distribution with mean zero and autoregressive covariance structure, Cj,k = ρ|jk|, with ρ = 0 or 0.5 and 1 ≤ j, k ≤ 20, or 200. For the regression coefficients β0, we consider the following four settings:

Model 1: β0 = c1×(1, 1, 1, 1, 0, 0, 0, 0, … , 0);

Model 2: β0 = c2×(1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, … , 0);

Model 3: β0 = c3×(3, 2, 1, 0, 0, 0, 0, … , 0);

Model 4: β0 = c4×(3, 2, 1, 0, 0, 0, 0, 0, 3, 2, 1, 0, … , 0);

where c1, … , c4 are constants chosen so that the population R2 of each model is 0.5 under the definition R2 = 1 − var(Y|X)/var(Y). For each combination of parameter settings, 10,000 datasets were generated; the bootstrap estimator was constructed using 5000 bootstrap replications.

For estimating the τ upper-level set, Lτ={λ:P(λ^GIC=λS^,X)>τ}, we consider:

  1. (AsympNor) the plug-in estimator based on the normal approximation to the distribution of pk;

  2. (Bootstrap) the estimator based on the bootstrap approximation to the sampling distribution of pk as described in Section 3.4;

  3. (UP1) the estimator based on a 90% projection confidence set as in Equation (4);

  4. (UP2) the estimator based on a 90% projection confidence set as in Equation (10);

  5. (Akaike) the estimator based on Akaike weights: L^τ={λi:exp(0.5nGICλi)i=1m^exp(0.5nGICλi)>τ}, with wn = 2/n (Burnham and Anderson 2003);

  6. (ApproxPost) the estimator based on approximate posterior distribution: L^τ={λi:exp(0.5nGICλi)i=1m^exp(0.5nGICλi)>τ}, with Wn = log(n)/n (Burnham and Anderson 2003).

We define the performance of these estimators in terms of their true and false discovery rates. Provided that Lτ is nonempty, define the true discovery rate of an estimator L^τ as

TDR(L^τ)=𝔼[#{L^τLτ}#Lτ],

where # denotes the number of elements in a set. Provided L^τ is nonempty with probability one, define the false discovery rate of an estimator L^τ as

FDR(L^τ)=1𝔼[#{L^τLτ}#L^τ].

Here, we present results for τ = 0.05; results for τ = 0.1 and τ = 0.2 are presented in the supplemental materials. The results for p = 20, n = 50 and p = 100, n = 200 are presented in Tables 1 and 2, respectively. AsympNor and the Bootstrap perform similarly with a TDR above 0.9 and an FDR below 0.10. As expected, methods based on the upper bound of the confidence interval achieve higher TDR but at the price of higher FDR. Methods based on Akaike weights and approximate posterior have the worst performances in terms of discovery rate of Lτ. The poor performance is not surprising as these methods were not designed for conditional inference.

Table 1.

Discovery rate for p = 20, n = 50.

AsympNor
Bootstrap
UP1
UP2
Akaike
ApproxPost
Model ρ TDR FDR TDR FDR TDR FDR TDR FDR TDR FDR TDR FDR
0 0.90 0.10 0.90 0.10 1.00 0.32 0.99 0.36 0.54 0.83 0.90 0.58
1 0.5 0.92 0.09 0.92 0.09 1.00 0.29 0.99 0.34 0.61 0.83 0.95 0.58
0 0.88 0.11 0.88 0.10 0.99 0.30 0.99 0.34 0.46 0.83 0.82 0.59
2 0.5 0.89 0.10 0.90 0.10 1.00 0.35 0.99 0.36 0.55 0.82 0.89 0.59
0 0.92 0.09 0.92 0.09 1.00 0.29 0.99 0.34 0.57 0.84 0.94 0.58
3 0.5 0.93 0.08 0.93 0.08 1.00 0.27 0.99 0.32 0.64 0.83 0.96 0.57
0 0.89 0.11 0.90 0.10 0.99 0.32 0.99 0.36 0.50 0.84 0.88 0.59
4 0.5 0.91 0.10 0.91 0.09 0.99 0.33 0.99 0.35 0.58 0.83 0.92 0.58

Table 2.

Discovery rate for p = 100, n = 200.

AsympNor
Bootstrap
UP1
UP2
Akaike
ApproxPost
Model ρ TDR FDR TDR FDR TDR FDR TDR FDR TDR FDR TDR FDR
0 0.97 0.03 0.97 0.03 1.00 0.12 1.00 0.13 0.11 0.98 1.00 0.58
1 0.5 0.98 0.03 0.98 0.02 1.00 0.09 1.00 0.10 0.24 0.95 1.00 0.56
0 0.95 0.04 0.95 0.04 1.00 0.17 1.00 0.18 0.05 0.99 0.98 0.59
2 0.5 0.96 0.03 0.96 0.03 1.00 0.14 1.00 0.15 0.12 0.97 0.99 0.58
0 0.97 0.03 0.97 0.03 1.00 0.12 1.00 0.13 0.16 0.97 1.00 0.57
3 0.5 0.98 0.02 0.98 0.02 1.00 0.09 1.00 0.10 0.28 0.94 1.00 0.56
0 0.96 0.04 0.96 0.04 1.00 0.14 1.00 0.16 0.07 0.98 0.99 0.59
4 0.5 0.97 0.03 0.97 0.03 1.00 0.12 1.00 0.13 0.16 0.97 1.00 0.57

We also evaluated the coverage of the proposed confidence intervals based on assumption of normality assumption as well as the asymptotic approximation. In calculating the coverage probabilities, we restricted calculations to the set (λ : 0.9999 > P(λ^GIC=λS^,X) > 0.0001}. Nominal coverage is set at 0.90. The results are presented in Table 3. The confidence intervals based on normality (Equation (3)) achieve nominal coverage in all cases. The confidence intervals based on an asymptotic approximation undercover slightly, though coverage approaches nominal levels as n increases.

Table 3.

Coverage probability. Results are based on 10,000 replicated datasets.

ρ Approximate
Normality
0 0.5 0 0.5
Model 1 0.86 0.85 0.92 0.91
n = 50 Model 2 0.86 0.85 0.91 0.91
p = 20 Model 3 0.85 0.85 0.91 0.91
Model 4 0.86 0.85 0.92 0.91
Model 1 0.88 0.88 0.91 0.90
n = 200 Model 2 0.87 0.88 0.91 0.91
p = 100 Model 3 0.87 0.87 0.91 0.90
Model 4 0.87 0.88 0.91 0.90

5. Illustrative Data Examples

In this section, we apply the proposed methods to two datasets. The first dataset informs the relationship of pollution and other factors related to urban-living, and to age-adjusted mortality (McDonald and Schwing 1973; Luo, Stefanski, and Boos 2006); and the second regards the relationship between gene mutations and drug resistance level (Rhee et al. 2006). We demonstrate that reporting a single model may not be appropriate in these two examples and that the proposed methods have the potential to identify interesting models warranting further examination. As in the simulation experiments, we consider the Lasso estimator tuned using BIC.

5.1. Pollution and Mortality

As our first illustrative example we consider data on mortality rates recorded in 60 metropolitan areas. Prior analyses of these data focused on the regression of age-adjusted mortality on 15 predictors that are grouped into three broad categories: weather, socioeconomic factors, and pollution. A copy of the dataset and a detailed description of each predictor are provided in the supplemental materials.

Ignoring uncertainty in the tuning parameter selection, the LASSO estimator tuned using BIC leads to a model with six variables, Percent Non-White, Education, SO2 Pollution Potential, Precipitation, Mean January Temperature, and Population Per Mile. However, the estimated conditional sampling distribution of the tuning parameter indicates that a larger model with eight predictors is approximately equally probable. Figure 1 displays the solution path with the estimated selection probabilities using both the asymptotic normal and bootstrap approximations. The estimated conditional distribution of the tuning parameter is displayed in Table 4; 90% confidence intervals based on Equations (3) and (9) are presented. The LASSO coefficient estimates are presented in Table B.1 of the Appendix.

Figure 1.

Figure 1.

The top figure shows LASSO solution path of mortality rates data; The vertical lines above and belowx-axis correspond to the distribution estimated by bootstrap and asymptotic normal approximation, respectively. The bottom figure shows BIC values for candidate models along solution path. The solid vertical line corresponds to the model with six variables and smallest BIC value. The dashed vertical line corresponds to a model with eight variables.

Table 4.

Estimated conditional distribution of the tuning parameter for the mortality rates data. Both the asymptotic normal approximation and the bootstrap had only two support points, {124.21,288.20}.

λ 288.20 124.21
Model size 6 8
Probability mass normal 0.51 0.49
Probability mass bootstrap 0.48 0.52
90% CI based on normality (0.052,0.950) (0.048, 0.948)
90% Approximate CI (0.009,0.848) (0.154, 1.000)

We further investigate the model with eight predictors. The eighth predictor added to the model is Mean July Temperature. Fitting a simple linear regression of mortality on Mean July Temperature yields a p-value of 0.03 compared with a p-value of 0.81 in the regression of mortality on Mean January Temperature. This is in line with Katsouyanni et al. (1993) who concluded that high temperatures are related to the mortality rate. It can be seen that in this case, reporting a single model may not be appropriate. Rather, it may be more informative to report the two models that contain essentially all of the mass of the conditional distribution of the tuning parameter.

For comparison, results from forward stepwise regression are presented in Figure 2. The smallest BIC value corresponds to Step 5 of the procedure which corresponds to a model with the five predictors: Percent Non-White, Education, Mean January Temperature, SO2 Pollution Potential, and Precipitation. This model is smaller than the model selected by LASSO tuned using BIC. This may be because forward stepwise regression is greedy in that at each step it seeks a variable that captures maximum variation in the residuals. Thus, if a candidate variable is correlated with those selected in previous steps, it may be difficult to see the improvement in the fitted model. In such cases, it might be preferable to use the solution path to generate a candidate set of models.

Figure 2.

Figure 2.

The top figure shows forward stepwise regression solution path of mortality rates data; The vertical line corresponds to the model which minimizes BIC. The bottom figures show BIC values for candidate models along solution path. The solid vertical line corresponds to the smallest BIC value.

5.2. ATV Drug Resistance

Our second example considers mutations that affect resistance to Atazanavir (ATV), a protease inhibitor for HIV (Rhee et al. 2006; Barber et al. 2015). After preprocessing, the dataset contains 328 observations and 361 predictors of gene mutations. The response is a measure of drug resistance for ATV. Because p > n in this example, we use 100 observations for screening 50 important predictors, ranked by Pearson correlation with response. We then fit a linear model using the Lasso applied to the remaining 228 observations with the 50 important predictors selected at screening.

The estimated conditional distribution of the tuning parameter and 90% confidence intervals are presented in Table 5; the estimated distribution is overlaid on the solution path in Figure 3. It can be seen that the estimated distribution of the tuning parameter mainly favors two models.

Table 5.

Estimated conditional distribution of the tuning parameter for ATV drug resistance data.

λ 1347.92 1101.71 805.53
Model size 10 12 15
Probability mass normal 0.443 0.045 0.472
Probability mass bootstrap 0.494 0.047 0.454
90% CI based on normality (0.019, 0.896) (0.026,0.127) (0.059, 0.959)
90% Approximate CI (0.000, 0.751) (0.020, 0.064) (0.020, 1)

Figure 3.

Figure 3.

The top figure shows LASSO solution path of ATV drug resistance data; The vertical lines above and below x-axis correspond to the distribution estimated by bootstrap and asymptotic normal approximation, respectively. The bottom figure shows BIC values for candidate models along solution path. The solid vertical line corresponds to the model with 15 variables and smallest BIC value. The dashed vertical lines correspond to model with 12 and 10 variables.

Similar to Barber et al. (2015), we evaluate candidate models based on treatment selected mutations (TSM) panels, which provide a surrogate for the true important mutations. The model minimizing BIC contains 15 variables, while two of them correspond to the same mutation. This leads to 14 unique mutation locations, and four locations are potential false discoveries (as assessed by TSM); see Table 5. Therefore, the surrogate-based estimated false discovery rate is 4/14 ≈ 0.29. For tuning parameter λ = 1101.71, 11 unique positions are identified, and two locations are potential false discoveries. Tuning parameter λ = 1347.92 leads to nine unique locations with one potential false discovery. The corresponding surrogate-based estimated false discovery rate is 1/9 ≈ 0.11, a decrease of 20% compared with the model minimizing BIC. Thus, in this case it might not be appropriate to report the single model selected by BIC.

6. Conclusion

We proposed two simple procedures for estimating the conditional distribution of a data-driven tuning parameter in penalized regression given the observed solution path and design matrix. Our objective was to quantify the stability of selected models and thereby identify a set of potential models for consideration by domain experts. A plot of the solution path with the estimated selection probabilities or upper confidence bounds overlaid, for example, Figures 1 and 3, is one means of easily conveying uncertainty in the tuning parameter and identifying models that warrant additional investigation. It is noteworthy that in both examples the identified sets of likely models are not contiguous in size. Thus, our methods provide a theoretically motivated, confidence-set-based alternative to the practice of considering models near in size to the BIC-chosen LASSO model.

Supplementary Material

supp

Acknowledgments

Funding

The authors gratefully acknowledge funding from the National Science Foundation (DMS-1555141, DMS-1557733, DMS-1513579) and the National Institutes of Health (P01 CA142538).

Appendix A: Simulation Results

Table A.1.

Discovery rate for p = 20, n = 50; τ = 0.1.

Model ρ AsympNor
Bootstrap
UP1
UP2
Akaike
ApproxPost
TDR FDR TDR FDR TDR FDR TDR FDR TDR FDR TDR FDR
0 0.90 0.10 0.90 0.10 1.00 0.33 0.99 0.37 0.42 0.80 0.89 0.48
1 05 0.92 0.08 0.92 0.08 1.00 0.30 0.99 0.33 0.48 0.78 0.93 0.48
0 0.87 0.11 0.88 0.11 0.99 0.33 0.99 0.36 0.35 0.81 0.81 0.48
2 05 0.89 0.10 0.89 0.11 1.00 0.35 0.99 0.36 0.45 0.78 0.87 0.47
0 0.91 0.09 0.92 0.09 1.00 0.29 0.99 0.34 0.44 0.80 0.93 0.48
3 05 0.92 0.07 0.93 0.08 1.00 026 0.99 0.30 0.50 0.78 0.96 0.49
0 0.89 0.10 0.90 0.10 1.00 0.34 0.99 0.37 0.39 0.80 0.87 0.48
4 05 0.90 0.10 0.91 0.10 1.00 0.34 0.99 0.35 0.46 0.79 0.91 0.48

Table A.2.

Discovery rate for p = 100, n = 200; τ = 0.1.

Model ρ AsympNor
Bootstrap
UP1
UP2
Akaike
ApproxPost
TDR FDR TDR FDR TDR FDR TDR FDR TDR FDR TDR FDR
0 0.97 0.03 0.97 0.03 1.00 0.13 1.00 0.14 0.07 0.97 0.99 0.42
1 0.5 0.97 0.02 0.97 0.02 1.00 0.10 1.00 0.10 0.15 0.93 1.00 0.40
0 0.95 0.04 0.95 0.04 1.00 0.17 1.00 0.18 0.03 0.99 0.98 0.42
2 0.5 0.96 0.04 0.96 0.04 1.00 0.14 1.00 0.15 0.07 0.97 0.99 0.42
0 0.97 0.03 0.97 0.03 1.00 0.12 1.00 0.13 0.09 0.96 0.99 0.42
3 0.5 0.98 0.02 0.98 0.02 1.00 0.09 1.00 0.10 0.17 0.93 1.00 0.40
0 0.96 0.04 0.96 0.04 1.00 0.14 1.00 0.15 0.04 0.98 0.99 0.42
4 0.5 0.97 0.03 0.97 0.03 1.00 0.12 1.00 0.13 0.09 0.96 0.99 0.42

Table A3.

Discovery rate for p = 20, n = 50; τ = 0.2.

Model ρ AsympNor
Bootstrap
UP1
UP2
Akaike
ApproxPost
TDR FDR TDR FDR TDR FDR TDR FDR TDR FDR TDR FDR
0 0.89 0.11 0.89 0.11 0.99 0.34 0.99 0.39 0.25 0.69 0.84 0.22
1 0.5 0.91 0.08 0.91 0.08 1.00 0.29 0.99 0.34 0.29 0.65 0.90 0.22
0 0.86 0.13 0.86 0.13 0.99 0.36 0.99 0.41 0.22 0.73 0.77 0.23
2 0.5 0.88 0.11 0.88 0.12 0.99 0.37 0.99 0.40 0.28 0.65 0.83 0.23
0 0.91 0.09 0.91 0.09 1.00 0.30 0.99 0.35 0.26 0.68 0.89 0.23
3 0.5 0.92 0.07 0.92 0.08 1.00 0.26 0.99 0.31 0.31 0.64 0.92 0.23
0 0.88 0.11 0.88 0.11 0.99 0.35 0.99 0.40 0.23 0.71 0.83 0.22
4 0.5 0.89 0.10 0.90 0.10 0.99 0.34 0.99 0.38 0.28 0.66 0.86 0.23

Table A.4.

Discovery rate for p = 100, n = 200; τ = 0.2.

Model ρ AsympNor
Bootstrap
UP1
UP2
Akaike
ApproxPost
TDR FDR TDR FDR TDR FDR TDR FDR TDR FDR TDR FDR
0 0.97 0.03 0.97 0.03 1.00 0.13 1.00 0.13 0.03 0.94 0.98 0.22
1 0.5 0.97 0.03 0.97 0.03 1.00 0.10 1.00 0.11 0.07 0.88 0.99 0.21
0 0.95 0.05 0.95 0.05 1.00 0.17 1.00 0.18 0.01 0.98 0.96 0.21
2 0.5 0.96 0.04 0.96 0.04 1.00 0.14 1.00 0.15 0.03 0.93 0.98 0.21
0 0.97 0.03 0.97 0.03 1.00 0.12 1.00 0.12 0.04 0.91 0.99 0.21
3 0.5 0.98 0.03 0.98 0.03 1.00 0.09 1.00 0.10 0.09 0.86 0.99 0.20
0 0.96 0.04 0.96 0.04 1.00 0.15 1.00 0.16 0.02 0.96 0.97 0.21
4 0.5 0.97 0.03 0.97 0.03 1.00 0.12 1.00 0.13 0.04 0.92 0.99 0.21

Appendix B: Additional Results for Real-Data

Table B.1.

Coefficient estimates for pollution and mortality data; columns BIC-8 and BIC-6 correspond to lasso estimateswith eight and six predictors, respectively; columns Refit-8 and Refit-6 correspond to refitting coefficient with eight and six predictors; column FS-5 corresponds to model selected by forward stepwise regression.

Variable Lasso
FS-5
BIC-8 Refit-8 BIC-6 Refit-6
Mean annual precipitation 14.95 17.76 11.78 14.85 14.86
Mean January temperature −12.04 −14.26 −8.80 −16.61 −16.50
Mean July temperature −5.80 −11.71 0 0 0
Median school years −8.89 −8.05 −9.99 −9.75 −10.80
Pct of housing units with facilities −2.55 −4.69 0 0 0
Population per square mile 5.15 7.11 2.62 6.039 0
Pct of non-White 35.38 39.98 30.04 36.98 36.27
Pollution potential of sulfur dioxide 14.47 14.92 13.66 15.51 18.00

Appendix C: Proof and Technical Details

Lemma C.1.

If the penalty function fj(βj; ℙn) depends on the data only through XY and X, then the distribution of λ^GIC conditional on S^ and X is equal to the distribution of λ^GIC conditional on XY and X.

Proof. β^(λ)=argminβ{12βXXβYXβ+j=2pfj(βj;n)}, from which it can be seen that the solution path is completely determined by XX and XY. On the other hand, given S^ and X, we can recover XY using XXβ^(0) = XX{(XX)−1XY} = XTY. □

Lemma C.2.

Suppose that b(·) and c(·) are nonnegative valued functions defined on [0, ∞] such that b(λ) is a nondecreasing for λ ≥ 0 with b(0) = 0. For x ≥ 0, define

H(x,λ)=log{x+b(λ)}+c(λ)

and

λ(x)=argminλH(x,λ).

Then λ(x) is nondecreasing in x ≥ 0.

Proof of Lemma C.2. Supposed x1x2, we need to show that λ(x1) ≤ λ(x2). First, consider the difference of H(x2, λ) and H(x1, λ)

H(x2,λ)H(x1,λ)=log{x2+b(λ)}log{x1+b(λ)}=log{1+x2x1x1+b(λ)},

which is nonnegative for every λ and nonincreasing in λ. Therefore if λ(x1) = argminλ H(x1, λ), it follows that

λ(x2)=argminλH(x2,λ),=argminλ[H(x1,λ)+log{1+x2x1x1+b(λ)}]λ(x1).

The last inequality follows from that log{1 + (x2x1 )/(x1 + b(λ))} is nonnegative and nonincreasing with respect to λ. □

Proof of Lemma 1. Recall that the information criterion can also be expressed as

GICλ=log(YXβ^(λ)2n)+wndf^λ=log(YXβ^ols+Xβ^olsXβ^(λ)2n)+wndf^λ=log(σ^02+Dλn)+wndf^λ,

where Dλ={β^olsβ^(λ)}XX{β^olsβ^(λ)}.

Because Dλ is a deterministic function of λ conditional on the solution path and design matrix, the only variability in λ^GIC is due to σ^02. Therefore, λ^GIC(β^ols,X) is a function of σ^02.

Then monotonicity comes immediately by observing that Dλ is a non-decreasing function for λ ≥ 0 with D(0) = 0 and invokng Lemma C.2. □

Lemma C.3.

Let S^ and X be fixed. If df^λ^(k)<df^λ^(i), then GICλ^(i)GICλ^(k) iff nσ^02^i,k and if df^λ^(k)>df^λ^(i), then GICλ^(i)GICλ^(k) iff nσ^02^i,k.

Proof of Lemma C.3. Consider the case df^λ^(k)<df^λ^(i),

p(GICλ^(i)GICλ^(k)S^,X)=P[log{σ^02+Dλ^(i)n}]+wndf^λ^(i)log[{σ^02+Dλ^(k)n}+wndf^λ^(k)S^,X]=P[log{σ^02+Dλ^(i)n}log{σ^02+Dλ^(k)n}]wn[{df^λ^(k)df^λ^(i)}S^,X]=P[nσ^02+Dλ^(i)nσ^02+Dλ^(k)exp{wn(df^λ^(k)df^λ^(i))}S^,X]=P(nσ^02Dλ^(k)exp{wn(df^λ^(k)df^λ^(i))}Dλ^(i)1exp{wn(df^λ^(k)df^λ^(i))}S^,X).

The case df^λ^(k)>df^λ^(i) follows by a similar argument. □

Proof of Proposition 1. The proof follows from the fact that

GICλ^(k)<GICλ^(i)foralliAkBkifandonlyifmaxiB^k^i,knσ^02maxiA^k^i,k.

To prove Proposition 2, we assume:

(A1F): under a fixed design limnn1i=1nXiXi=C, limnn1i=1nXi=μx, where C ∈ ℝp×p is nonnegative definite and μxx ∈ ℝp;

(A1R): under a random design, with probability one, limnn1i=1nXiXi=C, and limnn1i=1nXi=μx, where C ∈ ℝp×p is nonnegative definite and μx ∈ ℝp;

(A2): 𝔼ϵi4<.

Under assumptions (A1F) and (A2), we have the following well-known results, which facilitate the proof of Proposition 2.

Lemma C.4.

β^olsasβ0;n(β^olsβ0)dN(0p×1,C1).

Proof of Proposition 2. First consider the fixed design model. Let

ψ(Yi,Xi,β,σ2)=((YiXiβ)Xi(YiXiβ)2σ2).

Then (β^ols,σ^2) is a solution to the equation

i=1nψ(Yi,Xi,β,σ2)=0.

A Taylor series expansion around the true value (β0, σ02) results in

Σi=1nψ(Yi,Xi,β^ols,σ^2)=Σi=1nψ(Yi,Xi,β0,σ2)+Σi=1nψ(Yi,Xi,β0,σ2)(β^olsβ0σ^2σ02)+Rn,

where ψ′ is the derivative of ψ and

Rn=i=1n(0p×1(β^olsβ0)XiXi(β^olsβ0)).

Rearranging it leads to

{1nΣi=1nψ(Yi,Xi,β0,σ02)}n(β^olsβ0σ^2σ02)={1nΣi=1nψ(Yi,Xi,β0,σ02)}+Rnn.

Because −ψ(Yi,Xi,β0,σ02)=(XiXi0p×12Xi(YiXiβ0)1), it follows that

1ni=1nψ(Yi,Xi,β0,σ02)p(C0p×101×p1)

by consistency of β^ols.

Then by the multivariate Lindberg–Feller CLT,

1ni=1nψ(Yi,Xi,β0,σ02)dN[(0p×10),(σ02Cμxμ3,ϵμxμ3,ϵμ4,ϵσ04)].

Finally, Rnn is op(1) as of

1nΣi=1n(β^olsβ0)XiXi(β^olsβ0)=n(β^olsβ0){1ni=1nXiXi}(β^olsβ0).

Therefore by Slutsky’s theorem,

n(β^olsβ0σ^02σ02)dN[(0p×10),(σ02C1μxμ3,ϵμxμ3,ϵμ4,ϵσ04)]. (C.1)

Then, for the random design, because limnn1i=1nXiXi=C and limnn1i=1nXi=μx almost surely, assumption A1F holds for almost every sequence x1, x2, …. Therefore Equation (C.1) holds for almost every sequence x1, x2,…. □

Lemma C.5.

Suppose ϵ1, ϵ2, … , ϵn are normally distributed with mean zero and variance σ02, then plugging estimator p^k=Fχnp2(a^kσ^02)Fχnp2(b^kσ^02) need not be a consistent estimator for pk, where a^k and b^k are defined in Section 3.2.

Proof. We show that there exist sequences for which p^k is not consistent for pk. Considering the case XX = n × I2 with X1 = 1n×1 and σ02 = 1, then

p1=P(GIC+>GIC0β^ols,X)=P(β^ols,22exp(wn)1>σ^02β^ols,X)=ϕ(a^1n2n)+op(1),

where a^1=nβ^ols,22{exp(wn)1} and β^ols,2 corresponds least-square estimate for X2. And

p^1=Φ(a^1σ^02n2n)+op(1).

For the sequence of β^ols,2=(exp(wn)1)(1+cn12), where c is any constant, it follows that a^1=n+cn. And then

a^1n2na^1σ^02n2n=a^1(σ^021)2nσ^02=(n+cn)(σ^021)2nσ^02,

which is not op(1). Therefore, p^1 is not a consistent estimator for p1. □

Proposition C.1.

Assume the distribution of ϵi, i = 1, … , n is symmetric about zero, then for any ϵ > 0,

P{inf(t1,t2)Eypkhk(t1,t2)>ϵ}α+o(1), (C.2)

where Ey is an asymptotically (1 − α) × 100% confidence region for μ4,ϵσ04 and σ02.

Proof. Denote the event that (σ02,μ4,ϵσ04)Gy as A,

P(inf(t1,t2)Eypkhk(t1,t2)>ϵ)P(inf(t1,t2)Eypkhk(t1,t2)>ϵA)P(A)+P(Ac)0(1α)+α+o(1)=α+o(1).

Lemma C.6.

For any s ≥ 1, assume n1Σi=1nXis=O(1), then

n1i=1ne^isasms,

where ms = E|ϵ1|s.

Proof of Lemma C.6.

{(n1Σi=1ne^is)1s(n1Σi=1nϵis)1s}sn1Σi=1ne^iϵis=n1Σi=1nXi(β^olsβ0)sn1Σi=1nXisβ^olsβ0s.

But β^olsasβ0 and n1i=1nXis=O(1), so that (n1i=1ne^is)1s(n1i=1nϵis)1sas0. Then by Strong Law of Large Numbers, n1i=1nϵisasEϵ1s. And thus n1i=1ne^isasEϵ1s. □

Lemma C.7.

Assume (A1F) and (A2), then

1nγ(b)Pxγ(b)p0,

conditionally almost surely.

Proof of Lemma C.7. Denote β^ols=(XX)1X(Xβ^ols+γ(b)), then we have

1nγ(b)Pxγ(b)=n(β^olsβ^ols)(1nXX)(β^olsβ^ols).

Then by noting n(β^olsβ^ols)dN(0,C1) conditionally almost surely (Theorem 2.2 of Freedman 1981), 1nγ(b)Pxγ(b) is op(1) conditionally almost surely. □

Proposition C.2.

Under the assumptions (A1F) and (A2), and further assuming that E|ϵi|4+δ < ∞ and n1i=1nXi4+δ< for any δ > 0, then n(σ^2σ^02)dN(0,μ4,ϵσ04) conditionally almost surely. □

Proof of Proposition C.2. Recall

σ^2=1nγ(b)(IPx)γ(b)=1nγ(b)γ(b)1nγ(b)Pxγ(b).

Because n12γ(b)Pxγ(b)p0 almost surely by Lemma C.7, σ^2 has the same asymptotic distribution as n−1γ(b)⊺γ(b). Then because the γ1(b), γ2(b), … , γn(b) are sampled from different distribution for every n, the Lindberg central limit theorem is used to obtain the asymptotic distribution. The conditional mean of (γi(b))2 is

E({r1(b)}2Y)=1ni=1n(e^i)2=σ^02.

The conditional variance is

var({r1(b)}2Y)=1ni=1n(e^i)4{1ni=1n(e^i)2}2.

Use Lemma C.6, n1i=1n(e^)4asμ4,ϵ and {n1i=1n(e^i)2}2asσ04. So the conditional variance converges to μ4,ϵσ04 almost surely.

Then to verify the Lyapunov condition,

1{var({r1(b)}2Y)}2+δΣi=1nE{(r1(b))2n2+δY}=1{var({r1(b)}2Y)}2+δ1nδ2E{r1(b)4+2δY}=1{var({r1(b)}2Y)}2+δ1n1+δ2Σi=1ne^i4+2δ,

which is o(1) almost surely by invoking Lemma C.6. And thus n(σ^2σ^02)dN(0,μ4,ϵσ04) conditionally almost surely. □

C1. Theoretical Results for High Dimensions

Proposition C.3.

If p = o(n1/2), then

n(σ^02σ02)dN(0,μ4,ϵσ04).

Proof of Proposition C.3. We know σ^02=n1ϵϵn1ϵPxϵ, where n(n1ϵϵσ02) converges to N(0,μ4,ϵσ04) in distribution. It remains to prove n12ϵPxϵp0.

By expectation of quadratic form, we have n12E(ϵPxϵ)=n12tr(Px×I)n12p=o(1). Therefore, n12ϵPxϵp0. This completes the proof. □

Now we study the distribution of variance estimator after screening. First, we restate Theorem 1 from Fan and Lv (2008) with slight modification. Denote A0 to be the true index of nonzero regression coefficients, and S to be the screened subset. Assume Conditions 1–4 in Fan and Lv (2008) hold for some 2κ + τ < 1/2, then we have the following result.

Theorem C.1 (Accuracy of SIS). Under Conditions 1–4 in Fan and Lv (2008), if 2κ + τ < 1/2, there exists θ > 1/2, we have

P(A0S)=1O(exp(Cn12κlogn)),

where C is a positive constant, and the size of S is O(n1−θ).

From above result, we have a screening approach to reduce number of predictors from huge scale, O(exp(nc)), to a smaller scale, o(n). Denote Denote X = (X(1), X(2)), X(1), X(2) are corresponding to the first and second half of design matrix, respectively. Similarly, define Y(1), Y(2). Then the variance estimator is defined as

σ^02=1m{Y(2)}(IPXs(2))Y(2),

where m = n/2, and PXS(2) is the projection matrix constructed from screened subset S and second half of design matrix, X(2).

Proposition C.4.

Under Conditions 1–4 in Fan and Lv (2008), if 2κ + τ < 1/2, then

m(σ^02σ02)dN(0,μ4,ϵσ04).

Proof of Proposition C.4.

mσ^02=1m{Y(2)}(IPXs(2))Y(2)=1m{X(2)β0+ϵ(2)}(IPXs(2)){X(2)β0+ϵ(2)}=1m{ϵ(2)}ϵ(2)1m{ϵ(2)}PXs(2)ϵ(2)+1m{X(2)β0}(IPXS(2)){X(2)β0}+2m{X(2)β0}(IPXS(2)){ϵ(2)}.

For the first term, we know m(1m{ϵ(2)}ϵ(2)σ02) converges to N(0,μ4,ϵσ04). It remains to prove the remaining term are op(1). For the second term, we know

E(1m{ϵ(2)}PXS(2)ϵ(2))=1mE(PXS(2))=1m×O(n).

Therefore it is op(1). For the third term,

1mE({X(2)β0}(IPXS(2)){X(2)β0})=1mE({X(2)β0}(IPXS(2)){X(2)β0}A0S)P(A0S)+1mE({X(2)β0}(IPXS(2)){X(2)β0}A0S)P(A0S)=0+1mE({X(2)β0}(IPXS(2)){X(2)β0}A0S)P(A0S)1mE({X(2)β0}{X(2)β0})P(A0S)=mβ0Cβ0P(A0S)mvar(Y)P(A0S)=O(nexp(Cn12κlogn)),

where the last inequality follows Condition 3 in Fan and Lv (2008), var(Y ) = O(1). And thus it is op(1). For the last term,

var(2m{X(2)β0}(IPXS(2)){ϵ(2)})=4mE({X(2)β0}(IPXS(2)){X(2)β0})

which is o(1). This completes the proof. □

Footnotes

Color versions of one or more of the figures in this article are available online at www.tandfonline.com/r/TECH.

Supplementary materials for this article are available online. Please go to http://www.tandfonline.com/r/TECH

Supplementary Materials

Simulation results: Simulation results for τ = 0.1 and 0.2 are presented in the online supplement to this article.

Proofs and technical details: Detailed proofs are provided in the online supplement to this article.

R package: R package for proposed methods.

References

  1. Akaike H (1974), “A New Look at the Statistical Model Identification,” IEEE Transactions on Automatic Control, 19, 716–723. [1] [Google Scholar]
  2. Barber RF, and Candès EJ, (2015), “Controlling the False Discovery Rate Via Knockoffs,” The Annals of Statistics, 43, 2055–2085. [6] [Google Scholar]
  3. Berger RL, and Boos DD (1994), “P Values Maximized Over a Confidence Set for the Nuisance Parameter,” Journal of the American Statistical Association, 89, 1012–1016. [3] [Google Scholar]
  4. Bertsekas DP (2014), Constrained Optimization and Lagrange Multiplier Methods, Boston,MA: Academic Press; [4] [Google Scholar]
  5. Burnham KP, and Anderson D (2003), Model Selection and Multi-Model Inference: A Practical Information-theoretic Approach, NewYork: Springer; [4] [Google Scholar]
  6. Chen J, and Chen Z (2008), “Extended Bayesian Information Criteria for Model Selection with Large Model Spaces,” Biometrika, 95, 759–771. [1] [Google Scholar]
  7. Cox DR (2001), “StatisticalModeling: The Two Cultures: Comment,” Statistical Science, 16, 216–218. [1] [Google Scholar]
  8. Fan J, and Li R (2001), “Variable Selection Via Nonconcave Penalized Likelihood and its Oracle Properties,” Journal of the American Statistical Association, 96, 1348–1360. [1] [Google Scholar]
  9. Fan J, and Lv J (2008), “Sure Independence Screening for Ultrahigh Dimensional Feature Space,” Journal of the Royal Statistical Society, Series B, 70, 849–911. [10,11] [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Fan Y, and Tang CY (2013), “Tuning Parameter Selection in High Dimensional Penalized Likelihood,” Journal of theRoyal Statistical Society, Series B, 75, 531–552. [1] [Google Scholar]
  11. Feng Y, and Yu Y (2013), “Consistent Cross-validation for Tuning Parameter Selection in High-dimensional Variable Selection.” arXiv preprint arXiv:1308.5390. [1] [Google Scholar]
  12. Freedman DA (1981), “Bootstrapping RegressionModels,” The Annals of Statistics, 9, 1218–1228. [10] [Google Scholar]
  13. Golub GH, Heath M, and Wahba G (1979), “Generalized Crossvalidation as a Method for Choosing a Good Ridge Parameter,” Technometrics, 21, 215–223. [1] [Google Scholar]
  14. Hall P, Lee ER, and Park BU (2009), “Bootstrap-based Penalty Choice for the Lasso, Achieving Oracle Performance,” Statistica Sinica, 19, 449–471. [1] [Google Scholar]
  15. Henderson HV, andVelleman PF (1981), “BuildingMultiple Regression Models Interactively,” Biometrics, 37, 391–411. [1] [Google Scholar]
  16. Hui FK, Warton DI, and Foster SD (2015), “Tuning Parameter Selection for the Adaptive Lasso Using Eric,” Journal of the American Statistical Association, 110, 262–269. [1] [Google Scholar]
  17. Katsouyanni K, Pantazopoulou A, Touloumi G, Tselepidaki I, Moustris K, Asimakopoulos D, Poulopoulou G, and Trichopoulos D (1993), “Evidence for Interaction BetweenAir Pollution andHighTemperature in the Causation of ExcessMortality,” Archives of Environmental Health: An International Journal, 48, 235–242. [6] [DOI] [PubMed] [Google Scholar]
  18. Kim Y, Kwon S, and Choi H (2012), “Consistent Model Selection Criteria on High Dimensions,” Journal of Machine Learning Research, 13, 1037–1057. [1] [Google Scholar]
  19. Luo X, Stefanski LA, and Boos DD (2006), “Tuning Variable Selection Procedures by Adding Noise,” Technometrics, 48, 165–175. [5] [Google Scholar]
  20. Mallows CL (1973), “Some Comments on Cp,” Technometrics, 15, 661–675. [1] [Google Scholar]
  21. McDonald GC, and Schwing RC (1973), “Instabilities of Regression Estimates Relating Air Pollution to Mortality,” Technometrics, 15, 463–481. [5] [Google Scholar]
  22. Meinshausen N, and Bühlmann P (2010), “Stability Selection,” Journal of the Royal Statistical Society, Series B, 72, 417–473. [1] [Google Scholar]
  23. Rhee S-Y, Taylor J,Wadhera G, Ben-Hur A, Brutlag DL, and Shafer RW (2006), “Genotypic Predictors of Human Immunodeficiency Virus Type 1 Drug Resistance,” Proceedings of the National Academy of Sciences, 103, 17355–17360. [5,6] [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Schwarz G (1978), “Estimating the Dimension of a Model,” The Annals of Statistics, 6, 461–464. [1] [Google Scholar]
  25. Shah RD, andSamworth RJ (2013), “Variable Selectionwith ErrorControl: Another Look at Stability Selection,” Journal of the Royal Statistical Society, Series B, 75, 55–80. [1] [Google Scholar]
  26. Sun W, Wang J, and Fang Y (2013), “Consistent Selection of Tuning Parameters Via Variable Selection Stability,” The Journal of Machine Learning Research, 14, 3419–3440. [1] [Google Scholar]
  27. Tibshirani R (1996), “Regression Shrinkage and Selection Via the Lasso,” Journal of the Royal Statistical Society, Series B, 58, 267–288. [1] [Google Scholar]
  28. Wang H, Li B, and Leng C (2009), “Shrinkage Tuning Parameter Selection with a Diverging Number of Parameters,” Journal of the Royal Statistical Society, Series B, 71, 671–683. [1] [Google Scholar]
  29. Wang T, and Zhu L (2011), “Consistent Tuning Parameter Selection in High Dimensional Sparse Linear Regression,” Journal of Multivariate Analysis, 102, 1141–1151. [1] [Google Scholar]
  30. Zhang Y, Li R, and Tsai C-L (2010), “Regularization Parameter Selections Via Generalized Information Criterion,” Journal of the American Statistical Association, 105, 312–323. [1] [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Zou H (2006), “The Adaptive Lasso and its Oracle Properties,” Journal of the American Statistical Association, 101, 1418–1429. [1] [Google Scholar]
  32. Zou H, and Hastie T (2005), “Regularization and Variable Selection Via the Elastic Net,” Journal of the Royal Statistical Society, Series B, 67, 301–320. [1] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp

RESOURCES