Assessing Tuning Parameter Selection Variability in Penalized Regression

Wenhao Hu; Eric B Laber; Clay Barker; Leonard A Stefanski

doi:10.1080/00401706.2018.1513380

. Author manuscript; available in PMC: 2019 Sep 18.

Published in final edited form as: Technometrics. 2018 Oct 31;61(2):154–164. doi: 10.1080/00401706.2018.1513380

Assessing Tuning Parameter Selection Variability in Penalized Regression

Wenhao Hu ^a, Eric B Laber ^a, Clay Barker ^b, Leonard A Stefanski ^a

PMCID: PMC6750234 NIHMSID: NIHMS987505 PMID: 31534281

Abstract

Penalized regression methods that perform simultaneous model selection and estimation are ubiquitous in statistical modeling. The use of such methods is often unavoidable as manual inspection of all possible models quickly becomes intractable when there are more than a handful of predictors. However, automated methods usually fail to incorporate domain-knowledge, exploratory analyses, or other factors that might guide a more interactive model-building approach. A hybrid approach is to use penalized regression to identify a set of candidate models and then to use interactive model-building to examine this candidate set more closely. To identify a set of candidate models, we derive point and interval estimators of the probability that each model along a solution path will minimize a given model selection criterion, for example, Akaike information criterion, Bayesian information criterion (AIC, BIC), etc., conditional on the observed solution path. Then models with a high probability of selection are considered for further examination. Thus, the proposed methodology attempts to strike a balance between algorithmic modeling approaches that are computationally efficient but fail to incorporate expert knowledge, and interactive modeling approaches that are labor intensive but informed by experience, intuition, and domain knowledge. Supplementary materials for this article are available online.

Keywords: Conditional distribution, Lasso, Prediction sets

1. Introduction

Penalized estimation is a popular means of regression model fitting that is quickly becoming a standard tool among quantitative researchers working across nearly all areas of science. Examples include the Lasso (Tibshirani 1996), SCAD (Fan and Li 2001), Elastic Net (Zou and Hastie 2005), and the adaptive Lasso (Zou 2006). One appealing feature of these methods is that they perform simultaneous model selection and estimation, thereby automating model-building at least partially. This is especially beneficial in settings where the number of predictors is large, precluding manual inspection of all possible models. However, a consequence is that the analyst becomes increasingly dependent on an estimation algorithm that has neither the subject-matter knowledge nor the intuition that might guide a less automated and more interactive model-building process (Henderson and Velleman 1981; Cox 2001). A hybrid approach is to use penalized estimation to construct a small subset of models, for example, the sequence of models occurring on a solution path, and then to apply interactive model-building techniques to choose a model from among these. We develop and advocate such a hybrid approach wherein a set of candidate models are identified using a solution path, and then models along this path are prioritized using their conditional probability of selection according to one or more tuning parameter selection methods. We envision this approach as being useful in at least two ways: (i) it facilitates interactive, expert-knowledge-driven exploration of high-quality candidate models even when the initial pool of models is large; and (ii) it provides valid conditional prediction sets for a data-driven tuning parameter given the observed design matrix and solution path, which is applicable for a large class of tuning parameter selection methods.

There is a vast literature on tuning parameter selection methods. Classical methods include Mallow’s C_p (Mallows 1973), Akaike information criterion (AIC; Akaike 1974), Bayesian information criterion (BIC; Schwarz 1978), cross-validation, and generalized cross-validation (Golub, Heath, and Wahba 1979). More recent work on tuning parameter selection, driven by interest in high-dimensional data, includes new information-theoretic selection methods (Chen and Chen 2008; Wang, Li, and Leng 2009; Zhang, Li, and Tsai 2010; Wang and Zhu 2011; Kim, Kwon, and Choi 2012; Fan and Tang 2013; Hui, Warton, and Foster 2015) as well as resampling-based approaches (Hall, Lee, and Park 2009; Meinshausen and Bühlmann 2010; Feng and Yu 2013; Sun, Wang, and Fang 2013; Shah and Samworth 2013). The foregoing methods select a single tuning parameter and hence a single fitted model. Our goal is to quantify the stability of these methods by constructing conditional prediction sets for data-driven tuning parameters and to use these prediction sets to prioritize models for further, expert-guided exploration. Given one or more tuning parameter selection methods, we identify all models with sufficiently large conditional probability of being selected given the design matrix and observed solution path.

In Section 2, we review penalized linear regression. In Section 3, we derive exact and asymptotic estimators of the sampling distribution of a data-driven tuning parameter. We examine the performance of the proposed methods through simulation studies in Section 4. In Section 5, we illustrate the proposed methods using two data examples. A concluding discussion is given in Section 6. Technical details are relegated to the supplement materials.

2. Penalized Linear Regression

We assume that the data are generated according to the linear model $Y_{i} = X_{i}^{⊺} β_{0} + ϵ_{i}$ , for i = 1, … , n, where ϵ₁, … , ϵ_n are independent, identically distributed errors with expectation zero, β₀ = (β₀₁, … , β_0p)^⊺, and X₁, … ,X_n are predictors that can be regarded as either fixed or random. Let Y = (Y₁, Y₂, … , Y_n)^⊺ be the vector of responses and X = (X₁, X₂, … , X_n)^⊺ the design matrix with the first column equal to 1^n×1. Let $ℙ_{n}$ denote the empirical distribution. We consider penalized least-square estimators

\hat{β} (λ) = \underset{β \in ℝ^{p}}{argmin} {\frac{1}{2} ‖ Y - X β ‖^{2} + λ \sum_{j = 2}^{p} f_{j} (β_{j}; ℙ_{n})},

where f_j(·), j = 2, … , p are penalty functions. For example, $f_{j} (β_{j}; ℙ_{n}) = ∣ β_{j} ∣$ corresponds to the Lasso, and $f_{j} (β_{j}; ℙ_{n}) = ∣ β_{j} ∣ ∣ {\hat{β}}_{ols, j} ∣^{- γ}$ corresponds to the adaptive Lasso, where ${\hat{β}}_{ols, j}$ is the ordinary least-square estimator and γ > 0 is a constant.

For any Λ ⊆ [0, ∞) define the solution path along Λ as $\hat{S} (Λ) = {\hat{β} (λ) : λ \in Λ}$ ; we write $\hat{S}$ to denote $\hat{S} {[0, \infty)}$ . While the solution path along Λ may contain a continuum of coefficient vectors, it is commonly viewed as containing a finite set of unique models corresponding to each unique combination of nonzero elements of coefficients in $\hat{S} (Λ)$ , that is, the set of models $M {\hat{S} (Λ)} = {M \in {0, 1}^{p} : M = 𝟙_{\hat{β} (λ) \neq 0}, for some λ \in Λ}$ . The number of models in $M {\hat{S}}$ is typically much smaller, for example, O_p{min(n, p)}, than the set of all of 2^p possible models. Thus, the set of models along the solution path are a natural and computationally manageable subset ofmodels for further investigation. Standard practice is to choose a single value of the tuning parameter, say $\hat{λ}$ , that optimizes some prespecified criterion and subsequently a single model $M [\hat{S} {(\hat{λ})}]$ ]. However, the selected tuning parameter is a random variable and there may be multiple models along the solution path where the support of the selected tuning parameter is large, for example, $M {\hat{S} (L_{τ})}$ where $L_{τ}$ is a τ upper-level set of the conditional distribution of $\hat{λ}$ given $\hat{S}$ and X. If these models can be identified from the observed data, then they can be reported as potential candidate models or a single model can be chosen from among them using expert judgment and other factors not captured in the estimation algorithm. Also, unlikely models can be ruled out. To formalize this procedure, we consider selection methods within the framework of generalized information criterion.

Define the generalized information criterion as

{GIC}_{λ} = \log ({\hat{σ}}_{λ}^{2}) + w_{n} {\hat{df}}_{λ},

(1)

where ${\hat{σ}}_{λ}^{2} = n^{- 1} \sum_{i = 1}^{n} {Y_{i} - {X_{i}}^{⊺} \hat{β} (λ)}^{2}$ , ${\hat{df}}_{λ} = \sum_{j = 1}^{p} 𝟙_{∣ {\hat{β}}_{j} (λ) ∣ > 0}$ , and w_n is a sequence of positive constants, with w_n = log(n)/n and w_n = 2/n yielding BIC and AIC, respectively. We consider data-driven tuning parameters of the form ${\hat{λ}}_{GIC} = {argmin}_{λ} {\log ({\hat{σ}}_{λ}^{2}) + w_{n} {\hat{df}}_{λ}}$ We focus primarily on the setting where n > p as the GIC is not well-defined if p ≥ n However, we provide an illustrative example in Section 5 where p > n wherein our method is applied after an initial screening step; this two-stage procedure is in line with our vision for using automated methods to identify a small set of candidate models for further consideration. We also present extensions of key distributional approximations to the setting where p diverges with n in the Appendix.

3. Estimating the Conditional Distribution of ${\hat{λ}}_{GIC}$

In this section, we characterize and derive estimators of the conditional distribution of ${\hat{λ}}_{GIC}$ given $\hat{S}$ and X. We first show that conditioning on $\hat{S}$ and X is equivalent to conditioning on X^⊺Y and X. We then show that ${\hat{λ}}_{GIC}$ is a nondecreasing function of the sum of squares error of the full model ${\hat{σ}}_{0}^{2} = n^{- 1} \sum_{i = 1}^{n} {(Y_{i} - X_{i}^{⊺} {\hat{β}}_{ols})}^{2}$ . Therefore, the conditional distribution of ${\hat{λ}}_{GIC}$ is completely determined by the conditional distribution of ${\hat{σ}}_{0}^{2}$ .

3.1. Conditioning on the Solution Path

We assume that f_j(β_j, ℙ_n), j = 2, … , p depends on the observed data only through X^⊺Y and X^⊺X; this assumption is natural as X^⊺X and X^⊺Y are sufficient statistics for the conditional mean of Y given X under the assumed linear model. Under this assumption, $\hat{β} (λ) = {argmin}_{β} {\frac{1}{2} β^{⊺} X^{⊺} X β - Y^{⊺} X β + λ \sum_{j = 2}^{p} f_{j} (β_{j}; ℙ_{n})}$ , from which it can be seen that the solution path is completely determined by X^⊺X and X^⊺Y. On the other hand, given $\hat{S}$ and X, we can recover X^⊺Y using $X^{⊺} X \hat{β} (0) = X^{⊺} X {(X^{⊺} X)^{- 1} X^{⊺} Y} = X^{⊺} Y$ . Therefore, conditioning on solution path and design matrix is equivalent to conditioning on X^⊺Y and X (see Lemma C.1 in the Appendix).

In the case of adaptive Lasso, we assume that X is full column rank so that f_j(β_j; ℙ_n), which depends on ${\hat{β}}_{ols, j}$ , is well-defined. It can be seen that if X is full column rank, then the entire solution path is determined by X^⊺X and ${\hat{β}}_{ols}$ Conditioning on the solution path is also practically relevant because it is consistent with the common practice wherein an analyst is presented with a full solution path and then proceeds to identify a model as a point along this path.

3.2. Exact Distribution of ${\hat{λ}}_{GIC} ∣ (\hat{S}, X)$

We assume that the models along the solution path are determined by the sequence of tuning parameters ${\hat{λ}}_{(1)} < {\hat{λ}}_{(2)} < \dots < {\hat{λ}}_{(\hat{m})}$ , so that $\hat{m}$ is the total number of tuning parameters to be considered. The following lemma characterizes the conditional distribution of ${\hat{λ}}_{GIC}$ .

Lemma 1.

The selected tuning parameter, ${\hat{λ}}_{GIC}$ , is completely determined by $(\hat{S}, X, {\hat{σ}}_{0}^{2})$ . Furthermore, assume ${∣ ∣ Y - X \hat{β} (λ) ∣ ∣}^{2}$ is a nondecreasing function of λ, write ${\hat{λ}}_{GIC} = λ (\hat{S}, X, {\hat{σ}}_{0}^{2})$ , then for each fixed $\hat{S} = s$ = s and X = x, the map σ² ↦ λ(s, x, σ²) is nondecreasing.

The assumption that ${∣ ∣ Y - X \hat{β} (λ) ∣ ∣}^{2}$ is nondecreasing function of λ holds under mild conditions, for example, if $\sum_{j = 2}^{p} f_{j} {\hat{β} (λ); ℙ_{n}}$ is a decreasing function of λ and the original penalized problem can be recast as constrained minimization problem of the form minimize ||Y − X β||² subject to the constraint $\sum_{j = 2}^{p} f_{j} (β_{j}; ℙ_{n}) \leq \sum_{j = 2}^{p} f_{j} {\hat{β} (λ); ℙ_{n}}$ . It is well known that Lasso satisfies this property. If the error is normally distributed, then $(n {\hat{σ}}_{0}^{2}) ∕ σ_{0}^{2}$ is independent of ( $\hat{S}$ , X) and follows a chi-square distribution with n − p degrees of freedom. Therefore, the preceding lemma shows that, under normal errors, the conditional distribution of ${\hat{λ}}_{GIC}$ given $(\hat{S}, X)$ is a nondecreasing function of a chi-square random variable. The remaining results stated in this section do not require the assumption that ${∣ ∣ Y - X \hat{β} (λ) ∣ ∣}^{2}$ is nondecreasing in λ; rather, the results are stated in terms of a finite but arbitrary sequence of tuning parameter values.

Define ${\hat{D}}_{λ} = {{\hat{β}}_{ols} - \hat{β} (λ)}^{⊺} X^{⊺} X {{\hat{β}}_{ols} - \hat{β} (λ)}$ . For k = 1, … , $\hat{m}$ , define ${\hat{A}}_{k} = {i : {\hat{df}}_{{\hat{λ}}_{(i)}} < {\hat{df}}_{{\hat{λ}}_{(k)}}}$ , ${\hat{B}}_{k} = {i : {\hat{df}}_{{\hat{λ}}_{(i)}} > {\hat{df}}_{{\hat{λ}}_{(k)}}}$ , ${\hat{C}}_{k} = {i : i \neq k, and {\hat{df}}_{{\hat{λ}}_{(i)}} = {\hat{df}}_{{\hat{λ}}_{(k)}}}$ , and

\begin{matrix} {\hat{ℓ}}_{i, k} & = \frac{{\hat{D}}_{{\hat{λ}}_{(k)}} exp {w_{n} ({\hat{df}}_{{\hat{λ}}_{(k)}} - {\hat{df}}_{{\hat{λ}}_{(i)}})} - {\hat{D}}_{{\hat{λ}}_{(i)}}}{1 - exp {w_{n} ({\hat{df}}_{{\hat{λ}}_{(k)}} - {\hat{df}}_{{\hat{λ}}_{(i)}})}}, \\ for 1 \leq i, k \leq \hat{m}, \end{matrix}

where w_n is from Equation (1). The quantities in the foregoing definitions are all measurable with respect to X and $\hat{S}$ and thus, for probability statements conditional on X and $\hat{S}$ , they are regarded as constants.

The following proposition gives the exact conditional distribution of ${\hat{λ}}_{GIC}$ given $\hat{S}$ and X.

Proposition 1.

Define ${\hat{I}}_{k} = 𝟙 ({\hat{D}}_{{\hat{λ}}_{(k)}} < {\hat{D}}_{{\hat{λ}}_{(i)}}, for all i \in {\hat{C}}_{k})$ with the convention that ${\hat{I}}_{k} = 1$ if ${\hat{C}}_{k}$ is empty, and $p_{k} = P (\max_{i \in {\hat{B}}_{k}} {\hat{ℓ}}_{i, k} \leq n {\hat{σ}}_{0}^{2} \leq \min_{i \in {\hat{A}}_{k}} {\hat{ℓ}}_{i, k} ∣ \hat{S}, X)$ . Then,

P ({\hat{λ}}_{GIC} = {\hat{λ}}_{(k)} ∣ \hat{S}, X) = min (p_{k}, {\hat{I}}_{k}) .

Provided that the conditional distribution of ${\hat{σ}}_{0}^{2}$ given ( $(\hat{S}, X)$ is known or can be consistently estimated, the preceding proposition can be used to construct conditional prediction sets for ${\hat{λ}}_{GIC}$ . A (1 − α) × 100% conditional prediction set is ${{\hat{λ}}_{(i)} : i \in Γ}$ , where $\sum_{i \in Γ} P ({\hat{λ}}_{GIC} = {\hat{λ}}_{(i)} ∣ \hat{S}, X) \geq 1 - α$ . Alternatively, as discussed previously, one can construct the τ upper-level set $L_{t} = {{\hat{λ}}_{(i)} : P ({\hat{λ}}_{GIC} = {\hat{λ}}_{(i)} ∣ \hat{S}, X) > τ}$ , for any τ ∈ (0, 1).

Define ${\hat{a}}_{k} = \min_{i \in {\hat{A}}_{k}} {\hat{ℓ}}_{i, k}$ and ${\hat{b}}_{k} = \max_{i \in {\hat{B}}_{k}} {\hat{ℓ}}_{i, k}$ . If the errors are normally distributed then

p_{k} = F_{χ_{n - p}^{2}} (\frac{{\hat{a}}_{k}}{σ_{0}^{2}}) - F_{χ_{n - p}^{2}} (\frac{{\hat{b}}_{k}}{σ_{0}^{2}}), for {\hat{a}}_{k} \geq {\hat{b}}_{k} .

(2)

Plugging ${\hat{σ}}_{0}^{2}$ into this expression yields an estimator ${\hat{p}}_{k}$ for p_k.

Define $g_{k} (t) = F_{χ_{n - p}^{2}} ({\hat{a}}_{k} ∕ t) - F_{χ_{n - p}^{2}} ({\hat{b}}_{k} ∕ t)$ . Then a (1 − α) × 100% projection confidence interval (Berger and Boos 1994) for p_k (Equation (2)) is

(inf_{t \in C} g_{k} (t), sup_{t \in C} g_{k} (t)),

(3)

Where $C = (n {\hat{σ}}_{0}^{2} / χ_{α / 2, n - p}^{2}, n {\hat{σ}}_{0}^{2} / χ_{1 - α / 2, n - p}^{2})$ is a (1−α) × 100% confidence interval for $σ_{0}^{2}$ . Thus, an estimator of $L_{τ}$ is

{\hat{L}}_{τ} = {{\hat{λ}}_{(k)} : sup_{t \in C} g_{k} (t) > τ} .

(4)

Remark 1. The assumption that X is full rank is not necessary for Proposition 1. Note that the conclusions depend only on the quantities $X {\hat{β}}_{ols}$ , $X \hat{β} (λ)$ , and ${\hat{σ}}_{0}^{2}$ , which are computable even when X is not full rank.

3.3. Limiting Conditional Distribution of ${\hat{σ}}_{0}^{2}$

As discussed above, if the errors are assumed tobe normally distributed then exact distribution theory for ${\hat{λ}}_{GIC}$ is possible using a transformed chi-square random variable. Here, we consider asymptotic approximations that apply more generally.

Denote the third and fourth moment of ϵ as μ_3,ϵ and μ_4,ϵ respectively. Define

Σ = (\begin{matrix} σ_{0}^{2} C^{- 1} & μ_{x} μ_{3, ϵ} \\ μ_{x}^{⊺} μ_{3, ϵ} & μ_{4, ϵ} - σ_{0}^{4} \end{matrix}),

where $C = \lim_{n \to \infty} n^{- 1} \sum_{i = 1}^{n} X_{i} X_{i}^{⊺}$ . And write Φ_p+1 (t) to denote the cumulative distribution function of a standard (p + 1 )-dimensional multivariate normal distribution evaluated at t. For u, v ∈ ℝ^p+1 write u ≤ v to mean component-wise inequality. The following are standard results from ordinary linear regression under common regularity conditions summarized in Section C in the Appendix (see the proof of Proposition 2).

Proposition 2.

The asymptotic joint distribution of ${\hat{β}}_{ols} - β_{0}$ and ${\hat{σ}}_{0}^{2} - σ_{0}^{2}$ is multivariate normal with mean zero and covariance Σ, that is,

sup_{t \in ℝ^{p + 1}} ∣ P {\sqrt{n} Σ^{- 1 ∕ 2} (\begin{matrix} {\hat{β}}_{o l s} - β_{0} \\ {\hat{σ}}_{0}^{2} - σ_{0}^{2} \end{matrix}) \leq t} - Φ_{p + 1} (t) ∣ \to 0 .

Because we assume that X is full column rank, conditioning on ( $\hat{S}$ , X) is equivalent to conditioning on ( ${\hat{β}}_{ols}$ , X) (in the sense that they generate the same σ-algebra). Therefore to approximate the conditional distribution of ${\hat{σ}}_{0}^{2}$ given ( $\hat{S}$ , X), we construct an estimator of Σ, say $\hat{Σ}$ , and then use the above proposition to form a plug-in estimator of the distribution of ${\hat{σ}}_{0}^{2}$ given $({\hat{β}}_{ols}, X)$ . Define

{\hat{e}}_{i} = Y_{i} - X_{i}^{⊺} {\hat{β}}_{o l s}, i = 1, 2, \dots, n,

(5)

and subsequently ${\hat{σ}}_{0}^{2} = n^{- 1} \sum_{i = 1}^{n} {\hat{e}}_{i}^{2}$ , ${\hat{μ}}_{3, ϵ} = n^{- 1} \sum_{i = 1}^{n} {\hat{e}}_{i}^{3}$ , ${\hat{μ}}_{4, ϵ} = n^{- 1} \sum_{i = 1}^{n} {\hat{e}}_{i}^{4}$ , ${\hat{μ}}_{x} = n^{- 1} \sum_{i = 1}^{n} X_{i}$ , and $\hat{C} = n^{- 1} \sum_{i = 1}^{n} X_{i} X_{i}^{⊺}$ . The estimated conditional distribution of ${\hat{σ}}_{0}^{2}$ is

N [{\hat{σ}}_{0}^{2}, \frac{1}{n} {({\hat{μ}}_{4, ϵ} - {\hat{σ}}_{0}^{4}) - \frac{{\hat{μ}}_{3, ϵ}^{2}}{{\hat{σ}}_{0}^{2}} {\hat{μ}}_{x}^{⊺} \hat{C} {\hat{μ}}_{x}}] .

(6)

This approximation, coupled with Proposition 1, can be used to approximate the conditional distribution of ${\hat{λ}}_{GIC}$ when a chi-squared approximation is not feasible.

Henceforth, we assume that the errors are symmetric about zero, in which case the third moment of ϵ_i, μ_3ϵ, is zero, which implies ${\hat{σ}}_{0}^{2}$ is asymptotically independent of ${\hat{β}}_{ols}$ . Therefore,

\begin{matrix} p_{k} & = P ({\hat{b}}_{k} \leq n {\hat{σ}}_{0}^{2} \leq {\hat{a}}_{k} ∣ \hat{S}, X) \\ = Φ (\frac{\sqrt{n} ({\hat{a}}_{k} ∕ n - σ_{0}^{2})}{\sqrt{μ_{4, ϵ} - σ_{0}^{4}}}) - Φ (\frac{\sqrt{n} ({\hat{b}}_{k} ∕ n - σ_{0}^{2})}{\sqrt{μ_{4, ϵ} - σ_{0}^{4}}}) \\ + o_{p} (1), for {\hat{a}}_{k} \geq {\hat{b}}_{k}, \end{matrix}

(7)

where μ_4,ϵ is the fourth moment of ϵ_i. Define

h_{k} (t_{1}, t_{2}) = Φ (\frac{\sqrt{n} ({\hat{a}}_{k} ∕ n - t_{1})}{\sqrt{t_{2}}}) - Φ (\frac{\sqrt{n} ({\hat{b}}_{k} ∕ n - t_{1})}{\sqrt{t_{2}}}) .

Suppose that $E_{y}$ is a (1 − α) x 100% asymptotic confidence region for μ_4,ϵ − $σ_{0}^{4}$ and $σ_{0}^{2}$ , then

[inf_{(t_{1}, t_{2}) \in E_{y}} h_{k} (t_{1}, t_{2}), sup_{(t_{1}, t_{2}) \in E_{y}} h_{k} (t_{1}, t_{2})],

(8)

is an approximate (1 − α) × 100% projection confidence interval for p_k (see Proposition C.1 in Section C of the Appendix.).

We construct the confidence set $E_{y}$ using Wald confidence region:

\begin{matrix} E_{y} = & {(t_{1}, t_{2}) : {(\begin{matrix} t_{1} - {\hat{σ}}_{0}^{2} \\ t_{2} - {\hat{μ}}_{4, ϵ} + {\hat{σ}}_{0}^{4} \end{matrix})}^{⊺} \\ \times {\hat{V}}^{- 1} (\begin{matrix} t_{1} - {\hat{σ}}_{0}^{2} \\ t_{2} - {\hat{μ}}_{4, ϵ} + {\hat{σ}}_{0}^{4} \end{matrix}) \leq χ_{1 - α, 2}^{2}}, \end{matrix}

where $\hat{V}$ is the estimated covariance matrix of ${({\hat{σ}}_{0}^{2}, {\hat{μ}}_{4, ϵ} - {\hat{σ}}_{0}^{4})}^{⊺}$ . Then the optimization problem in Equation (9) is solved using an augmented Lagrangian method (Bertsekas 2014). An estimator of $L_{τ}$ is

{\hat{L}}_{τ} = {{\hat{λ}}_{(k)} : sup_{(t_{1}, t_{2}) \in E_{y}} h_{k} (t_{1}, t_{2}) > τ} .

(9)

Proposition 2 is stated in terms of fixed p and diverging n. We show in the Appendix that the approximation in Equation (7) remains valid in the setting $p = o (\sqrt{n})$ as well as the setting where $p = O {\exp (n^{c})}$ provided that an appropriate screening step is applied as a first step. We illustrate this screening approach with a high-dimensional example in our simulation experiments.

3.4. Bootstrap Approximation to the Distribution of ${\hat{λ}}_{GIC} ∣ (\hat{S}, X)$

In small samples, it may be preferable to estimate the conditional distribution of ${\hat{σ}}_{0}^{2}$ using the bootstrap. Let γ^(b) = (γ₁^(b), … , γ_n^(b))^⊺ be a sample drawn with replacement from ${{\hat{e}}_{1}, \dots, {\hat{e}}_{n}}$ . Define $Y^{(b)} = X {\hat{β}}_{ols} + (I - P_{x}) γ^{(b)}$ , where P_x = X(X^⊺X)⁻X^⊺. This bootstrap method differs from the usual residual bootstrap in ordinary linear regression because our goal is to estimate the conditional distribution of ${\hat{σ}}_{0}^{2}$ . We accomplish this by multiplying the error vector by (I − P_x), which ensures that ${\hat{β}}_{ols}^{(b)} = {(X^{⊺} X)}^{- 1} X^{⊺} Y^{(b)} = {\hat{β}}_{ols}$ so that Y^(b) produces the same solution path as the original sample Y. The conditional distribution of the tuning parameter is estimated by generating b = 1,…, B bootstrap samples and calculating the corresponding tuning parameter for each bootstrap sample. See Proposition C.2 in Section C of the Appendix for a statement of the asymptotic equivalence between the proposed bootstrap method and the normal approximation given in Equation (6).

4. Simulation Studies

In this section, we investigate the finite-sample performance of the proposed methods using a series of simulation experiments. We focus on the Lasso tuned using BIC. Simulated datasets are generated from the model $Y_{i} = X_{i}^{⊺} β_{0} + ϵ_{i}$ , where ϵ_i, i = 1, … , n are generated independently from a standard normal distribution and X_i, i = 1, … , n are generated independently from a multivariate normal distribution with mean zero and autoregressive covariance structure, C_j,k = ρ^|j−k|, with ρ = 0 or 0.5 and 1 ≤ j, k ≤ 20, or 200. For the regression coefficients β₀, we consider the following four settings:

Model 1: β₀ = c₁×(1, 1, 1, 1, 0, 0, 0, 0, … , 0)^⊺;

Model 2: β₀ = c₂×(1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, … , 0)^⊺;

Model 3: β₀ = c₃×(3, 2, 1, 0, 0, 0, 0, … , 0)^⊺;

Model 4: β₀ = c₄×(3, 2, 1, 0, 0, 0, 0, 0, 3, 2, 1, 0, … , 0)^⊺;

where c₁, … , c₄ are constants chosen so that the population R² of each model is 0.5 under the definition R² = 1 − var(Y|X)/var(Y). For each combination of parameter settings, 10,000 datasets were generated; the bootstrap estimator was constructed using 5000 bootstrap replications.

For estimating the τ upper-level set, $L_{τ} = {λ : P ({\hat{λ}}_{GIC} = λ ∣ \hat{S}, X) > τ}$ , we consider:

(AsympNor) the plug-in estimator based on the normal approximation to the distribution of p_k;
(Bootstrap) the estimator based on the bootstrap approximation to the sampling distribution of p_k as described in Section 3.4;
(UP1) the estimator based on a 90% projection confidence set as in Equation (4);
(UP2) the estimator based on a 90% projection confidence set as in Equation (10);
(Akaike) the estimator based on Akaike weights: ${\hat{L}}_{τ} = {λ_{i} : \frac{\exp (- 0.5 n {GIC}_{λ_{i}})}{\sum_{i = 1}^{\hat{m}} \exp (- 0.5 n {GIC}_{λ_{i}})} > τ}$ , with w_n = 2/n (Burnham and Anderson 2003);
(ApproxPost) the estimator based on approximate posterior distribution: ${\hat{L}}_{τ} = {λ_{i} : \frac{\exp (- 0.5 n {GIC}_{λ_{i}})}{\sum_{i = 1}^{\hat{m}} \exp (- 0.5 n {GIC}_{λ_{i}})} > τ}$ , with W_n = log(n)/n (Burnham and Anderson 2003).

We define the performance of these estimators in terms of their true and false discovery rates. Provided that $L_{τ}$ is nonempty, define the true discovery rate of an estimator ${\hat{L}}_{τ}$ as

TDR ({\hat{L}}_{τ}) = 𝔼 [\frac{# {{\hat{L}}_{τ} \cap L_{τ}}}{# L_{τ}}],

where # denotes the number of elements in a set. Provided ${\hat{L}}_{τ}$ is nonempty with probability one, define the false discovery rate of an estimator ${\hat{L}}_{τ}$ as

FDR ({\hat{L}}_{τ}) = 1 - 𝔼 [\frac{# {{\hat{L}}_{τ} \cap L_{τ}}}{# {\hat{L}}_{τ}}] .

Here, we present results for τ = 0.05; results for τ = 0.1 and τ = 0.2 are presented in the supplemental materials. The results for p = 20, n = 50 and p = 100, n = 200 are presented in Tables 1 and 2, respectively. AsympNor and the Bootstrap perform similarly with a TDR above 0.9 and an FDR below 0.10. As expected, methods based on the upper bound of the confidence interval achieve higher TDR but at the price of higher FDR. Methods based on Akaike weights and approximate posterior have the worst performances in terms of discovery rate of $L_{τ}$ . The poor performance is not surprising as these methods were not designed for conditional inference.

Table 1.

Discovery rate for p = 20, n = 50.

		AsympNor		Bootstrap		UP1		UP2		Akaike		ApproxPost
Model	ρ	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR
	0	0.90	0.10	0.90	0.10	1.00	0.32	0.99	0.36	0.54	0.83	0.90	0.58
1	0.5	0.92	0.09	0.92	0.09	1.00	0.29	0.99	0.34	0.61	0.83	0.95	0.58
	0	0.88	0.11	0.88	0.10	0.99	0.30	0.99	0.34	0.46	0.83	0.82	0.59
2	0.5	0.89	0.10	0.90	0.10	1.00	0.35	0.99	0.36	0.55	0.82	0.89	0.59
	0	0.92	0.09	0.92	0.09	1.00	0.29	0.99	0.34	0.57	0.84	0.94	0.58
3	0.5	0.93	0.08	0.93	0.08	1.00	0.27	0.99	0.32	0.64	0.83	0.96	0.57
	0	0.89	0.11	0.90	0.10	0.99	0.32	0.99	0.36	0.50	0.84	0.88	0.59
4	0.5	0.91	0.10	0.91	0.09	0.99	0.33	0.99	0.35	0.58	0.83	0.92	0.58

Open in a new tab

Table 2.

Discovery rate for p = 100, n = 200.

		AsympNor		Bootstrap		UP1		UP2		Akaike		ApproxPost
Model	ρ	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR
	0	0.97	0.03	0.97	0.03	1.00	0.12	1.00	0.13	0.11	0.98	1.00	0.58
1	0.5	0.98	0.03	0.98	0.02	1.00	0.09	1.00	0.10	0.24	0.95	1.00	0.56
	0	0.95	0.04	0.95	0.04	1.00	0.17	1.00	0.18	0.05	0.99	0.98	0.59
2	0.5	0.96	0.03	0.96	0.03	1.00	0.14	1.00	0.15	0.12	0.97	0.99	0.58
	0	0.97	0.03	0.97	0.03	1.00	0.12	1.00	0.13	0.16	0.97	1.00	0.57
3	0.5	0.98	0.02	0.98	0.02	1.00	0.09	1.00	0.10	0.28	0.94	1.00	0.56
	0	0.96	0.04	0.96	0.04	1.00	0.14	1.00	0.16	0.07	0.98	0.99	0.59
4	0.5	0.97	0.03	0.97	0.03	1.00	0.12	1.00	0.13	0.16	0.97	1.00	0.57

Open in a new tab

We also evaluated the coverage of the proposed confidence intervals based on assumption of normality assumption as well as the asymptotic approximation. In calculating the coverage probabilities, we restricted calculations to the set (λ : 0.9999 > $P ({\hat{λ}}_{GIC} = λ ∣ \hat{S}, X)$ > 0.0001}. Nominal coverage is set at 0.90. The results are presented in Table 3. The confidence intervals based on normality (Equation (3)) achieve nominal coverage in all cases. The confidence intervals based on an asymptotic approximation undercover slightly, though coverage approaches nominal levels as n increases.

Table 3.

Coverage probability. Results are based on 10,000 replicated datasets.

	ρ	Approximate		Normality
	ρ	0	0.5	0	0.5
	Model 1	0.86	0.85	0.92	0.91
n = 50	Model 2	0.86	0.85	0.91	0.91
p = 20	Model 3	0.85	0.85	0.91	0.91
	Model 4	0.86	0.85	0.92	0.91
	Model 1	0.88	0.88	0.91	0.90
n = 200	Model 2	0.87	0.88	0.91	0.91
p = 100	Model 3	0.87	0.87	0.91	0.90
	Model 4	0.87	0.88	0.91	0.90

Open in a new tab

5. Illustrative Data Examples

In this section, we apply the proposed methods to two datasets. The first dataset informs the relationship of pollution and other factors related to urban-living, and to age-adjusted mortality (McDonald and Schwing 1973; Luo, Stefanski, and Boos 2006); and the second regards the relationship between gene mutations and drug resistance level (Rhee et al. 2006). We demonstrate that reporting a single model may not be appropriate in these two examples and that the proposed methods have the potential to identify interesting models warranting further examination. As in the simulation experiments, we consider the Lasso estimator tuned using BIC.

5.1. Pollution and Mortality

As our first illustrative example we consider data on mortality rates recorded in 60 metropolitan areas. Prior analyses of these data focused on the regression of age-adjusted mortality on 15 predictors that are grouped into three broad categories: weather, socioeconomic factors, and pollution. A copy of the dataset and a detailed description of each predictor are provided in the supplemental materials.

Ignoring uncertainty in the tuning parameter selection, the LASSO estimator tuned using BIC leads to a model with six variables, Percent Non-White, Education, SO₂ Pollution Potential, Precipitation, Mean January Temperature, and Population Per Mile. However, the estimated conditional sampling distribution of the tuning parameter indicates that a larger model with eight predictors is approximately equally probable. Figure 1 displays the solution path with the estimated selection probabilities using both the asymptotic normal and bootstrap approximations. The estimated conditional distribution of the tuning parameter is displayed in Table 4; 90% confidence intervals based on Equations (3) and (9) are presented. The LASSO coefficient estimates are presented in Table B.1 of the Appendix.

Figure 1. — The top figure shows LASSO solution path of mortality rates data; The vertical lines above and belowx-axis correspond to the distribution estimated by bootstrap and asymptotic normal approximation, respectively. The bottom figure shows BIC values for candidate models along solution path. The solid vertical line corresponds to the model with six variables and smallest BIC value. The dashed vertical line corresponds to a model with eight variables.

Table 4.

Estimated conditional distribution of the tuning parameter for the mortality rates data. Both the asymptotic normal approximation and the bootstrap had only two support points, {124.21,288.20}.

λ	288.20	124.21
Model size	6	8
Probability mass normal	0.51	0.49
Probability mass bootstrap	0.48	0.52
90% CI based on normality	(0.052,0.950)	(0.048, 0.948)
90% Approximate CI	(0.009,0.848)	(0.154, 1.000)

Open in a new tab

We further investigate the model with eight predictors. The eighth predictor added to the model is Mean July Temperature. Fitting a simple linear regression of mortality on Mean July Temperature yields a p-value of 0.03 compared with a p-value of 0.81 in the regression of mortality on Mean January Temperature. This is in line with Katsouyanni et al. (1993) who concluded that high temperatures are related to the mortality rate. It can be seen that in this case, reporting a single model may not be appropriate. Rather, it may be more informative to report the two models that contain essentially all of the mass of the conditional distribution of the tuning parameter.

For comparison, results from forward stepwise regression are presented in Figure 2. The smallest BIC value corresponds to Step 5 of the procedure which corresponds to a model with the five predictors: Percent Non-White, Education, Mean January Temperature, SO₂ Pollution Potential, and Precipitation. This model is smaller than the model selected by LASSO tuned using BIC. This may be because forward stepwise regression is greedy in that at each step it seeks a variable that captures maximum variation in the residuals. Thus, if a candidate variable is correlated with those selected in previous steps, it may be difficult to see the improvement in the fitted model. In such cases, it might be preferable to use the solution path to generate a candidate set of models.

5.2. ATV Drug Resistance

Our second example considers mutations that affect resistance to Atazanavir (ATV), a protease inhibitor for HIV (Rhee et al. 2006; Barber et al. 2015). After preprocessing, the dataset contains 328 observations and 361 predictors of gene mutations. The response is a measure of drug resistance for ATV. Because p > n in this example, we use 100 observations for screening 50 important predictors, ranked by Pearson correlation with response. We then fit a linear model using the Lasso applied to the remaining 228 observations with the 50 important predictors selected at screening.

The estimated conditional distribution of the tuning parameter and 90% confidence intervals are presented in Table 5; the estimated distribution is overlaid on the solution path in Figure 3. It can be seen that the estimated distribution of the tuning parameter mainly favors two models.

Table 5.

Estimated conditional distribution of the tuning parameter for ATV drug resistance data.

λ	1347.92	1101.71	805.53
Model size	10	12	15
Probability mass normal	0.443	0.045	0.472
Probability mass bootstrap	0.494	0.047	0.454
90% CI based on normality	(0.019, 0.896)	(0.026,0.127)	(0.059, 0.959)
90% Approximate CI	(0.000, 0.751)	(0.020, 0.064)	(0.020, 1)

Open in a new tab

Figure 3. — The top figure shows LASSO solution path of ATV drug resistance data; The vertical lines above and below x-axis correspond to the distribution estimated by bootstrap and asymptotic normal approximation, respectively. The bottom figure shows BIC values for candidate models along solution path. The solid vertical line corresponds to the model with 15 variables and smallest BIC value. The dashed vertical lines correspond to model with 12 and 10 variables.

Similar to Barber et al. (2015), we evaluate candidate models based on treatment selected mutations (TSM) panels, which provide a surrogate for the true important mutations. The model minimizing BIC contains 15 variables, while two of them correspond to the same mutation. This leads to 14 unique mutation locations, and four locations are potential false discoveries (as assessed by TSM); see Table 5. Therefore, the surrogate-based estimated false discovery rate is 4/14 ≈ 0.29. For tuning parameter λ = 1101.71, 11 unique positions are identified, and two locations are potential false discoveries. Tuning parameter λ = 1347.92 leads to nine unique locations with one potential false discovery. The corresponding surrogate-based estimated false discovery rate is 1/9 ≈ 0.11, a decrease of 20% compared with the model minimizing BIC. Thus, in this case it might not be appropriate to report the single model selected by BIC.

6. Conclusion

We proposed two simple procedures for estimating the conditional distribution of a data-driven tuning parameter in penalized regression given the observed solution path and design matrix. Our objective was to quantify the stability of selected models and thereby identify a set of potential models for consideration by domain experts. A plot of the solution path with the estimated selection probabilities or upper confidence bounds overlaid, for example, Figures 1 and 3, is one means of easily conveying uncertainty in the tuning parameter and identifying models that warrant additional investigation. It is noteworthy that in both examples the identified sets of likely models are not contiguous in size. Thus, our methods provide a theoretically motivated, confidence-set-based alternative to the practice of considering models near in size to the BIC-chosen LASSO model.

Supplementary Material

supp

NIHMS987505-supplement-supp.gz^{(154.2KB, gz)}

Acknowledgments

Funding

The authors gratefully acknowledge funding from the National Science Foundation (DMS-1555141, DMS-1557733, DMS-1513579) and the National Institutes of Health (P01 CA142538).

Appendix A: Simulation Results

Table A.1.

Discovery rate for p = 20, n = 50; τ = 0.1.

Model	ρ	AsympNor		Bootstrap		UP1		UP2		Akaike		ApproxPost
Model	ρ	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR
	0	0.90	0.10	0.90	0.10	1.00	0.33	0.99	0.37	0.42	0.80	0.89	0.48
1	05	0.92	0.08	0.92	0.08	1.00	0.30	0.99	0.33	0.48	0.78	0.93	0.48
	0	0.87	0.11	0.88	0.11	0.99	0.33	0.99	0.36	0.35	0.81	0.81	0.48
2	05	0.89	0.10	0.89	0.11	1.00	0.35	0.99	0.36	0.45	0.78	0.87	0.47
	0	0.91	0.09	0.92	0.09	1.00	0.29	0.99	0.34	0.44	0.80	0.93	0.48
3	05	0.92	0.07	0.93	0.08	1.00	026	0.99	0.30	0.50	0.78	0.96	0.49
	0	0.89	0.10	0.90	0.10	1.00	0.34	0.99	0.37	0.39	0.80	0.87	0.48
4	05	0.90	0.10	0.91	0.10	1.00	0.34	0.99	0.35	0.46	0.79	0.91	0.48

Open in a new tab

Table A.2.

Discovery rate for p = 100, n = 200; τ = 0.1.

Model	ρ	AsympNor		Bootstrap		UP1		UP2		Akaike		ApproxPost
Model	ρ	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR
	0	0.97	0.03	0.97	0.03	1.00	0.13	1.00	0.14	0.07	0.97	0.99	0.42
1	0.5	0.97	0.02	0.97	0.02	1.00	0.10	1.00	0.10	0.15	0.93	1.00	0.40
	0	0.95	0.04	0.95	0.04	1.00	0.17	1.00	0.18	0.03	0.99	0.98	0.42
2	0.5	0.96	0.04	0.96	0.04	1.00	0.14	1.00	0.15	0.07	0.97	0.99	0.42
	0	0.97	0.03	0.97	0.03	1.00	0.12	1.00	0.13	0.09	0.96	0.99	0.42
3	0.5	0.98	0.02	0.98	0.02	1.00	0.09	1.00	0.10	0.17	0.93	1.00	0.40
	0	0.96	0.04	0.96	0.04	1.00	0.14	1.00	0.15	0.04	0.98	0.99	0.42
4	0.5	0.97	0.03	0.97	0.03	1.00	0.12	1.00	0.13	0.09	0.96	0.99	0.42

Open in a new tab

Table A3.

Discovery rate for p = 20, n = 50; τ = 0.2.

Model	ρ	AsympNor		Bootstrap		UP1		UP2		Akaike		ApproxPost
Model	ρ	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR
	0	0.89	0.11	0.89	0.11	0.99	0.34	0.99	0.39	0.25	0.69	0.84	0.22
1	0.5	0.91	0.08	0.91	0.08	1.00	0.29	0.99	0.34	0.29	0.65	0.90	0.22
	0	0.86	0.13	0.86	0.13	0.99	0.36	0.99	0.41	0.22	0.73	0.77	0.23
2	0.5	0.88	0.11	0.88	0.12	0.99	0.37	0.99	0.40	0.28	0.65	0.83	0.23
	0	0.91	0.09	0.91	0.09	1.00	0.30	0.99	0.35	0.26	0.68	0.89	0.23
3	0.5	0.92	0.07	0.92	0.08	1.00	0.26	0.99	0.31	0.31	0.64	0.92	0.23
	0	0.88	0.11	0.88	0.11	0.99	0.35	0.99	0.40	0.23	0.71	0.83	0.22
4	0.5	0.89	0.10	0.90	0.10	0.99	0.34	0.99	0.38	0.28	0.66	0.86	0.23

Open in a new tab

Table A.4.

Discovery rate for p = 100, n = 200; τ = 0.2.

Model	ρ	AsympNor		Bootstrap		UP1		UP2		Akaike		ApproxPost
Model	ρ	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR	TDR	FDR
	0	0.97	0.03	0.97	0.03	1.00	0.13	1.00	0.13	0.03	0.94	0.98	0.22
1	0.5	0.97	0.03	0.97	0.03	1.00	0.10	1.00	0.11	0.07	0.88	0.99	0.21
	0	0.95	0.05	0.95	0.05	1.00	0.17	1.00	0.18	0.01	0.98	0.96	0.21
2	0.5	0.96	0.04	0.96	0.04	1.00	0.14	1.00	0.15	0.03	0.93	0.98	0.21
	0	0.97	0.03	0.97	0.03	1.00	0.12	1.00	0.12	0.04	0.91	0.99	0.21
3	0.5	0.98	0.03	0.98	0.03	1.00	0.09	1.00	0.10	0.09	0.86	0.99	0.20
	0	0.96	0.04	0.96	0.04	1.00	0.15	1.00	0.16	0.02	0.96	0.97	0.21
4	0.5	0.97	0.03	0.97	0.03	1.00	0.12	1.00	0.13	0.04	0.92	0.99	0.21

Open in a new tab

Appendix B: Additional Results for Real-Data

Table B.1.

Coefficient estimates for pollution and mortality data; columns BIC-8 and BIC-6 correspond to lasso estimateswith eight and six predictors, respectively; columns Refit-8 and Refit-6 correspond to refitting coefficient with eight and six predictors; column FS-5 corresponds to model selected by forward stepwise regression.

Variable	Lasso				FS-5
Variable	BIC-8	Refit-8	BIC-6	Refit-6	FS-5
Mean annual precipitation	14.95	17.76	11.78	14.85	14.86
Mean January temperature	−12.04	−14.26	−8.80	−16.61	−16.50
Mean July temperature	−5.80	−11.71	0	0	0
Median school years	−8.89	−8.05	−9.99	−9.75	−10.80
Pct of housing units with facilities	−2.55	−4.69	0	0	0
Population per square mile	5.15	7.11	2.62	6.039	0
Pct of non-White	35.38	39.98	30.04	36.98	36.27
Pollution potential of sulfur dioxide	14.47	14.92	13.66	15.51	18.00

Open in a new tab

Appendix C: Proof and Technical Details

Lemma C.1.

If the penalty function f_j(β_j; ℙ_n) depends on the data only through X^⊺Y and X, then the distribution of ${\hat{λ}}_{GIC}$ conditional on $\hat{S}$ and X is equal to the distribution of ${\hat{λ}}_{GIC}$ conditional on X^⊺Y and X.

Proof. $\hat{β} (λ) = {argmin}_{β} {\frac{1}{2} β^{⊺} X^{⊺} X β - Y^{⊺} X β + \sum_{j = 2}^{p} f_{j} (β_{j}; ℙ_{n})}$ , from which it can be seen that the solution path is completely determined by X^⊺X and X^⊺Y. On the other hand, given $\hat{S}$ and X, we can recover X^⊺Y using $X^{⊺} X \hat{β} (0)$ = X^⊺X{(X^⊺X)⁻¹X^⊺Y} = X^TY. □

Lemma C.2.

Suppose that b(·) and c(·) are nonnegative valued functions defined on [0, ∞] such that b(λ) is a nondecreasing for λ ≥ 0 with b(0) = 0. For x ≥ 0, define

H (x, λ) = \log {x + b (λ)} + c (λ)

and

λ (x) = \underset{λ}{argmin} H (x, λ) .

Then λ(x) is nondecreasing in x ≥ 0.

Proof of Lemma C.2. Supposed x₁ ≤ x₂, we need to show that λ(x₁) ≤ λ(x₂). First, consider the difference of H(x₂, λ) and H(x₁, λ)

\begin{matrix} H (x_{2}, λ) - H (x_{1}, λ) & = \log {x_{2} + b (λ)} - \log {x_{1} + b (λ)} \\ = \log {1 + \frac{x_{2} - x_{1}}{x_{1} + b (λ)}}, \end{matrix}

which is nonnegative for every λ and nonincreasing in λ. Therefore if λ(x₁) = argmin_λ H(x₁, λ), it follows that

\begin{matrix} λ (x_{2}) & = \underset{λ}{argmin} H (x_{2}, λ), \\ = \underset{λ}{argmin} [H (x_{1}, λ) + \log {1 + \frac{x_{2} - x_{1}}{x_{1} + b (λ)}}] \\ \geq λ (x_{1}) . \end{matrix}

The last inequality follows from that log{1 + (x₂ − x₁ )/(x₁ + b(λ))} is nonnegative and nonincreasing with respect to λ. □

Proof of Lemma 1. Recall that the information criterion can also be expressed as

\begin{matrix} {GIC}_{λ} & = \log (\frac{‖ Y - X \hat{β} (λ) ‖^{2}}{n}) + w_{n} {\hat{df}}_{λ} \\ = \log (\frac{‖ Y - X {\hat{β}}_{ols} + X {\hat{β}}_{ols} - X \hat{β} (λ) ‖^{2}}{n}) + w_{n} {\hat{df}}_{λ} \\ = \log ({\hat{σ}}_{0}^{2} + \frac{D_{λ}}{n}) + w_{n} {\hat{df}}_{λ}, \end{matrix}

where $D_{λ} = {{\hat{β}}_{ols} - \hat{β} (λ)}^{⊺} X^{⊺} X {{\hat{β}}_{ols} - \hat{β} (λ)}$ .

Because D_λ is a deterministic function of λ conditional on the solution path and design matrix, the only variability in ${\hat{λ}}_{GIC}$ is due to ${\hat{σ}}_{0}^{2}$ . Therefore, ${\hat{λ}}_{GIC} ∣ ({\hat{β}}_{ols}, X)$ is a function of ${\hat{σ}}_{0}^{2}$ .

Then monotonicity comes immediately by observing that D_λ is a non-decreasing function for λ ≥ 0 with D(0) = 0 and invokng Lemma C.2. □

Lemma C.3.

Let $\hat{S}$ and X be fixed. If ${\hat{df}}_{{\hat{λ}}_{(k)}} < {\hat{df}}_{{\hat{λ}}_{(i)}}$ , then ${GIC}_{{\hat{λ}}_{(i)}} \leq {GIC}_{{\hat{λ}}_{(k)}}$ iff $n {\hat{σ}}_{0}^{2} \leq {\hat{ℓ}}_{i, k}$ and if ${\hat{df}}_{{\hat{λ}}_{(k)}} > {\hat{df}}_{{\hat{λ}}_{(i)}}$ , then ${GIC}_{{\hat{λ}}_{(i)}} \leq {GIC}_{{\hat{λ}}_{(k)}}$ iff $n {\hat{σ}}_{0}^{2} \geq {\hat{ℓ}}_{i, k}$ .

Proof of Lemma C.3. Consider the case ${\hat{df}}_{{\hat{λ}}_{(k)}} < {\hat{df}}_{{\hat{λ}}_{(i)}}$ ,

\begin{matrix} p ({GIC}_{{\hat{λ}}_{(i)}} & \leq {GIC}_{{\hat{λ}}_{(k)}} ∣ \hat{S}, X) \\ = P [\log {{\hat{σ}}_{0}^{2} + \frac{D_{{\hat{λ}}_{(i)}}}{n}} + w_{n} {\hat{df}}_{{\hat{λ}}_{(i)}} \\ \leq \log {{\hat{σ}}_{0}^{2} + \frac{D_{{\hat{λ}}_{(k)}}}{n}} + w_{n} {\hat{df}}_{{\hat{λ}}_{(k)}} ∣ \hat{S}, X] \\ = P [\log {{\hat{σ}}_{0}^{2} + \frac{D_{{\hat{λ}}_{(i)}}}{n}} - \log {{\hat{σ}}_{0}^{2} + \frac{D_{{\hat{λ}}_{(k)}}}{n}} \\ \leq w_{n} {{\hat{df}}_{{\hat{λ}}_{(k)}} - {\hat{df}}_{{\hat{λ}}_{(i)}}} ∣ \hat{S}, X] \\ = P [\frac{n {\hat{σ}}_{0}^{2} + D_{{\hat{λ}}_{(i)}}}{n {\hat{σ}}_{0}^{2} + D_{{\hat{λ}}_{(k)}}} \leq exp {w_{n} ({\hat{df}}_{{\hat{λ}}_{(k)}} - {\hat{df}}_{{\hat{λ}}_{(i)}})} ∣ \hat{S}, X] \\ = P (n {\hat{σ}}_{0}^{2} \leq \frac{D_{{\hat{λ}}_{(k)}} exp {w_{n} ({\hat{df}}_{{\hat{λ}}_{(k)}} - {\hat{df}}_{{\hat{λ}}_{(i)}})} - D_{{\hat{λ}}_{(i)}}}{1 - exp {w_{n} ({\hat{df}}_{{\hat{λ}}_{(k)}} - {\hat{df}}_{{\hat{λ}}_{(i)}})}} ∣ \hat{S}, X) . \end{matrix}

The case ${\hat{df}}_{{\hat{λ}}_{(k)}} > {\hat{df}}_{{\hat{λ}}_{(i)}}$ follows by a similar argument. □

Proof of Proposition 1. The proof follows from the fact that

{GIC}_{{\hat{λ}}_{(k)}} < {GIC}_{{\hat{λ}}_{(i)}} for all i \in A_{k} \cup B_{k} if and only if max_{i \in {\hat{B}}_{k}} {\hat{ℓ}}_{i, k} \leq n {\hat{σ}}_{0}^{2} \leq max_{i \in {\hat{A}}_{k}} {\hat{ℓ}}_{i, k} .

□

To prove Proposition 2, we assume:

(A1F): under a fixed design $\lim_{n \to \infty} n^{- 1} \sum_{i = 1}^{n} X_{i} X_{i}^{⊺} = C$ , $\lim_{n \to \infty} n^{- 1} \sum_{i = 1}^{n} X_{i} = μ_{x}$ , where C ∈ ℝ^p×p is nonnegative definite and μ_x_x ∈ ℝ^p;

(A1R): under a random design, with probability one, $\lim_{n \to \infty} n^{- 1} \sum_{i = 1}^{n} X_{i} X_{i}^{⊺} = C$ , and $\lim_{n \to \infty} n^{- 1} \sum_{i = 1}^{n} X_{i} = μ_{x}$ , where C ∈ ℝ^p×p is nonnegative definite and μ_x ∈ ℝ^p;

(A2): $𝔼 ϵ_{i}^{4} < \infty$ .

Under assumptions (A1F) and (A2), we have the following well-known results, which facilitate the proof of Proposition 2.

Lemma C.4.

\begin{matrix} {\hat{β}}_{ols} \overset{as}{\to} β_{0}; \\ \sqrt{n} ({\hat{β}}_{ols} - β_{0}) \overset{d}{\to} N (0_{p \times 1}, C^{- 1}) . \end{matrix}

Proof of Proposition 2. First consider the fixed design model. Let

ψ (Y_{i}, X_{i}, β, σ^{2}) = (\begin{matrix} (Y_{i} - X_{i}^{⊺} β) X_{i} \\ {(Y_{i} - X_{i}^{⊺} β)}^{2} - σ^{2} \end{matrix}) .

Then ${({\hat{β}}_{ols}^{⊺}, {\hat{σ}}^{2})}^{⊺}$ is a solution to the equation

\sum_{i = 1}^{n} ψ (Y_{i}, X_{i}, β, σ^{2}) = 0 .

A Taylor series expansion around the true value (β₀, $σ_{0}^{2}$ ) results in

\begin{matrix} Σ_{i = 1}^{n} ψ (Y_{i}, X_{i}, {\hat{β}}_{ols}, {\hat{σ}}^{2}) = Σ_{i = 1}^{n} ψ (Y_{i}, X_{i}, β_{0}, σ^{2}) \\ + Σ_{i = 1}^{n} ψ^{'} (Y_{i}, X_{i}, β_{0}, σ^{2}) (\begin{matrix} {\hat{β}}_{o l s} - β_{0} \\ {\hat{σ}}^{2} - σ_{0}^{2} \end{matrix}) + R_{n}, \end{matrix}

where ψ′ is the derivative of ψ and

R_{n} = \sum_{i = 1}^{n} (\begin{matrix} 0_{p \times 1} \\ ({\hat{β}}_{ols} - β_{0})^{⊺} X_{i} X_{i}^{⊺} ({\hat{β}}_{ols} - β_{0}) \end{matrix}) .

Rearranging it leads to

\begin{matrix} {- \frac{1}{n} Σ_{i = 1}^{n} ψ^{'} (Y_{i}, X_{i}, β_{0}, σ_{0}^{2})} \sqrt{n} (\begin{matrix} {\hat{β}}_{ols} - β_{0} \\ {\hat{σ}}^{2} - σ_{0}^{2} \end{matrix}) \\ = {\frac{1}{\sqrt{n}} Σ_{i = 1}^{n} ψ (Y_{i}, X_{i}, β_{0}, σ_{0}^{2})} + R_{n} ∕ \sqrt{n} . \end{matrix}

Because − $ψ^{'} (Y_{i}, X_{i}, β_{0}, σ_{0}^{2}) = (\begin{matrix} X_{i} X_{i}^{⊺} & 0_{p \times 1} \\ 2 X_{i}^{⊺} (Y_{i} - X_{i}^{⊺} β_{0}) & 1 \end{matrix})$ , it follows that

- \frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (Y_{i}, X_{i}, β_{0}, σ_{0}^{2}) \overset{p}{\to} (\begin{matrix} C & 0_{p \times 1} \\ 0_{1 \times p} & 1 \end{matrix})

by consistency of ${\hat{β}}_{ols}$ .

Then by the multivariate Lindberg–Feller CLT,

\frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ψ (Y_{i}, X_{i}, β_{0}, σ_{0}^{2}) \overset{d}{\to} N [(\begin{matrix} 0_{p \times 1} \\ 0 \end{matrix}), (\begin{matrix} σ_{0}^{2} C & μ_{x} μ_{3, ϵ} \\ μ_{x}^{⊺} μ_{3, ϵ} & μ_{4, ϵ} - σ_{0}^{4} \end{matrix})] .

Finally, $R_{n} ∕ \sqrt{n}$ is o_p(1) as of

\begin{matrix} \frac{1}{\sqrt{n}} Σ_{i = 1}^{n} ({\hat{β}}_{ols} - β_{0})^{⊺} X_{i} X_{i}^{⊺} ({\hat{β}}_{ols} - β_{0}) \\ = \sqrt{n} ({\hat{β}}_{ols} - β_{0})^{⊺} {\frac{1}{n} \sum_{i = 1}^{n} X_{i} X_{i}^{⊺}} ({\hat{β}}_{ols} - β_{0}) . \end{matrix}

Therefore by Slutsky’s theorem,

\sqrt{n} (\begin{matrix} {\hat{β}}_{ols} - β_{0} \\ {\hat{σ}}_{0}^{2} - σ_{0}^{2} \end{matrix}) \overset{d}{\to} N [(\begin{matrix} 0_{p \times 1} \\ 0 \end{matrix}), (\begin{matrix} σ_{0}^{2} C^{- 1} & μ_{x} μ_{3, ϵ} \\ μ_{x}^{⊺} μ_{3, ϵ} & μ_{4, ϵ} - σ_{0}^{4} \end{matrix})] .

(C.1)

Then, for the random design, because $\lim_{n \to \infty} n^{- 1} \sum_{i = 1}^{n} X_{i} X_{i}^{⊺} = C$ and $\lim_{n \to \infty} n^{- 1} \sum_{i = 1}^{n} X_{i} = μ_{x}$ almost surely, assumption A1F holds for almost every sequence x₁, x₂, …. Therefore Equation (C.1) holds for almost every sequence x₁, x₂,…. □

Lemma C.5.

Suppose ϵ₁, ϵ₂, … , ϵ_n are normally distributed with mean zero and variance $σ_{0}^{2}$ , then plugging estimator ${\hat{p}}_{k} = F_{χ_{n - p}^{2}} ({\hat{a}}_{k} ∕ {\hat{σ}}_{0}^{2}) - F_{χ_{n - p}^{2}} ({\hat{b}}_{k} ∕ {\hat{σ}}_{0}^{2})$ need not be a consistent estimator for p_k, where ${\hat{a}}_{k}$ and ${\hat{b}}_{k}$ are defined in Section 3.2.

Proof. We show that there exist sequences for which ${\hat{p}}_{k}$ is not consistent for p_k. Considering the case X^⊺X = n × I₂ with X₁ = 1^n×1 and $σ_{0}^{2}$ = 1, then

\begin{matrix} p_{1} & = P ({GIC}_{+ \infty} > {GIC}_{0} ∣ {\hat{β}}_{ols}, X) \\ = P (\frac{{\hat{β}}_{ols, 2}^{2}}{exp (w_{n}) - 1} > {\hat{σ}}_{0}^{2} ∣ {\hat{β}}_{ols}, X) \\ = ϕ (\frac{{\hat{a}}_{1} - n}{\sqrt{2 n}}) + o_{p} (1), \end{matrix}

where ${\hat{a}}_{1} = n {\hat{β}}_{ols, 2}^{2} ∕ {\exp (w_{n}) - 1}$ and ${\hat{β}}_{ols, 2}$ corresponds least-square estimate for X₂. And

{\hat{p}}_{1} = Φ (\frac{{\hat{a}}_{1} ∕ {\hat{σ}}_{0}^{2} - n}{\sqrt{2 n}}) + o_{p} (1) .

For the sequence of ${\hat{β}}_{ols, 2} = \sqrt{(\exp (w_{n}) - 1) (1 + {cn}^{- 1 ∕ 2})}$ , where c is any constant, it follows that ${\hat{a}}_{1} = n + c \sqrt{n}$ . And then

\begin{matrix} \frac{{\hat{a}}_{1} - n}{\sqrt{2 n}} - \frac{{\hat{a}}_{1} ∕ {\hat{σ}}_{0}^{2} - n}{\sqrt{2 n}} & = \frac{{\hat{a}}_{1} ({\hat{σ}}_{0}^{2} - 1)}{\sqrt{2 n} {\hat{σ}}_{0}^{2}} \\ = \frac{(n + c \sqrt{n}) ({\hat{σ}}_{0}^{2} - 1)}{\sqrt{2 n} {\hat{σ}}_{0}^{2}}, \end{matrix}

which is not o_p(1). Therefore, ${\hat{p}}_{1}$ is not a consistent estimator for p₁. □

Proposition C.1.

Assume the distribution of ϵ_i, i = 1, … , n is symmetric about zero, then for any ϵ > 0,

P {inf_{(t_{1}, t_{2}) \in E_{y}} ∣ p_{k} - h_{k} (t_{1}, t_{2}) ∣ > ϵ} \leq α + o (1),

(C.2)

where $E_{y}$ is an asymptotically (1 − α) × 100% confidence region for μ_4,ϵ − $σ_{0}^{4}$ and $σ_{0}^{2}$ .

Proof. Denote the event that $(σ_{0}^{2}, μ_{4, ϵ} - σ_{0}^{4}) \in G_{y}$ as A,

\begin{matrix} P & (\inf_{(t_{1}, t_{2}) \in E_{y}} ∣ p_{k} - h_{k} (t_{1}, t_{2}) ∣ > ϵ) \\ \leq P (\inf_{(t_{1}, t_{2}) \in E_{y}} ∣ p_{k} - h_{k} (t_{1}, t_{2}) ∣ > ϵ ∣ A) P (A) + P (A^{c}) \\ \leq 0 (1 - α) + α + o (1) \\ = α + o (1) . \end{matrix}

□

Lemma C.6.

For any s ≥ 1, assume $n^{- 1} Σ_{i = 1}^{n} {∣ ∣ X_{i} ∣ ∣}^{s} = O (1)$ , then

n^{- 1} \sum_{i = 1}^{n} {∣ {\hat{e}}_{i} ∣}^{s} \overset{as}{\to} m_{s},

where m_s = E|ϵ₁|^s.

Proof of Lemma C.6.

\begin{matrix} {{(n^{- 1} Σ_{i = 1}^{n} {∣ {\hat{e}}_{i} ∣}^{s})}^{1 ∕ s} - {(n^{- 1} Σ_{i = 1}^{n} {∣ ϵ_{i} ∣}^{s})}^{1 ∕ s}}^{s} & \leq n^{- 1} Σ_{i = 1}^{n} {∣ {\hat{e}}_{i} - ϵ_{i} ∣}^{s} \\ = n^{- 1} Σ_{i = 1}^{n} {∣ X_{i}^{⊺} ({\hat{β}}_{ols} - β_{0}) ∣}^{s} \\ \leq n^{- 1} Σ_{i = 1}^{n} {‖ X_{i} ‖}^{s} {‖ {\hat{β}}_{ols} - β_{0} ‖}^{s} . \end{matrix}

But ${\hat{β}}_{ols} \overset{as}{\to} β_{0}$ and $n^{- 1} \sum_{i = 1}^{n} {∣ ∣ X_{i} ∣ ∣}^{s} = O (1)$ , so that ${(n^{- 1} \sum_{i = 1}^{n} {∣ {\hat{e}}_{i} ∣}^{s})}^{1 ∕ s} - {(n^{- 1} \sum_{i = 1}^{n} {∣ ϵ_{i} ∣}^{s})}^{1 ∕ s} \overset{as}{\to} 0$ . Then by Strong Law of Large Numbers, $n^{- 1} \sum_{i = 1}^{n} {∣ ϵ_{i} ∣}^{s} \overset{as}{\to} E {∣ ϵ_{1} ∣}^{s}$ . And thus $n^{- 1} \sum_{i = 1}^{n} {∣ {\hat{e}}_{i} ∣}^{s} \overset{as}{\to} E {∣ ϵ_{1} ∣}^{s}$ . □

Lemma C.7.

Assume (A1F) and (A2), then

\frac{1}{\sqrt{n}} γ^{{(b)}^{⊺}} P_{x} γ^{(b)} \overset{p}{\to} 0,

conditionally almost surely.

Proof of Lemma C.7. Denote ${\hat{β}}_{ols}^{*} = {(X^{⊺} X)}^{- 1} X^{⊺} (X {\hat{β}}_{ols} + γ^{(b)})$ , then we have

\frac{1}{\sqrt{n}} γ^{{(b)}^{⊺}} P_{x} γ^{(b)} = \sqrt{n} {({\hat{β}}_{ols}^{*} - {\hat{β}}_{ols})}^{⊺} (\frac{1}{n} X^{⊺} X) ({\hat{β}}_{ols}^{*} - {\hat{β}}_{ols}) .

Then by noting $\sqrt{n} ({\hat{β}}_{ols}^{*} - {\hat{β}}_{ols}) \overset{d}{\to} N (0, C^{- 1})$ conditionally almost surely (Theorem 2.2 of Freedman 1981), $\frac{1}{\sqrt{n}} γ^{(b)^{⊺}} P_{x} γ^{(b)}$ is o_p(1) conditionally almost surely. □

Proposition C.2.

Under the assumptions (A1F) and (A2), and further assuming that E|ϵ_i|^4+δ < ∞ and $n^{- 1} \sum_{i = 1}^{n} ∣ ∣ X_{i} ∣ ∣^{4 + δ} < \infty$ for any δ > 0, then $\sqrt{n} ({\hat{σ}}_{*}^{2} - {\hat{σ}}_{0}^{2}) \overset{d}{\to} N (0, μ_{4, ϵ} - σ_{0}^{4})$ conditionally almost surely. □

Proof of Proposition C.2. Recall

\begin{matrix} {\hat{σ}}_{*}^{2} & = \frac{1}{n} γ^{{(b)}^{⊺}} (I - P_{x}) γ^{(b)} \\ = \frac{1}{n} γ^{{(b)}^{⊺}} γ^{(b)} - \frac{1}{n} γ^{{(b)}^{⊺}} P_{x} γ^{(b)} . \end{matrix}

Because $n^{- 1 ∕ 2} γ^{(b)^{⊺}} P_{x} γ^{(b)} \underset{\to}{p} 0$ almost surely by Lemma C.7, ${\hat{σ}}_{*}^{2}$ has the same asymptotic distribution as n⁻¹γ^(b)⊺γ^(b). Then because the γ₁^(b), γ₂^(b), … , γ_n^(b) are sampled from different distribution for every n, the Lindberg central limit theorem is used to obtain the asymptotic distribution. The conditional mean of ${(γ_{i}^{(b)})}^{2}$ is

E ({r_{1}^{(b)}}^{2} ∣ Y) = \frac{1}{n} \sum_{i = 1}^{n} {({\hat{e}}_{i})}^{2} = {\hat{σ}}_{0}^{2} .

The conditional variance is

var ({r_{1}^{(b)}}^{2} ∣ Y) = \frac{1}{n} \sum_{i = 1}^{n} {({\hat{e}}_{i})}^{4} - {\frac{1}{n} \sum_{i = 1}^{n} {({\hat{e}}_{i})}^{2}}^{2} .

Use Lemma C.6, $n^{- 1} \sum_{i = 1}^{n} {(\hat{e})}^{4} \overset{as}{\to} μ_{4, ϵ}$ and ${n^{- 1} \sum_{i = 1}^{n} {({\hat{e}}_{i})}^{2}}^{2} \overset{as}{\to} σ_{0}^{4}$ . So the conditional variance converges to $μ_{4, ϵ} - σ_{0}^{4}$ almost surely.

Then to verify the Lyapunov condition,

\begin{matrix} \frac{1}{{var ({r_{1}^{(b)}}^{2} ∣ Y)}^{2 + δ}} Σ_{i = 1}^{n} E {{∣ {(r_{1}^{(b)})}^{2} ∕ \sqrt{n} ∣}^{2 + δ} ∣ Y} \\ = \frac{1}{{var ({r_{1}^{(b)}}^{2} ∣ Y)}^{2 + δ}} \frac{1}{n^{δ ∕ 2}} E {{∣ r_{1}^{(b)} ∣}^{4 + 2 δ} ∣ Y} \\ = \frac{1}{{var ({r_{1}^{(b)}}^{2} ∣ Y)}^{2 + δ}} \frac{1}{n^{1 + δ ∕ 2}} Σ_{i = 1}^{n} {∣ {\hat{e}}_{i} ∣}^{4 + 2 δ}, \end{matrix}

which is o(1) almost surely by invoking Lemma C.6. And thus $\sqrt{n} ({\hat{σ}}_{*}^{2} - {\hat{σ}}_{0}^{2}) \overset{d}{\to} N (0, μ_{4, ϵ} - σ_{0}^{4})$ conditionally almost surely. □

C1. Theoretical Results for High Dimensions

Proposition C.3.

If p = o(n^1/2), then

\sqrt{n} ({\hat{σ}}_{0}^{2} - σ_{0}^{2}) \overset{d}{\to} N (0, μ_{4, ϵ} - σ_{0}^{4}) .

Proof of Proposition C.3. We know ${\hat{σ}}_{0}^{2} = n^{- 1} ϵ^{⊺} ϵ - n^{- 1} ϵ^{⊺} P_{x} ϵ$ , where $\sqrt{n} (n^{- 1} ϵ^{⊺} ϵ - σ_{0}^{2})$ converges to $N (0, μ_{4, ϵ} - σ_{0}^{4})$ in distribution. It remains to prove $n^{- 1 ∕ 2} ϵ^{⊺} P_{x} ϵ \overset{p}{\to} 0$ .

By expectation of quadratic form, we have $n^{- 1 ∕ 2} E (ϵ^{⊺} P_{x} ϵ) = n^{- 1 ∕ 2} tr (P_{x} \times I) \leq n^{- 1 ∕ 2} p = o (1)$ . Therefore, $n^{- 1 ∕ 2} ϵ^{⊺} P_{x} ϵ \overset{p}{\to} 0$ . This completes the proof. □

Now we study the distribution of variance estimator after screening. First, we restate Theorem 1 from Fan and Lv (2008) with slight modification. Denote A₀ to be the true index of nonzero regression coefficients, and S to be the screened subset. Assume Conditions 1–4 in Fan and Lv (2008) hold for some 2κ + τ < 1/2, then we have the following result.

Theorem C.1 (Accuracy of SIS). Under Conditions 1–4 in Fan and Lv (2008), if 2κ + τ < 1/2, there exists θ > 1/2, we have

P (A_{0} \in S) = 1 - O (exp (- C n^{1 - 2 κ} ∕ \log n)),

where C is a positive constant, and the size of S is O(n^1−θ).

From above result, we have a screening approach to reduce number of predictors from huge scale, O(exp(n^c)), to a smaller scale, $o (\sqrt{n})$ . Denote Denote X = (X⁽¹⁾^⊺, X⁽²⁾^⊺)^⊺, X⁽¹⁾, X⁽²⁾ are corresponding to the first and second half of design matrix, respectively. Similarly, define Y⁽¹⁾, Y⁽²⁾. Then the variance estimator is defined as

{\hat{σ}}_{0}^{2} = 1 ∕ m {Y^{(2)}}^{⊺} (I - P_{X_{s}^{(2)}}) Y^{(2)},

where m = n/2, and $P_{X_{S}^{(2)}}$ is the projection matrix constructed from screened subset S and second half of design matrix, X⁽²⁾.

Proposition C.4.

Under Conditions 1–4 in Fan and Lv (2008), if 2κ + τ < 1/2, then

\sqrt{m} ({\hat{σ}}_{0}^{2} - σ_{0}^{2}) \overset{d}{\to} N (0, μ_{4, ϵ} - σ_{0}^{4}) .

Proof of Proposition C.4.

\begin{matrix} \sqrt{m} {\hat{σ}}_{0}^{2} & = 1 ∕ \sqrt{m} {Y^{(2)}}^{⊺} (I - P_{X_{s}^{(2)}}) Y^{(2)} \\ = 1 ∕ \sqrt{m} {X^{(2)} β_{0} + ϵ^{(2)}}^{⊺} (I - P_{X_{s}^{(2)}}) {X^{(2)} β_{0} + ϵ^{(2)}} \\ = 1 ∕ \sqrt{m} {ϵ^{(2)}}^{⊺} ϵ^{(2)} - 1 ∕ \sqrt{m} {ϵ^{(2)}}^{⊺} P_{X_{s}^{(2)}} ϵ^{(2)} \\ + 1 ∕ \sqrt{m} {X^{(2)} β_{0}}^{⊺} (I - P_{X_{S}^{(2)}}) {X^{(2)} β_{0}} \\ + 2 ∕ \sqrt{m} {X^{(2)} β_{0}}^{⊺} (I - P_{X_{S}^{(2)}}) {ϵ^{(2)}} . \end{matrix}

For the first term, we know $\sqrt{m} (1 ∕ m {ϵ^{(2)}}^{⊺} ϵ^{(2)} - σ_{0}^{2})$ converges to $N (0, μ_{4, ϵ} - σ_{0}^{4})$ . It remains to prove the remaining term are o_p(1). For the second term, we know

E (1 ∕ \sqrt{m} {ϵ^{(2)}}^{⊺} P_{X_{S}^{(2)}} ϵ^{(2)}) = 1 ∕ \sqrt{m} E (P_{X_{S}^{(2)}}) = 1 ∕ \sqrt{m} \times O (\sqrt{n}) .

Therefore it is o_p(1). For the third term,

\begin{matrix} 1 ∕ \sqrt{m} E ({X^{(2)} β_{0}}^{⊺} (I - P_{X_{S}^{(2)}}) {X^{(2)} β_{0}}) \\ = 1 ∕ \sqrt{m} E ({X^{(2)} β_{0}}^{⊺} (I - P_{X_{S}^{(2)}}) {X^{(2)} β_{0}} ∣ A_{0} \in S) P (A_{0} \in S) \\ + 1 ∕ \sqrt{m} E ({X^{(2)} β_{0}}^{⊺} (I - P_{X_{S}^{(2)}}) {X^{(2)} β_{0}} ∣ A_{0} \notin S) P (A_{0} \notin S) \\ = 0 + 1 ∕ \sqrt{m} E ({X^{(2)} β_{0}}^{⊺} (I - P_{X_{S}^{(2)}}) {X^{(2)} β_{0}} ∣ A_{0} \notin S) P (A_{0} \notin S) \\ \leq 1 ∕ \sqrt{m} E ({X^{(2)} β_{0}}^{⊺} {X^{(2)} β_{0}}) P (A_{0} \notin S) \\ = \sqrt{m} {β_{0}}^{⊺} C β_{0} P (A_{0} \notin S) \\ \leq \sqrt{m} var (Y) P (A_{0} \notin S) \\ = O (\sqrt{n} exp (- C n^{1 - 2 κ} ∕ \log n)), \end{matrix}

where the last inequality follows Condition 3 in Fan and Lv (2008), var(Y ) = O(1). And thus it is o_p(1). For the last term,

\begin{matrix} var (2 ∕ \sqrt{m} {X^{(2)} β_{0}}^{⊺} (I - P_{X_{S}^{(2)}}) {{ϵ}^{(2)}}) \\ = 4 ∕ m E ({X^{(2)} β_{0}}^{⊺} (I - P_{X_{S}^{(2)}}) {X^{(2)} β_{0}}) \end{matrix}

which is o(1). This completes the proof. □

Footnotes

Color versions of one or more of the figures in this article are available online at www.tandfonline.com/r/TECH.

Supplementary materials for this article are available online. Please go to http://www.tandfonline.com/r/TECH

Supplementary Materials

Simulation results: Simulation results for τ = 0.1 and 0.2 are presented in the online supplement to this article.

Proofs and technical details: Detailed proofs are provided in the online supplement to this article.

R package: R package for proposed methods.

References

Akaike H (1974), “A New Look at the Statistical Model Identification,” IEEE Transactions on Automatic Control, 19, 716–723. [1] [Google Scholar]
Barber RF, and Candès EJ, (2015), “Controlling the False Discovery Rate Via Knockoffs,” The Annals of Statistics, 43, 2055–2085. [6] [Google Scholar]
Berger RL, and Boos DD (1994), “P Values Maximized Over a Confidence Set for the Nuisance Parameter,” Journal of the American Statistical Association, 89, 1012–1016. [3] [Google Scholar]
Bertsekas DP (2014), Constrained Optimization and Lagrange Multiplier Methods, Boston,MA: Academic Press; [4] [Google Scholar]
Burnham KP, and Anderson D (2003), Model Selection and Multi-Model Inference: A Practical Information-theoretic Approach, NewYork: Springer; [4] [Google Scholar]
Chen J, and Chen Z (2008), “Extended Bayesian Information Criteria for Model Selection with Large Model Spaces,” Biometrika, 95, 759–771. [1] [Google Scholar]
Cox DR (2001), “StatisticalModeling: The Two Cultures: Comment,” Statistical Science, 16, 216–218. [1] [Google Scholar]
Fan J, and Li R (2001), “Variable Selection Via Nonconcave Penalized Likelihood and its Oracle Properties,” Journal of the American Statistical Association, 96, 1348–1360. [1] [Google Scholar]
Fan J, and Lv J (2008), “Sure Independence Screening for Ultrahigh Dimensional Feature Space,” Journal of the Royal Statistical Society, Series B, 70, 849–911. [10,11] [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan Y, and Tang CY (2013), “Tuning Parameter Selection in High Dimensional Penalized Likelihood,” Journal of theRoyal Statistical Society, Series B, 75, 531–552. [1] [Google Scholar]
Feng Y, and Yu Y (2013), “Consistent Cross-validation for Tuning Parameter Selection in High-dimensional Variable Selection.” arXiv preprint arXiv:1308.5390. [1] [Google Scholar]
Freedman DA (1981), “Bootstrapping RegressionModels,” The Annals of Statistics, 9, 1218–1228. [10] [Google Scholar]
Golub GH, Heath M, and Wahba G (1979), “Generalized Crossvalidation as a Method for Choosing a Good Ridge Parameter,” Technometrics, 21, 215–223. [1] [Google Scholar]
Hall P, Lee ER, and Park BU (2009), “Bootstrap-based Penalty Choice for the Lasso, Achieving Oracle Performance,” Statistica Sinica, 19, 449–471. [1] [Google Scholar]
Henderson HV, andVelleman PF (1981), “BuildingMultiple Regression Models Interactively,” Biometrics, 37, 391–411. [1] [Google Scholar]
Hui FK, Warton DI, and Foster SD (2015), “Tuning Parameter Selection for the Adaptive Lasso Using Eric,” Journal of the American Statistical Association, 110, 262–269. [1] [Google Scholar]
Katsouyanni K, Pantazopoulou A, Touloumi G, Tselepidaki I, Moustris K, Asimakopoulos D, Poulopoulou G, and Trichopoulos D (1993), “Evidence for Interaction BetweenAir Pollution andHighTemperature in the Causation of ExcessMortality,” Archives of Environmental Health: An International Journal, 48, 235–242. [6] [DOI] [PubMed] [Google Scholar]
Kim Y, Kwon S, and Choi H (2012), “Consistent Model Selection Criteria on High Dimensions,” Journal of Machine Learning Research, 13, 1037–1057. [1] [Google Scholar]
Luo X, Stefanski LA, and Boos DD (2006), “Tuning Variable Selection Procedures by Adding Noise,” Technometrics, 48, 165–175. [5] [Google Scholar]
Mallows CL (1973), “Some Comments on Cp,” Technometrics, 15, 661–675. [1] [Google Scholar]
McDonald GC, and Schwing RC (1973), “Instabilities of Regression Estimates Relating Air Pollution to Mortality,” Technometrics, 15, 463–481. [5] [Google Scholar]
Meinshausen N, and Bühlmann P (2010), “Stability Selection,” Journal of the Royal Statistical Society, Series B, 72, 417–473. [1] [Google Scholar]
Rhee S-Y, Taylor J,Wadhera G, Ben-Hur A, Brutlag DL, and Shafer RW (2006), “Genotypic Predictors of Human Immunodeficiency Virus Type 1 Drug Resistance,” Proceedings of the National Academy of Sciences, 103, 17355–17360. [5,6] [DOI] [PMC free article] [PubMed] [Google Scholar]
Schwarz G (1978), “Estimating the Dimension of a Model,” The Annals of Statistics, 6, 461–464. [1] [Google Scholar]
Shah RD, andSamworth RJ (2013), “Variable Selectionwith ErrorControl: Another Look at Stability Selection,” Journal of the Royal Statistical Society, Series B, 75, 55–80. [1] [Google Scholar]
Sun W, Wang J, and Fang Y (2013), “Consistent Selection of Tuning Parameters Via Variable Selection Stability,” The Journal of Machine Learning Research, 14, 3419–3440. [1] [Google Scholar]
Tibshirani R (1996), “Regression Shrinkage and Selection Via the Lasso,” Journal of the Royal Statistical Society, Series B, 58, 267–288. [1] [Google Scholar]
Wang H, Li B, and Leng C (2009), “Shrinkage Tuning Parameter Selection with a Diverging Number of Parameters,” Journal of the Royal Statistical Society, Series B, 71, 671–683. [1] [Google Scholar]
Wang T, and Zhu L (2011), “Consistent Tuning Parameter Selection in High Dimensional Sparse Linear Regression,” Journal of Multivariate Analysis, 102, 1141–1151. [1] [Google Scholar]
Zhang Y, Li R, and Tsai C-L (2010), “Regularization Parameter Selections Via Generalized Information Criterion,” Journal of the American Statistical Association, 105, 312–323. [1] [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H (2006), “The Adaptive Lasso and its Oracle Properties,” Journal of the American Statistical Association, 101, 1418–1429. [1] [Google Scholar]
Zou H, and Hastie T (2005), “Regularization and Variable Selection Via the Elastic Net,” Journal of the Royal Statistical Society, Series B, 67, 301–320. [1] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp

NIHMS987505-supplement-supp.gz^{(154.2KB, gz)}

[R1] Akaike H (1974), “A New Look at the Statistical Model Identification,” IEEE Transactions on Automatic Control, 19, 716–723. [1] [Google Scholar]

[R2] Barber RF, and Candès EJ, (2015), “Controlling the False Discovery Rate Via Knockoffs,” The Annals of Statistics, 43, 2055–2085. [6] [Google Scholar]

[R3] Berger RL, and Boos DD (1994), “P Values Maximized Over a Confidence Set for the Nuisance Parameter,” Journal of the American Statistical Association, 89, 1012–1016. [3] [Google Scholar]

[R4] Bertsekas DP (2014), Constrained Optimization and Lagrange Multiplier Methods, Boston,MA: Academic Press; [4] [Google Scholar]

[R5] Burnham KP, and Anderson D (2003), Model Selection and Multi-Model Inference: A Practical Information-theoretic Approach, NewYork: Springer; [4] [Google Scholar]

[R6] Chen J, and Chen Z (2008), “Extended Bayesian Information Criteria for Model Selection with Large Model Spaces,” Biometrika, 95, 759–771. [1] [Google Scholar]

[R7] Cox DR (2001), “StatisticalModeling: The Two Cultures: Comment,” Statistical Science, 16, 216–218. [1] [Google Scholar]

[R8] Fan J, and Li R (2001), “Variable Selection Via Nonconcave Penalized Likelihood and its Oracle Properties,” Journal of the American Statistical Association, 96, 1348–1360. [1] [Google Scholar]

[R9] Fan J, and Lv J (2008), “Sure Independence Screening for Ultrahigh Dimensional Feature Space,” Journal of the Royal Statistical Society, Series B, 70, 849–911. [10,11] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Fan Y, and Tang CY (2013), “Tuning Parameter Selection in High Dimensional Penalized Likelihood,” Journal of theRoyal Statistical Society, Series B, 75, 531–552. [1] [Google Scholar]

[R11] Feng Y, and Yu Y (2013), “Consistent Cross-validation for Tuning Parameter Selection in High-dimensional Variable Selection.” arXiv preprint arXiv:1308.5390. [1] [Google Scholar]

[R12] Freedman DA (1981), “Bootstrapping RegressionModels,” The Annals of Statistics, 9, 1218–1228. [10] [Google Scholar]

[R13] Golub GH, Heath M, and Wahba G (1979), “Generalized Crossvalidation as a Method for Choosing a Good Ridge Parameter,” Technometrics, 21, 215–223. [1] [Google Scholar]

[R14] Hall P, Lee ER, and Park BU (2009), “Bootstrap-based Penalty Choice for the Lasso, Achieving Oracle Performance,” Statistica Sinica, 19, 449–471. [1] [Google Scholar]

[R15] Henderson HV, andVelleman PF (1981), “BuildingMultiple Regression Models Interactively,” Biometrics, 37, 391–411. [1] [Google Scholar]

[R16] Hui FK, Warton DI, and Foster SD (2015), “Tuning Parameter Selection for the Adaptive Lasso Using Eric,” Journal of the American Statistical Association, 110, 262–269. [1] [Google Scholar]

[R17] Katsouyanni K, Pantazopoulou A, Touloumi G, Tselepidaki I, Moustris K, Asimakopoulos D, Poulopoulou G, and Trichopoulos D (1993), “Evidence for Interaction BetweenAir Pollution andHighTemperature in the Causation of ExcessMortality,” Archives of Environmental Health: An International Journal, 48, 235–242. [6] [DOI] [PubMed] [Google Scholar]

[R18] Kim Y, Kwon S, and Choi H (2012), “Consistent Model Selection Criteria on High Dimensions,” Journal of Machine Learning Research, 13, 1037–1057. [1] [Google Scholar]

[R19] Luo X, Stefanski LA, and Boos DD (2006), “Tuning Variable Selection Procedures by Adding Noise,” Technometrics, 48, 165–175. [5] [Google Scholar]

[R20] Mallows CL (1973), “Some Comments on Cp,” Technometrics, 15, 661–675. [1] [Google Scholar]

[R21] McDonald GC, and Schwing RC (1973), “Instabilities of Regression Estimates Relating Air Pollution to Mortality,” Technometrics, 15, 463–481. [5] [Google Scholar]

[R22] Meinshausen N, and Bühlmann P (2010), “Stability Selection,” Journal of the Royal Statistical Society, Series B, 72, 417–473. [1] [Google Scholar]

[R23] Rhee S-Y, Taylor J,Wadhera G, Ben-Hur A, Brutlag DL, and Shafer RW (2006), “Genotypic Predictors of Human Immunodeficiency Virus Type 1 Drug Resistance,” Proceedings of the National Academy of Sciences, 103, 17355–17360. [5,6] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Schwarz G (1978), “Estimating the Dimension of a Model,” The Annals of Statistics, 6, 461–464. [1] [Google Scholar]

[R25] Shah RD, andSamworth RJ (2013), “Variable Selectionwith ErrorControl: Another Look at Stability Selection,” Journal of the Royal Statistical Society, Series B, 75, 55–80. [1] [Google Scholar]

[R26] Sun W, Wang J, and Fang Y (2013), “Consistent Selection of Tuning Parameters Via Variable Selection Stability,” The Journal of Machine Learning Research, 14, 3419–3440. [1] [Google Scholar]

[R27] Tibshirani R (1996), “Regression Shrinkage and Selection Via the Lasso,” Journal of the Royal Statistical Society, Series B, 58, 267–288. [1] [Google Scholar]

[R28] Wang H, Li B, and Leng C (2009), “Shrinkage Tuning Parameter Selection with a Diverging Number of Parameters,” Journal of the Royal Statistical Society, Series B, 71, 671–683. [1] [Google Scholar]

[R29] Wang T, and Zhu L (2011), “Consistent Tuning Parameter Selection in High Dimensional Sparse Linear Regression,” Journal of Multivariate Analysis, 102, 1141–1151. [1] [Google Scholar]

[R30] Zhang Y, Li R, and Tsai C-L (2010), “Regularization Parameter Selections Via Generalized Information Criterion,” Journal of the American Statistical Association, 105, 312–323. [1] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Zou H (2006), “The Adaptive Lasso and its Oracle Properties,” Journal of the American Statistical Association, 101, 1418–1429. [1] [Google Scholar]

[R32] Zou H, and Hastie T (2005), “Regularization and Variable Selection Via the Elastic Net,” Journal of the Royal Statistical Society, Series B, 67, 301–320. [1] [Google Scholar]

PERMALINK

Assessing Tuning Parameter Selection Variability in Penalized Regression

Wenhao Hu

Eric B Laber

Clay Barker

Leonard A Stefanski

Abstract

1. Introduction

2. Penalized Linear Regression

3. Estimating the Conditional Distribution of λ^GIC

3.1. Conditioning on the Solution Path

3.2. Exact Distribution of λ^GIC∣(S^,X)

Lemma 1.

Proposition 1.

3.3. Limiting Conditional Distribution of σ^02

Proposition 2.

3.4. Bootstrap Approximation to the Distribution of λ^GIC∣(S^,X)

4. Simulation Studies

Table 1.

Table 2.

Table 3.

5. Illustrative Data Examples

5.1. Pollution and Mortality

Figure 1.

Table 4.

Figure 2.

5.2. ATV Drug Resistance

Table 5.

Figure 3.

6. Conclusion

Supplementary Material

Acknowledgments

Appendix A: Simulation Results

Table A.1.

Table A.2.

Table A3.

Table A.4.

Appendix B: Additional Results for Real-Data

Table B.1.

Appendix C: Proof and Technical Details

Lemma C.1.

Lemma C.2.

Lemma C.3.

Lemma C.4.

Lemma C.5.

Proposition C.1.

Lemma C.6.

Lemma C.7.

Proposition C.2.

C1. Theoretical Results for High Dimensions

Proposition C.3.

Proposition C.4.

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3. Estimating the Conditional Distribution of ${\hat{λ}}_{GIC}$

3.2. Exact Distribution of ${\hat{λ}}_{GIC} ∣ (\hat{S}, X)$

3.3. Limiting Conditional Distribution of ${\hat{σ}}_{0}^{2}$

3.4. Bootstrap Approximation to the Distribution of ${\hat{λ}}_{GIC} ∣ (\hat{S}, X)$