Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Jul 25.
Published in final edited form as: J Am Stat Assoc. 2012 Jan 24;106(496):1371–1382. doi: 10.1198/jasa.2011.tm10382

A Perturbation Method for Inference on Regularized Regression Estimates

Jessica Minnier 1, Lu Tian 2, Tianxi Cai 3
PMCID: PMC3404855  NIHMSID: NIHMS390937  PMID: 22844171

Abstract

Analysis of high dimensional data often seeks to identify a subset of important features and assess their effects on the outcome. Traditional statistical inference procedures based on standard regression methods often fail in the presence of high-dimensional features. In recent years, regularization methods have emerged as promising tools for analyzing high dimensional data. These methods simultaneously select important features and provide stable estimation of their effects. Adaptive LASSO and SCAD for instance, give consistent and asymptotically normal estimates with oracle properties. However, in finite samples, it remains difficult to obtain interval estimators for the regression parameters. In this paper, we propose perturbation resampling based procedures to approximate the distribution of a general class of penalized parameter estimates. Our proposal, justified by asymptotic theory, provides a simple way to estimate the covariance matrix and confidence regions. Through finite sample simulations, we verify the ability of this method to give accurate inference and compare it to other widely used standard deviation and confidence interval estimates. We also illustrate our proposals with a data set used to study the association of HIV drug resistance and a large number of genetic mutations.

Keywords: High dimensional regression, Interval estimation, Oracle property, Regularized estimation, Resampling methods

1. INTRODUCTION

Accurate prediction of disease outcomes is fundamental for successful disease prevention and treatment selection. Recent advancement in biological and genomic research has led to the discovery of a vast number of new markers that can potentially be used to develop molecular disease prevention and intervention strategies. For example, gene expression analyses have identified molecular subtypes that are associated with differential prognosis and response to treatment for breast cancer patients (Perou et al. 2000; Dent et al. 2007). For non-small cell lung cancer patients, a composite score consisting of several biological markers including cyclin E and Ki-67 was shown to be highly predictive of patient survival (Dosaka-Akita et al. 2001). However, construction of accurate prediction models with a panel of markers is a difficult task in general. For example, statistical models for calculating individual cancer risk have been developed for a few types of cancer in the past two decades (Gail et al. 1989; Thompson et al. 2006; Cassidy et al. 2008; Freedman et al. 2009). However, much refinement is needed even for the best of these models due to their limited discriminatory accuracy (Spiegelman et al. 1994; Gail and Costantino 2001).

The increasing availability of new potential markers, while holding great promise for better prediction of disease outcomes, imposes challenges to model development due to the high dimensionality in the feature space and the relatively small sample size. To improve prediction with a large number of promising genomic or biological markers, an important step is to build a parsimonious model that only includes important markers. Such a model could reduce the cost associated with unnecessary marker measurements and improve the prediction precision for future patients. For such purposes, various regularization procedures such as the LASSO (Tibshirani 1996; Knight and Fu 2000), the SCAD (Fan and Li 2001, 2002, 2004; Zhang et al. 2006), the adaptive LASSO (ALASSO; Zou 2006; Wang and Leng 2007), the Elastic Net (Zou and Hastie 2005; Zou and Zhang 2009), and one-step local linear approximation (LLA; Zou and Li 2008) have been developed in recent years. These procedures simultaneously identify non-informative variables and produce coefficient estimates for the selected variables to induce a model for prediction.

These regularization procedures, while effective for variable selection and stable estimation, yield estimators whose distributions are difficult to approximate. LASSO type estimators have a non-standard limiting distribution that depends on which components of the coefficient vector are zero. Since the LASSO type estimator is not consistent in variable selection, the limiting distribution cannot be estimated directly. Furthermore, standard bootstrap methods fail when the true coefficient vector is sparse (Knight and Fu 2000). Recently, Chatterjee and Lahiri (2010) proposed a truncated LASSO estimator whose distribution can be approximated using a residual bootstrap procedure. To overcome the difficulties in LASSO estimators, other regularized procedures such as the SCAD and ALASSO have been proposed. These estimators possess asymptotic oracle properties including perfect variable selection and super efficiency. However, our simulation results suggest that in finite samples, such oracle properties are far from being true and inference procedures based on asymptotic properties such as those given in Zou (2006) perform poorly especially when the signal to noise ratio (SNR) is high and the between covariate correlations are not low. Recently, Pötscher and Schneider (2009, 2010) developed theory on the coverage probabilities of the confidence intervals for ALASSO type estimators under the orthogonal design. It was shown that estimating the distribution function of the ALASSO estimator is not feasible when the true parameter is of similar magnitude to n-12, where n is the sample size. It is thus generally difficult to develop well performed confidence regions (CRs) and hypothesis testing procedures based on these regularized estimators. Such difficulties limit applicability to clinical studies where confidence in statistical evidence is crucial for clinical decision making.

In this paper, we propose resampling methods to derive CR and testing procedures for marker effects estimated from regularized procedures such as the ALASSO and one-step SCAD estimator when the true parameter is fixed. Our preliminary studies suggest that CRs constructed from such resampling procedures perform much better than their asymptotic based counterparts. When the fitted model is merely a working model, many frequently used estimation procedures may fail to produce stable parameter estimates. Procedures that can provide stable parameter estimates and valid interval estimates under a possibly misspecified working model are highly valuable when building a prediction model with high dimensional data. Our proposed procedures remain valid even if the fitted model fails to hold, provided that the employed objective function satisfies mild regularity conditions. The rest of the paper is organized as follows. In Section 2, we introduce the proposed perturbation resampling procedures and describe various methods for constructing confidence regions. In Section 3, we demonstrate the validity of the proposed procedures in finite samples via simulation studies. In Section 4, we illustrate our proposed procedure with an HIV drug resistance study where the goal is to predict phenotypic drug resistance levels using genotypic viral mutations.

2. RESAMPLING PROCEDURES

Suppose that y = (y1, … yn) is the n × 1 vector of response variables and xj = (x1i, …, xpi), i = 1 … n, are the predictors. Let X = [x1, …, xn] be the n × p matrix of these covariates. Assume that the effect of x on y is determined via an objective function Inline graphic(θ; Inline graphic) = ℓ(y, α + βx), where θ = (α, β), α is an unknown location parameter, β is an unknown p × 1 vector of covariate effects, and Inline graphic = (y, x). To assess the association between x and y, let L(θ)=n-1i=1nL(θ;Di) be the objective function used to fit a regression model and θ̃ = (α̃, β̃) = argminθ Inline graphic (θ). To obtain a regularized estimator for θ0, we minimize the regularized objective function

L^(θ)=L(θ)+j=1ppλnj(βj)βj (1)

where pλnj(βj) is the derivative of a penalty pλnj (|βj|) evaluated at the initial estimate of β0j. We consider the cases where pλnj (|βj|) is the concave SCAD penalty or the Lq penalty for 0 < q < 1, and utilize a one-step estimator of these penalties with the local linear approximation (LLA) method proposed by Zou and Li (2008). Additionally, we consider the ALASSO penalty of Zou (2006) that arises when pλnj(βj)=n-12λnβj-1.

2.1 Regularity Conditions

To ensure the asymptotic oracle properties of the regularized estimators and the validity of the proposed resampling procedures, we require the following set of conditions:

  • C1

    ℙ{ Inline graphic(θ; Inline graphic)} has a unique minimum at θ0 and a continuous secondary derivative with a positive definite Inline graphic = 2ℙ{ Inline graphic(θ; Inline graphic)}/θθ|θ = θ0 > 0, where ℙ is the probability measure generated by the data Inline graphic = { Inline graphic, i = 1, …, n}.

  • C2

    The class of functions indexed by θ, { Inline graphic(θ; Inline graphic) | θ ∈ Ω}, is Glivenko-Cantelli (Kosorok 2008), where Inline graphic = (y, xT )T and Ω is the compact parameter space containing θ0.

  • C3

    There exists a “qausi-derivative” function Inline graphic (θ; Inline graphic) for Inline graphic (θ; Inline graphic) such that for any positive sequence δn → 0

    1. ℙ{ Inline graphic (θ0; Inline graphic)} = Inline graphic, a positive definite matrix.

    2. P{L(θ;D)-L(θ0;D)-U(θ0;D)(θ-θ0)}=12(θ-θ0)TA(θ-θ0)+o(||θ-θ0||2), where ||θθ0|| ≤ δn.

    3. Pn{L(θ1;D)-L(θ2;D)-U(θ2;D)(θ1-θ2)}=12(θ1-θ2)TA(θ1-θ2)+o(||θ1-θ2||2+n-1/2||θ1-θ2||), almost surely, uniformly in ||θ1θ0|| ≤ δn, ||θ2θ0|| ≤ δn.

These conditions are parallel to the conditions required in Proposition A1–A3 in Jin et al. (2001). These regularity conditions hold for commonly used L2 minimization with Inline graphic (β; Inline graphic) = (yβx)2 and L1 minimization with Inline graphic (β; Inline graphic) = |yβx|. Details of the justification for these two cases can be found in Section 3 of Jin et al. (2001). These conditions also guarantee that θ̃ is a consistent estimator of θ0 and n-12(θ-θ0) converges in distribution to N(0, Inline graphic Inline graphic Inline graphic). Let Inline graphic = {j : β0j ≠ 0} of size q and Inline graphic = {j : β0j = 0}, where aj denotes the jth component of a vector a.

Following similar arguments to those given in Zou (2006), Zou and Li (2008) and the unconditional arguments given in the appendix, θ̂ = argminθ Inline graphic (θ) has ‘good’ properties for certain choices of λn, including the oracle property,

Lemma 1: (Oracle properties)

Suppose that λn → 0 and λnn12. Then the regularized estimates must satisfy the following:

  1. Consistency in variable selection: limnℙ{I ( Inline graphic= Inline graphic) = 1} = 1, where Inline graphic= {j : βj ≠ 0}.

  2. Asymptotic normality: n12(θ^A-θ0A)dN(0,A11-1B11A11-1), where Inline graphic and Inline graphic are the respective q × q submatrices of Inline graphic and Inline graphic corresponding to Inline graphic.

This lemma guarantees that the regularized estimate asymptotically chooses the correct model and has the optimal estimation rate. However, estimating the distribution of n12(θ^-θ0) in finite samples remains difficult. To estimate the standard errors of the SCAD estimates θ^=argminθ{L(θ)+j=1ppλnj(βj)} when L(θ)=n-1i=1nL(θ;Di) is smooth in θ, Fan and Li (2001) proposed a local quadratic approximation (LQA) method. This gives a sandwich estimator for the covariance matrix of the estimated nonzero parameters:

cov^(θ^A^)={2L(θ^A^)+λ(θ^A^)}-1cov^{L(θ^A^)}{2L(θ^A^)+λ(θ^A^)}-1 (2)

where ∇ Inline graphic (θ̂Inline graphic) = ∂ Inline graphic(θ̂Inline graphic/∂θ, ∇2 Inline graphic (θ̂Inline graphic) = ∂2 Inline graphic(θ̂Inline graphic)/∂θθT, and Σλ(θ̂Inline graphic) is a diagonal matrix with the (j, j)th element being I(β^j0)pλn1(β^j)/β^j. The LQA approach can also be used to construct a covariance estimate for the ALASSO estimates where pλnj(βj)=n-12λnβj-1. Similar to covariance estimates in Tibshirani (1996) and Fan and Li (2001) for penalized estimates, this procedure estimates the standard errors for variables with β̂j = 0 as 0. Although this sandwich estimator has been proven to be consistent (Fan and Peng 2004) under the linear regression model, it tends to underestimate the standard errors, and normal confidence regions (CRs) using this estimate often do not provide acceptable coverage in finite sample.

To approximate the covariance of θ̂ more accurately, we propose a perturbation method to estimate the distribution of n12(θ^-θ0) for a general class of objective functions and penalties. Let Inline graphic = {Gi, i = 1, …, n} be a set of independent and identically distributed (i.i.d.) positive random variables with mean and variance equal to one. We first perturb the initial objective function and obtain

L(θ)=n-1i=1nL(θ,Di)Gi,andθ=argminθL(θ). (3)

Then with the same set Inline graphic, we obtain the minimizer of a stochastically perturbed version of the regularized objective function:

L^(θ)=L(θ)+j=1ppλnj(βj)βj (4)

where λn satisfies the same order constraints as λn as discussed in the Lemma 1. In practice, one may select λn and λn based on the BIC criterion detailed in the appendix with the corresponding objective functions. In the appendix we first show that n12(θ^A-θ0A) converges in distribution to N(0,A11-1B11A11-1), the same limiting distribution of n12(θ^-θ0). Furthermore, P(θ^Ac=0)1, where ℙ* is the probability measure generated by both Inline graphic and Inline graphic. In addition, we show that the distribution of n12(θ^A-θ^A) conditional on the data can be used to approximate the unconditional distribution of n12(θ^A-θA0) and that P(θ^Ac=0X)1. In practice, these results allow us to estimate the distribution of n12(θ^-θ0) by generating a large number, M, say, of random samples Inline graphic. We obtain θ^m by minimizing the perturbed objective function for each sample m = 1, … M, and then approximate the theoretical distribution of θ̂ by the empirical distribution { θ^m, m = 1, … M}. Specifically, the covariance matrix of θ̂ can be estimated by the sample covariance matrix constructed from { θ^m, m = 1, … M}.

Estimating the distribution of n12(θ^-θ0) based on the distribution of n12(θ^-θ^)X leads to the construction of three possible (1 − α)100% confidence regions for θ0. For the first, let σ^j2=M-1m=1M(β^mj-β^j)2. We construct a normal CR for β0j, CRjN, centered at β̂j with standard deviation σ^j. Since n12(θ^-θ^)X and n12(θ^-θ0) converge to the same normal distribution, nσ^j2 consistently estimates the variance of n12(β^j-β0j). This method is in contrast to CRAsym obtained with standard deviations σ^jAsym estimated with the asymptotically consistent LQA sandwich estimator in Fan and Li (2001) and Zou (2006). In contrast to setting the standard error to 0 when β̂j = 0, we set CRjN={0} if the proportion of β^j being 0 is larger than a threshold high, such that highphigh < 1. This method accounts for the superefficiency due to the oracle property and results in a shorter interval with valid coverage. Secondly, we simply take the (α/2)100th and (1 − α/2)100th quantiles of β^j as the upper and lower bounds of CRjQ. For the third, we estimate the density of β^j with a kernel density estimator and choose the (1 − α)100% highest density region, CRjHDR. We estimate the density of β^jX as a mixed density with distribution fj(β)=P^0jI(β=0)+(1-P^0j)fj(β), where Inline graphic is the proportion of β^j set to 0, and fj(β) is the unknown distribution of β^j given that it is not set to 0. To construct a highest density confidence region that has accurate coverage of this mixed distribution, we adjust the definition of the region depending on thresholds that reflect the strength of evidence for β0j = 0. Our highest density confidence region CRjHDR is defined as

CRjHDR={{0}ifP^0jp^high{β:fj(β)c^1}{0}ifp^lowP^0j<p^high{β:fj(β)c^2}{0}ifαP^0j<max(α,p^low){β:fj(β)c^3}ifP^0jα

where ĉ1, ĉ2, and ĉ3 are chosen such that for H(c)=I{fj(β)c}fj(β)dβ, we have H(ĉ1) = (1 − αInline graphic)/(1 − Inline graphic), H (ĉ2) = 1 − α + α( Inline graphic + low), H (ĉ3) = 1 − α, while low → 0 and highphigh = 1 − α. When Inline graphic, the proportion of β^j set to zero, is greater than the upper thresholding value high, we have strong evidence that β0j = 0 and thus take {0} as the confidence interval. When Inline graphic is between the high and low thresholding values, we have moderately strong evidence that β0j = 0 and thus take the mass at 0 and a 1 − αInline graphic highest density region from the β^jβ^j0 samples. The occurrence of αInline graphic < max(α, low) suggests that β0j is likely to be a weak signal. For such cases, it would be difficult to make inference about β0j due to shrinkage. Thus, we inflate the highest density region from the β^jβ^j0 samples. Finally, when Inline graphic < α, we have strong evidence that β0j is nonzero and so we take the 1 − α highest density region of the continuous empirical distribution of nonzero β^j samples. The justification of this method and the choices of high and low are relegated to the appendix.

In practice, when assessing the effects of multiple features, it is often important to adjust for multiple comparisons. For interval estimation, we may construct a (1 − α)100% simultaneous confidence region to cover the entire parameter vector θ0. We may then make statements about the importance of each of the covariates in the presence of other covariates while maintaining a type I error of α. For the regularized estimator, we define the Normal simultaneous region as CRSim=jA^{0}×jA^(β^j-γασ^j,β^j+γασ^j) where Inline graphic = {j : Inline graphic < high} and γα is the (1 − α) 100% quantile of {max{β^jm-β^j/σ^j}jA^}m=1M. We define the (1 − α)100% HDR simultaneous region as CRSimHDR=jCRj,αsHDR where CRj,αsHDR is the 1-αsCRjHDR for β̂j and αs = 2(1 − Φ(γα)). We compare the performance of these confidence regions with numerical examples in Sections 3 and 4.

3. SIMULATION STUDIES

To examine the validity of our procedures in finite samples, we performed simulation studies to assess the performance of the corresponding confidence regions. For each setting, we simulated 1500 data sets with n observations generated under the linear model, y = Xβ + ε, where xij ~ Inline graphic(0, 1), the pairwise correlation between xi and xj was set to cor(xi, xj) = ρ, εi ~ Inline graphic(0, σ2), and β, ρ, and σ were varied between settings. In each setting, β was sparse and included medium and high signals. We obtained ALASSO estimators via LARS (Efron et al. 2004) for each simulated data set with OLS initial estimates and λ chosen by the BIC as described in the appendix, and then M = 500 perturbed samples using our proposed method with Inline graphic generated from a mean 1 exponential distribution. The sample size n was set to 100, 200, 400, or 1000, while ρ was 0, 0.2, or 0.5, and σ was 1 or 2. To compute the highest density regions CR*HDR we utilized the hdrcde package in R with the “ndr” bandwidth estimator as presented in Scott (1992) based on Silverman’s rule of thumb (Silverman 1986). We chose p^low=min{2/πexp(-nλ/(4σ^2)),0.49} and p^high=min{1-2/πexp(-nλ/σ^2),0.95} as justified in the appendix for CR*HDR, CR*N, CR*Sim, and CR*SimHDR. We substituted the σ used in the standard deviation estimate from Zou (2006) analogous to equation (2) with the known σ from the simulations. We present the results for simulations with n = 100, 200 and 400 when σ = 1 or 2 and p = 10 or 20. In these cases, the true β0 contains two large effects of β0j = 1, two moderate effects of β0j = 0.5, and six (for p = 10) or sixteen (for p = 20) noise parameters where β0j = 0. To examine the effect of regularization we compare our CRs for the regularized estimators to CROLS, the normal CR based on the empirical standard error of the perturbed ordinary least squares (OLS) estimates.

In Tables 1 and 2 we see that when σ = 1, most regions perform similarly for nonzero parameters. When σ = 2, the perturbation regions usually have higher coverage than CRAsym and sacrifice very little in length. The asymmetric CR*HDR has the shortest length when β0j = 0 for all settings. Coverage for CR*HDR and simultaneous confidence regions can be low when n = 100 due to the difficulty of estimating Inline graphic at such a small sample size, but coverage reaches nominal levels by n = 200. The standard deviation estimate from Zou (2006), σ̂Asym (also see Table 3), is not large enough to cover β0j sufficiently, and while the coverage probability of the CROLS is not extremely low, it is notably outperformed by the other confidence regions when β0j = 0. We omit the results from the settings where n = 1000 because the results have similar patterns as those with n = 400. For these large sample cases with n greater than or equal to 400 we saw convergence to 95% coverage for the normal CRs, highest density regions, and OLS CRs in all settings when the true parameter was nonzero. For true zero parameters, the coverage probabilities of our confidence regions converged to 1, while the OLS CR converged to 0.95. A tradeoff associated with our method is that while the coverage of our perturbation confidence regions tends to be higher than CROLS and CRAsym, some power is sacrificed for moderate signals of β0j = 0.5. This loss is minimal, however, and only appears in difficult cases when sample size is low and ρ and σ are high. When β0j = 0, CROLS has coverage lower than 95% for small samples while our methods produce regions with coverage probability near 1 and very short lengths reflecting the oracle properties. Overall, the most disparity between our methods and previous methods is seen when the SNR is low.

Table 1.

Coverage probabilities (lengths) of confidence regions when σ = 1.

p β0 n = 100 n = 200 n = 400
ρ = 0 ρ = 0.2 ρ = 0.5 ρ = 0 ρ = 0.2 ρ = 0.5 ρ = 0 ρ = 0.2 ρ = 0.5
10 1 CR*N 91.6 (38) 92.7 (41) 92.5 (52) 92.9 (27) 94.1 (29) 94.2 (37) 93.9 (19) 95.1 (21) 94.9 (26)
CR*HDR 91.4 (38) 91.7 (40) 91.5 (51) 92.4 (27) 93.7 (29) 93.9 (36) 93.4 (19) 94.5 (21) 94.1 (26)
CR*Q 91.7 (38) 91.0 (40) 91.5 (51) 92.5 (27) 93.6 (29) 93.8 (36) 93.5 (19) 94.3 (20) 93.9 (26)
CRAsym 93.9 (41) 94.1 (43) 93.7 (53) 94.1 (28) 94.2 (30) 94.1 (36) 94.4 (20) 95.0 (21) 94.9 (25)
CROLS 91.4 (38) 91.7 (40) 90.6 (51) 93.0 (27) 93.6 (29) 93.9 (36) 93.0 (19) 94.3 (21) 93.7 (26)
0.5 CR*N 93.0 (40) 93.3 (43) 92.0 (54) 93.9 (28) 94.6 (30) 95.1 (38) 93.7 (20) 94.2 (21) 95.2 (27)
CR*HDR 91.9 (38) 92.5 (41) 93.4 (51) 93.5 (27) 94.0 (29) 93.8 (37) 93.7 (19) 93.8 (21) 94.7 (26)
CR*Q 91.7 (39) 92.3 (42) 90.7 (53) 93.3 (27) 93.7 (29) 93.9 (37) 93.8 (19) 93.6 (21) 94.7 (26)
CRAsym 93.3 (41) 93.5 (43) 91.5 (52) 94.5 (28) 94.3 (30) 93.7 (36) 95.0 (20) 94.0 (21) 94.1 (25)
CROLS 92.4 (38) 91.6 (41) 90.9 (50) 93.5 (27) 94.3 (29) 93.8 (36) 94.3 (19) 93.7 (21) 94.9 (26)
0 CR*N 97.6 (23) 98.5 (25) 98.1 (31) 98.4 (17) 98.3 (19) 97.9 (23) 98.7 (13) 98.7 (13) 98.7 (16)
CR*HDR 99.1 (17) 99.3 (18) 99.4 (23) 99.6 (12) 99.3 (13) 99.0 (16) 99.7 (8) 99.3 (8) 99.5 (11)
CR*Q 99.5 (31) 99.7 (33) 99.8 (43) 99.7 (22) 99.7 (24) 99.7 (30) 99.8 (16) 99.7 (17) 99.8 (21)
CROLS 92.9 (38) 92.9 (40) 91.7 (51) 93.4 (27) 93.0 (29) 92.6 (36) 93.4 (19) 93.7 (21) 93.6 (26)
CR*SimHDR 91.9 (36) 92.5 (39) 91.7 (49) 93.1 (26) 94.9 (28) 93.8 (36) 94.9 (19) 94.8 (20) 96.0 (25)
CR*Sim 92.5 (42) 92.9 (46) 91.9 (58) 93.7 (31) 95.5 (33) 95.2 (42) 95.5 (23) 95.7 (24) 96.6 (30)
CR*SimOLS 87.1 (54) 86.5 (58) 85.9 (72) 89.8 (38) 90.0 (41) 90.9 (52) 92.3 (28) 91.6 (30) 92.6 (37)
20 1 CR*N 91.7 (38) 92.4 (42) 92.5 (53) 93.3 (27) 92.9 (30) 94.3 (38) 95.4 (19) 94.6 (21) 94.5 (27)
CR*HDR 90.2 (37) 90.9 (41) 90.8 (51) 92.4 (26) 91.9 (29) 91.9 (36) 95.1 (19) 93.6 (21) 93.3 (26)
CR*Q 90.2 (37) 90.7 (41) 90.7 (51) 92.3 (26) 92.3 (29) 91.9 (36) 95.1 (19) 93.2 (21) 93.1 (26)
CRAsym 93.9 (41) 94.5 (44) 93.3 (54) 95.1 (28) 93.9 (30) 94.0 (37) 95.9 (20) 94.7 (21) 93.3 (26)
CROLS 90.3 (38) 89.9 (42) 90.0 (52) 92.6 (27) 92.1 (29) 92.0 (37) 95.3 (19) 93.4 (21) 93.3 (26)
0.5 CR*N 91.1 (40) 91.5 (44) 91.0 (56) 93.7 (28) 93.5 (30) 92.7 (39) 93.1 (20) 94.7 (21) 95.0 (27)
CR*HDR 89.7 (38) 90.3 (42) 92.5 (52) 93.1 (27) 93.1 (30) 91.7 (38) 92.9 (19) 93.3 (21) 94.4 (27)
CR*Q 89.7 (39) 89.7 (43) 89.2 (54) 92.7 (27) 92.5 (30) 91.5 (38) 92.7 (19) 93.5 (21) 94.5 (27)
CRAsym 91.7 (41) 92.0 (44) 89.7 (53) 94.5 (28) 93.2 (30) 92.3 (37) 93.7 (20) 94.3 (21) 94.2 (26)
CROLS 89.7 (38) 89.7 (42) 89.6 (52) 92.9 (27) 92.9 (29) 91.8 (37) 92.9 (19) 92.7 (21) 94.2 (26)
0 CR*N 96.6 (29) 97.3 (32) 96.8 (40) 98.6 (21) 98.5 (23) 98.7 (29) 98.8 (15) 99.1 (17) 99.0 (21)
CR*HDR 97.7 (25) 98.3 (28) 98.3 (34) 98.9 (17) 99.3 (19) 99.1 (24) 99.3 (12) 99.5 (13) 99.4 (16)
CR*Q 99.0 (31) 99.4 (34) 99.2 (43) 99.5 (22) 99.9 (24) 99.5 (30) 99.8 (15) 99.9 (17) 99.7 (21)
CROLS 89.5 (38) 90.1 (41) 90.1 (52) 92.4 (27) 92.2 (29) 92.6 (37) 94.7 (19) 93.8 (21) 94.2 (26)
CR*SimHDR 90.5 (42) 92.0 (47) 92.1 (58) 95.4 (31) 95.7 (34) 95.3 (44) 96.9 (22) 96.9 (25) 96.9 (32)
CR*Sim 92.5 (51) 91.9 (57) 91.7 (71) 96.5 (38) 97.5 (42) 96.5 (54) 97.7 (28) 97.8 (31) 98.1 (40)
CR*SimOLS 80.1 (59) 78.4 (64) 77.9 (80) 87.5 (41) 87.1 (45) 84.9 (57) 90.6 (29) 90.0 (32) 91.1 (41)

NOTE: We multiply values by 100. The lengths of the simultaneous confidence regions are averaged over the number of parameters.

Table 2.

Coverage probabilities (lengths) of confidence regions when σ = 2.

p β0 n = 100 n = 200 n = 400
ρ = 0 ρ = 0.2 ρ = 0.5 ρ = 0 ρ = 0.2 ρ = 0.5 ρ = 0 ρ = 0.2 ρ = 0.5
10 1 CR*N 92.6 (79) 94.3 (85) 92.9 (110) 93.4 (55) 94.4 (59) 94.9 (76) 94.4 (39) 94.2 (42) 94.7 (53)
CR*HDR 91.7 (76) 93.9 (82) 94.0 (104) 92.7 (55) 94.2 (59) 94.4 (74) 94.0 (39) 93.6 (42) 93.7 (52)
CR*Q 91.9 (77) 93.7 (84) 91.4 (107) 92.7 (55) 93.8 (59) 94.3 (75) 93.9 (39) 93.4 (42) 93.8 (52)
CRAsym 80.7 (57) 82.7 (60) 78.8 (73) 82.3 (40) 83.6 (42) 80.5 (51) 83.7 (28) 83.5 (30) 80.9 (36)
CROLS 91.7 (75) 93.4 (81) 91.1 (102) 92.7 (54) 93.4 (58) 94.0 (73) 94.0 (39) 93.7 (42) 93.4 (52)
0.5 CR*N 87.5 (80) 87.6 (86) 81.3 (100) 93.5 (59) 94.2 (63) 90.3 (79) 94.7 (41) 95.7 (44) 94.5 (57)
CR*HDR 90.3 (71) 91.4 (76) 83.9 (88) 95.9 (54) 96.5 (58) 92.3 (69) 93.5 (39) 94.5 (42) 96.5 (52)
CR*Q 91.5 (76) 92.3 (81) 90.5 (97) 93.4 (57) 93.9 (61) 92.8 (75) 94.0 (40) 94.6 (43) 94.0 (56)
CRAsym 76.5 (51) 76.3 (53) 67.3 (57) 78.8 (39) 78.8 (42) 78.3 (48) 79.9 (28) 80.9 (29) 78.1 (36)
CROLS 91.1 (75) 92.1 (81) 90.7 (101) 93.0 (54) 93.5 (58) 93.1 (72) 94.1 (38) 94.5 (41) 93.5 (52)
0 CR*N 97.1 (47) 97.1 (50) 97.7 (65) 98.1 (33) 98.0 (37) 98.1 (47) 98.3 (25) 98.4 (27) 98.8 (34)
CR*HDR 98.3 (40) 98.2 (40) 99.1 (52) 99.0 (25) 99.4 (28) 99.5 (36) 99.4 (17) 99.5 (19) 99.6 (24)
CR*Q 99.1 (64) 98.9 (68) 99.4 (86) 99.7 (45) 99.7 (49) 99.7 (61) 99.9 (32) 99.7 (34) 99.9 (43)
CROLS 91.3 (75) 91.2 (81) 91.7 (102) 92.9 (54) 92.5 (58) 92.3 (73) 94.1 (39) 94.7 (42) 94.2 (52)
CR*SimHDR 85.3 (71) 85.7 (77) 75.1 (96) 94.1 (51) 94.4 (56) 90.8 (70) 96.3 (38) 95.2 (41) 95.2 (52)
CR*Sim 84.1 (83) 84.1 (91) 73.9 (116) 93.4 (60) 93.5 (66) 90.1 (84) 96.6 (45) 95.9 (49) 95.8 (62)
CR*SimOLS 85.2 (108) 86.9 (116) 87.5 (146) 90.9 (77) 90.7 (83) 91.1 (104) 92.7 (55) 92.5 (59) 92.6 (74)
20 1 CR*N 91.2 (80) 91.7 (87) 90.0 (112) 92.1 (55) 93.9 (60) 92.9 (78) 94.3 (39) 93.7 (43) 94.6 (54)
CR*HDR 90.2 (76) 91.2 (83) 92.1 (104) 92.1 (54) 93.1 (59) 91.7 (75) 94.1 (38) 93.1 (42) 93.9 (53)
CR*Q 90.1 (77) 90.7 (84) 89.6 (107) 92.1 (54) 93.3 (59) 91.9 (76) 94.4 (38) 92.6 (42) 93.8 (53)
CRAsym 79.2 (58) 78.9 (62) 75.9 (76) 80.3 (40) 82.2 (43) 77.3 (53) 83.5 (28) 80.5 (30) 81.9 (37)
CROLS 89.7 (76) 90.1 (83) 89.7 (104) 92.3 (54) 92.9 (58) 91.3 (74) 94.4 (38) 92.9 (42) 93.8 (53)
0.5 CR*N 86.1 (81) 82.7 (86) 80.7 (103) 91.6 (59) 91.9 (64) 88.4 (81) 93.9 (41) 94.9 (45) 93.1 (58)
CR*HDR 91.0 (74) 87.8 (79) 84.7 (94) 95.7 (55) 94.7 (59) 92.1 (73) 93.3 (39) 94.3 (43) 96.1 (54)
CR*Q 89.1 (75) 88.8 (80) 88.6 (97) 92.0 (57) 92.1 (61) 90.9 (75) 93.3 (40) 94.1 (44) 92.5 (56)
CRAsym 72.5 (50) 71.6 (51) 64.2 (57) 77.2 (39) 76.6 (41) 73.8 (48) 81.3 (28) 79.2 (30) 76.8 (36)
CROLS 89.5 (76) 88.9 (82) 89.3 (104) 92.5 (54) 92.5 (59) 91.8 (74) 94.1 (38) 94.4 (42) 93.1 (53)
0 CR*N 97.3 (57) 96.8 (61) 97.0 (79) 98.5 (40) 97.3 (45) 97.7 (58) 98.8 (29) 98.9 (33) 98.9 (41)
CR*HDR 97.9 (53) 97.5 (55) 97.7 (70) 99.1 (35) 98.2 (39) 98.7 (50) 99.3 (24) 99.3 (27) 99.3 (33)
CR*Q 98.9 (63) 98.7 (69) 98.8 (87) 99.6 (44) 99.2 (49) 99.3 (62) 99.6 (31) 99.7 (34) 99.7 (43)
CROLS 90.1 (76) 90.2 (83) 89.8 (104) 92.8 (54) 91.9 (59) 91.9 (74) 93.5 (38) 93.5 (42) 93.7 (53)
CR*SimHDR 87.1 (81) 83.6 (89) 78.1 (112) 95.1 (60) 94.4 (66) 93.7 (84) 97.1 (45) 97.1 (50) 97.3 (62)
CR*Sim 86.1 (99) 82.8 (109) 78.3 (138) 94.3 (73) 94.4 (81) 94.1 (103) 97.8 (55) 98.3 (61) 97.1 (77)
CR*SimOLS 77.7 (117) 76.3 (128) 76.6 (160) 86.1 (82) 85.4 (90) 86.9 (114) 90.5 (59) 89.8 (64) 90.3 (81)

NOTE: We multiply values by 100. The lengths of the simultaneous confidence regions are averaged over the number of parameters.

Table 3.

Empirical s.d. of the parameter estimates (σ̃) and average s.e. estimates (σ̂).

p β0 n = 100 n = 200 n = 400
ρ= 0 ρ = 0.2 ρ = 0.5 ρ = 0 ρ = 0.2 ρ = 0.5 ρ = 0 ρ = 0.2 ρ = 0.5
10 1 σ̃ 21.7 22.2 29.9 14.7 15.2 19.3 10.1 10.8 13.7
σ̃OLS 21.1 21.6 28.5 14.4 15.2 19.1 10.1 10.9 13.9
σ̂* 20.0 21.7 28.1 14.1 15.2 19.4 10.0 10.8 13.5
σ̂OLS 19.1 20.6 26.0 13.8 14.8 18.5 9.8 10.6 13.2
σ̂Asym 14.7 15.4 18.7 10.2 10.8 13.1 7.1 7.5 9.2
0.5 σ̃ 24.2 25.5 32.3 16.1 16.8 21.8 10.6 11.2 14.7
σ̃OLS 21.6 22.8 29.2 14.8 15.5 19.4 10.2 10.8 13.9
σ̂* 21.1 22.8 27.9 15.1 16.2 20.5 10.3 11.2 14.5
σ̂OLS 19.1 20.7 25.9 13.8 14.8 18.4 9.8 10.6 13.2
σ̂Asym 13.0 13.6 14.5 10.0 10.6 12.1 7.1 7.5 9.1
0 σ̃ 17.0 17.2 20.3 10.7 11.5 14.2 6.9 7.2 9.2
σ̃OLS 22.4 23.5 28.4 14.9 16.2 19.9 10.2 10.9 13.8
σ̂* 18.6 19.9 25.1 13.2 14.2 17.8 9.4 10.1 12.6
σ̂OLS 19.2 20.6 26.1 13.8 14.9 18.5 9.9 10.6 13.3
σ̂Asym 5.0 4.5 5.4 2.6 2.9 3.5 1.5 1.5 2.0
20 1 σ̃ 23.2 24.8 33.4 15.4 16.2 21.6 9.9 11.4 14.0
σ̃OLS 23.0 24.6 31.7 15.4 16.1 21.3 9.8 11.4 13.9
σ̂* 20.3 22.2 28.5 14.1 15.3 19.9 9.9 10.9 13.9
σ̂OLS 19.3 21.1 26.5 13.7 14.9 18.9 9.7 10.7 13.4
σ̂Asym 14.9 15.9 19.4 10.3 10.9 13.5 7.1 7.6 9.4
0.5 σ̃ 24.7 27.1 32.8 16.4 18.1 23.1 10.5 11.7 15.7
σ̃OLS 22.8 25.3 31.6 15.1 16.6 20.8 10.0 11.2 14.4
σ̂* 21.2 22.8 28.1 15.2 16.5 20.8 10.5 11.5 14.9
σ̂OLS 19.3 21.0 26.4 13.7 14.9 18.8 9.8 10.7 13.5
σ̂Asym 12.8 13.1 14.5 10.1 10.5 12.2 7.2 7.6 9.2
0 σ̃ 15.3 16.7 21.1 9.3 10.8 13.7 6.1 6.6 8.1
σ̃OLS 22.5 25.0 31.7 14.7 16.6 21.2 10.2 11.4 14.1
σ̂* 18.5 20.2 25.6 12.9 14.2 18.1 9.1 10.2 12.7
σ̂Asym 4.7 5.1 6.1 2.6 2.8 3.5 1.4 1.6 1.9

NOTE: We present results for settings when σ, the standard deviation of ε, is 2. All values are multiplied by 100. Note that σ^jAsym=0 when β̂j = 0, but β̂j and β^j are not always 0 in the simulations, and therefore the average σ^jAsym is nonzero.

The coverage probabilities and lengths of our simultaneous confidence regions are also displayed in Tables 1 and 2. We compared our methods to CR*SimOLS, constructed analogously to CR*Sim except Inline graphic = {j|j = 1, …, p} and CR*SimOLS is centered at the OLS estimates and the standard error is the sample standard deviation of the perturbed OLS estimates. Our regularized CR*Sim and CR*SimHDR have the advantage of shrinking the dimension of the region by reducing some CRs to the point {0} when Inline graphic is large. We see that our CR*Sim and CR*SimHDR outperform CR*SimOLS in coverage and have shorter lengths. For large sample settings when n = 1000, CR*SimOLS converges further to 95% coverage with levels around 90% for p = 20 and CR*Sim and CR*SimHDR have coverage almost always over 95%.

In Table 3 we also present the standard error estimates when σ = 2. For notation, let the empirical standard deviations of the estimators β̂j and β̃j be denoted as σ̃j and σjOLS, respectively. We see that our estimate of the standard error from the perturbed samples, σ^j, does well in estimating σ̃j. However, the standard error proposed by Zou (2006) underestimates the true standard error of the parameter estimates, especially when σ = 2 and β0j = 0.5 or 0. When the SNR is higher, σjAsym estimates σ̃j well except when β0j = 0 because σ^jAsym=0 whereas σ̃j and σ^j are clearly nonzero.

4. EXAMPLE: HIV DRUG RESISTANCE

We illustrate our methods in a real example using the HIV antiretroviral drug susceptibility data described in Rhee et al. (2003). This dataset was refined from the Stanford HIV Drug Resistance Database (available at http://hivdb.stanford.edu/), and is used to study the association of protease mutations with susceptibility to the protease inhibitor anti-retroviral (ARV) drug amprenavir. The data consist of mutation information at 99 protease codons in the viral genome, of which 79 contain mutations, and ARV drug resistance assays for n = 702 HIV infected patients. Drug resistance was measured in units of IC50, the amount of drug needed to inhibit viral replication by 50% in units of fold increase compared to drug-sensitive wildtype virus. Researchers are interested in determining which protease mutations are associated with ARV resistance so that they may develop a genotype test for resistance that looks for these mutations in the patient’s infecting HIV strain. Therefore, we aim to examine the effect of the presence of any of the mutations at 79 codons on IC50, where higher IC50 measurements indicate higher levels of drug resistance. We chose to log-transform the non-negative IC50 outcome and represented the presence of each of the mutations as a binary predictor in our regression model. We removed the fifteen mutations that occurred less than 0.5% in the data set. Recently, Wu (2009) analyzed these data with a permutation test for regression coefficients of LASSO. In this paper, we will analyze the data using ALASSO and gain inference by using our perturbation methods to construct CRs and standard errors.

For this analysis, we used LARS to fit an ALASSO linear model with initial parameters β̃ estimated by OLS and λ and λ* chosen to minimize the BIC as described in the appendix. We generated M=500 perturbation variable sets Inline graphic, consisting of n = 702 i.i.d. variables from an exponential distribution with mean and variance equal to 1, and for each Inline graphic we minimized the perturbed objective function to obtain β^m. We constructed 95% CRs using our perturbation method and compared inference gained from CR*N, CR*HDR, and CR*Q to the inference from CRAsym and CROLS. We estimated the σ used in the standard deviation estimate from Zou (2006) analogous to equation (2) with the residual variance from the nonregularized linear regression model and chose high and low as described in the simulation studies.

We present a graphical summary of the analysis results in Figure 1. Previous studies by Prado et al. (2002) and results collected by Johnson et al. (2005) found that mutations at codons 10, 32, 46, 47, 50, 54, 73, 82, 84 and 90 emerge in amprenavir resistant viral genomes. Using a permutation based p-value adjusted for multiple testing, Wu (2009) determined these mutations (except 73 and 82) as well as additional codon mutations to be significantly associated with amprenavir susceptibility at the α = 0.05 level for a total of thirteen significant associations. The ALASSO estimator obtained with λ = 0.56 from BIC estimated 36 coefficients as nonzero. The confidence region from nonregularized estimates CROLS was significant for twenty-six mutations. Our perturbation based CR*N, CR*HDR, and CR*Q for the mutations found significant by Wu (2009) did not include zero and three new mutations (37, 64, 93) had significant perturbation confidence regions. We see in Figure 2 that the parameter for codons 71 and 89 have marginally significant Normal and HDR confidence regions and marginally nonsignificant quantile confidence regions and note that Inline graphic is marginally close to 0.05.

Figure 1.

Figure 1

Perturbation methods results denoting significant associations between genetic mutations and drug susceptibility.

Figure 2.

Figure 2

95% perturbation CRs (CR*N and CR*HDR) for the association between genetic mutations and antiretroviral drug susceptibility. Estimated coefficients β̂j are represented with a circle on each CR line and a star at zero signifies that the CR includes the point mass at zero. The shaded region denotes the simultaneous confidence regions CR*Sim and CR*SimHDR. Note that even coefficients estimated as zero may have CRs around their estimates and that CR*HDR may be asymmetrical and noncontiguous.

Our use of ALASSO provides estimates of the effects of each mutation while adjusting for the presence of other mutations. Several studies have shown that mutations associated with resistance to protease inhibitors can have varying effects when combined with other mutations (Schumi and DeGruttola 2008; Van Marck et al. 2009). For instance, the mutation at codon 32 has been found to have no effect on resistance of the protease inhibitor drug darunavir when a mutation at codon 84 is present (Van Marck et al. 2009). Our method allows us to determine the size of associations without orthogonalizing predictors and we adjust for multiple testing with the simultaneous confidence region CR*Sim. Results could be impacted by studies summarized in Johnson et al. (2005) that may not have adjusted for other mutations, and the use of LASSO estimators that do not have oracle properties in Wu (2009). Our methods highlight three new mutations that have not been found to be associated with drug susceptibility. Furthermore, our methods produce CRs for the coefficients of mutations that were estimated as zero. These CRs quantify the uncertainty in our estimation and can aid scientists who wish to conduct future drug therapy studies involving the codons.

5. DISCUSSION

In this paper, we address the problem of constructing a covariance estimate for parameter estimates obtained with a general objective function and concave penalty functions including adaptive LASSO and SCAD. The proposed methods for covariance estimates are simple to implement and possess the attractive property that parameters estimated as zero have nonzero standard errors. We may then construct confidence regions for each parameter estimate and obtain more meaningful inference.

We have shown through extensive simulation studies using the ALASSO penalty that our perturbation method results in confidence regions with accurate coverage probability. The perturbation based normal CR does not sacrifice much in length and has reasonable coverage for small sample sizes. We set the CR to {0} when the proportion of perturbed estimates set to 0 is higher than a threshold, and therefore shorten the length by utilizing the oracle property. The perturbation based highest density region has even shorter length and good coverage probability, especially for the moderate signal β0j = 0.5 in comparison to all other confidence regions. The asymptotic based Normal interval that uses the standard error estimate presented in Zou (2006) fails to reach nominal coverage levels due to the underestimation of the standard error, most notably when the standard error is estimated as 0 when β̂ = 0. However, our estimate of the standard error of the parameter estimates based on our perturbation samples is close to the empirical standard error of the ALASSO estimates, even for parameters estimated as 0. Additionally, we propose two types of simultaneous CRs that adjust for multiple comparisons. We again utilize the oracle property and reduce the dimension of our region by setting intervals to {0} when the proportion of zero perturbed parameter estimates is high. Therefore, the average length of our Normal simultaneous region will often be shorter than the simultaneous OLS region. For instance, when all covariates are independent, the OLS length is asymptotically proportional to γOLS=max{|(βj-β0j)/σ|}j=1p whereas the perturbation region length is asymptotically proportional to (q/p)γ where γ = max {|(β̂jβ0j)/σ|}β0j ≠ 0. Note that γγOLS and so the length of the perturbation region will be shorter than the OLS length when the true model is sparse. Similarly, when the covariates are not independent, {(βj-β0j)/σ}j=1p~N(0,Corr(β^)) and the perturbation region generally has shorter average length than the OLS region. Simple simulations show that when q parameters are estimated as nonzero, we expect the perturbation region length to be approximately 0.36 times the OLS region length when p = 10 and q = 4 and approximately 0.16 times the OLS region length when p = 20 and q = 4 for both the independent case and the compound symmetry case when ρ = 0.5 and σ = 1. However, in finite sample, the gain in interval length for the shrinkage estimators may be substantially less than the theoretical gain as oracle properties may be far from being true and the intervals may need to be enlarged to ensure proper coverage levels.

When the SNR is low, much larger sample sizes may be required for the resampling procedure to yield confidence intervals with proper coverage levels. We conducted further simulations for the case when β = (1, 1, 0.1, 0.1, 01×(p−4)). In general, we find that the standard error estimates perform well even with sample sizes around 100. The confidence intervals have reasonable coverage levels for β3 when σ = 1 and the correlation ρ is low with sample size 400 or larger. For example, when σ = 1, ρ = 0.2, and p = 20, the coverage level of the 95% CR*HDR of β3 is about 90% for n = 400 and 94% for n = 1000. As we increase the correlation ρ and σ, the interval estimation of β3 becomes more difficult. For example, for the most difficult case with σ = 2, ρ = 0.5, and p = 20, the empirical coverage level of CR*HDR is about 60%, 84% and 90% when n = 400, 1000, and 2000, respectively. This is a particularly difficult case as it has been shown that estimating the distribution function of the ALASSO type estimator is not feasible when the effect size is of similar magnitude to n-12 (Pötscher and Schneider 2009). Note that when σ = 2, the effect size corresponding to β3 is 0.05 whereas n-12=0.1 when n is 100 and n-120.032 for n = 1000.

Additionally, it is well known that regularized estimators, while possessing asymptotic oracle properties, are prone to bias in finite samples. Bias correction for the ALASSO estimator can be achieved based on our perturbation samples. We present the technical details of the estimation of the bias in the appendix. We find that this bias correction works well in practice, especially when the signal is small or moderate, as when β0j = 0.5. For example, in our simulations when p = 20, n = 200, ρ = 0.2, σ = 2, and β0j = 0.5, the bias of β̂j is −0.067 while the bias of β^0jBC is −0.034. Similar gains are seen for most settings. The bias corrected estimator has empirical standard error similar to that of the original ALASSO estimator but with smaller bias. We could construct analogous bias-corrected estimators based on other penalties and objective functions. The model size with the ALASSO and bias-corrected ALASSO estimator in our simulations is close to 5 when σ = 1, except for the difficult cases when n = 100 and p = 20 for which the average model size is closer to 5.5. For the settings where the SNR is low with σ = 2, the oracle property is weak in finite samples and so the model size is between 5 and 6 when p = 10 and between 6 and 9 when p = 20.

We note that when p is large relative to n, initial parameter estimates obtained with ridge regression can produce more stable results. Furthermore, our results may be extended to the case where p tends to ∞ at some rate slower than n. We expect that the theory could be derived using similar arguments as given in Fan and Peng (2004) and Zou and Zhang (2009). Lastly, we note that our methods are robust to misspecification of the model and are valid provided that regularity conditions given in Section 2 hold.

Acknowledgments

The authors thank the editor, the associate editor, and two referees for their insightful and constructive comments that greatly improved the article.

This research was supported by National Institutes of Health Grants T32 AI007358, R01 GM079330, R01 HL089778 and DMS 0854970.

APPENDIX

A.1 Justification for the resampling method

To show that the distribution of n12(θ^-θ0) can be estimated by that of n12(θ^-θ^)X under conditions C1–C3, we first consider the distribution of n12(θ^-θ0) under the product probability measure ℙ* generated by the data, Inline graphic, and the perturbation variables Inline graphic = {Gi, i = 1, …, n}. Throughout, we assume that the parameter space for θ, denoted by Ω, is a compact set and θ0 is an interior point of Ω. Note that this compactness condition may be nontrivial in practice. This condition is necessary for this proof of our proposed method, and validity of the method without this condition warrants further investigation. We let ℙn denote the empirical measure generated by Inline graphic and Gn=n-12(Pn-P). We use notation →p to denote convergence in probability.

We first show that θ̃*p θ0, where θ̃* is the perturbed initial parameter estimate obtained by minimizing the perturbed un-regularized likelihood in (3). For f ∈ { Inline graphic (θ; Inline graphic)}, denote the empirical perturbation measure as Pnf=n-1i=1nGif(Xi). Since { Inline graphic (θ; Inline graphic) : θ ∈ Ω} is ℙ-Glivenko-Cantelli, by Corollary 10.14 of Kosorok (2008) L(θ)-P{L(θ:D)}L(θ)-L(θ)+L(θ)-P{L(θ;D)}=(Pn-Pn){L(θ;D)}+(Pn-P){L(θ;D)} uniformly converges to zero. Then, under condition C1, ℙ{ Inline graphic (θ; Inline graphic)} has a unique minimum at θ0, and so θ̃*p θ0 (Newey and McFadden 1994; Theorem 2.1).

We now show that θ̂*p θ0. First note that j=1ppλnj(βj)βj0 in probability. When the penalty is Lq,pλnj(βj)=λnβjq,p(βj)pp(β0j) by the continuous mapping theorem and λn → 0. For the SCAD penalty, pλnj(βj)=λnI(βjλn)+(aλn-βj)+I(βj>λn)/(a-1). We consider two cases: (i) β0j ≠ 0, and (ii) β0j = 0. For case (i), λn → 0 and βjpβ0j. Thus, I(βjλn)p0 and (aλn-βj)+p0. For case (ii), λn → 0 and (aλn-βj)+p0. Finally, for the ALASSO penalty, pλnj(βj)=λnn12βj-1,n12βj=OP(1), and λn → 0. Then, since θ lies in a compact space, j=1ppλnj(βj)βjτj=1ppλnj(βj)||β||Bn, where τ = max{|βj|}, Bn = o (1) since pλnj(βj)P0 for each j, and hence supθ|j=1ppλnj(βj)βj|p0 (Newey and McFadden 1994; Lemma 2.9). Now, with similar arguments as above for the proof of θ̃*p θ0, we have that L^(θ)=P{L(θ;D)}L(θ)-P{L(θ;D)}+j=1ppλnj(βj)βj uniformly converges to zero. This implies the convergence of θ̂*p θ0.

We next show that ||θ^-θ0||=OP(n-12). It is sufficient to show that for any ε > 0, there exits C > 0 such that

P(inf||θ-θo||Cn-12L^(θ)>L^(θ0))>1-ε (A.1)

Consider θ=θ0+n-12u. Condition C3(c) implies

Pn{L(θ0+n-12u)-L(θ0)-n-12U(θ0;D)u}-12n-1uAu||n-12u||=oP(1)

uniformly in u. By the multiplier central limit theorem (Kosorok 2008; Theorem 10.1),

Pn{L(θ0+n-12u)G-L(θ0)G-n-12U(θ0;D)uG}-12n-1uAu||n-12u||=oP(1)

uniformly in u. It follows that uniformly for θ{θ:||θ-θ0||n-12u},

L(θ0+n-12u)-L(θ0)=n-12Pn{U(θ0;D)G}u+12n-1uAu+oP(n-1||u||) (A.2)

and thus we may approximate n{L^(θ0+n-12u)-L^(θ0)} with Gn{U(θ0;D)G}u+12uAu+nj=1ppλnj(βj)(|β0j+n-12uj|-β0j)+oP(||u||2+||u||).

Now we show the “consistency” of variable selection, i.e., P(θ^Ac=0)1 as n → ∞. It suffices to to show that for any constant C and given θ̃Inline graphic such that ||θA-θ0A||=OP(n-12)

P[argmin||θAc||Cn-12L^{(θA,θAc)}=0]1. (A.3)

Let ũInline graphic and uInline graphic denote n12(θA-θ0A) and n12θAc, respectively. It follows from (A.2)

n[L^{(θ0A+n-12uA,n-12uAc)}-L^{(θ0A+n-12uA,0)}]=[Gn{U(θ0;D)AcG}+uAA12]uAc+12uAcA22uAc+njAcpλnj(βj)n-12uj+oP(||uAc||2+||uAc||)=jAcn12pλnj(βj)uj+Rn(uAc).

where sup||uInline graphic|| ≤ C Rn(uInline graphic)/(||uInline graphic||2 + ||uInline graphic||) = o* (1). Zou and Li (2008) consider the limiting behavior of n12pλnj(βj) for SCAD and Lq penalties in their proof of the oracle properties of the one-step LLA estimator. They show that for both cases, when jInline graphic, n12pλnj(βj)p. Additionally, for the ALASSO penalty, n12pλnj(βj)=n-12λnn12βj-1, when jInline graphic, we have n-12λn and n12βj=OP(1). Hence, for all three types of penalties, n12pλnj(βj)p. Thus, for any ε > 0, there exist C1 > C0 > 0 and N0 such that P{jAcn12pλnj(βj)ujC1jAcuj}1-ε and ℙ*{C0 Σj Inline graphic|uj| ≥ |Rn(uInline graphic)|} ≥ 1 − ε for ||uInline graphic|| ≤ C and nN0. This implies that with probability greater than 1 − 2ε, n[L^{(θA,n-12uAc)}-L^{(θA,0)}](C1-C0)jAcuj0, which implies (A.3).

Lastly, we will justify the oracle property of θ^A. Since P(θ^Ac=0)1,θ^A can be considered as the minimizer of L^A(θA)=L^{(θA,0)}. Following the approach of Zou (2006), we consider the reparametrization

L^A(θ0A+n-12uA)=PnL{(θ0A+n-12uA,0),Di}Gi+jApλnj(βj)|β0j+n-12uj|. (A.4)

Let u^A(n)=argminuAL^A(θ0A+n-12uA). Note u^A(n)=n12(θ^A-θ0A) is also the minimizer of Vn(uA)L^A(θ0A+n-12uA)-L^(θ0), as Inline graphic (θ0) is a constant. Again, it follows from (A.2)

Vn(uA)=n12uAPn{UA(θ0,D)G}+12uAA11uA+njApλnj(βj)(|β0j+n12uj|-β0j)+oP(||uA||2+||uA||)

To examine the limiting behavior of the third term of Vn(u), we have β0j ≠ 0, n12(βj0+n-12uj-βj0)pujsgn(β0j), since jInline graphic. Also, as Zou and Li (2008) proved in their appendix, n12pλnj(βj)p0 when jInline graphic for the SCAD and Lq penalties. For the ALASSO penalty, n12pλnj(βj)=λnβj-1, λn → 0, and |β̃j|−1p |β0j|−1 for β0j ≠ 0. Therefore, by Slutsky’s theorem, we have npλnj(βj)(|β0j+n12uj|-β0j)=oP(1) and

Vn(uA)=uAGn{UA(θ0,D)G}+12uAA11uA+oP(1+||uA||2+||uA||).

Thus, u^A(n)=-A11-1Gn{UA(θ0,D)G}+oP(1). Since Inline graphic { Inline graphic (θ0, Inline graphic)G} converges to N(0, Inline graphic) in distribution, n12(θ^A-θ0,A)dN(0,A11-1B11A11-1) and P(θ^Ac=0)1. Then the perturbed regularized estimator θ̂* is asymptotically normal in the true nonzero parameter set.

Similar arguments as given above, along with the conditional multiplier central limit theorem (Kosorok 2008; Theorem 10.4), can be used to justify that the distribution of n12(θ^-θ^)X approximates that of n12(θ^-θ0). Specifically, we can similarly obtain n12(θ^A-θ0A)=-A11-1Gn{UA(θ0,D)}+oP(1) and ℙ*(θ̂Inline graphic= 0) → 1. Therefore, n12(θ^A-θ^A)=-A11-1Gn{UA(θ0,D)(G-1)}+oP(1). Since -A11-1Gn{UA(θ0,D)(G-1)}XdN(0,A11-1B^11A11-1) and Inline graphic Inline graphic, n12(θ^A-θ^A)X and n12(θ^A-θ0A) converge in distribution to the same limit. Furthermore, P(θ^AC=0X)1.

A.2 Choice of thresholding values high and low for confidence regions

We choose values for high and low to converge at rates relative to the order of the tuning parameter λ and bounded by the probability that our perturbation samples are set to zero. For illustration, consider the univariate β case with one predictor under orthonormal design. Consider standardized parameters γ̂ = β̂/σ, γ = β/σ and λ̃n = λn/σ2, where λn → 0 and n12λn. Then

γ^.N(γ0,n-1),γ^1=γ^(1-λnγ^2)+,γ1.γ(1-λnγ2)+

where γ* ∼̇ N(γ̂, 1/n). Thus γ1=0 with probability

P^0=P{γ<λn1/2}=Φ{n1/2(λn1/2-γ^)}-Φ{-n1/2(λn1/2+γ^)}

First consider non-zero parameters. Without loss of generality, assume that γ0 > 0. Let ε=2λn12 and assume that γ0 > 2ε. Then

E(P^0)=E[Φ{n12(λn12-γ^)}-Φ{n12(-λn12-γ^)}][Φ{n12(λn12-ε)}-Φ{n12(-λn12-ε)}]P(γ^>ε)+P(γ^ε)2Φ(-n12ε)2/πexp{-nλn}=2/πexp{-nλn/σ2}

Thus, we propose to choose the lower bound p^low=min(0.49,2/πexp{-nλn/(4σ2)}) such that p^low2/πexp(-nλn/σ2). On the other hand, if γ0 = 0, then

E(1-P^0)=E[Φ{n12(-λn12+γ^)}+Φ{n12(-λn12-γ^)}]2Φ{-n12λn12/2}2/πexp{-nλn/4}=2/πexp{-nλn/(4σ2)}

Thus, we chose p^high=1-2/πexp(-nλn/σ2) such that p^high1-2/πexp{-nλn/(4σ2)}. Note that we chose low and high such that low goes to 0 at a much slower rate than 0 for γ0 ≠ 0. On the other hand, when γ0 = 0, p^high goes to 1 at a much faster rate than 0 and thus P^0>p^high=min(1-α,p^high) occurs with probability approaching 1 as n → ∞, for any fixed α > 0. Consequently, this indicates a strong evidence of γ = 0 when 0 > p̂high. When σ is unknown, it is replaced with a consistent estimate σ̂.

A.3 Justification of highest density region and bias estimate

For jInline graphic, P(β^j=0)1 and thus for any α > 0, ℙ*( Inline graphic > α) → 1, and ℙ( Inline graphic < p̂high) + ℙ( Inline graphic < p̂low) → 0. Hence, P(0CRjHDR)1 and so we include {0} in our CR when Inline graphic > p̂low and the coverage of CRjHDR converges to 1 when β0j = 1. For jInline graphic, Inline graphicp 0, and our estimates converge to a continuous distribution, specifically n12(β^j-β^j)XdN(0,σj2), where σj2 is the asymptotic variance of n12(β^j-β0j). It follows that supxn-12fj(β^0j+n12x)-φσj(x)p0 where φσ(x) = φ(x/σ)/σ and φ(·) is the density function of the standard normal. Therefore, supβn-12fj(β)-φσj{n12(β-β^0j)}p0 and n-12c^3pc30, where c30 is the solution to ∫ I{φσj (x) > c30}φσj (x)dx = 1 − α. It follows that the coverage of our CR converges to nominal levels since, with respect to probability measure ℙ*, pr(β0jCRjHDR)=pr{fj(β0j)c^3}+oP(1)=pr{n-12fj(β0j)n-12c^3}+oP(1)=pr[φσj{n12(β0j-β^0j)}c30]+oP(1)1-α.

Here we define our bias corrected estimator for β0j, β^jBC=β^j+I(β^j0)bias^j, where bias^j=(1Mm=1Mβ^j,m)(-1)I[m=1M{I(β^j,m>0)-I(β^j,m<0)}<0](A^λ-1)jj/{nmax(ξ^7.5,ξ^97.5)},A^λ=n-1(XA^XA^+n-12λndiag{1/βj2}j=1p) and ξ̂r is the r percentile of { βj,m, m = 1, … M}. We estimate Inline graphic for ALASSO with Inline graphic following the methods of Cai et al. (2009) where a stabilized estimate of the covariance of coefficients from an accelerated failure time model is used.

A.4 Selection of λ with Bayes Information Criterion

In Section 2, we suggest choosing the tuning parameter λn by minimizing the BIC. Here we explicitly present the BIC for the linear regression objective function and ALASSO penalty that we utilized in the simulations and data example in Sections 3 and 4. First, assume that the data has been centered so there is no intercept. We implement a least squares approximation of the likelihood for BIC(λ) as in Wang and Leng (2007). For a given λ,

BIC(λ)=(β^(λ)-β)T^λ-1(β^(λ)-β)+q^λωn

where β̂ (λ) minimizes the least squares objective function L^(β)=(y-Xβ)T(y-Xβ)+j=1ppλ(βj)βj, based on (1), ^λ-1=(σ^2n)-1{XTX+λdiag{I(β^j(λ)0)/βjβj(λ)}j=1p} is a stabilized estimate of Σ similar to that in Zou (2006), σ̂2 is the consistent estimate of σ from the linear regression model based on the residual variance, and λ estimates the degrees of freedom of ALASSO with the number of nonzero elements of β̂ (λ) (Zou et al. 2007). We choose ωn = min(n0.1, log(n)) because numerical results suggest that log(n) is much greater than n0.1 and leads to excessive shrinkage of moderately sized parameters in finite sample.

Contributor Information

Jessica Minnier, Email: jminnier@hsph.harvard.edu, Ph.D. candidate, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115.

Lu Tian, Email: lutian@stanford.edu, Assistant Professor, Department of Health Research & Policy, Stanford University School of Medicine, Palo Alto, CA 94304.

Tianxi Cai, Email: tcai@hsph.harvard.edu, Associate Professor, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115.

References

  1. Cai T, Huang J, Tian L. Regularized estimation for the accelerated failure time model. Biometrics. 2009;65:394–404. doi: 10.1111/j.1541-0420.2008.01074.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Cassidy A, Myles J, van Tongeren M, Page R, Liloglou T, Duffy S, Field J. The LLP risk model: an individual risk prediction model for lung cancer. British Journal of Cancer. 2008;98:270. doi: 10.1038/sj.bjc.6604158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chatterjee A, Lahiri S. Asymptotic properties of the residual bootstrap for lasso estimators. Proceedings of the American Mathematical Society. 2010 (accepted) [Google Scholar]
  4. Dent R, Trudeau M, Pritchard K, Hanna W, Kahn H, Sawka C, Lickley L, Rawlinson E, Sun P, Narod S. Triple-negative breast cancer: clinical features and patterns of recurrence. Clinical Cancer Research. 2007;13:4429. doi: 10.1158/1078-0432.CCR-06-3045. [DOI] [PubMed] [Google Scholar]
  5. Dosaka-Akita H, Hommura F, Mishina T, Ogura S, Shimizu M, Katoh H, Kawakami Y. A risk-stratification model of non-small cell lung cancers using cyclin E, Ki-67, and ras p21: different roles of G1 cyclins in cell proliferation and prognosis. Cancer Research. 2001;61:2500. [PubMed] [Google Scholar]
  6. Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Annals of Statistics. 2004;32:407–451. [Google Scholar]
  7. Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  8. Fan J, Li R. Variable Selection for Cox’s Proportional Hazards Model and Frailty Model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]
  9. Fan J, Li R. New Estimation and Model Selection Procedures for Semiparametric Modeling in Longitudinal Data Analysis. Journal of the American Statistical Association. 2004;99:710–723. [Google Scholar]
  10. Fan J, Peng H. On Nonconcave Penalized Likelihood With Diverging Number of Parameters. The Annals of Statistics. 2004;32:928–961. [Google Scholar]
  11. Freedman A, Slattery M, Ballard-Barbash R, Willis G, Cann B, Pee D, Gail M, Pfeiffer R. Colorectal cancer risk prediction tool for white men and women without known susceptibility. Journal of Clinical Oncology. 2009;27:686. doi: 10.1200/JCO.2008.17.4797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gail M, Brinton L, Byar D, Corle D, Green S, Schairer C, Mulvihill J. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. Journal of the National Cancer Institute. 1989;81:1879. doi: 10.1093/jnci/81.24.1879. [DOI] [PubMed] [Google Scholar]
  13. Gail M, Costantino J. Validating and improving models for projecting the absolute risk of breast cancer. Journal of the National Cancer Institute. 2001;93:334. doi: 10.1093/jnci/93.5.334. [DOI] [PubMed] [Google Scholar]
  14. Jin Z, Ying Z, Wei L. A simple resampling method by perturbing the minimand. Biometrika. 2001;88:381–390. [Google Scholar]
  15. Johnson V, Brun-Vézinet F, Clotet B, Conway B, Kuritzkes D, Pillay D, Schapiro J, Telenti A, Richman D. Update of the drug resistance mutations in HIV-1: Fall 2005. Top HIV Med. 2005;13:125–131. [PubMed] [Google Scholar]
  16. Knight K, Fu W. Asymptotics for Lasso-Type Estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
  17. Kosorok M. Introduction to empirical processes and semiparametric inference. New York: Springer Verlag; 2008. [Google Scholar]
  18. Newey W, McFadden D. Large sample estimation and hypothesis testing. Handbook of econometrics. 1994;4:2111–2245. [Google Scholar]
  19. Perou C, Sørlie T, Eisen M, van de Rijn M, Jeffrey S, Rees C, Pollack J, Ross D, Johnsen H, Akslen L, et al. Molecular portraits of human breast tumours. Nature. 2000;406:747–752. doi: 10.1038/35021093. [DOI] [PubMed] [Google Scholar]
  20. Pötscher BM, Schneider U. On the distribution of the adaptive LASSO estimator. Journal of Statistical Planning and Inference. 2009;139:2775–2790. [Google Scholar]
  21. Pötscher BM, Schneider U. Confidence Sets Based on Penalized Maximum Likelihood Estimators in Gaussian Regression. Electronic Journal of Statistics. 2010;4:334–360. [Google Scholar]
  22. Prado J, Wrin T, Beauchaine J, Ruiz L, Petropoulos C, Frost S, Clotet B, D’Aquila R, Martinez-Picado J. Amprenavir-resistant HIV-1 exhibits lopinavir cross-resistance and reduced replication capacity. Aids. 2002;16:1009. doi: 10.1097/00002030-200205030-00007. [DOI] [PubMed] [Google Scholar]
  23. Rhee S, Gonzales M, Kantor R, Betts B, Ravela J, Shafer R. HIV reverse transcriptase and sequence database. Nucleic Acids Res. 2003;31:298–303. doi: 10.1093/nar/gkg100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Schumi J, DeGruttola V. Resampling-based analyses of the effects of combinations of HIV genetic mutations on drug susceptibility. Statistics in Medicine. 2008;27 doi: 10.1002/sim.3181. [DOI] [PubMed] [Google Scholar]
  25. Scott D. Multivariate density estimation: theory, practice, and visualization. Wiley-Interscience; 1992. [Google Scholar]
  26. Silverman B. Density estimation for statistics and data analysis. Chapman & Hall/CRC; 1986. [Google Scholar]
  27. Spiegelman D, Colditz G, Hunter D, Hertzmark E. Validation of the Gail et al. model for predicting individual breast cancer risk. Journal of the National Cancer Institute. 1994;86:600. doi: 10.1093/jnci/86.8.600. [DOI] [PubMed] [Google Scholar]
  28. Thompson I, Ankerst D, Chi C, Goodman P, Tangen C, Lucia M, Feng Z, Parnes H, Coltman C., Jr Assessing prostate cancer risk: results from the Prostate Cancer Prevention Trial. Journal of the National Cancer Institute. 2006;98:529. doi: 10.1093/jnci/djj131. [DOI] [PubMed] [Google Scholar]
  29. Tibshirani R. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
  30. Van Marck H, Dierynck I, Kraus G, Hallenberger S, Pattery T, Muyldermans G, Geeraert L, Borozdina L, Bonesteel R, Aston C, et al. The Impact of Individual Human Immunodeficiency Virus Type 1 Protease Mutations on Drug Susceptibility Is Highly Influenced by Complex Interactions with the Background Protease Sequence. Journal of Virology. 2009;83:9512. doi: 10.1128/JVI.00291-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Wang H, Leng C. Unified LASSO estimation via least squares approximation. Journal of the American Statistical Association. 2007;102:1039–1048. [Google Scholar]
  32. Wu M. PhD thesis. Harvard School of Public Health, Department of Biostatistics; Boston, MA: 2009. A parametric permutation test for regression coefficients in LASSO regularized regression. [Google Scholar]
  33. Zhang H, Ahn J, Lin X, Park C. Gene Selection using Support Vector Machines with Non-convex Penalty. Bioinformatics. 2006;22:88–95. doi: 10.1093/bioinformatics/bti736. [DOI] [PubMed] [Google Scholar]
  34. Zou H. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
  35. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. 2005;B67:301–320. [Google Scholar]
  36. Zou H, Hastie T, Tibshirani R. On the “degrees of freedom” of the lasso. Annals of Statistics. 2007;35:2173–2192. [Google Scholar]
  37. Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Zou H, Zhang H. On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics. 2009;37:1733–1751. doi: 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES