Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Apr 10.
Published in final edited form as: J Am Stat Assoc. 2016 Jan 15;110(512):1422–1433. doi: 10.1080/01621459.2015.1095099

An adaptive resampling test for detecting the presence of significant predictors

Ian W McKeague *, Min Qian
PMCID: PMC4826762  NIHMSID: NIHMS746744  PMID: 27073292

Abstract

This paper investigates marginal screening for detecting the presence of significant predictors in high-dimensional regression. Screening large numbers of predictors is a challenging problem due to the non-standard limiting behavior of post-model-selected estimators. There is a common misconception that the oracle property for such estimators is a panacea, but the oracle property only holds away from the null hypothesis of interest in marginal screening. To address this difficulty, we propose an adaptive resampling test (ART). Our approach provides an alternative to the popular (yet conservative) Bonferroni method of controlling familywise error rates. ART is adaptive in the sense that thresholding is used to decide whether the centered percentile bootstrap applies, and otherwise adapts to the non-standard asymptotics in the tightest way possible. The performance of the approach is evaluated using a simulation study and applied to gene expression data and HIV drug resistance data.

Keywords: Bootstrap, Family-wise error rate, Marginal regression, Non-regular asymptotics, Screening covariates

1 Introduction

The problem of selecting significant predictors is a central aspect of scientific discovery, and has become increasingly important in an era in which massive data sets are readily available (Fan and Li, 2006). Much of the modern statistical literature in this area focuses on consistency of variable selection in high-dimensional settings based on machine learning and data mining techniques (e.g., Fan and Li 2001; Zou and Hastie 2005; Huang et al. 2008; Fan and Lv 2008; Genovese et al. 2012). A major gap in this literature, however, has been the scarcity of formal hypothesis testing procedures that take variable selection into account; the oracle property enjoyed by many variable selection methods in the presence of high dimensionality can not be applied directly for testing whether a post-model-selected variable is significant. In bioinformatics, for example, variable selection techniques based on penalization (such as lasso, scad, etc) are routinely used to produce lists of differentially-expressed genes that are most related to disease risk, but few methods for obtaining valid p-values have been developed.

A more traditional approach to the selection of significant predictors is multiple testing to control either family-wise error rate (FWER), or false-discovery rate (Benjamini and Hochberg 1995; Dudoit et al. 2003; Efron 2006; Dudoit and van der Laan 2008; Efron 2010). Procedures that control FWER (e.g., Bonferroni, or Holm's procedure) are often criticized as being too conservative (in the sense of having low power). False-discovery rate methods, on the other hand, although having greater power, incur the cost of inflated FWER. Our aim in the present paper is to introduce a more powerful single test that can be used as an alternative screening procedure to detect the presence of some significant predictor while rigorously controlling FWER.

The proposed procedure uses marginal linear regression to select the predictor (from among covariates X1, . . . , Xp) that has maximal sample correlation with a scalar outcome Y (as in marginal screening or correlation learning, Genovese et al. 2012). The test is based on θ^n the estimated marginal regression coefficient of the selected predictor. If there is a unique predictor, say Xk0, maximally correlated with the outcome, then the selection procedure consistently estimates k0, and θ^n is asymptotically normal; if all predictors are uncorrelated with the outcome, then the selected predictor does not converge (in probability) and θ^n has a non-normal limiting distribution. In particular, the limiting distribution is discontinuous (at zero) as a function of the regression coefficient of Xk0 (where k0 is not identifiable), and this “non-regularity” causes non-uniform convergence.

Breiman (1992) drew early attention to the issue of invalid post-model-selection inference, calling it the “quiet scandal” of Statistics; even earlier references are mentioned in Berk et al. (2013). Samworth (2003) gave a detailed account of the inaccuracy of bootstrap methods applied to super-efficient estimators. Leeb and Pötscher (2006) (and other papers by the same authors) established that non-uniform limiting behavior of post-model-selected estimators is at the root of the problem, and that estimates of asymptotic null distributions in such settings can give a misleading picture of finite-sample performance. In particular, calibrating a test based on θ^n in a way that does not adapt to the implicit post-model-selection will be extremely inaccurate. This type of non-regularity occurs in various other settings as well, e.g., when a nuisance parameter is only defined under an alternative hypothesis (Davies, 1977), and when the parameter of interest under the null hypothesis is on the boundary of the parameter space (Andrews, 2000). McCloskey (2012) surveyed non-standard testing problems in econometrics, and introduced some Bonferroni-based size-correction methods designed to improve power. As far as we know, however, there is not yet a resolution of these issues for marginal screening.

In this paper we introduce an adaptive resampling test (ART) for marginal screening that adapts to the small sample behavior of θ^n in terms of a local model. Under local alternatives, we find an explicit representation of the asymptotic distribution of θ^n and construct a suitable bootstrap estimator of this distribution that is consistent, thus circumventing the non-regularity mentioned above. Under non-local alternatives, we show that the critical values obtained in this way agree asymptotically with those used by the oracle (who is given knowledge of k0), so ART can be expected to provide good power as well.

Several new approaches to post-model selection inference for linear regression have been proposed in recent years. Meinshausen et al. (2009) introduced a random sample splitting procedure in the high-dimensional setting to obtain (conservative) Bonferroni-adjusted p-values following variable selection. Chatterjee and Lahiri (2011) developed a modified bootstrap method that provides an asymptotically valid confidence region for the regression parameters based on the lasso estimator; this method depends on the presence of at least one active predictor, so it is not applicable to marginal screening (under the null hypothesis there is no active predictor).

More relevant to marginal screening, the covariance test recently introduced by Lockhart et al. (2014) uses a forward stepwise lasso procedure to test for active predictors entering a sparse linear model under the assumption of normal errors. Also in the sparse linear model setting with normal errors, but further assuming that the predictors are nearly uncorrelated, Ingster et al. (2010) and Arias-Castro et al. (2011) have studied the detection boundary and optimality properties of general classes of multiple testing procedures (including Bonferroni and Higher Criticism). Berk et al. (2013) developed a valid method of post-model selection inference that is feasible for up to about p = 20 predictors, also assuming normal errors. In various sparse high-dimensional settings, Belloni et al. (2013), Bühlmann (2013), Zhang and Zhang (2014) and Ning and Liu (2015) have established asymptotically valid confidence intervals for a preconceived regression parameter after variable selection on the remaining predictors, but this does not apply to marginal screening (where no regression parameter is singled-out a priori).

This paper is organized as follows. We formulate the problem and discuss the issue of non-regularity in Section 2. In Section 3, we develop the ART procedure and establish the consistency of the underlying bootstrap. Simulation studies and applications to gene expression data and HIV drug resistance data are presented in Section 4. Concluding discussion appears in Section 5, and proofs are collected in the Appendix.

2 Marginal regression and non-regularity

Consider a scalar outcome Y and a p-dimensional vector of covariates X = (X1, . . . , Xp)T such that the marginal variance of each covariate is finite and non-zero. Marginal regression consists in using separate linear models to predict Y from each Xk. Let k0 be the label of a covariate that maximizes the absolute correlation with Y :

k0argmaxk=1,,pCorr(Xk,Y),

and let α0 + θ0Xk0 be the best linear predictor based on Xk0, i.e.,

(α0,θ0)=argminα,θRE(YαθXk0)2=(EYθ0EXk0,Cov(Xk0,Y)Var(Xk0)). (1)

We are interested in testing whether at least one of the covariates is correlated with Y, for which it suffices to check whether Xk0 and Y are correlated. This is equivalent to testing

H0:θ0=0versusHa:θ00.

Given an iid sample of size n, let α^n, θ^n, and n be the least squares estimates of α0, θ0, and k0, respectively:

α^n=PnYθ^nPnXk^0,θ^n=Cov^(Xk^n,Y)Var^(Xk^n),k^nargmaxk=1,,pCorr^(Xk,Y),

where Pn is the empirical distribution, and the hats indicate sample versions. It is natural to base the test on θ^n but calibration is problematic because the distribution of n(θ^nθ0) does not converge uniformly with respect to θ0, as mentioned in the Introduction. The non-uniformity occurs in the neighborhood of θ0 = 0. Specifically, there exists a bounded continuous function h:RR such that fn(θ0)Eh(n(θ^nθ0)) does not converge uniformly in any neighborhood of θ0 = 0, despite converging pointwise. To see this, first note that under mild conditions

n(θ^nθ0)dU{Zk0Vk0ifθ00,ZKVKifθ0=0,}

where Vk = Var(Xk), K=argmaxk=1,,pZk2Vk, and (Z1, . . . , Zp)T is a mean-zero normal random vector with covariance matrix depending on parameters of the full linear model (this is a special case of Theorem 1 below). From the form of the distribution of U, we can choose h so that f(θ0) ≡ Eh(U) is discontinuous at θ0 = 0 (this is the non-regularity mentioned in the Introduction). If fn were to converge uniformly to f on some compact neighborhood of zero, we would have a contradiction because each fn is continuous, and the uniform limit of a sequence of continuous functions on a compact interval is continuous.

To address this problem, in the next section we develop a formal test procedure (ART) inspired by work of Cheng (2008, 2015) concerning robust confidence intervals for non-linear regression parameters in the presence of weak-identifiability. Other variations of this approach have been used by Laber and Murphy (2011) to construct a confidence interval for the classification error, by Laber et al. (2014) in a sequential decision making problem, and by Laber and Murphy (2013) to provide robust confidence intervals for adaptive lasso. As already noted, the distribution of n(θ^nθ0) does not converge uniformly in the neighborhood of θ0 = 0, so its small sample behavior can be very far from normal when the true parameter is close to zero. Therefore an understanding of the asymptotic behavior of θ^n under local alternatives plays a crucial role in devising a suitable test, or more generally in providing robust confidence intervals for θ0.

3 Adaptive resampling test

In this section, we develop the proposed ART procedure for detecting the presence of a significant predictor. The idea is to adapt to the inherent non-regular behavior of the post-model-selected estimator θ^n in a way that accurately captures its asymptotic behavior in n-neighborhoods of the null hypothesis.

We frame the problem in terms of the general local linear model

Y=α0+XTβn+ϵ, (2)

where α0R, βnRp, the noise ε has mean 0, finite variance, and is uncorrelated with X, and βn = β0 + n−1/2b0, where b0Rp is the local parameter. The distributions of ε and X are assumed to be fixed, so only the distribution of Y depends on n (although we suppress n in the notation for Y). The relevant hypotheses are now

H0:θn=0versusHa:θn0,

where θn = Cov(Xkn, Y)/Var(Xkn) and kn is the label of a component of X that maximizes absolute correlation with Y.

Our first result gives the asymptotic distribution of θ^n. To state the result, we need the notation

k(b)argmaxk=1,,pCorr(Xk,XTb)

for any bRp. Note that kn = (βn) under the local model. If k0(β0) is unique (so β0 ≠ = 0), then knk0, and θn is asymptotically bounded away from zero (a non-local alternative). On the other hand, if β0 = 0 and (b0) is unique, then kn = (b0); also θn is in the neighborhood of zero and represents a local alternative. Finally, if β0 = b0 = 0, then kn is not well-defined and the null hypothesis θn = 0 holds. We need the uniqueness of the most active predictor k0 (away from the null hypothesis), but this seems to be a very mild condition because the likelihood that there would be two or more predictors having exactly the same maximal correlation with Y seems remote in practice. Even in practice, as we will see in the simulation study, non-uniqueness of the maximally correlated predictor does not adversely affect power.

Theorem 1

Suppose that k0 = k̄(b0) is unique when β00, and k̄(b0) is unique when β0 = 0 and b00. Then, under the local model (2),

n(θ^nθn)d{Zk0(β0)Vk0ifβ00,ZK(0)VK+(CKVKCk(b0)Vk(b0))Tb0ifβ0=0,}

where K=argmaxk=1,,p[Zk(0)+CkTb0]2Vk,Ck=Cov(Xk,X), and (Zk(β))k=1p is a mean-zero normal random vector with covariance matrix Σ(β) given by that of the random vector with components

((XEX)Tβ(XkEXk)CkTβVk+ϵ)(XkEXk),

for k = 1, . . . , p, and Σ(β0) is assumed to exist.

The non-regularity at β0 = 0 is explained by the dependence of the limiting distribution on the (non-identifiable) local parameter b0. The limiting distribution is nevertheless continuous as a function of b0Rp into the space of distribution functions (this is a simple consequence of Lemma 3 in the Appendix), and the convergence is uniform over compact subsets of Rp, unlike the limiting behavior discussed in the previous section, so finite-sample accuracy should be less of an issue when designing a screening test using this result. On the other hand, naive resampling methods that do not take into account the local asymptotic behavior will fail to provide consistent estimates of the distribution of n(θ^nθn), as discussed in the Introduction for the non-local case.

To get around this problem, we decompose n(θ^nθn) in a way that isolates the possibility that β0 ≠ 0 by comparing |Tn| to some threshold λn (to be specified later), where Tn=θ^nsn is the post-model-selected t-statistic and sn is the standard error of the slope estimator when regressing Y on Xn. Specifically,

n(θ^nθn)=n(θ^nθn)1Tn>λnorβ00+n(θ^nθn)1Tnλn,β0=0=n(θ^nθn)1Tn>λnorβ00+[Zn,k^n+Cov^(Xk^n,XTb0)Var^(Xk^n)Cov(Xkn,XTb0)Var(Xkn)]1Tnλn,β0=0, (3)

where Zn,k=Gn[ϵ(XkPnXk)], Gn=n(PnPn) is the empirical process, and Pn is the distribution of (X, Y). It is clear that the nonparametric bootstrap is consistent for the first term in (3) if λn=o(n) and λn → ∞, since it is easily shown that P(|Tn| > λn) → 1β00. The second term is more problematic though because n does not converge in probability to k0 when β0 = 0. Denote the term in the square brackets by Vn(b), indexed by b=b0Rp. Note that when this term is active (under β0 = 0), k^n=Kn(b0) and kn = (b0), where

Kn(b)=argmaxk=1,,p[Zn,k+Cov^(Xk,XTb)]2Var^(Xk)

and

K(b)=argmaxk=1,,p[Cov(Xk,XTb)]2Var(Xk),

so

Vn(b)=Zn,Kn(b)+Cov^(XKn(b),XTb)Var^(XKn(b))Cov(XK(b),XTb)Var(XK(b)). (4)

All parts of Vn(b) are now seen to be smooth functions of Pn, so it is reasonable to expect that a consistent bootstrap can be constructed by replacing Pn by its nonparametric bootstrap Pn, and replacing Pn by Pn. In such a construction, the event indicated in the second term of (3) is naturally replaced by the event that Tnλn and |Tn| ≤ λn.

Here and throughout the paper, a superscript * is used to indicate the nonparametric bootstrap (sometimes called “bootrapping in pairs” in regression settings, to distinguish it from the residual bootstrap). The above arguments lead to our main result showing that n(θ^nθn) can indeed be consistently bootstrapped under the general local model. The precise definition of Vn is given at the start of the proof.

Theorem 2

Suppose all assumptions in Theorem 1 hold, and the tuning parameter λn satisfies λn=o(n) and λn → ∞ as n → ∞. Then, under the local model (2),

n(θ^nθ^n)1Tn>λnorTn>λn+Vn(b0)1Tnλn,Tnλn

converges to the limiting distribution of n(θ^nθ0) conditionally (on the data) in probability.

ART procedure

ART provides a bootstrap calibration for the test statistic nθ^n based on a special case of the above theorem. Under H0 we have the simplification Vn(b0)=Vn(0). For some nominal level γ, let cl and cu be the lower and upper γ/2 quantiles, respectively, of

An=n(θ^nθ^n)1Tn>λnorTn>λn+Vn(0)1Tnλn,Tnλn.

If nθ^n falls outside the interval [cl, cu], then we reject H0 and conclude that there is at least one significant predictor.

Before applying ART, it is advisable to standardize all the variables Xk and Y (by sample mean and standard deviation), which has the advantage of making the procedure scale invariant (θ^n is then the maximal sample correlation); our results naturally extend, but we develop the theory only for the unstandardized variables to keep the presentation simple.

Robust confidence intervals

The above theorem also allows the construction of a robust confidence interval for θn by treating b0 as unknown, then finding the widest bootstrap quantiles over all b0. Here by “robust” we mean asymptotically valid uniformly over b0. For testing purposes, however, this approach would be too conservative and also computationally intensive (grid search over Rp is needed); for this reason, in ART we set b0 = 0 under the null, so the critical values can be readily computed from An. In contrast, Laber and Murphy (2013) propose using almost sure bounds over their local parameter b0 to find robust confidence intervals for adaptive lasso; this involves less computation than distributional bounds, but is still computationally intensive, and it produces more conservative confidence intervals than the distributional approach.

Choice of the tuning parameter λn

The above theorem requires that λn=o(n) and λn → ∞ as n → ∞. Under this condition, the thresholding provides a consistent pre-test (for θn = 0) with asymptotically negligible type I error rate: limn→∞ P(|Tn| > λn|θn = 0) = 0. On the other hand, if λn increases too quickly, the pre-test will be conservative. One simple choice would be to set λn=alogn, for some constant a > 0, but it is also desirable that λn increase with p, see Section 5 for discussion about the null limiting behavior of Tn as both p and n → ∞. To that end, note that by Theorem 1 in the special case that ε and X are independent, under θn = 0 (or b0 = 0 and β0 = 0) we have TndZ~K, where K=argmaxk=1,,pZ~k2, and (1, . . . ,p)T is a vector of standard normal random variables. Thus, for any fixed λ > 0,

P(Tn>λθn=0)P(maxk=1,,pZ~k>λ)k=1pP(Z~k>λ).

Hence the pre-test type I error rate can be asymptotically controlled below level γ, without sacrificing consistency, by choosing

λn=max{alogn,upperγ(2p)-quantile ofN(0,1)}. (5)

In the simulation study below we describe a way of specifying the constant a via the double bootstrap, and this is used whenever we refer to ART in the sequel.

Forward stepwise ART

If we find a significant predictor using ART, it would be reasonable to continue applying the procedure in a forward stepwise fashion until no more significant predictors are detected. That is, in successive stages the residual Yα^nθ^nXk^n is treated as a new outcome variable and marginal regression carried out on the remaining predictors. Although it would be challenging to extend our theoretical results to this procedure, we find that in real data applications it performs well, and in a similar way to the covariance test of Lockhart et al. (2014), as we discuss in the HIV drug resistance example considered in the next section.

4 Numerical studies

In this section, we study the performance of the proposed ART procedure using simulated data, and give illustrations of the approach in two real data examples.

4.1 Finite sample simulations

We compare the performance of ART with four procedures that are commonly used for detecting the presence of a significant predictor:

Likelihood ratio test (LRT)

This test is based on assuming a full linear model involving all of the covariates, and is applicable when n > p. Under the null hypothesis, all the regression coefficients are zero. The reduction in the residual sum of squares is compared to the residual sum of squares for the full model using an F-ratio [see, e.g. Section 7.4 of Johnson and Wichern (2007)]. When the full linear model holds, it can be seen that both null and alternative hypotheses are identical to those used in ART.

Multiple testing with Bonferroni correction

As in ART, marginal linear models are used to predict Y from each Xk. A t-test with Bonferroni correction is then carried out to detect whether each regression coefficient is non-zero. The intersection of the p null hypotheses coincides with the null used in ART.

Centered percentile bootstrap (CPB)

This procedure is similar to ART, except n(θ^nθ^n) is used to estimate the upper and lower quantiles of n(θ^nθ0), providing critical values for the test statistic nθ^n, see Efron and Tibshirani (1993).

Higher Criticism (HC)

This is a test originally proposed by John Tukey for determining the overall significance of a collection of independent p-values. We apply the statistic HCN+ developed by Donoho and Jin (2004, 2015), which is expected to perform well if the predictors are nearly uncorrelated.

We consider three examples for the data generating model: i) Y = ε, ii) Y = X1/4+ε, and iii) Y=k=1pβkXk+ϵ, where β1 = . . . = β5 = 0.15, β6 = . . . = β10 = −0.1, and βk = 0 for k = 11, . . . , p. In the first example, there is no active predictor, in the second there is a single active predictor, and in the third there are 10 active predictors and the maximally correlated predictor is not unique. The covariate vector X is distributed as p-dimensional normal with each component Xk ~ N(0, 1), an exchangeable correlation structure Corr(Xj, Xk) = ρ for jk, where ρ takes values 0, 0.5 and 0.8, and the noise ε ~ N(0, 1) is independent of X.

We consider two sample sizes (n = 100 and 200), and five values of the dimension (p = 10, 50, 100, 150 and 200). A nominal 5% significance level is used throughout. The bootstrap sample size is taken as 1,000. To specify the threshold λn in ART the double bootstrap is implemented by generating 1,000 bootstrap estimates θ^n, then choosing λn so that 5% of the ARTs (based on 1,000 nested bootstrap samples) with test statistic n(θ^nθ^n) reject.

Empirical rejection rates based on 1,000 Monte Carlo replications are reported in Figures 13. For model i), the figures provide type I error rates, which should be compared with the 5% nominal rate; for models ii) and iii), the figures provide the power of each test. The ART procedure has good control of the type I error rate throughout (compared to all the other methods), while consistently maintaining relatively high power. Comparing the results of models ii) and iii), non-uniqueness of the maximally correlated predictor has no adverse effect on the power of ART.

Figure 1.

Figure 1

Empirical rejection rates based on 1,000 samples generated from models i), ii) and iii) as the dimension ranges from p = 10 to p = 200, for n = 100 (top row) and n = 200 (bottom row), and ρ = 0.8.

Figure 3.

Figure 3

Empirical rejection rates as in Figure 1 except for independent predictors.

Bonferroni is highly conservative when ρ = 0.5 and 0.8, see the left panels of Figures 1 and 2. The CPB method is highly anti-conservative, with empirical type I error rates exceeding 15% for both sample sizes (and thus out of range for most of the panels on the left). The LRT effectively controls the type I error rate at around the nominal 5% level when it is applicable, but it has very low power compared with all the other methods, except under model iii) in the “classical case” of small numbers of predictors that are not highly correlated, see the right panels of Figures 2 and 3. Higher Criticism fails to control type I error except when the predictors are independent (Figure 3), in which case it is slightly anti-conservative and has excellent power under model iii), but very poor under model ii). That is, HC performs well (under zero correlation) when there are multiple active predictors, but not in the sparse case of only one active predictor. Except in the case of independent predictors, when Bonferroni is slightly better, ART outperforms all the competing procedures when both type I error and power are taken into account, and the improvement increases with the correlation between predictors.

Figure 2.

Figure 2

Empirical rejection rates as in Figure 1 except with lower correlation between predictors: ρ = 0.5.

4.2 Asymptotic power

In this section, we carry out a simulation study to assess the asymptotic power of ART compared with that of the Bonferroni procedure. The computational expense of implementing ART is high because of the double bootstrap, so our full simulation study of the previous section is only feasible for small sample sizes. Nevertheless, we are able to assess asymptotic power by making use of our results on the local model in Section 3.

Consider the local model Y = (n−1/2b0)X1 + ε, where b0R. Here X and ε are generated in the same way as Section 4.1, but now we only consider ρ = 0.5. The local parameter b0 takes the special form (b0, 0, . . . , 0)T, and we allow b0 to vary over a grid in [0, 5], in increments of 0.5. We set β0 = 0, b0 = (b0, 0, . . . , 0)T and make use of the given covariance structure of X and the explicit form of the limiting distribution in Theorem 1 to generate a draws from the asymptotic distribution of nθ^n. Specifically, we carry out the following steps:

  1. For each value of b0 on the grid, take 5,000 draws from the limiting distribution of n(θ^nθ0) given in Theorem 1 (this distribution only depends on b0 and the given distribution of (X, Y )), then add b0 to obtain draws from the limiting distribution of nθ^n. Based on these draws, we can obtain the (approximate) rejection rate of the test statistic nθ^n for any given rejection region. In particular, the asymptotic rejection rate of ART (for any given b0 on the grid) can be calculated by referring to the rejection rate corresponding to the particular critical values cl and cu generated by ART.

  2. To assess the asymptotic power of ART at each given b0, we generate 10 independent large samples (with n = 5,000) from the local model, find cl and cu for each sample, and display in a boxplot the corresponding asymptotic rejection rates (using the results of step 1).

  3. For comparison, we also plot the asymptotic power of the Bonferroni procedure, which is approximated using 1,000 samples each of size n = 5,000.

The results are presented in Figure 4 for p = 10 and 50. The main source of variation within each boxplot is due to randomness over the 10 independent samples drawn from the local model, rather than bootstrap randomness (in view of bootstrap consistency and the large sample size n = 5,000). The median of each boxplot provides a suitable reference point to compare with the asymptotic power of Bonferroni (indicated by the circle). Note that ART provides accurate control of asymptotic type I error, and, as expected, Bonferroni is slightly conservative. In terms of median power, ART always outperforms Bonferroni, and can provide an additional 25% power (e.g., at b0 = 3 for p = 10, and at b0 = 3.5 for p = 50).

Figure 4.

Figure 4

Asymptotic type I error and power of ART (box plots) compared with Bonferroni (circles) as a function of the local parameter b0, for p = 10 and 50, ρ = 0.5, calculated using steps 1–3 in Section 4.2.

The cost of implementing the double bootstrap part of ART makes it prohibitive to extend the results in Figure 4 to larger p, but if we fix λn, then it becomes practical to run the simulations for p = 1000. Figure 5 shows how the asymptotic power of ART compares with Bonferroni as the constant a used to specify λn takes values 2, 4, 5 and 8 (the corresponding λn are 4.3, 6.1, 6.8 and 8.6). Note that as a increases (going from one panel to the next), ART becomes more stable and provides more accurate type I error control, but the overall power decreases. At small values of a, ART behaves like the CPB, which is anti-conservative (as we have already seen in the previous section), whereas at larger values the influence of CPB is diluted. For the CPB (which corresponds to setting λn = 0), the plot (not shown) appears very similar to that for a = 2; also, for a > 8 the plots appear very similar to a = 8. The best choice of a, therefore, is a trade-off between type I error control and power; comparing with Figure 4, ART with double bootstrapping appears to achieve a satisfactory balance in this regard. Also note that, even at the largest value a = 8, ART can provide an additional 20% power over Bonferroni, and thus outperform Bonferroni by a considerable margin in high-dimensional settings as well, at least when there is a high degree of correlation among the components of X.

Figure 5.

Figure 5

Asymptotic type I error and power of ART compared with Bonferroni for p = 1,000 and ρ = 0.5, where ART is implemented using a fixed threshold λn specified by a = 2, 4, 5, 8, and each box plot is based on 20 independent replications with n = 10,000.

4.3 Gene expression example

We consider gene expression profiles from the tumors of n = 156 patients diagnosed with a common type of adult brain cancer (glioblastoma), collected as part of the Cancer Genome Atlas pilot project (TCGA, 2008). Our analysis is based on log gene expression levels X at p = 181 loci along chromosome 1. We are interested in detecting the presence of a gene that is significantly related to log-survival time Y.

We compare the results from applying the Bonferroni, CPB and ART procedures; LRT is not applicable since p > n. The three methods yield very different p-values. The smallest Bonferroni adjusted p-value is 40.8%, suggesting that no gene is significantly related to Y. The CPB and ART p-values are 3.2% and 17.2%, respectively, from 1000 bootstrap samples. Figure 6 shows how these p-values are calculated. Thus the CPB method suggests the presence of a significant genetic effect, whereas ART does not.

Figure 6.

Figure 6

Gene expression example. Left panel: histogram of n(θ^nθ^n) showing that the two-sided CPB p-value is 3.2%. Right panel: histogram of An showing that the two-sided ART p-value is 17.2%.

4.4 HIV drug resistance example

Our second example uses data from the HIV Drug Resistance Database (2014), an important public resource for understanding how HIV-1 mutation patterns cause resistance to antiretroviral drugs (Rhee et al., 2002). We will compare our results with those of Lockhart et al. (2014), who applied their covariance test to data on the susceptibility (a measure of drug resistance) of the nucleotide reverse transcriptase inhibitor lamivudine (3TC). We code susceptibility on a log-scale (Y), and each predictor Xj is taken as indicating the presence/absence of a mutation at a given sequence position. The viral sequence positions are indexed by j. Excluding missing data and rare mutations resulted in data on p = 103 positions and a total of 1266 isolates.

We randomly split the data 50 times into a training set of size n = 126 and a test set of size 1140. For each split, we carry out 20 steps of forward stepwise ART and standard forward stepwise regression using the training data, and calculate the corresponding prediction error (including all previously selected variables) using the test data. The left panel of Figure 7 shows the training data p-values (mean ± SD) for the newly entered predictor at each step, over the 50 random splits, and the right panel shows the corresponding prediction errors (mean ± SD). Forward stepwise ART detects one very highly significant mutation, but no more, as confirmed by the test set error plot, and this result is roughly consistent with the findings of Lockhart et al. (2014). Standard forward stepwise regression picks out at least 10 mutations, but there is no improvement in test set error after the first predictor enters the model; moreover, the test error almost exactly coincides with ART.

Figure 7.

Figure 7

HIV drug resistance example. Left panel: training set p-values (mean ± SD) over 50 random splits of the data for forward stepwise ART (solid line), standard forward stepwise regression (dash-dot line) and the 0.05 alpha level (dotted). Right panel: test set error for the corresponding models (including all previously selected variables); the two lines are almost indistinguishable.

5 Discussion

In this paper we have developed an adaptive resampling test (ART) for detecting the existence of a significant predictor, Xk0, from among predictors X1, . . . , Xp. The procedure is designed to adjust to the non-regular limiting behavior of the estimated marginal regression coefficient θ^n of the selected predictor. This is done by using a thresholded version of the bootstrap that adapts to the non-regularity: if there is at least one significant predictor, it reduces to a centered percentile bootstrap, otherwise it mimics the local (non-uniform) asymptotic behavior of θ^n. We have shown that in simulation studies, ART performs favorably compared with standard methods such as Bonferroni, but also compared with more sophisticated methods such as Higher Criticism. The advantage of ART may stem from it being designed to take into account correlations between predictors, while also avoiding distributional assumptions (the nonparametric bootstrap steps in ART are essentially distribution free). We have restricted attention to linear models, but our approach has much wider applicability (e.g., generalized linear models, quantile regression, and censored time-to-event outcomes), and these will be studied in future papers.

Although our simulation results suggest that ART is useful and remarkably stable in “large p, small n” settings, the asymptotic theory that we have used to calibrate ART relies on assuming a fixed p, with n tending to infinity. In view of the conservative nature of the Bonferroni procedure in high-dimensional settings, there is a pressing need for more powerful tests in this area. In future work it would be of interest to develop the asymptotic theory of ART for the case of p growing with n, although this would be very challenging. As far as we know, formal testing procedures that provably control FWER and adjust to non-regularity under diverging p are not yet available, except for Higher Criticism in the case that the predictors are nearly uncorrelated, as established by Ingster et al. (2010) and Arias-Castro et al. (2011). In the only other instance we know of, under the strong assumption that X1, . . . , Xp, Y are iid N(0, 1), results of Cai and Jiang (2012) can be used to find the weak limit of ρ^n=maxk=1,,pCorr^(Xk,Y) and thus devise an asymptotically correct calibration: if p = pn → ∞ at sub-exponential rate, log(p)/n → 0, then ρ^np0 and nρ^n22logp+loglogpdF where F(y)=eey2(2π). In the super-exponential case, log(p)/n → ∞, then ρ^np1 and there is a similar weak limit.

Another interesting direction for future work would be to study the forward stepwise version of ART discussed in Section 3. Modifications to ART when applied stepwise in this way would be needed to adjust for the implicit dependence among the new outcomes. By repeating such a procedure until no more significant predictors are detected, the aim would be to correctly identify all active predictors.

Acknowledgments

Research supported by NIH Grant R01GM095722-01 and NSF Grant DMS-1307838.

Appendix: Proofs

Proof of Theorem 1

For k = 1, . . . , p, let (α^k,θ^k)=argmin(α,θ)Pn(YαθXk)2. Then k^n=argmink=1,,pPn(Yα^kθ^kXk)2 and (α^n,θ^n)=(α^k^n,θ^k^n). It is easy to verify that α^k=Pn(Yθ^kXk),

nθ^k=nCov^(Xk,Y)Var^(Xk)=nCov^(Xk,XT)βn+Gn[ϵ(XkPnXk)]Var^(Xk)=(GnXkXTPnXkGnXTGnXkPnXT)βnVar^(Xk)+Gn[ϵ(XkPnXk)]PnϵGnXk+nCov(Xk,XT)βnVar^(Xk), (6)

where Pn is the distribution of (Y, X), and the mean residual squared error

R^kPn[Yα^kθ^kXk]2=Var^(Y)Var^(Xk)θ^k2. (7)

The result then follows immediately from the following two lemmas. The first lemma verifies the oracle property for marginal regression under the assumption that there is at least one active predictor; the proof is included for completeness. The second lemma gives the (non-regular) asymptotic behavior of θ^n when there are no active predictors.

Lemma 1

If all conditions in Theorem 1 hold and β00, then k^na.s.k0 and n(θ^nθn)dZk0(β0)Vk0, where Zk0 is defined in Theorem 1.

Proof

Denote ≡ (1, . . . ,p)T. When β0 ≠ 0, Var(XT β0) > 0. By the SLLN

Var^(Y)R^Var(XTβ0)a.s.(Corr2(X1,XTβ0),,Corr2(Xp,XTβ0))T.

Since k^n=argmaxk=1,,p[Var^(Y)R^k]Var(XTβ0) and Corr2(Xk, XTβ0) is maximized at k = k0, it follows immediately that k^na.s.k0.

Next, denote = Xn and Xn = Xkn. Since Pn[YPnYθ^n(X^PnX^)]X^=0 and Y = α0 + XTβn + ε, we have

n(θ^nθn)Var^(X^)=nCov^(X^,XT)βn+bPn(ϵ(X^PnX^))nVar^(X^)Cov(Xn,X)Tβn+Cov(Xn,ϵ)Var(Xn)=nCov^(Xk0,XT)βn+nPn(ϵ(Xk0PnXk0))nVar^(Xk0)Cov(Xk0,X)Tβn+Cov(Xk0,ϵ)Var(Xk0)+oPn(1)=Gn[(ϵ+(XPnX)Tβ0Cov(Xk0,X)Tβ0Var(Xk0)(Xk0PnXk0))(Xk0PnXk0)]+oPn(1),

where the second equality uses k^na.sk0 and knk0 as n → ∞, and the third equality follows from the LLN and Cov(ε, Xk0) = 0. Similarly, Var^(X^)PnVk0Var(Xk0). The proof is completed using Slutsky's lemma and the CLT.

Lemma 2

If all conditions in Theorem 1 hold and β0 = 0, then n(θ^nθn)dZK(0)VK+(CKVKCk(b0)Vk(b0))Tb0.

Proof

Since (Z1(0), . . . , Zp(0))T is a normal random vector and |Corr(Xj, Xk)| < 1 for jk, it is easy to see that

(Zj(0)+CjTb0)2Vj(Zk(0)+CkTb0)2Vkfor anyjka.s. (8)

So K is unique a.s.

Denote θ^=(θ^1,,θ^p)T. Note that when β0 = 0, nβn=b0. By the CLT and Slutsky's lemma, we see from (6) that

nθ^d(Z1(0)+C1Tb0V1,,Zp(0)+CpTb0Vp)T.

From (7), we have

n[Var^(Y)R^]=(nθ^)(nθ^)(Var^(X1),,Var^(Xp))T,

where ☉ denotes the elementwise (Hadamard) product, so, by the continuous mapping theorem and Slutsky's lemma,

(nθ^n[Var^(Y)R^])d(((Z1(0)+C1Tb0)V1,,(Zp(0)+CpTb0)Vp)T((Z1(0)+C1Tb0)2V1,,(Zp(0)+CpTb0)2Vp)T).

Define h(t) = (1arg maxktk=1, . . . , 1arg maxktk=p)T, where t=(t1,,tp)TRp. Note that h is continuous at t if arg maxk tk is unique. Thus, using (8) and since nθ^n=nθ^Th(n[Var^(Y)R^]), the result follows by applying the continuous mapping theorem to the above display.

Lemma 3

Let Z be a p-dimensional random vector and f:R2pRp a function such that f(z, ·) is continuous for every zRp, and f(Z, b)jf(Z, b)k a.s. for all jk and bRp. Then K(b) ≡ arg maxk=1...,p f(Z, b)k is unique a.s. Also, if blb0, then K(bl) = K(b0) for l sufficiently large a.s.

The proof is omitted. An immediate consequence of this lemma is the continuity of the limiting distribution in Theorem 1 as a function of b0; this is seen by setting f(z1,,zp,b)k=(zk+CkTb)2Vk for k = 1, . . . , p, and using (8).

Proof of Theorem 2

The notation θ^n and k^n means that θ^n and n are based on n iid observations taken from Pn. The bootstrapped process Vn(b) in the statement of the theorem is defined by re-expressing (4), along with (b) and Kn(b), in terms of Pn and Pn operating on functions of (X, Y), then replacing Pn by Pn and Pn by Pn throughout. In the case of Zn,k in which ε is not observed, we also replace ε by ϵ^n=ϵ^n(X,Y)Yα^nθ^nX^, resulting in

Zn,k=Gn[ϵ^n(XkPnXk)]=Gn[ϵ^nXk][Gnϵ^n][PnXk] (9)

where Gn=n(PnPn) is the bootstrapped empirical process. As is conventional in empirical process theory, Pn, Pn and Pn are assumed to operate only on functions that are defined on (X, Y), explaining why PnXk can be separated in the above display.

Let EM denote expectation conditional on the data, and let PM be the corresponding probability measure. We will show that 1Tn>λnorTn>λnPM1β00 and 1Tnλn1TnλnPM1β0=0 conditionally (on the data) in probability. This together with Lemmas 4 and 5 below implies the result.

For k = 1, . . . , p, the bootstrapped marginal regression coefficient θ^k satisfies

nθ^k=n[PnXkY(PnXk)(PnY)]PnXk2(PnXk)2=GnXkYGnXkPnY(PnXk)(GnY)+n[PnXkY(PnXk)(PnY)]PnXk2(PnXk)2=GnXkYGnXkPnY(PnXk)(GnY)+nθ^k[PnXk2(PnXk)2]PnXk2(PnXk)2. (10)

When β0 = 0, by Lemma 2 and the condition that λn → ∞ as n → ∞, we have TnλnPM0 in probability. When β00, it is easy to verify that θnCk0Tβ0Vk0, which is positive under the condition that k0 is unique. Thus

PM(Tnλn)=PM((θ^nθ^n)+(θ^nθn)+θnλnsn)PM(θnλnsn+θ^nθ^n+θ^nθn)

tends to zero in probability when β00, where the convergence follows from Lemma 1, Lemma 4 (below) and the condition that λn=o(n). Hence

EM1Tnλm1β0=0=EM1Tn>λn1β00=PM(Tn>λn,β0=0)+PM(Tnλn,β00)=PM(Tn>λnβ0=0)1β0=0+PM(Tnλnβ0)1β00

tends to zero in probability. This implies that 1Tn>λnPM1β00 and 1TnλnPM1β0=0 conditionally in probability. Since 1|Tn|≤λn converges to 1β0=0 in probability, the result follows from Slutsky's lemma.

Lemma 4

If the conditions in Theorem 1 hold and β00, then k^nPMk0 conditionally (on the data) a.s. and n(θ^nθ^n)dZk0(β0)Vk0 conditionally (on the data) in probability.

Proof

It follows from (10), the SLLN and Slutsky's lemma that, when β00,

Var^(Xk)θ^k=n12[GnXkYGnXkPnY(PnXk)(GnY)]+θ^k[PnXk2(PnXk)2]PMCkTβ0

and θ^kPMCkTβ0Vk a.s. for k = 1, . . . , p. Denote the bootstrap mean squared error

R^kPn[Yα^kθ^kXk]2=Var^(Y)(θ^k)2Var^(Xk),

where Var^(Y)=PnY2(PnY)2 and Var^(Xk)=PnXk2(PnXk)2. Then we can write

k^n=argmaxk=1,,pVar^(Y)R^kVar(XTβ0)=argmaxk=1,,p(θ^k)2Var^(Xk)Var(XTβ0)

since the denominator plays no role. By Slutsky's lemma

(θ^k)2Var^(Xk)Var(XTβ0)PMCorr2(Xk,XTβ0)

a.s. for k = 1, . . . , p, so we obtain

PM(k^nk0)=PM(k:kk0{(θ^k0)2Var^(Xk0)Var(XTβ0)(θ^k)2Var^(Xk)Var(XTβ0)})k:kk0PM((θ^k0)2Var^(Xk0)Var(XTβ0)(θ^k)2Var^(Xk0)Var(XTβ0))0a.s.,

where the convergence follows from the condition that k0 is unique when β00.

Recall that ϵ^nYα^nθ^nX^, where Xn. Note that Pnϵ^n=0. By the definition of θ^n, we have

n(θ^nθ^n)[PnXk^n2(PnXk^n)2]=n[PnXk^nY(PnXk^n)(PnY)θ^n(PnXk^n2(PnXk^n)2)]=n(PnXk^nϵ^nPnXk^nPnϵ^n)+nθ^n[(PnXk^n)2PnXk^n2+PnXk^nX^(PnXk^n)(PnX^)]=Gnϵ^n(Xk^nPnXk^n)GnXk^n(PnPn)ϵ^nGnϵ^n(PnPn)Xk^n+nθ^n[(PnXk^n)2PnXk^n2+PnXk^nX^(PnXk^n)(PnX^)]. (11)

The last term in (11) is opM (1) a.s. because the first and last terms within the square bracket cancel asymptotically, similarly for the second and third terms, due to k^nPMk0 and nk0 a.s. We next show that the first term in (11) converges in distribution to Zk0 (β0) conditionally (on the data) in probability. By Lemma 1, it is easy to verify that θ^nPnθ0Ck0Tβ0Vk0 and α^nPnα0+EXTβ0θ0EXk0. Denote ϵ=ϵ+(XEX)Tβ0θ0(Xk0EXk0). Then the first term can be decomposed as

Gnϵ^n[(Xk^nPnXk^n(Xk0PnXk0)]+Gn[(ϵ^nϵ)(Xk0PnXk0)]+Gn[ϵ(Xk0PnXk0)]. (12)

The first term in (12) is oPM (1) a.s. since k^nPMk0. The second term in (12) can be written as

[(α0+EXTβ0θ0EXk0)α^n]Gn(Xk0PnXk0)+(PnPn)[(Xk0PnXk0)XTb0]+(θ0θ^n)Gn[Xk0(Xk0PnXk0)]θ^nGn[(X^Xk0)(Xk0PnXk0)]

which is oPM (1) in probability by bootstrap consistency of the sample mean [see, e.g., Theorem 23.4 of van der Vaart (1998)], and the fact that = Xk0 for n sufficiently large a.s. Bootstrap consistency of the sample mean also gives that the third term in (12) converges in distribution to Zk0(β0) conditionally (on the data) in probability.

Similarly, the second and third terms in (11) and PnXk^n2(PnXk^n)2Var(Xk0) can be shown to be oPM (1) in probability. The result then follows from Slutsky's lemma.

Lemma 5

If all conditions in Theorem 1 hold and β0 = 0, then Vn(b0) converges to the same limiting distribution as n(θ^nθn) conditionally (on the data) in probability.

Proof

Define Zn, Mn(b) and M′(b) to be p-vectors with kth components given by Zn,k=Gn[ϵ(XkPnXk)],

[Cov^(Xk,XTb)+Zn,k]2Var^(Xk)and[Cov(Xk,XTb)]2Var(Xk),

respectively. Let Wn(b) be a p × p matrix with the (j, k)-th component given by

Cov^(Xk,XTb)+Zn,kVar^(Xk)Cov(Xj,XTb)Var(Xj).

Also, let Dn(b) and D′(b) be p-vectors of zeros, apart from a 1 in the entry that maximizes Mn(b) and M′(b), respectively. Then

Vn(b)=D(b)TWn(b)Dn(b).

Similarly, define M(b), W(b) and D(b) (without indexing by n) to be processes of the same form as Mn(b), Wn(b) and Dn(b), except with Zn,k replaced by Zk(0), and the sample variances/covariances replaced by their population versions.

Referring to the notation in (4), it is clear that when β0 = 0,

n(θ^nθn)=Vn(b0)=D(b0)TWn(b0)Dn(b0)dD(b0)TW(b0)D(b0).

Moreover, the second equality in the above display also holds for the bootstrap version. Writing the bootstrapped version of Zn,k in (9) as

Zn,k=Gn[ϵ(XkPnXk)]+Gn[(ϵ^nϵ)(XkPnXk)]+[(PnPn)Xk]Gnϵ^n],

and using arguments similar to those in the proof Lemma 4 for handling (12), we have Znd(Z1(0),,Zp(0))T conditionally (on the data) in probability. As a result, (D^n(b0),Wn(b0),Mn(b0))d(D(b0),W(b0),M(b0)) conditionally (on the data) in probability, where D^n(b) is the sample version of D′(b), and Wn(b) and Mn(b) are the bootstrap versions of Wn(b) and Mn(b), respectively. Finally, using similar arguments to those at the end of the proof of Lemma 2, along with the continuous mapping theorem, we conclude that

Vn(b0)=D^n(b0)TWn(b0)Dn(b0)dD(b0)TW(b0)D(b0)

conditionally (on the data) in probability.

References

  1. Andrews D. Inconsistency of the Bootstrap when a Parameter is on the Boundary of the Parameter Space. Econometrica. 2000;68(2):399–405. [Google Scholar]
  2. Arias-Castro E, Candès EJ, Plan Y. Global Testing Under Sparse Alter natives: ANOVA, Multiple Comparisons and the Higher Criticism. Ann. Statist. 2011;39:2533–2556. [Google Scholar]
  3. Belloni A, Chernozhukov V, Hansen C. Inference on Treatment Effects After Selection Amongst High-Dimensional Controls. Review of Economic Studies. 2014;81(2):608–650. [Google Scholar]
  4. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A prac tical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society, Ser. B. 1995;57:289–300. [Google Scholar]
  5. Berk R, Brown LD, Buja A, Zhang K, Zhao L. Valid Post-Selection Inference. Annals of Statistics. 2013;41:802–837. [Google Scholar]
  6. Breiman L. The Little Bootstrap and Other Methods for Dimensionality Se lection in Regression: X-Fixed Prediction Error. Journal of the American Statistical Association. 1992;87:738–754. [Google Scholar]
  7. Bühlmann P. Statistical Significance in High-dimensional Linear Models. Bernoulli. 2013;19:1212–1242. [Google Scholar]
  8. Cai TT, Jiang T. Phase Transition in Limiting Distributions of Co herence of High-dimensional Random Matrices. Journal of Multivariate Analysis. 2012;107:24–39. [Google Scholar]
  9. Cancer Genome Atlas Research Network Comprehensive Genomic Charac terization Defines Human Glioblastoma Genes and Core Pathways. Nature. 2008;455:1061–1068. doi: 10.1038/nature07385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Chatterjee A, Lahiri SN. Bootstrapping Lasso Estimators. Journal of the American Statistical Association. 2011;106(494):608–625. [Google Scholar]
  11. Cheng X. Robust Confidence Intervals in Nonlinear Regression under Weak Identification. Department of Economics, University of Pennsylvania; 2008. Unpublished Manuscript. Version posted in 2015: http://www.sas.upenn.edu/xucheng/papers/Cheng mixed id 19.pdf. [Google Scholar]
  12. Cheng X. Robust Inference in Nonlinear Models with Mixed Identification Strength. Journal of Econometrics. 2015 to appear. [Google Scholar]
  13. Davies RB. Hypothesis Testing when a Nuisance Parameter Is Present Only under the Alternative. Biometrika. 1977;64(2):247–254. doi: 10.1111/j.0006-341X.2005.030531.x. [DOI] [PubMed] [Google Scholar]
  14. Donoho D, Jin J. Higher Criticism for Detecting Sparse Heterogeneous Mixtures. Annals of Statistics. 2004;32(3):962–994. [Google Scholar]
  15. Donoho D, Jin J. Higher Criticism for Large-Scale Inference, Especially for Rare and Weak Effects. Statistical Science. 2015;30(1):1–25. [Google Scholar]
  16. Dudoit S, Shaffer JP, Boldrick JC. Multiple Hypothesis Testing in Microarray Experiments. Statistical Science. 2003;18:71–103. [Google Scholar]
  17. Dudoit S, van der Laan MJ. Multiple Testing Procedures with Appli cations to Genomics. Springer; New York: 2008. [Google Scholar]
  18. Efron B. Large-scale Simultaneous Hypothesis Testing: the Choice of a Null Hypothesis. Journal of American Statistical Association. 2006;99:96–104. [Google Scholar]
  19. Efron B. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press; 2010. [Google Scholar]
  20. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Chapman & Hall/CRC Monographs on Statistics & Applied Probability; 1993. [Google Scholar]
  21. Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  22. Fan J, Li R. Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery. In: Sanz-Sole M, Soria J, Varona JL, Verdera J, editors. Proceedings of the International Congress of Mathematicians. III. European Mathematical Society; Zurich: 2006. pp. 595–622. [Google Scholar]
  23. Fan J, Lv J. Sure Independence Screening for Ultra-high Dimensional Feature Space. Journal of the Royal Statistical Society, Ser. B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. with discussion. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Genovese C, Jin J, Wasserman L, Yao Z. A Comparison of the Lasso and Marginal Regression. Journal of Machine Learning Research. 2012;13:2107–2143. [Google Scholar]
  25. HIV Drug Resistance Database Genotype-Phenotype Datasets, Stanford Uni versity. 2014 http://hivdb.stanford.edu/pages/genopheno.dataset.html.
  26. Huang J, Ma S, Zhang C-H. Adaptive Lasso for High-dimensional Regression Models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
  27. Ingster YI, Tsybakov AB, Verzelen N. Detection Boundary in Sparse Regression. Electron. J. Statist. 2010;4:1476–1526. [Google Scholar]
  28. Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. 6th Edition Prentice Hall; New Jersey: 2007. [Google Scholar]
  29. Laber E, Murphy SA. Adaptive Confidence Intervals for the Test Error in Classification (with discussion). Journal of the American Statistical Association. 2011;106(495):904–913. doi: 10.1198/jasa.2010.tm10053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Laber E, Murphy SA. Adaptive Inference after Model Selection. 2013 Under review. [Google Scholar]
  31. Laber E, Lizotte D, Qian M, Murphy SA. Dynamic Treatment Regimes: Technical Challenges and Applications. Electronic Journal of Statistics. 2014;8:1225–1272. doi: 10.1214/14-ejs920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Leeb H, Pötscher BM. Can One Estimate the Conditional Distribution of Post-model-selection Estimators? Annals of Statistics. 2006;34(5):2554–2591. [Google Scholar]
  33. Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R. A Significance Test for the Lasso. Annals of Statistics. 2014;42(2):413–468. doi: 10.1214/13-AOS1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. McCloskey A. Bonferroni-based Size-correction for Nonstandard Test ing Problems. 2012 Working Paper, http://www.econ.brown.edu/fac/adam mccloskey/Research files/McCloskey BBCV.pdf.
  35. Meinshausen N, Meier L, Bühlmann P. P-values for High-dimensional Regression. Journal of the American Statistical Association. 2009;104:1671–1681. [Google Scholar]
  36. Ning Y, Liu H. A General Theory of Hypothesis Tests and Confidence Regions for Sparse High Dimensional Models. 2015 http://arxiv.org/abs/1412.8765.
  37. Rhee S-Y, Gonzales MJ, Kantor R, Betts BJ, Ravela J, Shafer RW. Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res. 2003;31(1):298–303. doi: 10.1093/nar/gkg100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Samworth R. A Note on Methods of Restoring Consistency to the Bootstrap. Biometrika. 2003;90:985–990. [Google Scholar]
  39. van der Vaart AW. Asymptotic Statistics. Cambridge University Press; 1998. [Google Scholar]
  40. Zhang C-H, Zhang S. Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models. J. R. Stat. Soc. B. 2014;76:217–242. [Google Scholar]
  41. Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society, Series B. 2005;67(2):301–320. [Google Scholar]

RESOURCES