An adaptive resampling test for detecting the presence of significant predictors

Ian W McKeague; Min Qian

doi:10.1080/01621459.2015.1095099

. Author manuscript; available in PMC: 2016 Apr 10.

Published in final edited form as: J Am Stat Assoc. 2016 Jan 15;110(512):1422–1433. doi: 10.1080/01621459.2015.1095099

An adaptive resampling test for detecting the presence of significant predictors

Ian W McKeague ^*, Min Qian ^†

PMCID: PMC4826762 NIHMSID: NIHMS746744 PMID: 27073292

Abstract

This paper investigates marginal screening for detecting the presence of significant predictors in high-dimensional regression. Screening large numbers of predictors is a challenging problem due to the non-standard limiting behavior of post-model-selected estimators. There is a common misconception that the oracle property for such estimators is a panacea, but the oracle property only holds away from the null hypothesis of interest in marginal screening. To address this difficulty, we propose an adaptive resampling test (ART). Our approach provides an alternative to the popular (yet conservative) Bonferroni method of controlling familywise error rates. ART is adaptive in the sense that thresholding is used to decide whether the centered percentile bootstrap applies, and otherwise adapts to the non-standard asymptotics in the tightest way possible. The performance of the approach is evaluated using a simulation study and applied to gene expression data and HIV drug resistance data.

Keywords: Bootstrap, Family-wise error rate, Marginal regression, Non-regular asymptotics, Screening covariates

1 Introduction

The problem of selecting significant predictors is a central aspect of scientific discovery, and has become increasingly important in an era in which massive data sets are readily available (Fan and Li, 2006). Much of the modern statistical literature in this area focuses on consistency of variable selection in high-dimensional settings based on machine learning and data mining techniques (e.g., Fan and Li 2001; Zou and Hastie 2005; Huang et al. 2008; Fan and Lv 2008; Genovese et al. 2012). A major gap in this literature, however, has been the scarcity of formal hypothesis testing procedures that take variable selection into account; the oracle property enjoyed by many variable selection methods in the presence of high dimensionality can not be applied directly for testing whether a post-model-selected variable is significant. In bioinformatics, for example, variable selection techniques based on penalization (such as lasso, scad, etc) are routinely used to produce lists of differentially-expressed genes that are most related to disease risk, but few methods for obtaining valid p-values have been developed.

A more traditional approach to the selection of significant predictors is multiple testing to control either family-wise error rate (FWER), or false-discovery rate (Benjamini and Hochberg 1995; Dudoit et al. 2003; Efron 2006; Dudoit and van der Laan 2008; Efron 2010). Procedures that control FWER (e.g., Bonferroni, or Holm's procedure) are often criticized as being too conservative (in the sense of having low power). False-discovery rate methods, on the other hand, although having greater power, incur the cost of inflated FWER. Our aim in the present paper is to introduce a more powerful single test that can be used as an alternative screening procedure to detect the presence of some significant predictor while rigorously controlling FWER.

The proposed procedure uses marginal linear regression to select the predictor (from among covariates X₁, . . . , X_p) that has maximal sample correlation with a scalar outcome Y (as in marginal screening or correlation learning, Genovese et al. 2012). The test is based on ${\hat{θ}}_{n}$ the estimated marginal regression coefficient of the selected predictor. If there is a unique predictor, say X_k₀, maximally correlated with the outcome, then the selection procedure consistently estimates k₀, and ${\hat{θ}}_{n}$ is asymptotically normal; if all predictors are uncorrelated with the outcome, then the selected predictor does not converge (in probability) and ${\hat{θ}}_{n}$ has a non-normal limiting distribution. In particular, the limiting distribution is discontinuous (at zero) as a function of the regression coefficient of X_k₀ (where k₀ is not identifiable), and this “non-regularity” causes non-uniform convergence.

Breiman (1992) drew early attention to the issue of invalid post-model-selection inference, calling it the “quiet scandal” of Statistics; even earlier references are mentioned in Berk et al. (2013). Samworth (2003) gave a detailed account of the inaccuracy of bootstrap methods applied to super-efficient estimators. Leeb and Pötscher (2006) (and other papers by the same authors) established that non-uniform limiting behavior of post-model-selected estimators is at the root of the problem, and that estimates of asymptotic null distributions in such settings can give a misleading picture of finite-sample performance. In particular, calibrating a test based on ${\hat{θ}}_{n}$ in a way that does not adapt to the implicit post-model-selection will be extremely inaccurate. This type of non-regularity occurs in various other settings as well, e.g., when a nuisance parameter is only defined under an alternative hypothesis (Davies, 1977), and when the parameter of interest under the null hypothesis is on the boundary of the parameter space (Andrews, 2000). McCloskey (2012) surveyed non-standard testing problems in econometrics, and introduced some Bonferroni-based size-correction methods designed to improve power. As far as we know, however, there is not yet a resolution of these issues for marginal screening.

In this paper we introduce an adaptive resampling test (ART) for marginal screening that adapts to the small sample behavior of ${\hat{θ}}_{n}$ in terms of a local model. Under local alternatives, we find an explicit representation of the asymptotic distribution of ${\hat{θ}}_{n}$ and construct a suitable bootstrap estimator of this distribution that is consistent, thus circumventing the non-regularity mentioned above. Under non-local alternatives, we show that the critical values obtained in this way agree asymptotically with those used by the oracle (who is given knowledge of k₀), so ART can be expected to provide good power as well.

Several new approaches to post-model selection inference for linear regression have been proposed in recent years. Meinshausen et al. (2009) introduced a random sample splitting procedure in the high-dimensional setting to obtain (conservative) Bonferroni-adjusted p-values following variable selection. Chatterjee and Lahiri (2011) developed a modified bootstrap method that provides an asymptotically valid confidence region for the regression parameters based on the lasso estimator; this method depends on the presence of at least one active predictor, so it is not applicable to marginal screening (under the null hypothesis there is no active predictor).

More relevant to marginal screening, the covariance test recently introduced by Lockhart et al. (2014) uses a forward stepwise lasso procedure to test for active predictors entering a sparse linear model under the assumption of normal errors. Also in the sparse linear model setting with normal errors, but further assuming that the predictors are nearly uncorrelated, Ingster et al. (2010) and Arias-Castro et al. (2011) have studied the detection boundary and optimality properties of general classes of multiple testing procedures (including Bonferroni and Higher Criticism). Berk et al. (2013) developed a valid method of post-model selection inference that is feasible for up to about p = 20 predictors, also assuming normal errors. In various sparse high-dimensional settings, Belloni et al. (2013), Bühlmann (2013), Zhang and Zhang (2014) and Ning and Liu (2015) have established asymptotically valid confidence intervals for a preconceived regression parameter after variable selection on the remaining predictors, but this does not apply to marginal screening (where no regression parameter is singled-out a priori).

This paper is organized as follows. We formulate the problem and discuss the issue of non-regularity in Section 2. In Section 3, we develop the ART procedure and establish the consistency of the underlying bootstrap. Simulation studies and applications to gene expression data and HIV drug resistance data are presented in Section 4. Concluding discussion appears in Section 5, and proofs are collected in the Appendix.

2 Marginal regression and non-regularity

Consider a scalar outcome Y and a p-dimensional vector of covariates X = (X₁, . . . , X_p)^T such that the marginal variance of each covariate is finite and non-zero. Marginal regression consists in using separate linear models to predict Y from each X_k. Let k₀ be the label of a covariate that maximizes the absolute correlation with Y :

k_{0} \in \arg \max_{k = 1, \dots, p} ∣ Corr (X_{k}, Y) ∣,

and let α₀ + θ₀X_k₀ be the best linear predictor based on X_k₀, i.e.,

(α_{0}, θ_{0}) = \arg \min_{α, θ \in R} E {(Y - α - θ X_{k_{0}})}^{2} = (E Y - θ_{0} E X_{k_{0}}, \frac{Cov (X_{k_{0}}, Y)}{Var (X_{k_{0}})}) .

(1)

We are interested in testing whether at least one of the covariates is correlated with Y, for which it suffices to check whether X_k₀ and Y are correlated. This is equivalent to testing

H_{0} : θ_{0} = 0 versus H_{a} : θ_{0} \neq 0 .

Given an iid sample of size n, let ${\hat{α}}_{n}$ , ${\hat{θ}}_{n}$ , and k̂_n be the least squares estimates of α₀, θ₀, and k₀, respectively:

{\hat{α}}_{n} = P_{n} Y - {\hat{θ}}_{n} P_{n} X_{{\hat{k}}_{0}}, {\hat{θ}}_{n} = \frac{\hat{Cov} (X_{{\hat{k}}_{n}}, Y)}{\hat{Var} (X_{{\hat{k}}_{n}})}, {\hat{k}}_{n} \in \arg \max_{k = 1, \dots, p} ∣ \hat{Corr} (X_{k}, Y) ∣,

where $P_{n}$ is the empirical distribution, and the hats indicate sample versions. It is natural to base the test on ${\hat{θ}}_{n}$ but calibration is problematic because the distribution of $\sqrt{n} ({\hat{θ}}_{n} - θ_{0})$ does not converge uniformly with respect to θ₀, as mentioned in the Introduction. The non-uniformity occurs in the neighborhood of θ₀ = 0. Specifically, there exists a bounded continuous function $h : R \to R$ such that $f_{n} (θ_{0}) \equiv E h (\sqrt{n} ({\hat{θ}}_{n} - θ_{0}))$ does not converge uniformly in any neighborhood of θ₀ = 0, despite converging pointwise. To see this, first note that under mild conditions

\sqrt{n} ({\hat{θ}}_{n} - θ_{0}) \overset{d}{\to} U \equiv {\begin{matrix} Z_{k_{0}} ∕ V_{k_{0}} & if θ_{0} \neq 0, \\ Z_{K} ∕ V_{K} & if θ_{0} = 0, \end{matrix}

where V_k = Var(X_k), $K = \arg \max_{k = 1, \dots, p} Z_{k}^{2} ∕ V_{k}$ , and (Z₁, . . . , Z_p)^T is a mean-zero normal random vector with covariance matrix depending on parameters of the full linear model (this is a special case of Theorem 1 below). From the form of the distribution of U, we can choose h so that f_∞(θ₀) ≡ Eh(U) is discontinuous at θ₀ = 0 (this is the non-regularity mentioned in the Introduction). If f_n were to converge uniformly to f_∞ on some compact neighborhood of zero, we would have a contradiction because each f_n is continuous, and the uniform limit of a sequence of continuous functions on a compact interval is continuous.

To address this problem, in the next section we develop a formal test procedure (ART) inspired by work of Cheng (2008, 2015) concerning robust confidence intervals for non-linear regression parameters in the presence of weak-identifiability. Other variations of this approach have been used by Laber and Murphy (2011) to construct a confidence interval for the classification error, by Laber et al. (2014) in a sequential decision making problem, and by Laber and Murphy (2013) to provide robust confidence intervals for adaptive lasso. As already noted, the distribution of $\sqrt{n} ({\hat{θ}}_{n} - θ_{0})$ does not converge uniformly in the neighborhood of θ₀ = 0, so its small sample behavior can be very far from normal when the true parameter is close to zero. Therefore an understanding of the asymptotic behavior of ${\hat{θ}}_{n}$ under local alternatives plays a crucial role in devising a suitable test, or more generally in providing robust confidence intervals for θ₀.

3 Adaptive resampling test

In this section, we develop the proposed ART procedure for detecting the presence of a significant predictor. The idea is to adapt to the inherent non-regular behavior of the post-model-selected estimator ${\hat{θ}}_{n}$ in a way that accurately captures its asymptotic behavior in $\sqrt{n}$ -neighborhoods of the null hypothesis.

We frame the problem in terms of the general local linear model

Y = α_{0} + X^{T} β_{n} + ϵ,

(2)

where $α_{0} \in R$ , $β_{n} \in R^{p}$ , the noise ε has mean 0, finite variance, and is uncorrelated with X, and β_n = β₀ + n^−1/2b₀, where $b_{0} \in R^{p}$ is the local parameter. The distributions of ε and X are assumed to be fixed, so only the distribution of Y depends on n (although we suppress n in the notation for Y). The relevant hypotheses are now

H_{0} : θ_{n} = 0 versus H_{a} : θ_{n} \neq 0,

where θ_n = Cov(X_{k_n}, Y)/Var(X_{k_n}) and k_n is the label of a component of X that maximizes absolute correlation with Y.

Our first result gives the asymptotic distribution of ${\hat{θ}}_{n}$ . To state the result, we need the notation

\overset{‒}{k} (b) \equiv \arg \max_{k = 1, \dots, p} ∣ Corr (X_{k}, X^{T} b) ∣

for any $b \in R^{p}$ . Note that k_n = k̄(β_n) under the local model. If k₀ ≡ k̄(β₀) is unique (so β₀ ≠ = 0), then k_n → k₀, and θ_n is asymptotically bounded away from zero (a non-local alternative). On the other hand, if β₀ = 0 and k̄(b₀) is unique, then k_n = k̄(b₀); also θ_n is in the neighborhood of zero and represents a local alternative. Finally, if β₀ = b₀ = 0, then k_n is not well-defined and the null hypothesis θ_n = 0 holds. We need the uniqueness of the most active predictor k₀ (away from the null hypothesis), but this seems to be a very mild condition because the likelihood that there would be two or more predictors having exactly the same maximal correlation with Y seems remote in practice. Even in practice, as we will see in the simulation study, non-uniqueness of the maximally correlated predictor does not adversely affect power.

Theorem 1

Suppose that k₀ = k̄(b₀) is unique when β₀ ≠ 0, and k̄(b₀) is unique when β₀ = 0 and b₀ ≠ 0. Then, under the local model (2),

\sqrt{n} ({\hat{θ}}_{n} - θ_{n}) \overset{d}{\to} {\begin{matrix} Z_{k_{0}} (β_{0}) ∕ V_{k_{0}} & i f β_{0} \neq 0, \\ Z_{K} (0) ∕ V_{K} + {(C_{K} ∕ V_{K} - C_{\overset{‒}{k} (b_{0})} ∕ V_{\overset{‒}{k} (b_{0})})}^{T} b_{0} & i f β_{0} = 0, \end{matrix}

where $K = \arg \max_{k = 1, \dots, p} {[Z_{k} (0) + C_{k}^{T} b_{0}]}^{2} ∕ V_{k}, C_{k} = Cov (X_{k}, X)$ , and ${(Z_{k} (β))}_{k = 1}^{p}$ is a mean-zero normal random vector with covariance matrix Σ(β) given by that of the random vector with components

({(X - E X)}^{T} β - (X_{k} - E X_{k}) C_{k}^{T} β ∕ V_{k} + ϵ) (X_{k} - E X_{k}),

for k = 1, . . . , p, and Σ(β₀) is assumed to exist.

The non-regularity at β₀ = 0 is explained by the dependence of the limiting distribution on the (non-identifiable) local parameter b₀. The limiting distribution is nevertheless continuous as a function of $b_{0} \in R^{p}$ into the space of distribution functions (this is a simple consequence of Lemma 3 in the Appendix), and the convergence is uniform over compact subsets of $R^{p}$ , unlike the limiting behavior discussed in the previous section, so finite-sample accuracy should be less of an issue when designing a screening test using this result. On the other hand, naive resampling methods that do not take into account the local asymptotic behavior will fail to provide consistent estimates of the distribution of $\sqrt{n} ({\hat{θ}}_{n} - θ_{n})$ , as discussed in the Introduction for the non-local case.

To get around this problem, we decompose $\sqrt{n} ({\hat{θ}}_{n} - θ_{n})$ in a way that isolates the possibility that β₀ ≠ 0 by comparing |T_n| to some threshold λ_n (to be specified later), where $T_{n} = {\hat{θ}}_{n} ∕ s_{n}$ is the post-model-selected t-statistic and s_n is the standard error of the slope estimator when regressing Y on X_{k̂_n}. Specifically,

\begin{matrix} \sqrt{n} ({\hat{θ}}_{n} - θ_{n}) = & \sqrt{n} ({\hat{θ}}_{n} - θ_{n}) 1_{∣ T_{n} ∣ > λ_{n} or β_{0} \neq 0} + \sqrt{n} ({\hat{θ}}_{n} - θ_{n}) 1_{∣ T_{n} ∣ \leq λ_{n}, β_{0} = 0} \\ = & \sqrt{n} ({\hat{θ}}_{n} - θ_{n}) 1_{∣ T_{n} ∣ > λ_{n} or β_{0} \neq 0} + [\frac{Z_{n, {\hat{k}}_{n}} + \hat{Cov} (X_{{\hat{k}}_{n}}, X^{T} b_{0})}{\hat{Var} (X_{{\hat{k}}_{n}})} - \frac{Cov (X_{k_{n}}, X^{T} b_{0})}{Var (X_{k_{n}})}] 1_{∣ T_{n} ∣ \leq λ_{n}, β_{0} = 0}, \end{matrix}

(3)

where $Z_{n, k} = G_{n} [ϵ (X_{k} - P_{n} X_{k})]$ , $G_{n} = \sqrt{n} (P_{n} - P_{n})$ is the empirical process, and P_n is the distribution of (X, Y). It is clear that the nonparametric bootstrap is consistent for the first term in (3) if $λ_{n} = o (\sqrt{n})$ and λ_n → ∞, since it is easily shown that P(|T_n| > λ_n) → 1_β₀≠0. The second term is more problematic though because k̂_n does not converge in probability to k₀ when β₀ = 0. Denote the term in the square brackets by $V_{n} (b)$ , indexed by $b = b_{0} \in R^{p}$ . Note that when this term is active (under β₀ = 0), ${\hat{k}}_{n} = K_{n} (b_{0})$ and k_n = K̄(b₀), where

K_{n} (b) = \arg \max_{k = 1, \dots, p} \frac{{[Z_{n, k} + \hat{Cov} (X_{k}, X^{T} b)]}^{2}}{\hat{Var} (X_{k})}

and

\overset{‒}{K} (b) = \arg \max_{k = 1, \dots, p} \frac{{[Cov (X_{k}, X^{T} b)]}^{2}}{Var (X_{k})},

V_{n} (b) = \frac{Z_{n, K_{n} (b)} + \hat{Cov} (X_{K_{n} (b)}, X^{T} b)}{\hat{Var} (X_{K_{n} (b)})} - \frac{Cov (X_{\overset{‒}{K} (b)}, X^{T} b)}{Var (X_{\overset{‒}{K} (b)})} .

(4)

All parts of $V_{n} (b)$ are now seen to be smooth functions of $P_{n}$ , so it is reasonable to expect that a consistent bootstrap can be constructed by replacing $P_{n}$ by its nonparametric bootstrap $P_{n}^{*}$ , and replacing P_n by $P_{n}$ . In such a construction, the event indicated in the second term of (3) is naturally replaced by the event that $∣ T_{n}^{*} ∣ \leq λ_{n}$ and |T_n| ≤ λ_n.

Here and throughout the paper, a superscript * is used to indicate the nonparametric bootstrap (sometimes called “bootrapping in pairs” in regression settings, to distinguish it from the residual bootstrap). The above arguments lead to our main result showing that $\sqrt{n} ({\hat{θ}}_{n} - θ_{n})$ can indeed be consistently bootstrapped under the general local model. The precise definition of $V_{n}^{*}$ is given at the start of the proof.

Theorem 2

Suppose all assumptions in Theorem 1 hold, and the tuning parameter λ_n satisfies $λ_{n} = o (\sqrt{n})$ and λ_n → ∞ as n → ∞. Then, under the local model (2),

\sqrt{n} ({\hat{θ}}_{n}^{*} - {\hat{θ}}_{n}) 1_{∣ T_{n}^{*} ∣ > λ_{n} or ∣ T_{n} ∣ > λ_{n}} + V_{n}^{*} (b_{0}) 1_{∣ T_{n}^{*} ∣ \leq λ_{n}, ∣ T_{n} ∣ \leq λ_{n}}

converges to the limiting distribution of $\sqrt{n} ({\hat{θ}}_{n} - θ_{0})$ conditionally (on the data) in probability.

ART procedure

ART provides a bootstrap calibration for the test statistic $\sqrt{n} {\hat{θ}}_{n}$ based on a special case of the above theorem. Under H₀ we have the simplification $V_{n}^{*} (b_{0}) = V_{n}^{*} (0)$ . For some nominal level γ, let c_l and c_u be the lower and upper γ/2 quantiles, respectively, of

A_{n}^{*} = \sqrt{n} ({\hat{θ}}_{n}^{*} - {\hat{θ}}_{n}) 1_{∣ T_{n}^{*} ∣ > λ_{n} or ∣ T_{n} ∣ > λ_{n}} + V_{n}^{*} (0) 1_{∣ T_{n}^{*} ∣ \leq λ_{n}, ∣ T_{n} ∣ \leq λ_{n}} .

If $\sqrt{n} {\hat{θ}}_{n}$ falls outside the interval [c_l, c_u], then we reject H₀ and conclude that there is at least one significant predictor.

Before applying ART, it is advisable to standardize all the variables X_k and Y (by sample mean and standard deviation), which has the advantage of making the procedure scale invariant ( ${\hat{θ}}_{n}$ is then the maximal sample correlation); our results naturally extend, but we develop the theory only for the unstandardized variables to keep the presentation simple.

Robust confidence intervals

The above theorem also allows the construction of a robust confidence interval for θ_n by treating b₀ as unknown, then finding the widest bootstrap quantiles over all b₀. Here by “robust” we mean asymptotically valid uniformly over b₀. For testing purposes, however, this approach would be too conservative and also computationally intensive (grid search over $R^{p}$ is needed); for this reason, in ART we set b₀ = 0 under the null, so the critical values can be readily computed from $A_{n}^{*}$ . In contrast, Laber and Murphy (2013) propose using almost sure bounds over their local parameter b₀ to find robust confidence intervals for adaptive lasso; this involves less computation than distributional bounds, but is still computationally intensive, and it produces more conservative confidence intervals than the distributional approach.

Choice of the tuning parameter λ_n

The above theorem requires that $λ_{n} = o (\sqrt{n})$ and λ_n → ∞ as n → ∞. Under this condition, the thresholding provides a consistent pre-test (for θ_n = 0) with asymptotically negligible type I error rate: lim_n→∞ P(|T_n| > λ_n|θ_n = 0) = 0. On the other hand, if λ_n increases too quickly, the pre-test will be conservative. One simple choice would be to set $λ_{n} = \sqrt{a \log n}$ , for some constant a > 0, but it is also desirable that λ_n increase with p, see Section 5 for discussion about the null limiting behavior of T_n as both p and n → ∞. To that end, note that by Theorem 1 in the special case that ε and X are independent, under θ_n = 0 (or b₀ = 0 and β₀ = 0) we have $T_{n} \overset{d}{\to} {\tilde{Z}}_{K}$ , where $K = \arg \max_{k = 1, \dots, p} {\tilde{Z}}_{k}^{2}$ , and (Z̃₁, . . . ,Z̃_p)^T is a vector of standard normal random variables. Thus, for any fixed λ > 0,

P (∣ T_{n} ∣ > λ ∣ θ_{n} = 0) \to P (\max_{k = 1, \dots, p} ∣ {\tilde{Z}}_{k} ∣ > λ) \leq \sum_{k = 1}^{p} P (∣ {\tilde{Z}}_{k} ∣ > λ) .

Hence the pre-test type I error rate can be asymptotically controlled below level γ, without sacrificing consistency, by choosing

λ_{n} = \max {\sqrt{a \log n}, upper γ ∕ (2 p) -quantile of N (0, 1)} .

(5)

In the simulation study below we describe a way of specifying the constant a via the double bootstrap, and this is used whenever we refer to ART in the sequel.

Forward stepwise ART

If we find a significant predictor using ART, it would be reasonable to continue applying the procedure in a forward stepwise fashion until no more significant predictors are detected. That is, in successive stages the residual $Y - {\hat{α}}_{n} - {\hat{θ}}_{n} X_{{\hat{k}}_{n}}$ is treated as a new outcome variable and marginal regression carried out on the remaining predictors. Although it would be challenging to extend our theoretical results to this procedure, we find that in real data applications it performs well, and in a similar way to the covariance test of Lockhart et al. (2014), as we discuss in the HIV drug resistance example considered in the next section.

4 Numerical studies

In this section, we study the performance of the proposed ART procedure using simulated data, and give illustrations of the approach in two real data examples.

4.1 Finite sample simulations

We compare the performance of ART with four procedures that are commonly used for detecting the presence of a significant predictor:

Likelihood ratio test (LRT)

This test is based on assuming a full linear model involving all of the covariates, and is applicable when n > p. Under the null hypothesis, all the regression coefficients are zero. The reduction in the residual sum of squares is compared to the residual sum of squares for the full model using an F-ratio [see, e.g. Section 7.4 of Johnson and Wichern (2007)]. When the full linear model holds, it can be seen that both null and alternative hypotheses are identical to those used in ART.

Multiple testing with Bonferroni correction

As in ART, marginal linear models are used to predict Y from each X_k. A t-test with Bonferroni correction is then carried out to detect whether each regression coefficient is non-zero. The intersection of the p null hypotheses coincides with the null used in ART.

Centered percentile bootstrap (CPB)

This procedure is similar to ART, except $\sqrt{n} ({\hat{θ}}_{n}^{*} - {\hat{θ}}_{n})$ is used to estimate the upper and lower quantiles of $\sqrt{n} ({\hat{θ}}_{n} - θ_{0})$ , providing critical values for the test statistic $\sqrt{n} {\hat{θ}}_{n}$ , see Efron and Tibshirani (1993).

Higher Criticism (HC)

This is a test originally proposed by John Tukey for determining the overall significance of a collection of independent p-values. We apply the statistic ${HC}_{N}^{+}$ developed by Donoho and Jin (2004, 2015), which is expected to perform well if the predictors are nearly uncorrelated.

We consider three examples for the data generating model: i) Y = ε, ii) Y = X₁/4+ε, and iii) $Y = \sum_{k = 1}^{p} β_{k} X_{k} + ϵ$ , where β₁ = . . . = β₅ = 0.15, β₆ = . . . = β₁₀ = −0.1, and β_k = 0 for k = 11, . . . , p. In the first example, there is no active predictor, in the second there is a single active predictor, and in the third there are 10 active predictors and the maximally correlated predictor is not unique. The covariate vector X is distributed as p-dimensional normal with each component X_k ~ N(0, 1), an exchangeable correlation structure Corr(X_j, X_k) = ρ for j ≠ k, where ρ takes values 0, 0.5 and 0.8, and the noise ε ~ N(0, 1) is independent of X.

We consider two sample sizes (n = 100 and 200), and five values of the dimension (p = 10, 50, 100, 150 and 200). A nominal 5% significance level is used throughout. The bootstrap sample size is taken as 1,000. To specify the threshold λ_n in ART the double bootstrap is implemented by generating 1,000 bootstrap estimates ${\hat{θ}}_{n}^{*}$ , then choosing λ_n so that 5% of the ARTs (based on 1,000 nested bootstrap samples) with test statistic $\sqrt{n} ({\hat{θ}}_{n}^{*} - {\hat{θ}}_{n})$ reject.

Empirical rejection rates based on 1,000 Monte Carlo replications are reported in Figures 1–3. For model i), the figures provide type I error rates, which should be compared with the 5% nominal rate; for models ii) and iii), the figures provide the power of each test. The ART procedure has good control of the type I error rate throughout (compared to all the other methods), while consistently maintaining relatively high power. Comparing the results of models ii) and iii), non-uniqueness of the maximally correlated predictor has no adverse effect on the power of ART.

Empirical rejection rates based on 1,000 samples generated from models **i),** **ii)** and **iii)** as the dimension ranges from p = 10 to p = 200, for n = 100 (top row) and n = 200 (bottom row), and ρ = 0.8.

Empirical rejection rates as in Figure 1 except for independent predictors.

Bonferroni is highly conservative when ρ = 0.5 and 0.8, see the left panels of Figures 1 and 2. The CPB method is highly anti-conservative, with empirical type I error rates exceeding 15% for both sample sizes (and thus out of range for most of the panels on the left). The LRT effectively controls the type I error rate at around the nominal 5% level when it is applicable, but it has very low power compared with all the other methods, except under model iii) in the “classical case” of small numbers of predictors that are not highly correlated, see the right panels of Figures 2 and 3. Higher Criticism fails to control type I error except when the predictors are independent (Figure 3), in which case it is slightly anti-conservative and has excellent power under model iii), but very poor under model ii). That is, HC performs well (under zero correlation) when there are multiple active predictors, but not in the sparse case of only one active predictor. Except in the case of independent predictors, when Bonferroni is slightly better, ART outperforms all the competing procedures when both type I error and power are taken into account, and the improvement increases with the correlation between predictors.

Empirical rejection rates as in Figure 1 except with lower correlation between predictors: ρ = 0.5.

4.2 Asymptotic power

In this section, we carry out a simulation study to assess the asymptotic power of ART compared with that of the Bonferroni procedure. The computational expense of implementing ART is high because of the double bootstrap, so our full simulation study of the previous section is only feasible for small sample sizes. Nevertheless, we are able to assess asymptotic power by making use of our results on the local model in Section 3.

Consider the local model Y = (n^−1/2b₀)X₁ + ε, where $b_{0} \in R$ . Here X and ε are generated in the same way as Section 4.1, but now we only consider ρ = 0.5. The local parameter b₀ takes the special form (b₀, 0, . . . , 0)^T, and we allow b₀ to vary over a grid in [0, 5], in increments of 0.5. We set β₀ = 0, b₀ = (b₀, 0, . . . , 0)^T and make use of the given covariance structure of X and the explicit form of the limiting distribution in Theorem 1 to generate a draws from the asymptotic distribution of $\sqrt{n} {\hat{θ}}_{n}$ . Specifically, we carry out the following steps:

For each value of b₀ on the grid, take 5,000 draws from the limiting distribution of $\sqrt{n} ({\hat{θ}}_{n} - θ_{0})$ given in Theorem 1 (this distribution only depends on b₀ and the given distribution of (X, Y )), then add b₀ to obtain draws from the limiting distribution of $\sqrt{n} {\hat{θ}}_{n}$ . Based on these draws, we can obtain the (approximate) rejection rate of the test statistic $\sqrt{n} {\hat{θ}}_{n}$ for any given rejection region. In particular, the asymptotic rejection rate of ART (for any given b₀ on the grid) can be calculated by referring to the rejection rate corresponding to the particular critical values c_l and c_u generated by ART.
To assess the asymptotic power of ART at each given b₀, we generate 10 independent large samples (with n = 5,000) from the local model, find c_l and c_u for each sample, and display in a boxplot the corresponding asymptotic rejection rates (using the results of step 1).
For comparison, we also plot the asymptotic power of the Bonferroni procedure, which is approximated using 1,000 samples each of size n = 5,000.

The results are presented in Figure 4 for p = 10 and 50. The main source of variation within each boxplot is due to randomness over the 10 independent samples drawn from the local model, rather than bootstrap randomness (in view of bootstrap consistency and the large sample size n = 5,000). The median of each boxplot provides a suitable reference point to compare with the asymptotic power of Bonferroni (indicated by the circle). Note that ART provides accurate control of asymptotic type I error, and, as expected, Bonferroni is slightly conservative. In terms of median power, ART always outperforms Bonferroni, and can provide an additional 25% power (e.g., at b₀ = 3 for p = 10, and at b₀ = 3.5 for p = 50).

Asymptotic type I error and power of ART (box plots) compared with Bonferroni (circles) as a function of the local parameter b₀, for p = 10 and 50, ρ = 0.5, calculated using steps 1–3 in Section 4.2.

The cost of implementing the double bootstrap part of ART makes it prohibitive to extend the results in Figure 4 to larger p, but if we fix λ_n, then it becomes practical to run the simulations for p = 1000. Figure 5 shows how the asymptotic power of ART compares with Bonferroni as the constant a used to specify λ_n takes values 2, 4, 5 and 8 (the corresponding λ_n are 4.3, 6.1, 6.8 and 8.6). Note that as a increases (going from one panel to the next), ART becomes more stable and provides more accurate type I error control, but the overall power decreases. At small values of a, ART behaves like the CPB, which is anti-conservative (as we have already seen in the previous section), whereas at larger values the influence of CPB is diluted. For the CPB (which corresponds to setting λ_n = 0), the plot (not shown) appears very similar to that for a = 2; also, for a > 8 the plots appear very similar to a = 8. The best choice of a, therefore, is a trade-off between type I error control and power; comparing with Figure 4, ART with double bootstrapping appears to achieve a satisfactory balance in this regard. Also note that, even at the largest value a = 8, ART can provide an additional 20% power over Bonferroni, and thus outperform Bonferroni by a considerable margin in high-dimensional settings as well, at least when there is a high degree of correlation among the components of X.

Asymptotic type I error and power of ART compared with Bonferroni for p = 1,000 and ρ = 0.5, where ART is implemented using a fixed threshold λ_n specified by a = 2, 4, 5, 8, and each box plot is based on 20 independent replications with n = 10,000.

4.3 Gene expression example

We consider gene expression profiles from the tumors of n = 156 patients diagnosed with a common type of adult brain cancer (glioblastoma), collected as part of the Cancer Genome Atlas pilot project (TCGA, 2008). Our analysis is based on log gene expression levels X at p = 181 loci along chromosome 1. We are interested in detecting the presence of a gene that is significantly related to log-survival time Y.

We compare the results from applying the Bonferroni, CPB and ART procedures; LRT is not applicable since p > n. The three methods yield very different p-values. The smallest Bonferroni adjusted p-value is 40.8%, suggesting that no gene is significantly related to Y. The CPB and ART p-values are 3.2% and 17.2%, respectively, from 1000 bootstrap samples. Figure 6 shows how these p-values are calculated. Thus the CPB method suggests the presence of a significant genetic effect, whereas ART does not.

Gene expression example. Left panel: histogram of $\sqrt{n} ({\hat{θ}}_{n}^{*} - {\hat{θ}}_{n})$ showing that the two-sided CPB p-value is 3.2%. Right panel: histogram of $A_{n}^{*}$ showing that the two-sided ART p-value is 17.2%.

4.4 HIV drug resistance example

Our second example uses data from the HIV Drug Resistance Database (2014), an important public resource for understanding how HIV-1 mutation patterns cause resistance to antiretroviral drugs (Rhee et al., 2002). We will compare our results with those of Lockhart et al. (2014), who applied their covariance test to data on the susceptibility (a measure of drug resistance) of the nucleotide reverse transcriptase inhibitor lamivudine (3TC). We code susceptibility on a log-scale (Y), and each predictor X_j is taken as indicating the presence/absence of a mutation at a given sequence position. The viral sequence positions are indexed by j. Excluding missing data and rare mutations resulted in data on p = 103 positions and a total of 1266 isolates.

We randomly split the data 50 times into a training set of size n = 126 and a test set of size 1140. For each split, we carry out 20 steps of forward stepwise ART and standard forward stepwise regression using the training data, and calculate the corresponding prediction error (including all previously selected variables) using the test data. The left panel of Figure 7 shows the training data p-values (mean ± SD) for the newly entered predictor at each step, over the 50 random splits, and the right panel shows the corresponding prediction errors (mean ± SD). Forward stepwise ART detects one very highly significant mutation, but no more, as confirmed by the test set error plot, and this result is roughly consistent with the findings of Lockhart et al. (2014). Standard forward stepwise regression picks out at least 10 mutations, but there is no improvement in test set error after the first predictor enters the model; moreover, the test error almost exactly coincides with ART.

HIV drug resistance example. Left panel: training set p-values (mean ± SD) over 50 random splits of the data for forward stepwise ART (solid line), standard forward stepwise regression (dash-dot line) and the 0.05 alpha level (dotted). Right panel: test set error for the corresponding models (including all previously selected variables); the two lines are almost indistinguishable.

5 Discussion

In this paper we have developed an adaptive resampling test (ART) for detecting the existence of a significant predictor, X_k₀, from among predictors X₁, . . . , X_p. The procedure is designed to adjust to the non-regular limiting behavior of the estimated marginal regression coefficient ${\hat{θ}}_{n}$ of the selected predictor. This is done by using a thresholded version of the bootstrap that adapts to the non-regularity: if there is at least one significant predictor, it reduces to a centered percentile bootstrap, otherwise it mimics the local (non-uniform) asymptotic behavior of ${\hat{θ}}_{n}$ . We have shown that in simulation studies, ART performs favorably compared with standard methods such as Bonferroni, but also compared with more sophisticated methods such as Higher Criticism. The advantage of ART may stem from it being designed to take into account correlations between predictors, while also avoiding distributional assumptions (the nonparametric bootstrap steps in ART are essentially distribution free). We have restricted attention to linear models, but our approach has much wider applicability (e.g., generalized linear models, quantile regression, and censored time-to-event outcomes), and these will be studied in future papers.

Although our simulation results suggest that ART is useful and remarkably stable in “large p, small n” settings, the asymptotic theory that we have used to calibrate ART relies on assuming a fixed p, with n tending to infinity. In view of the conservative nature of the Bonferroni procedure in high-dimensional settings, there is a pressing need for more powerful tests in this area. In future work it would be of interest to develop the asymptotic theory of ART for the case of p growing with n, although this would be very challenging. As far as we know, formal testing procedures that provably control FWER and adjust to non-regularity under diverging p are not yet available, except for Higher Criticism in the case that the predictors are nearly uncorrelated, as established by Ingster et al. (2010) and Arias-Castro et al. (2011). In the only other instance we know of, under the strong assumption that X₁, . . . , X_p, Y are iid N(0, 1), results of Cai and Jiang (2012) can be used to find the weak limit of ${\hat{ρ}}_{n} = \max_{k = 1, \dots, p} ∣ \hat{Corr} (X_{k}, Y) ∣$ and thus devise an asymptotically correct calibration: if p = p_n → ∞ at sub-exponential rate, log(p)/n → 0, then ${\hat{ρ}}_{n} \to_{p} 0$ and $n {\hat{ρ}}_{n}^{2} - 2 \log p + \log \log p \overset{d}{\to} F$ where $F (y) = e^{- e^{- y ∕ 2} ∕ (2 \sqrt{π})}$ . In the super-exponential case, log(p)/n → ∞, then ${\hat{ρ}}_{n} \to_{p} 1$ and there is a similar weak limit.

Another interesting direction for future work would be to study the forward stepwise version of ART discussed in Section 3. Modifications to ART when applied stepwise in this way would be needed to adjust for the implicit dependence among the new outcomes. By repeating such a procedure until no more significant predictors are detected, the aim would be to correctly identify all active predictors.

Acknowledgments

Research supported by NIH Grant R01GM095722-01 and NSF Grant DMS-1307838.

Appendix: Proofs

Proof of Theorem 1

For k = 1, . . . , p, let $({\hat{α}}_{k}, {\hat{θ}}_{k}) = \arg \min_{(α, θ)} P_{n} {(Y - α - θ X_{k})}^{2}$ . Then ${\hat{k}}_{n} = \arg \min_{k = 1, \dots, p} P_{n} {(Y - {\hat{α}}_{k} - {\hat{θ}}_{k} X_{k})}^{2}$ and $({\hat{α}}_{n}, {\hat{θ}}_{n}) = ({\hat{α}}_{{\hat{k}}_{n}}, {\hat{θ}}_{{\hat{k}}_{n}})$ . It is easy to verify that ${\hat{α}}_{k} = P_{n} (Y - {\hat{θ}}_{k} X_{k})$ ,

\begin{matrix} \sqrt{n} {\hat{θ}}_{k} = & \frac{\sqrt{n} \hat{Cov} (X_{k}, Y)}{\hat{Var} (X_{k})} \\ = & \frac{\sqrt{n} \hat{Cov} (X_{k}, X^{T}) β_{n} + G_{n} [ϵ (X_{k} - P_{n} X_{k})]}{\hat{Var} (X_{k})} \\ = & \frac{(G_{n} X_{k} X^{T} - P_{n} X_{k} G_{n} X^{T} - G_{n} X_{k} P_{n} X^{T}) β_{n}}{\hat{Var} (X_{k})} + \frac{G_{n} [ϵ (X_{k} - P_{n} X_{k})] - P_{n} ϵ G_{n} X_{k} + \sqrt{n} Cov (X_{k}, X^{T}) β_{n}}{\hat{Var} (X_{k})}, \end{matrix}

(6)

where P_n is the distribution of (Y, X), and the mean residual squared error

{\hat{R}}_{k} \equiv P_{n} {[Y - {\hat{α}}_{k} - {\hat{θ}}_{k} X_{k}]}^{2} = \hat{Var} (Y) - \hat{Var} (X_{k}) {\hat{θ}}_{k}^{2} .

(7)

The result then follows immediately from the following two lemmas. The first lemma verifies the oracle property for marginal regression under the assumption that there is at least one active predictor; the proof is included for completeness. The second lemma gives the (non-regular) asymptotic behavior of ${\hat{θ}}_{n}$ when there are no active predictors.

Lemma 1

If all conditions in Theorem 1 hold and β₀ ≠ 0, then ${\hat{k}}_{n} \overset{a . s .}{\to} k_{0}$ and $\sqrt{n} ({\hat{θ}}_{n} - θ_{n}) \overset{d}{\to} Z_{k_{0}} (β_{0}) ∕ V_{k_{0}}$ , where Z_k₀ is defined in Theorem 1.

Proof

Denote R̂ ≡ (R̂₁, . . . ,R̂_p)^T. When β₀ ≠ 0, Var(X^T β₀) > 0. By the SLLN

\frac{\hat{Var} (Y) - \hat{R}}{Var (X^{T} β_{0})} \overset{a . s .}{\to} {({Corr}^{2} (X_{1}, X^{T} β_{0}), \dots, {Corr}^{2} (X_{p}, X^{T} β_{0}))}^{T} .

Since ${\hat{k}}_{n} = \arg \max_{k = 1, \dots, p} [\hat{Var} (Y) - {\hat{R}}_{k}] ∕ Var (X^{T} β_{0})$ and Corr²(X_k, X^Tβ₀) is maximized at k = k₀, it follows immediately that ${\hat{k}}_{n} \overset{a . s .}{\to} k_{0}$ .

Next, denote X̂ = X_{k̂_n} and X_n = X_{k_n}. Since $P_{n} [Y - P_{n} Y - {\hat{θ}}_{n} (\hat{X} - P_{n} \hat{X})] \hat{X} = 0$ and Y = α₀ + X^Tβ_n + ε, we have

\begin{matrix} \sqrt{n} ({\hat{θ}}_{n} - θ_{n}) \hat{Var} (\hat{X}) = & \sqrt{n} \hat{Cov} (\hat{X}, X^{T}) β_{n} + \sqrt{b} P_{n} (ϵ (\hat{X} - P_{n} \hat{X})) - \sqrt{n} \hat{Var} (\hat{X}) \frac{Cov {(X_{n}, X)}^{T} β_{n} + Cov (X_{n}, ϵ)}{Var (X_{n})} \\ = & \sqrt{n} \hat{Cov} (X_{k_{0}}, X^{T}) β_{n} + \sqrt{n} P_{n} (ϵ (X_{k_{0}} - P_{n} X_{k_{0}})) - \sqrt{n} \hat{Var} (X_{k_{0}}) \frac{Cov {(X_{k_{0}}, X)}^{T} β_{n} + Cov (X_{k_{0}}, ϵ)}{Var (X_{k_{0}})} + o_{P_{n}} (1) \\ = & G_{n} [(ϵ + {(X - P_{n} X)}^{T} β_{0} - \frac{Cov {(X_{k_{0}}, X)}^{T} β_{0}}{Var (X_{k_{0}})} (X_{k_{0}} - P_{n} X_{k_{0}})) (X_{k_{0}} - P_{n} X_{k_{0}})] + o_{P_{n}} (1), \end{matrix}

where the second equality uses ${\hat{k}}_{n} \overset{a . s}{\to} k_{0}$ and k_n → k₀ as n → ∞, and the third equality follows from the LLN and Cov(ε, X_k₀) = 0. Similarly, $\hat{Var} (\hat{X}) \overset{P_{n}}{\to} V_{k_{0}} \equiv Var (X_{k_{0}})$ . The proof is completed using Slutsky's lemma and the CLT.

Lemma 2

If all conditions in Theorem 1 hold and β₀ = 0, then $\sqrt{n} ({\hat{θ}}_{n} - θ_{n}) \overset{d}{\to} Z_{K} (0) ∕ V_{K} + {(C_{K} ∕ V_{K} - C_{\overset{‒}{k} (b_{0})} ∕ V_{\overset{‒}{k} (b_{0})})}^{T} b_{0}$ .

Proof

Since (Z₁(0), . . . , Z_p(0))^T is a normal random vector and |Corr(X_j, X_k)| < 1 for j ≠ k, it is easy to see that

\frac{{(Z_{j} (0) + C_{j}^{T} b_{0})}^{2}}{V_{j}} \neq \frac{{(Z_{k} (0) + C_{k}^{T} b_{0})}^{2}}{V_{k}} for any j \neq k a . s .

(8)

So K is unique a.s.

Denote $\hat{θ} = {({\hat{θ}}_{1}, \dots, {\hat{θ}}_{p})}^{T}$ . Note that when β₀ = 0, $\sqrt{n} β_{n} = b_{0}$ . By the CLT and Slutsky's lemma, we see from (6) that

\sqrt{n} \hat{θ} \overset{d}{\to} {(\frac{Z_{1} (0) + C_{1}^{T} b_{0}}{V_{1}}, \dots, \frac{Z_{p} (0) + C_{p}^{T} b_{0}}{V_{p}})}^{T} .

From (7), we have

n [\hat{Var} (Y) - \hat{R}] = (\sqrt{n} \hat{θ}) ⊙ (\sqrt{n} \hat{θ}) ⊙ {(\hat{Var} (X_{1}), \dots, \hat{Var} (X_{p}))}^{T},

where ☉ denotes the elementwise (Hadamard) product, so, by the continuous mapping theorem and Slutsky's lemma,

(\begin{matrix} \sqrt{n} \hat{θ} \\ n [\hat{Var} (Y) - \hat{R}] \end{matrix}) \overset{d}{\to} (\begin{matrix} {((Z_{1} (0) + C_{1}^{T} b_{0}) ∕ V_{1}, \dots, (Z_{p} (0) + C_{p}^{T} b_{0}) ∕ V_{p})}^{T} \\ {({(Z_{1} (0) + C_{1}^{T} b_{0})}^{2} ∕ V_{1}, \dots, {(Z_{p} (0) + C_{p}^{T} b_{0})}^{2} ∕ V_{p})}^{T} \end{matrix}) .

Define h(t) = (1_{arg max_kt_k=1}, . . . , 1_{arg max_kt_k=p})^T, where $t = {(t_{1}, \dots, t_{p})}^{T} \in R^{p}$ . Note that h is continuous at t if arg max_k t_k is unique. Thus, using (8) and since $\sqrt{n} {\hat{θ}}_{n} = \sqrt{n} {\hat{θ}}^{T} h (n [\hat{Var} (Y) - \hat{R}])$ , the result follows by applying the continuous mapping theorem to the above display.

Lemma 3

Let Z be a p-dimensional random vector and $f : R^{2 p} \to R^{p}$ a function such that f(z, ·) is continuous for every $z \in R^{p}$ , and f(Z, b)_j ≠ f(Z, b)_k a.s. for all j ≠ k and $b \in R^{p}$ . Then K(b) ≡ arg max_k=1...,p f(Z, b)_k is unique a.s. Also, if b_l → b₀, then K(b_l) = K(b₀) for l sufficiently large a.s.

The proof is omitted. An immediate consequence of this lemma is the continuity of the limiting distribution in Theorem 1 as a function of b₀; this is seen by setting $f {(z_{1}, \dots, z_{p}, b)}_{k} = {(z_{k} + C_{k}^{T} b)}^{2} ∕ V_{k}$ for k = 1, . . . , p, and using (8).

Proof of Theorem 2

The notation ${\hat{θ}}_{n}^{*}$ and ${\hat{k}}_{n}^{*}$ means that ${\hat{θ}}_{n}$ and k̂_n are based on n iid observations taken from $P_{n}$ . The bootstrapped process $V_{n}^{*} (b)$ in the statement of the theorem is defined by re-expressing (4), along with K̄(b) and $K_{n} (b)$ , in terms of P_n and $P_{n}$ operating on functions of (X, Y), then replacing P_n by $P_{n}$ and $P_{n}$ by $P_{n}^{*}$ throughout. In the case of $Z_{n, k}$ in which ε is not observed, we also replace ε by ${\hat{ϵ}}_{n} = {\hat{ϵ}}_{n} (X, Y) \equiv Y - {\hat{α}}_{n} - {\hat{θ}}_{n} \hat{X}$ , resulting in

Z_{n, k}^{*} = G_{n}^{*} [{\hat{ϵ}}_{n} (X_{k} - P_{n}^{*} X_{k})] = G_{n}^{*} [{\hat{ϵ}}_{n} X_{k}] - [G_{n}^{*} {\hat{ϵ}}_{n}] [P_{n}^{*} X_{k}]

(9)

where $G_{n}^{*} = \sqrt{n} (P_{n}^{*} - P_{n})$ is the bootstrapped empirical process. As is conventional in empirical process theory, $P_{n}^{*}$ , $P_{n}$ and P_n are assumed to operate only on functions that are defined on (X, Y), explaining why $P_{n}^{*} X_{k}$ can be separated in the above display.

Let E^M denote expectation conditional on the data, and let P^M be the corresponding probability measure. We will show that $1_{∣ T_{n}^{*} ∣ > λ_{n} or ∣ T_{n} ∣ > λ_{n}} \overset{P^{M}}{\to} 1_{β_{0} \neq 0}$ and $1_{∣ T_{n}^{*} ∣ \leq λ_{n}} 1_{∣ T_{n} ∣ \leq λ_{n}} \overset{P^{M}}{\to} 1_{β_{0} = 0}$ conditionally (on the data) in probability. This together with Lemmas 4 and 5 below implies the result.

For k = 1, . . . , p, the bootstrapped marginal regression coefficient ${\hat{θ}}_{k}^{*}$ satisfies

\begin{matrix} \sqrt{n} {\hat{θ}}_{k}^{*} = & \frac{\sqrt{n} [P_{n}^{*} X_{k} Y - (P_{n}^{*} X_{k}) (P_{n}^{*} Y)]}{P_{n}^{*} X_{k}^{2} - {(P_{n}^{*} X_{k})}^{2}} \\ = & \frac{G_{n}^{*} X_{k} Y - G_{n}^{*} X_{k} P_{n}^{*} Y - (P_{n} X_{k}) (G_{n}^{*} Y) + \sqrt{n} [P_{n} X_{k} Y - (P_{n} X_{k}) (P_{n} Y)]}{P_{n}^{*} X_{k}^{2} - {(P_{n}^{*} X_{k})}^{2}} \\ = & \frac{G_{n}^{*} X_{k} Y - G_{n}^{*} X_{k} P_{n}^{*} Y - (P_{n} X_{k}) (G_{n}^{*} Y) + \sqrt{n} {\hat{θ}}_{k} [P_{n} X_{k}^{2} - {(P_{n} X_{k})}^{2}]}{P_{n}^{*} X_{k}^{2} - {(P_{n}^{*} X_{k})}^{2}} . \end{matrix}

(10)

When β₀ = 0, by Lemma 2 and the condition that λ_n → ∞ as n → ∞, we have $T_{n}^{*} ∕ λ_{n} \overset{P^{M}}{\to} 0$ in probability. When β₀ ≠ 0, it is easy to verify that $∣ θ_{n} ∣ \to ∣ C_{k_{0}}^{T} β_{0} ∣ ∕ V_{k_{0}}$ , which is positive under the condition that k₀ is unique. Thus

\begin{matrix} P^{M} (∣ T_{n}^{*} ∣ \leq λ_{n}) & = P^{M} (∣ ({\hat{θ}}_{n}^{*} - {\hat{θ}}_{n}) + ({\hat{θ}}_{n} - θ_{n}) + θ_{n} ∣ \leq λ_{n} s_{n}^{*}) \\ \leq P^{M} (∣ θ_{n} ∣ \leq λ_{n} s_{n}^{*} + ∣ {\hat{θ}}_{n}^{*} - {\hat{θ}}_{n} ∣ + ∣ {\hat{θ}}_{n} - θ_{n} ∣) \end{matrix}

tends to zero in probability when β₀ ≠ 0, where the convergence follows from Lemma 1, Lemma 4 (below) and the condition that $λ_{n} = o (\sqrt{n})$ . Hence

\begin{matrix} E^{M} ∣ 1_{∣ T_{n}^{*} ∣ \leq λ_{m}} - 1_{β_{0} = 0} ∣ = & E^{M} ∣ 1_{∣ T_{n}^{*} ∣ > λ_{n}} - 1_{β_{0} \neq 0} ∣ \\ = & P^{M} (∣ T_{n}^{*} ∣ > λ_{n}, β_{0} = 0) + P^{M} (∣ T_{n}^{*} ∣ \leq λ_{n}, β_{0} \neq 0) \\ = & P^{M} (∣ T_{n}^{*} ∣ > λ_{n} ∣ β_{0} = 0) 1_{β_{0} = 0} + P^{M} (∣ T_{n}^{*} ∣ \leq λ_{n} ∣ β \neq 0) 1_{β_{0} \neq 0} \end{matrix}

tends to zero in probability. This implies that $1_{∣ T_{n}^{*} ∣ > λ_{n}} \overset{P^{M}}{\to} 1_{β_{0} \neq 0}$ and $1_{∣ T_{n}^{*} ∣ \leq λ_{n}} \overset{P^{M}}{\to} 1_{β_{0} = 0}$ conditionally in probability. Since 1_{|T_n|≤λ_n} converges to 1_β₀=0 in probability, the result follows from Slutsky's lemma.

Lemma 4

If the conditions in Theorem 1 hold and β₀ ≠ 0, then ${\hat{k}}_{n}^{*} \overset{P^{M}}{\to} k_{0}$ conditionally (on the data) a.s. and $\sqrt{n} ({\hat{θ}}_{n}^{*} - {\hat{θ}}_{n}) \overset{d}{\to} Z_{k_{0}} (β_{0}) ∕ V_{k_{0}}$ conditionally (on the data) in probability.

Proof

It follows from (10), the SLLN and Slutsky's lemma that, when β₀ ≠ 0,

{\hat{Var}}^{*} (X_{k}) {\hat{θ}}_{k}^{*} = n^{- 1 ∕ 2} [G_{n}^{*} X_{k} Y - G_{n}^{*} X_{k} P_{n}^{*} Y - (P_{n} X_{k}) (G_{n}^{*} Y)] + {\hat{θ}}_{k} [P_{n} X_{k}^{2} - {(P_{n} X_{k})}^{2}] \overset{P^{M}}{\to} C_{k}^{T} β_{0}

and ${\hat{θ}}_{k}^{*} \overset{P^{M}}{\to} C_{k}^{T} β_{0} ∕ V_{k}$ a.s. for k = 1, . . . , p. Denote the bootstrap mean squared error

{\hat{R}}_{k}^{*} \equiv P_{n}^{*} {[Y - {\hat{α}}_{k}^{*} - {\hat{θ}}_{k}^{*} X_{k}]}^{2} = {\hat{Var}}^{*} (Y) - {({\hat{θ}}_{k}^{*})}^{2} {\hat{Var}}^{*} (X_{k}),

where ${\hat{Var}}^{*} (Y) = P_{n}^{*} Y^{2} - {(P_{n}^{*} Y)}^{2}$ and ${\hat{Var}}^{*} (X_{k}) = P_{n}^{*} X_{k}^{2} - {(P_{n}^{*} X_{k})}^{2}$ . Then we can write

{\hat{k}}_{n}^{*} = \arg \max_{k = 1, \dots, p} \frac{{\hat{Var}}^{*} (Y) - {\hat{R}}_{k}^{*}}{Var (X^{T} β_{0})} = \arg \max_{k = 1, \dots, p} \frac{{({\hat{θ}}_{k}^{*})}^{2} {\hat{Var}}^{*} (X_{k})}{Var (X^{T} β_{0})}

since the denominator plays no role. By Slutsky's lemma

\frac{{({\hat{θ}}_{k}^{*})}^{2} {\hat{Var}}^{*} (X_{k})}{Var (X^{T} β_{0})} \overset{P^{M}}{\to} {Corr}^{2} (X_{k}, X^{T} β_{0})

a.s. for k = 1, . . . , p, so we obtain

\begin{matrix} P^{M} ({\hat{k}}_{n}^{*} \neq k_{0}) & = P^{M} (⋃_{k : k \neq k_{0}} {\frac{{({\hat{θ}}_{k_{0}}^{*})}^{2} {\hat{Var}}^{*} (X_{k_{0}})}{Var (X^{T} β_{0})} \leq \frac{{({\hat{θ}}_{k}^{*})}^{2} {\hat{Var}}^{*} (X_{k})}{Var (X^{T} β_{0})}}) \\ \leq \sum_{k : k \neq k_{0}} P^{M} (\frac{{({\hat{θ}}_{k_{0}}^{*})}^{2} {\hat{Var}}^{*} (X_{k_{0}})}{Var (X^{T} β_{0})} \leq \frac{{({\hat{θ}}_{k}^{*})}^{2} {\hat{Var}}^{*} (X_{k_{0}})}{Var (X^{T} β_{0})}) \\ \to 0 a . s ., \end{matrix}

where the convergence follows from the condition that k₀ is unique when β₀ ≠ 0.

Recall that ${\hat{ϵ}}_{n} \equiv Y - {\hat{α}}_{n} - {\hat{θ}}_{n} \hat{X}$ , where X̂ ≡ X_{k̂_n}. Note that $P_{n} {\hat{ϵ}}_{n} = 0$ . By the definition of ${\hat{θ}}_{n}^{*}$ , we have

\begin{matrix} \sqrt{n} ({\hat{θ}}_{n}^{*} - {\hat{θ}}_{n}) [P_{n}^{*} X_{{\hat{k}}_{n}^{*}}^{2} - {(P_{n}^{*} X_{{\hat{k}}_{n}^{*}})}^{2}] & = \sqrt{n} [P_{n}^{*} X_{{\hat{k}}_{n}^{*}} Y - (P_{n}^{*} X_{{\hat{k}}_{n}^{*}}) (P_{n}^{*} Y) - {\hat{θ}}_{n} (P_{n}^{*} X_{{\hat{k}}_{n}^{*}}^{2} - {(P_{n}^{*} X_{{\hat{k}}_{n}^{*}})}^{2})] \\ = \sqrt{n} (P_{n}^{*} X_{{\hat{k}}_{n}^{*}} {\hat{ϵ}}_{n} - P_{n}^{*} X_{{\hat{k}}_{n}^{*}} P_{n}^{*} {\hat{ϵ}}_{n}) + \sqrt{n} {\hat{θ}}_{n} [{(P_{n}^{*} X_{{\hat{k}}_{n}^{*}})}^{2} - P_{n}^{*} X_{{\hat{k}}_{n}^{*}}^{2} + P_{n}^{*} X_{{\hat{k}}_{n}^{*}} \hat{X} - (P_{n}^{*} X_{{\hat{k}}_{n}^{*}}) (P_{n}^{*} \hat{X})] \\ = G_{n}^{*} {\hat{ϵ}}_{n} (X_{{\hat{k}}_{n}^{*}} - P_{n} X_{{\hat{k}}_{n}^{*}}) - G_{n}^{*} X_{{\hat{k}}_{n}^{*}} (P_{n}^{*} - P_{n}) {\hat{ϵ}}_{n} - G_{n}^{*} {\hat{ϵ}}_{n} (P_{n} - P_{n}) X_{{\hat{k}}_{n}^{*}} + \sqrt{n} {\hat{θ}}_{n} [{(P_{n}^{*} X_{{\hat{k}}_{n}^{*}})}^{2} - P_{n}^{*} X_{{\hat{k}}_{n}^{*}}^{2} + P_{n}^{*} X_{{\hat{k}}_{n}^{*}} \hat{X} - (P_{n}^{*} X_{{\hat{k}}_{n}^{*}}) (P_{n}^{*} \hat{X})] . \end{matrix}

(11)

The last term in (11) is o_p^M (1) a.s. because the first and last terms within the square bracket cancel asymptotically, similarly for the second and third terms, due to ${\hat{k}}_{n}^{*} \overset{P^{M}}{\to} k_{0}$ and k̂_n → k₀ a.s. We next show that the first term in (11) converges in distribution to Z_k₀ (β₀) conditionally (on the data) in probability. By Lemma 1, it is easy to verify that ${\hat{θ}}_{n} \overset{P_{n}}{\to} θ_{0} ≜ C_{k_{0}}^{T} β_{0} ∕ V_{k_{0}}$ and ${\hat{α}}_{n} \overset{P_{n}}{\to} α_{0} + E X^{T} β_{0} - θ_{0} E X_{k_{0}}$ . Denote $\overset{‒}{ϵ} = ϵ + {(X - E X)}^{T} β_{0} - θ_{0} (X_{k_{0}} - E X_{k_{0}})$ . Then the first term can be decomposed as

G_{n}^{*} {\hat{ϵ}}_{n} [(X_{{\hat{k}}_{n}^{*}} - P_{n} X_{{\hat{k}}_{n}^{*}} - (X_{k_{0}} - P_{n} X_{k_{0}})] + G_{n}^{*} [({\hat{ϵ}}_{n} - \overset{‒}{ϵ}) (X_{k_{0}} - P_{n} X_{k_{0}})] + G_{n}^{*} [\overset{‒}{ϵ} (X_{k_{0}} - P_{n} X_{k_{0}})] .

(12)

The first term in (12) is o_P^M (1) a.s. since ${\hat{k}}_{n}^{*} \overset{P^{M}}{\to} k_{0}$ . The second term in (12) can be written as

[(α_{0} + E X^{T} β_{0} - θ_{0} E X_{k_{0}}) - {\hat{α}}_{n}] G_{n}^{*} (X_{k_{0}} - P_{n} X_{k_{0}}) + (P_{n}^{*} - P_{n}) [(X_{k_{0}} - P_{n} X_{k_{0}}) X^{T} b_{0}] + (θ_{0} - {\hat{θ}}_{n}) G_{n}^{*} [X_{k_{0}} (X_{k_{0}} - P_{n} X_{k_{0}})] - {\hat{θ}}_{n} G_{n}^{*} [(\hat{X} - X_{k_{0}}) (X_{k_{0}} - P_{n} X_{k_{0}})]

which is o_P^M (1) in probability by bootstrap consistency of the sample mean [see, e.g., Theorem 23.4 of van der Vaart (1998)], and the fact that X̂ = X_k₀ for n sufficiently large a.s. Bootstrap consistency of the sample mean also gives that the third term in (12) converges in distribution to Z_k₀(β₀) conditionally (on the data) in probability.

Similarly, the second and third terms in (11) and $P_{n}^{*} X_{{\hat{k}}_{n}^{*}}^{2} - {(P_{n}^{*} X_{{\hat{k}}_{n}^{*}})}^{2} - Var (X_{k_{0}})$ can be shown to be o_P^M (1) in probability. The result then follows from Slutsky's lemma.

Lemma 5

If all conditions in Theorem 1 hold and β₀ = 0, then $V_{n}^{*} (b_{0})$ converges to the same limiting distribution as $\sqrt{n} ({\hat{θ}}_{n} - θ_{n})$ conditionally (on the data) in probability.

Proof

Define $Z_{n}$ , $M_{n} (b)$ and M′(b) to be p-vectors with kth components given by $Z_{n, k} = G_{n} [ϵ (X_{k} - P_{n} X_{k})]$ ,

\frac{{[\hat{Cov} (X_{k}, X^{T} b) + Z_{n, k}]}^{2}}{\hat{Var} (X_{k})} and \frac{{[Cov (X_{k}, X^{T} b)]}^{2}}{Var (X_{k})},

respectively. Let $W_{n} (b)$ be a p × p matrix with the (j, k)-th component given by

\frac{\hat{Cov} (X_{k}, X^{T} b) + Z_{n, k}}{\hat{Var} (X_{k})} - \frac{Cov (X_{j}, X^{T} b)}{Var (X_{j})} .

Also, let $D_{n} (b)$ and D′(b) be p-vectors of zeros, apart from a 1 in the entry that maximizes $M_{n} (b)$ and M′(b), respectively. Then

V_{n} (b) = D^{'} {(b)}^{T} W_{n} (b) D_{n} (b) .

Similarly, define $M (b)$ , $W (b)$ and $D (b)$ (without indexing by n) to be processes of the same form as $M_{n} (b)$ , $W_{n} (b)$ and $D_{n} (b)$ , except with $Z_{n, k}$ replaced by Z_k(0), and the sample variances/covariances replaced by their population versions.

Referring to the notation in (4), it is clear that when β₀ = 0,

\sqrt{n} ({\hat{θ}}_{n} - θ_{n}) = V_{n} (b_{0}) = D^{'} {(b_{0})}^{T} W_{n} (b_{0}) D_{n} (b_{0}) \overset{d}{\to} D^{'} {(b_{0})}^{T} W (b_{0}) D (b_{0}) .

Moreover, the second equality in the above display also holds for the bootstrap version. Writing the bootstrapped version of $Z_{n, k}$ in (9) as

Z_{n, k}^{*} = G_{n}^{*} [ϵ (X_{k} - P_{n} X_{k})] + G_{n}^{*} [({\hat{ϵ}}_{n} - ϵ) (X_{k} - P_{n} X_{k})] + [(P_{n} - P_{n}^{*}) X_{k}] G_{n}^{*} {\hat{ϵ}}_{n}],

and using arguments similar to those in the proof Lemma 4 for handling (12), we have $Z_{n}^{*} \overset{d}{\to} {(Z_{1} (0), \dots, Z_{p} (0))}^{T}$ conditionally (on the data) in probability. As a result, $({\hat{D}}_{n}^{'} (b_{0}), W_{n}^{*} (b_{0}), M_{n}^{*} (b_{0})) \overset{d}{\to} (D^{'} (b_{0}), W (b_{0}), M (b_{0}))$ conditionally (on the data) in probability, where ${\hat{D}}_{n}^{'} (b)$ is the sample version of D′(b), and $W_{n}^{*} (b)$ and $M_{n}^{*} (b)$ are the bootstrap versions of $W_{n} (b)$ and $M_{n} (b)$ , respectively. Finally, using similar arguments to those at the end of the proof of Lemma 2, along with the continuous mapping theorem, we conclude that

V_{n}^{*} (b_{0}) = {\hat{D}}_{n}^{'} {(b_{0})}^{T} W_{n}^{*} (b_{0}) D_{n}^{*} (b_{0}) \overset{d}{\to} D^{'} {(b_{0})}^{T} W (b_{0}) D (b_{0})

conditionally (on the data) in probability.

References

Andrews D. Inconsistency of the Bootstrap when a Parameter is on the Boundary of the Parameter Space. Econometrica. 2000;68(2):399–405. [Google Scholar]
Arias-Castro E, Candès EJ, Plan Y. Global Testing Under Sparse Alter natives: ANOVA, Multiple Comparisons and the Higher Criticism. Ann. Statist. 2011;39:2533–2556. [Google Scholar]
Belloni A, Chernozhukov V, Hansen C. Inference on Treatment Effects After Selection Amongst High-Dimensional Controls. Review of Economic Studies. 2014;81(2):608–650. [Google Scholar]
Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A prac tical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society, Ser. B. 1995;57:289–300. [Google Scholar]
Berk R, Brown LD, Buja A, Zhang K, Zhao L. Valid Post-Selection Inference. Annals of Statistics. 2013;41:802–837. [Google Scholar]
Breiman L. The Little Bootstrap and Other Methods for Dimensionality Se lection in Regression: X-Fixed Prediction Error. Journal of the American Statistical Association. 1992;87:738–754. [Google Scholar]
Bühlmann P. Statistical Significance in High-dimensional Linear Models. Bernoulli. 2013;19:1212–1242. [Google Scholar]
Cai TT, Jiang T. Phase Transition in Limiting Distributions of Co herence of High-dimensional Random Matrices. Journal of Multivariate Analysis. 2012;107:24–39. [Google Scholar]
Cancer Genome Atlas Research Network Comprehensive Genomic Charac terization Defines Human Glioblastoma Genes and Core Pathways. Nature. 2008;455:1061–1068. doi: 10.1038/nature07385. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chatterjee A, Lahiri SN. Bootstrapping Lasso Estimators. Journal of the American Statistical Association. 2011;106(494):608–625. [Google Scholar]
Cheng X. Robust Confidence Intervals in Nonlinear Regression under Weak Identification. Department of Economics, University of Pennsylvania; 2008. Unpublished Manuscript. Version posted in 2015: http://www.sas.upenn.edu/xucheng/papers/Cheng mixed id 19.pdf. [Google Scholar]
Cheng X. Robust Inference in Nonlinear Models with Mixed Identification Strength. Journal of Econometrics. 2015 to appear. [Google Scholar]
Davies RB. Hypothesis Testing when a Nuisance Parameter Is Present Only under the Alternative. Biometrika. 1977;64(2):247–254. doi: 10.1111/j.0006-341X.2005.030531.x. [DOI] [PubMed] [Google Scholar]
Donoho D, Jin J. Higher Criticism for Detecting Sparse Heterogeneous Mixtures. Annals of Statistics. 2004;32(3):962–994. [Google Scholar]
Donoho D, Jin J. Higher Criticism for Large-Scale Inference, Especially for Rare and Weak Effects. Statistical Science. 2015;30(1):1–25. [Google Scholar]
Dudoit S, Shaffer JP, Boldrick JC. Multiple Hypothesis Testing in Microarray Experiments. Statistical Science. 2003;18:71–103. [Google Scholar]
Dudoit S, van der Laan MJ. Multiple Testing Procedures with Appli cations to Genomics. Springer; New York: 2008. [Google Scholar]
Efron B. Large-scale Simultaneous Hypothesis Testing: the Choice of a Null Hypothesis. Journal of American Statistical Association. 2006;99:96–104. [Google Scholar]
Efron B. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press; 2010. [Google Scholar]
Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Chapman & Hall/CRC Monographs on Statistics & Applied Probability; 1993. [Google Scholar]
Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Li R. Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery. In: Sanz-Sole M, Soria J, Varona JL, Verdera J, editors. Proceedings of the International Congress of Mathematicians. III. European Mathematical Society; Zurich: 2006. pp. 595–622. [Google Scholar]
Fan J, Lv J. Sure Independence Screening for Ultra-high Dimensional Feature Space. Journal of the Royal Statistical Society, Ser. B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. with discussion. [DOI] [PMC free article] [PubMed] [Google Scholar]
Genovese C, Jin J, Wasserman L, Yao Z. A Comparison of the Lasso and Marginal Regression. Journal of Machine Learning Research. 2012;13:2107–2143. [Google Scholar]
HIV Drug Resistance Database Genotype-Phenotype Datasets, Stanford Uni versity. 2014 http://hivdb.stanford.edu/pages/genopheno.dataset.html.
Huang J, Ma S, Zhang C-H. Adaptive Lasso for High-dimensional Regression Models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
Ingster YI, Tsybakov AB, Verzelen N. Detection Boundary in Sparse Regression. Electron. J. Statist. 2010;4:1476–1526. [Google Scholar]
Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. 6th Edition Prentice Hall; New Jersey: 2007. [Google Scholar]
Laber E, Murphy SA. Adaptive Confidence Intervals for the Test Error in Classification (with discussion). Journal of the American Statistical Association. 2011;106(495):904–913. doi: 10.1198/jasa.2010.tm10053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Laber E, Murphy SA. Adaptive Inference after Model Selection. 2013 Under review. [Google Scholar]
Laber E, Lizotte D, Qian M, Murphy SA. Dynamic Treatment Regimes: Technical Challenges and Applications. Electronic Journal of Statistics. 2014;8:1225–1272. doi: 10.1214/14-ejs920. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leeb H, Pötscher BM. Can One Estimate the Conditional Distribution of Post-model-selection Estimators? Annals of Statistics. 2006;34(5):2554–2591. [Google Scholar]
Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R. A Significance Test for the Lasso. Annals of Statistics. 2014;42(2):413–468. doi: 10.1214/13-AOS1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
McCloskey A. Bonferroni-based Size-correction for Nonstandard Test ing Problems. 2012 Working Paper, http://www.econ.brown.edu/fac/adam mccloskey/Research files/McCloskey BBCV.pdf.
Meinshausen N, Meier L, Bühlmann P. P-values for High-dimensional Regression. Journal of the American Statistical Association. 2009;104:1671–1681. [Google Scholar]
Ning Y, Liu H. A General Theory of Hypothesis Tests and Confidence Regions for Sparse High Dimensional Models. 2015 http://arxiv.org/abs/1412.8765.
Rhee S-Y, Gonzales MJ, Kantor R, Betts BJ, Ravela J, Shafer RW. Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res. 2003;31(1):298–303. doi: 10.1093/nar/gkg100. [DOI] [PMC free article] [PubMed] [Google Scholar]
Samworth R. A Note on Methods of Restoring Consistency to the Bootstrap. Biometrika. 2003;90:985–990. [Google Scholar]
van der Vaart AW. Asymptotic Statistics. Cambridge University Press; 1998. [Google Scholar]
Zhang C-H, Zhang S. Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models. J. R. Stat. Soc. B. 2014;76:217–242. [Google Scholar]
Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society, Series B. 2005;67(2):301–320. [Google Scholar]

[R1] Andrews D. Inconsistency of the Bootstrap when a Parameter is on the Boundary of the Parameter Space. Econometrica. 2000;68(2):399–405. [Google Scholar]

[R2] Arias-Castro E, Candès EJ, Plan Y. Global Testing Under Sparse Alter natives: ANOVA, Multiple Comparisons and the Higher Criticism. Ann. Statist. 2011;39:2533–2556. [Google Scholar]

[R3] Belloni A, Chernozhukov V, Hansen C. Inference on Treatment Effects After Selection Amongst High-Dimensional Controls. Review of Economic Studies. 2014;81(2):608–650. [Google Scholar]

[R4] Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A prac tical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society, Ser. B. 1995;57:289–300. [Google Scholar]

[R5] Berk R, Brown LD, Buja A, Zhang K, Zhao L. Valid Post-Selection Inference. Annals of Statistics. 2013;41:802–837. [Google Scholar]

[R6] Breiman L. The Little Bootstrap and Other Methods for Dimensionality Se lection in Regression: X-Fixed Prediction Error. Journal of the American Statistical Association. 1992;87:738–754. [Google Scholar]

[R7] Bühlmann P. Statistical Significance in High-dimensional Linear Models. Bernoulli. 2013;19:1212–1242. [Google Scholar]

[R8] Cai TT, Jiang T. Phase Transition in Limiting Distributions of Co herence of High-dimensional Random Matrices. Journal of Multivariate Analysis. 2012;107:24–39. [Google Scholar]

[R9] Cancer Genome Atlas Research Network Comprehensive Genomic Charac terization Defines Human Glioblastoma Genes and Core Pathways. Nature. 2008;455:1061–1068. doi: 10.1038/nature07385. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Chatterjee A, Lahiri SN. Bootstrapping Lasso Estimators. Journal of the American Statistical Association. 2011;106(494):608–625. [Google Scholar]

[R11] Cheng X. Robust Confidence Intervals in Nonlinear Regression under Weak Identification. Department of Economics, University of Pennsylvania; 2008. Unpublished Manuscript. Version posted in 2015: http://www.sas.upenn.edu/xucheng/papers/Cheng mixed id 19.pdf. [Google Scholar]

[R12] Cheng X. Robust Inference in Nonlinear Models with Mixed Identification Strength. Journal of Econometrics. 2015 to appear. [Google Scholar]

[R13] Davies RB. Hypothesis Testing when a Nuisance Parameter Is Present Only under the Alternative. Biometrika. 1977;64(2):247–254. doi: 10.1111/j.0006-341X.2005.030531.x. [DOI] [PubMed] [Google Scholar]

[R14] Donoho D, Jin J. Higher Criticism for Detecting Sparse Heterogeneous Mixtures. Annals of Statistics. 2004;32(3):962–994. [Google Scholar]

[R15] Donoho D, Jin J. Higher Criticism for Large-Scale Inference, Especially for Rare and Weak Effects. Statistical Science. 2015;30(1):1–25. [Google Scholar]

[R16] Dudoit S, Shaffer JP, Boldrick JC. Multiple Hypothesis Testing in Microarray Experiments. Statistical Science. 2003;18:71–103. [Google Scholar]

[R17] Dudoit S, van der Laan MJ. Multiple Testing Procedures with Appli cations to Genomics. Springer; New York: 2008. [Google Scholar]

[R18] Efron B. Large-scale Simultaneous Hypothesis Testing: the Choice of a Null Hypothesis. Journal of American Statistical Association. 2006;99:96–104. [Google Scholar]

[R19] Efron B. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press; 2010. [Google Scholar]

[R20] Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Chapman & Hall/CRC Monographs on Statistics & Applied Probability; 1993. [Google Scholar]

[R21] Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R22] Fan J, Li R. Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery. In: Sanz-Sole M, Soria J, Varona JL, Verdera J, editors. Proceedings of the International Congress of Mathematicians. III. European Mathematical Society; Zurich: 2006. pp. 595–622. [Google Scholar]

[R23] Fan J, Lv J. Sure Independence Screening for Ultra-high Dimensional Feature Space. Journal of the Royal Statistical Society, Ser. B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. with discussion. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Genovese C, Jin J, Wasserman L, Yao Z. A Comparison of the Lasso and Marginal Regression. Journal of Machine Learning Research. 2012;13:2107–2143. [Google Scholar]

[R25] HIV Drug Resistance Database Genotype-Phenotype Datasets, Stanford Uni versity. 2014 http://hivdb.stanford.edu/pages/genopheno.dataset.html.

[R26] Huang J, Ma S, Zhang C-H. Adaptive Lasso for High-dimensional Regression Models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]

[R27] Ingster YI, Tsybakov AB, Verzelen N. Detection Boundary in Sparse Regression. Electron. J. Statist. 2010;4:1476–1526. [Google Scholar]

[R28] Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. 6th Edition Prentice Hall; New Jersey: 2007. [Google Scholar]

[R29] Laber E, Murphy SA. Adaptive Confidence Intervals for the Test Error in Classification (with discussion). Journal of the American Statistical Association. 2011;106(495):904–913. doi: 10.1198/jasa.2010.tm10053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Laber E, Murphy SA. Adaptive Inference after Model Selection. 2013 Under review. [Google Scholar]

[R31] Laber E, Lizotte D, Qian M, Murphy SA. Dynamic Treatment Regimes: Technical Challenges and Applications. Electronic Journal of Statistics. 2014;8:1225–1272. doi: 10.1214/14-ejs920. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Leeb H, Pötscher BM. Can One Estimate the Conditional Distribution of Post-model-selection Estimators? Annals of Statistics. 2006;34(5):2554–2591. [Google Scholar]

[R33] Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R. A Significance Test for the Lasso. Annals of Statistics. 2014;42(2):413–468. doi: 10.1214/13-AOS1175. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] McCloskey A. Bonferroni-based Size-correction for Nonstandard Test ing Problems. 2012 Working Paper, http://www.econ.brown.edu/fac/adam mccloskey/Research files/McCloskey BBCV.pdf.

[R35] Meinshausen N, Meier L, Bühlmann P. P-values for High-dimensional Regression. Journal of the American Statistical Association. 2009;104:1671–1681. [Google Scholar]

[R36] Ning Y, Liu H. A General Theory of Hypothesis Tests and Confidence Regions for Sparse High Dimensional Models. 2015 http://arxiv.org/abs/1412.8765.

[R37] Rhee S-Y, Gonzales MJ, Kantor R, Betts BJ, Ravela J, Shafer RW. Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res. 2003;31(1):298–303. doi: 10.1093/nar/gkg100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Samworth R. A Note on Methods of Restoring Consistency to the Bootstrap. Biometrika. 2003;90:985–990. [Google Scholar]

[R39] van der Vaart AW. Asymptotic Statistics. Cambridge University Press; 1998. [Google Scholar]

[R40] Zhang C-H, Zhang S. Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models. J. R. Stat. Soc. B. 2014;76:217–242. [Google Scholar]

[R41] Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society, Series B. 2005;67(2):301–320. [Google Scholar]

PERMALINK

An adaptive resampling test for detecting the presence of significant predictors

Ian W McKeague

Min Qian

Abstract

1 Introduction

2 Marginal regression and non-regularity

3 Adaptive resampling test

Theorem 1

Theorem 2

ART procedure

Robust confidence intervals

Choice of the tuning parameter λn

Forward stepwise ART

4 Numerical studies

4.1 Finite sample simulations

Likelihood ratio test (LRT)

Multiple testing with Bonferroni correction

Centered percentile bootstrap (CPB)

Higher Criticism (HC)

Figure 1.

Figure 3.

Figure 2.

4.2 Asymptotic power

Figure 4.

Figure 5.

4.3 Gene expression example

Figure 6.

4.4 HIV drug resistance example

Figure 7.

5 Discussion

Acknowledgments

Appendix: Proofs

Proof of Theorem 1

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof of Theorem 2

Lemma 4

Proof

Lemma 5

Proof

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Choice of the tuning parameter λ_n