Abstract
This paper investigates marginal screening for detecting the presence of significant predictors in high-dimensional regression. Screening large numbers of predictors is a challenging problem due to the non-standard limiting behavior of post-model-selected estimators. There is a common misconception that the oracle property for such estimators is a panacea, but the oracle property only holds away from the null hypothesis of interest in marginal screening. To address this difficulty, we propose an adaptive resampling test (ART). Our approach provides an alternative to the popular (yet conservative) Bonferroni method of controlling familywise error rates. ART is adaptive in the sense that thresholding is used to decide whether the centered percentile bootstrap applies, and otherwise adapts to the non-standard asymptotics in the tightest way possible. The performance of the approach is evaluated using a simulation study and applied to gene expression data and HIV drug resistance data.
Keywords: Bootstrap, Family-wise error rate, Marginal regression, Non-regular asymptotics, Screening covariates
1 Introduction
The problem of selecting significant predictors is a central aspect of scientific discovery, and has become increasingly important in an era in which massive data sets are readily available (Fan and Li, 2006). Much of the modern statistical literature in this area focuses on consistency of variable selection in high-dimensional settings based on machine learning and data mining techniques (e.g., Fan and Li 2001; Zou and Hastie 2005; Huang et al. 2008; Fan and Lv 2008; Genovese et al. 2012). A major gap in this literature, however, has been the scarcity of formal hypothesis testing procedures that take variable selection into account; the oracle property enjoyed by many variable selection methods in the presence of high dimensionality can not be applied directly for testing whether a post-model-selected variable is significant. In bioinformatics, for example, variable selection techniques based on penalization (such as lasso, scad, etc) are routinely used to produce lists of differentially-expressed genes that are most related to disease risk, but few methods for obtaining valid p-values have been developed.
A more traditional approach to the selection of significant predictors is multiple testing to control either family-wise error rate (FWER), or false-discovery rate (Benjamini and Hochberg 1995; Dudoit et al. 2003; Efron 2006; Dudoit and van der Laan 2008; Efron 2010). Procedures that control FWER (e.g., Bonferroni, or Holm's procedure) are often criticized as being too conservative (in the sense of having low power). False-discovery rate methods, on the other hand, although having greater power, incur the cost of inflated FWER. Our aim in the present paper is to introduce a more powerful single test that can be used as an alternative screening procedure to detect the presence of some significant predictor while rigorously controlling FWER.
The proposed procedure uses marginal linear regression to select the predictor (from among covariates X1, . . . , Xp) that has maximal sample correlation with a scalar outcome Y (as in marginal screening or correlation learning, Genovese et al. 2012). The test is based on the estimated marginal regression coefficient of the selected predictor. If there is a unique predictor, say Xk0, maximally correlated with the outcome, then the selection procedure consistently estimates k0, and is asymptotically normal; if all predictors are uncorrelated with the outcome, then the selected predictor does not converge (in probability) and has a non-normal limiting distribution. In particular, the limiting distribution is discontinuous (at zero) as a function of the regression coefficient of Xk0 (where k0 is not identifiable), and this “non-regularity” causes non-uniform convergence.
Breiman (1992) drew early attention to the issue of invalid post-model-selection inference, calling it the “quiet scandal” of Statistics; even earlier references are mentioned in Berk et al. (2013). Samworth (2003) gave a detailed account of the inaccuracy of bootstrap methods applied to super-efficient estimators. Leeb and Pötscher (2006) (and other papers by the same authors) established that non-uniform limiting behavior of post-model-selected estimators is at the root of the problem, and that estimates of asymptotic null distributions in such settings can give a misleading picture of finite-sample performance. In particular, calibrating a test based on in a way that does not adapt to the implicit post-model-selection will be extremely inaccurate. This type of non-regularity occurs in various other settings as well, e.g., when a nuisance parameter is only defined under an alternative hypothesis (Davies, 1977), and when the parameter of interest under the null hypothesis is on the boundary of the parameter space (Andrews, 2000). McCloskey (2012) surveyed non-standard testing problems in econometrics, and introduced some Bonferroni-based size-correction methods designed to improve power. As far as we know, however, there is not yet a resolution of these issues for marginal screening.
In this paper we introduce an adaptive resampling test (ART) for marginal screening that adapts to the small sample behavior of in terms of a local model. Under local alternatives, we find an explicit representation of the asymptotic distribution of and construct a suitable bootstrap estimator of this distribution that is consistent, thus circumventing the non-regularity mentioned above. Under non-local alternatives, we show that the critical values obtained in this way agree asymptotically with those used by the oracle (who is given knowledge of k0), so ART can be expected to provide good power as well.
Several new approaches to post-model selection inference for linear regression have been proposed in recent years. Meinshausen et al. (2009) introduced a random sample splitting procedure in the high-dimensional setting to obtain (conservative) Bonferroni-adjusted p-values following variable selection. Chatterjee and Lahiri (2011) developed a modified bootstrap method that provides an asymptotically valid confidence region for the regression parameters based on the lasso estimator; this method depends on the presence of at least one active predictor, so it is not applicable to marginal screening (under the null hypothesis there is no active predictor).
More relevant to marginal screening, the covariance test recently introduced by Lockhart et al. (2014) uses a forward stepwise lasso procedure to test for active predictors entering a sparse linear model under the assumption of normal errors. Also in the sparse linear model setting with normal errors, but further assuming that the predictors are nearly uncorrelated, Ingster et al. (2010) and Arias-Castro et al. (2011) have studied the detection boundary and optimality properties of general classes of multiple testing procedures (including Bonferroni and Higher Criticism). Berk et al. (2013) developed a valid method of post-model selection inference that is feasible for up to about p = 20 predictors, also assuming normal errors. In various sparse high-dimensional settings, Belloni et al. (2013), Bühlmann (2013), Zhang and Zhang (2014) and Ning and Liu (2015) have established asymptotically valid confidence intervals for a preconceived regression parameter after variable selection on the remaining predictors, but this does not apply to marginal screening (where no regression parameter is singled-out a priori).
This paper is organized as follows. We formulate the problem and discuss the issue of non-regularity in Section 2. In Section 3, we develop the ART procedure and establish the consistency of the underlying bootstrap. Simulation studies and applications to gene expression data and HIV drug resistance data are presented in Section 4. Concluding discussion appears in Section 5, and proofs are collected in the Appendix.
2 Marginal regression and non-regularity
Consider a scalar outcome Y and a p-dimensional vector of covariates X = (X1, . . . , Xp)T such that the marginal variance of each covariate is finite and non-zero. Marginal regression consists in using separate linear models to predict Y from each Xk. Let k0 be the label of a covariate that maximizes the absolute correlation with Y :
and let α0 + θ0Xk0 be the best linear predictor based on Xk0, i.e.,
| (1) |
We are interested in testing whether at least one of the covariates is correlated with Y, for which it suffices to check whether Xk0 and Y are correlated. This is equivalent to testing
Given an iid sample of size n, let , , and k̂n be the least squares estimates of α0, θ0, and k0, respectively:
where is the empirical distribution, and the hats indicate sample versions. It is natural to base the test on but calibration is problematic because the distribution of does not converge uniformly with respect to θ0, as mentioned in the Introduction. The non-uniformity occurs in the neighborhood of θ0 = 0. Specifically, there exists a bounded continuous function such that does not converge uniformly in any neighborhood of θ0 = 0, despite converging pointwise. To see this, first note that under mild conditions
where Vk = Var(Xk), , and (Z1, . . . , Zp)T is a mean-zero normal random vector with covariance matrix depending on parameters of the full linear model (this is a special case of Theorem 1 below). From the form of the distribution of U, we can choose h so that f∞(θ0) ≡ Eh(U) is discontinuous at θ0 = 0 (this is the non-regularity mentioned in the Introduction). If fn were to converge uniformly to f∞ on some compact neighborhood of zero, we would have a contradiction because each fn is continuous, and the uniform limit of a sequence of continuous functions on a compact interval is continuous.
To address this problem, in the next section we develop a formal test procedure (ART) inspired by work of Cheng (2008, 2015) concerning robust confidence intervals for non-linear regression parameters in the presence of weak-identifiability. Other variations of this approach have been used by Laber and Murphy (2011) to construct a confidence interval for the classification error, by Laber et al. (2014) in a sequential decision making problem, and by Laber and Murphy (2013) to provide robust confidence intervals for adaptive lasso. As already noted, the distribution of does not converge uniformly in the neighborhood of θ0 = 0, so its small sample behavior can be very far from normal when the true parameter is close to zero. Therefore an understanding of the asymptotic behavior of under local alternatives plays a crucial role in devising a suitable test, or more generally in providing robust confidence intervals for θ0.
3 Adaptive resampling test
In this section, we develop the proposed ART procedure for detecting the presence of a significant predictor. The idea is to adapt to the inherent non-regular behavior of the post-model-selected estimator in a way that accurately captures its asymptotic behavior in -neighborhoods of the null hypothesis.
We frame the problem in terms of the general local linear model
| (2) |
where , , the noise ε has mean 0, finite variance, and is uncorrelated with X, and βn = β0 + n−1/2b0, where is the local parameter. The distributions of ε and X are assumed to be fixed, so only the distribution of Y depends on n (although we suppress n in the notation for Y). The relevant hypotheses are now
where θn = Cov(Xkn, Y)/Var(Xkn) and kn is the label of a component of X that maximizes absolute correlation with Y.
Our first result gives the asymptotic distribution of . To state the result, we need the notation
for any . Note that kn = k̄(βn) under the local model. If k0 ≡ k̄(β0) is unique (so β0 ≠ = 0), then kn → k0, and θn is asymptotically bounded away from zero (a non-local alternative). On the other hand, if β0 = 0 and k̄(b0) is unique, then kn = k̄(b0); also θn is in the neighborhood of zero and represents a local alternative. Finally, if β0 = b0 = 0, then kn is not well-defined and the null hypothesis θn = 0 holds. We need the uniqueness of the most active predictor k0 (away from the null hypothesis), but this seems to be a very mild condition because the likelihood that there would be two or more predictors having exactly the same maximal correlation with Y seems remote in practice. Even in practice, as we will see in the simulation study, non-uniqueness of the maximally correlated predictor does not adversely affect power.
Theorem 1
Suppose that k0 = k̄(b0) is unique when β0 ≠ 0, and k̄(b0) is unique when β0 = 0 and b0 ≠ 0. Then, under the local model (2),
where , and is a mean-zero normal random vector with covariance matrix Σ(β) given by that of the random vector with components
for k = 1, . . . , p, and Σ(β0) is assumed to exist.
The non-regularity at β0 = 0 is explained by the dependence of the limiting distribution on the (non-identifiable) local parameter b0. The limiting distribution is nevertheless continuous as a function of into the space of distribution functions (this is a simple consequence of Lemma 3 in the Appendix), and the convergence is uniform over compact subsets of , unlike the limiting behavior discussed in the previous section, so finite-sample accuracy should be less of an issue when designing a screening test using this result. On the other hand, naive resampling methods that do not take into account the local asymptotic behavior will fail to provide consistent estimates of the distribution of , as discussed in the Introduction for the non-local case.
To get around this problem, we decompose in a way that isolates the possibility that β0 ≠ 0 by comparing |Tn| to some threshold λn (to be specified later), where is the post-model-selected t-statistic and sn is the standard error of the slope estimator when regressing Y on Xk̂n. Specifically,
| (3) |
where , is the empirical process, and Pn is the distribution of (X, Y). It is clear that the nonparametric bootstrap is consistent for the first term in (3) if and λn → ∞, since it is easily shown that P(|Tn| > λn) → 1β0≠0. The second term is more problematic though because k̂n does not converge in probability to k0 when β0 = 0. Denote the term in the square brackets by , indexed by . Note that when this term is active (under β0 = 0), and kn = K̄(b0), where
and
so
| (4) |
All parts of are now seen to be smooth functions of , so it is reasonable to expect that a consistent bootstrap can be constructed by replacing by its nonparametric bootstrap , and replacing Pn by . In such a construction, the event indicated in the second term of (3) is naturally replaced by the event that and |Tn| ≤ λn.
Here and throughout the paper, a superscript * is used to indicate the nonparametric bootstrap (sometimes called “bootrapping in pairs” in regression settings, to distinguish it from the residual bootstrap). The above arguments lead to our main result showing that can indeed be consistently bootstrapped under the general local model. The precise definition of is given at the start of the proof.
Theorem 2
Suppose all assumptions in Theorem 1 hold, and the tuning parameter λn satisfies and λn → ∞ as n → ∞. Then, under the local model (2),
converges to the limiting distribution of conditionally (on the data) in probability.
ART procedure
ART provides a bootstrap calibration for the test statistic based on a special case of the above theorem. Under H0 we have the simplification . For some nominal level γ, let cl and cu be the lower and upper γ/2 quantiles, respectively, of
If falls outside the interval [cl, cu], then we reject H0 and conclude that there is at least one significant predictor.
Before applying ART, it is advisable to standardize all the variables Xk and Y (by sample mean and standard deviation), which has the advantage of making the procedure scale invariant ( is then the maximal sample correlation); our results naturally extend, but we develop the theory only for the unstandardized variables to keep the presentation simple.
Robust confidence intervals
The above theorem also allows the construction of a robust confidence interval for θn by treating b0 as unknown, then finding the widest bootstrap quantiles over all b0. Here by “robust” we mean asymptotically valid uniformly over b0. For testing purposes, however, this approach would be too conservative and also computationally intensive (grid search over is needed); for this reason, in ART we set b0 = 0 under the null, so the critical values can be readily computed from . In contrast, Laber and Murphy (2013) propose using almost sure bounds over their local parameter b0 to find robust confidence intervals for adaptive lasso; this involves less computation than distributional bounds, but is still computationally intensive, and it produces more conservative confidence intervals than the distributional approach.
Choice of the tuning parameter λn
The above theorem requires that and λn → ∞ as n → ∞. Under this condition, the thresholding provides a consistent pre-test (for θn = 0) with asymptotically negligible type I error rate: limn→∞ P(|Tn| > λn|θn = 0) = 0. On the other hand, if λn increases too quickly, the pre-test will be conservative. One simple choice would be to set , for some constant a > 0, but it is also desirable that λn increase with p, see Section 5 for discussion about the null limiting behavior of Tn as both p and n → ∞. To that end, note that by Theorem 1 in the special case that ε and X are independent, under θn = 0 (or b0 = 0 and β0 = 0) we have , where , and (Z̃1, . . . ,Z̃p)T is a vector of standard normal random variables. Thus, for any fixed λ > 0,
Hence the pre-test type I error rate can be asymptotically controlled below level γ, without sacrificing consistency, by choosing
| (5) |
In the simulation study below we describe a way of specifying the constant a via the double bootstrap, and this is used whenever we refer to ART in the sequel.
Forward stepwise ART
If we find a significant predictor using ART, it would be reasonable to continue applying the procedure in a forward stepwise fashion until no more significant predictors are detected. That is, in successive stages the residual is treated as a new outcome variable and marginal regression carried out on the remaining predictors. Although it would be challenging to extend our theoretical results to this procedure, we find that in real data applications it performs well, and in a similar way to the covariance test of Lockhart et al. (2014), as we discuss in the HIV drug resistance example considered in the next section.
4 Numerical studies
In this section, we study the performance of the proposed ART procedure using simulated data, and give illustrations of the approach in two real data examples.
4.1 Finite sample simulations
We compare the performance of ART with four procedures that are commonly used for detecting the presence of a significant predictor:
Likelihood ratio test (LRT)
This test is based on assuming a full linear model involving all of the covariates, and is applicable when n > p. Under the null hypothesis, all the regression coefficients are zero. The reduction in the residual sum of squares is compared to the residual sum of squares for the full model using an F-ratio [see, e.g. Section 7.4 of Johnson and Wichern (2007)]. When the full linear model holds, it can be seen that both null and alternative hypotheses are identical to those used in ART.
Multiple testing with Bonferroni correction
As in ART, marginal linear models are used to predict Y from each Xk. A t-test with Bonferroni correction is then carried out to detect whether each regression coefficient is non-zero. The intersection of the p null hypotheses coincides with the null used in ART.
Centered percentile bootstrap (CPB)
This procedure is similar to ART, except is used to estimate the upper and lower quantiles of , providing critical values for the test statistic , see Efron and Tibshirani (1993).
Higher Criticism (HC)
This is a test originally proposed by John Tukey for determining the overall significance of a collection of independent p-values. We apply the statistic developed by Donoho and Jin (2004, 2015), which is expected to perform well if the predictors are nearly uncorrelated.
We consider three examples for the data generating model: i) Y = ε, ii) Y = X1/4+ε, and iii) , where β1 = . . . = β5 = 0.15, β6 = . . . = β10 = −0.1, and βk = 0 for k = 11, . . . , p. In the first example, there is no active predictor, in the second there is a single active predictor, and in the third there are 10 active predictors and the maximally correlated predictor is not unique. The covariate vector X is distributed as p-dimensional normal with each component Xk ~ N(0, 1), an exchangeable correlation structure Corr(Xj, Xk) = ρ for j ≠ k, where ρ takes values 0, 0.5 and 0.8, and the noise ε ~ N(0, 1) is independent of X.
We consider two sample sizes (n = 100 and 200), and five values of the dimension (p = 10, 50, 100, 150 and 200). A nominal 5% significance level is used throughout. The bootstrap sample size is taken as 1,000. To specify the threshold λn in ART the double bootstrap is implemented by generating 1,000 bootstrap estimates , then choosing λn so that 5% of the ARTs (based on 1,000 nested bootstrap samples) with test statistic reject.
Empirical rejection rates based on 1,000 Monte Carlo replications are reported in Figures 1–3. For model i), the figures provide type I error rates, which should be compared with the 5% nominal rate; for models ii) and iii), the figures provide the power of each test. The ART procedure has good control of the type I error rate throughout (compared to all the other methods), while consistently maintaining relatively high power. Comparing the results of models ii) and iii), non-uniqueness of the maximally correlated predictor has no adverse effect on the power of ART.
Figure 1.
Empirical rejection rates based on 1,000 samples generated from models i), ii) and iii) as the dimension ranges from p = 10 to p = 200, for n = 100 (top row) and n = 200 (bottom row), and ρ = 0.8.
Figure 3.
Empirical rejection rates as in Figure 1 except for independent predictors.
Bonferroni is highly conservative when ρ = 0.5 and 0.8, see the left panels of Figures 1 and 2. The CPB method is highly anti-conservative, with empirical type I error rates exceeding 15% for both sample sizes (and thus out of range for most of the panels on the left). The LRT effectively controls the type I error rate at around the nominal 5% level when it is applicable, but it has very low power compared with all the other methods, except under model iii) in the “classical case” of small numbers of predictors that are not highly correlated, see the right panels of Figures 2 and 3. Higher Criticism fails to control type I error except when the predictors are independent (Figure 3), in which case it is slightly anti-conservative and has excellent power under model iii), but very poor under model ii). That is, HC performs well (under zero correlation) when there are multiple active predictors, but not in the sparse case of only one active predictor. Except in the case of independent predictors, when Bonferroni is slightly better, ART outperforms all the competing procedures when both type I error and power are taken into account, and the improvement increases with the correlation between predictors.
Figure 2.
Empirical rejection rates as in Figure 1 except with lower correlation between predictors: ρ = 0.5.
4.2 Asymptotic power
In this section, we carry out a simulation study to assess the asymptotic power of ART compared with that of the Bonferroni procedure. The computational expense of implementing ART is high because of the double bootstrap, so our full simulation study of the previous section is only feasible for small sample sizes. Nevertheless, we are able to assess asymptotic power by making use of our results on the local model in Section 3.
Consider the local model Y = (n−1/2b0)X1 + ε, where . Here X and ε are generated in the same way as Section 4.1, but now we only consider ρ = 0.5. The local parameter b0 takes the special form (b0, 0, . . . , 0)T, and we allow b0 to vary over a grid in [0, 5], in increments of 0.5. We set β0 = 0, b0 = (b0, 0, . . . , 0)T and make use of the given covariance structure of X and the explicit form of the limiting distribution in Theorem 1 to generate a draws from the asymptotic distribution of . Specifically, we carry out the following steps:
For each value of b0 on the grid, take 5,000 draws from the limiting distribution of given in Theorem 1 (this distribution only depends on b0 and the given distribution of (X, Y )), then add b0 to obtain draws from the limiting distribution of . Based on these draws, we can obtain the (approximate) rejection rate of the test statistic for any given rejection region. In particular, the asymptotic rejection rate of ART (for any given b0 on the grid) can be calculated by referring to the rejection rate corresponding to the particular critical values cl and cu generated by ART.
To assess the asymptotic power of ART at each given b0, we generate 10 independent large samples (with n = 5,000) from the local model, find cl and cu for each sample, and display in a boxplot the corresponding asymptotic rejection rates (using the results of step 1).
For comparison, we also plot the asymptotic power of the Bonferroni procedure, which is approximated using 1,000 samples each of size n = 5,000.
The results are presented in Figure 4 for p = 10 and 50. The main source of variation within each boxplot is due to randomness over the 10 independent samples drawn from the local model, rather than bootstrap randomness (in view of bootstrap consistency and the large sample size n = 5,000). The median of each boxplot provides a suitable reference point to compare with the asymptotic power of Bonferroni (indicated by the circle). Note that ART provides accurate control of asymptotic type I error, and, as expected, Bonferroni is slightly conservative. In terms of median power, ART always outperforms Bonferroni, and can provide an additional 25% power (e.g., at b0 = 3 for p = 10, and at b0 = 3.5 for p = 50).
Figure 4.
Asymptotic type I error and power of ART (box plots) compared with Bonferroni (circles) as a function of the local parameter b0, for p = 10 and 50, ρ = 0.5, calculated using steps 1–3 in Section 4.2.
The cost of implementing the double bootstrap part of ART makes it prohibitive to extend the results in Figure 4 to larger p, but if we fix λn, then it becomes practical to run the simulations for p = 1000. Figure 5 shows how the asymptotic power of ART compares with Bonferroni as the constant a used to specify λn takes values 2, 4, 5 and 8 (the corresponding λn are 4.3, 6.1, 6.8 and 8.6). Note that as a increases (going from one panel to the next), ART becomes more stable and provides more accurate type I error control, but the overall power decreases. At small values of a, ART behaves like the CPB, which is anti-conservative (as we have already seen in the previous section), whereas at larger values the influence of CPB is diluted. For the CPB (which corresponds to setting λn = 0), the plot (not shown) appears very similar to that for a = 2; also, for a > 8 the plots appear very similar to a = 8. The best choice of a, therefore, is a trade-off between type I error control and power; comparing with Figure 4, ART with double bootstrapping appears to achieve a satisfactory balance in this regard. Also note that, even at the largest value a = 8, ART can provide an additional 20% power over Bonferroni, and thus outperform Bonferroni by a considerable margin in high-dimensional settings as well, at least when there is a high degree of correlation among the components of X.
Figure 5.
Asymptotic type I error and power of ART compared with Bonferroni for p = 1,000 and ρ = 0.5, where ART is implemented using a fixed threshold λn specified by a = 2, 4, 5, 8, and each box plot is based on 20 independent replications with n = 10,000.
4.3 Gene expression example
We consider gene expression profiles from the tumors of n = 156 patients diagnosed with a common type of adult brain cancer (glioblastoma), collected as part of the Cancer Genome Atlas pilot project (TCGA, 2008). Our analysis is based on log gene expression levels X at p = 181 loci along chromosome 1. We are interested in detecting the presence of a gene that is significantly related to log-survival time Y.
We compare the results from applying the Bonferroni, CPB and ART procedures; LRT is not applicable since p > n. The three methods yield very different p-values. The smallest Bonferroni adjusted p-value is 40.8%, suggesting that no gene is significantly related to Y. The CPB and ART p-values are 3.2% and 17.2%, respectively, from 1000 bootstrap samples. Figure 6 shows how these p-values are calculated. Thus the CPB method suggests the presence of a significant genetic effect, whereas ART does not.
Figure 6.
Gene expression example. Left panel: histogram of showing that the two-sided CPB p-value is 3.2%. Right panel: histogram of showing that the two-sided ART p-value is 17.2%.
4.4 HIV drug resistance example
Our second example uses data from the HIV Drug Resistance Database (2014), an important public resource for understanding how HIV-1 mutation patterns cause resistance to antiretroviral drugs (Rhee et al., 2002). We will compare our results with those of Lockhart et al. (2014), who applied their covariance test to data on the susceptibility (a measure of drug resistance) of the nucleotide reverse transcriptase inhibitor lamivudine (3TC). We code susceptibility on a log-scale (Y), and each predictor Xj is taken as indicating the presence/absence of a mutation at a given sequence position. The viral sequence positions are indexed by j. Excluding missing data and rare mutations resulted in data on p = 103 positions and a total of 1266 isolates.
We randomly split the data 50 times into a training set of size n = 126 and a test set of size 1140. For each split, we carry out 20 steps of forward stepwise ART and standard forward stepwise regression using the training data, and calculate the corresponding prediction error (including all previously selected variables) using the test data. The left panel of Figure 7 shows the training data p-values (mean ± SD) for the newly entered predictor at each step, over the 50 random splits, and the right panel shows the corresponding prediction errors (mean ± SD). Forward stepwise ART detects one very highly significant mutation, but no more, as confirmed by the test set error plot, and this result is roughly consistent with the findings of Lockhart et al. (2014). Standard forward stepwise regression picks out at least 10 mutations, but there is no improvement in test set error after the first predictor enters the model; moreover, the test error almost exactly coincides with ART.
Figure 7.
HIV drug resistance example. Left panel: training set p-values (mean ± SD) over 50 random splits of the data for forward stepwise ART (solid line), standard forward stepwise regression (dash-dot line) and the 0.05 alpha level (dotted). Right panel: test set error for the corresponding models (including all previously selected variables); the two lines are almost indistinguishable.
5 Discussion
In this paper we have developed an adaptive resampling test (ART) for detecting the existence of a significant predictor, Xk0, from among predictors X1, . . . , Xp. The procedure is designed to adjust to the non-regular limiting behavior of the estimated marginal regression coefficient of the selected predictor. This is done by using a thresholded version of the bootstrap that adapts to the non-regularity: if there is at least one significant predictor, it reduces to a centered percentile bootstrap, otherwise it mimics the local (non-uniform) asymptotic behavior of . We have shown that in simulation studies, ART performs favorably compared with standard methods such as Bonferroni, but also compared with more sophisticated methods such as Higher Criticism. The advantage of ART may stem from it being designed to take into account correlations between predictors, while also avoiding distributional assumptions (the nonparametric bootstrap steps in ART are essentially distribution free). We have restricted attention to linear models, but our approach has much wider applicability (e.g., generalized linear models, quantile regression, and censored time-to-event outcomes), and these will be studied in future papers.
Although our simulation results suggest that ART is useful and remarkably stable in “large p, small n” settings, the asymptotic theory that we have used to calibrate ART relies on assuming a fixed p, with n tending to infinity. In view of the conservative nature of the Bonferroni procedure in high-dimensional settings, there is a pressing need for more powerful tests in this area. In future work it would be of interest to develop the asymptotic theory of ART for the case of p growing with n, although this would be very challenging. As far as we know, formal testing procedures that provably control FWER and adjust to non-regularity under diverging p are not yet available, except for Higher Criticism in the case that the predictors are nearly uncorrelated, as established by Ingster et al. (2010) and Arias-Castro et al. (2011). In the only other instance we know of, under the strong assumption that X1, . . . , Xp, Y are iid N(0, 1), results of Cai and Jiang (2012) can be used to find the weak limit of and thus devise an asymptotically correct calibration: if p = pn → ∞ at sub-exponential rate, log(p)/n → 0, then and where . In the super-exponential case, log(p)/n → ∞, then and there is a similar weak limit.
Another interesting direction for future work would be to study the forward stepwise version of ART discussed in Section 3. Modifications to ART when applied stepwise in this way would be needed to adjust for the implicit dependence among the new outcomes. By repeating such a procedure until no more significant predictors are detected, the aim would be to correctly identify all active predictors.
Acknowledgments
Research supported by NIH Grant R01GM095722-01 and NSF Grant DMS-1307838.
Appendix: Proofs
Proof of Theorem 1
For k = 1, . . . , p, let . Then and . It is easy to verify that ,
| (6) |
where Pn is the distribution of (Y, X), and the mean residual squared error
| (7) |
The result then follows immediately from the following two lemmas. The first lemma verifies the oracle property for marginal regression under the assumption that there is at least one active predictor; the proof is included for completeness. The second lemma gives the (non-regular) asymptotic behavior of when there are no active predictors.
Lemma 1
If all conditions in Theorem 1 hold and β0 ≠ 0, then and , where Zk0 is defined in Theorem 1.
Proof
Denote R̂ ≡ (R̂1, . . . ,R̂p)T. When β0 ≠ 0, Var(XT β0) > 0. By the SLLN
Since and Corr2(Xk, XTβ0) is maximized at k = k0, it follows immediately that .
Next, denote X̂ = Xk̂n and Xn = Xkn. Since and Y = α0 + XTβn + ε, we have
where the second equality uses and kn → k0 as n → ∞, and the third equality follows from the LLN and Cov(ε, Xk0) = 0. Similarly, . The proof is completed using Slutsky's lemma and the CLT.
Lemma 2
If all conditions in Theorem 1 hold and β0 = 0, then .
Proof
Since (Z1(0), . . . , Zp(0))T is a normal random vector and |Corr(Xj, Xk)| < 1 for j ≠ k, it is easy to see that
| (8) |
So K is unique a.s.
Denote . Note that when β0 = 0, . By the CLT and Slutsky's lemma, we see from (6) that
From (7), we have
where ☉ denotes the elementwise (Hadamard) product, so, by the continuous mapping theorem and Slutsky's lemma,
Define h(t) = (1arg maxktk=1, . . . , 1arg maxktk=p)T, where . Note that h is continuous at t if arg maxk tk is unique. Thus, using (8) and since , the result follows by applying the continuous mapping theorem to the above display.
Lemma 3
Let Z be a p-dimensional random vector and a function such that f(z, ·) is continuous for every , and f(Z, b)j ≠ f(Z, b)k a.s. for all j ≠ k and . Then K(b) ≡ arg maxk=1...,p f(Z, b)k is unique a.s. Also, if bl → b0, then K(bl) = K(b0) for l sufficiently large a.s.
The proof is omitted. An immediate consequence of this lemma is the continuity of the limiting distribution in Theorem 1 as a function of b0; this is seen by setting for k = 1, . . . , p, and using (8).
Proof of Theorem 2
The notation and means that and k̂n are based on n iid observations taken from . The bootstrapped process in the statement of the theorem is defined by re-expressing (4), along with K̄(b) and , in terms of Pn and operating on functions of (X, Y), then replacing Pn by and by throughout. In the case of in which ε is not observed, we also replace ε by , resulting in
| (9) |
where is the bootstrapped empirical process. As is conventional in empirical process theory, , and Pn are assumed to operate only on functions that are defined on (X, Y), explaining why can be separated in the above display.
Let EM denote expectation conditional on the data, and let PM be the corresponding probability measure. We will show that and conditionally (on the data) in probability. This together with Lemmas 4 and 5 below implies the result.
For k = 1, . . . , p, the bootstrapped marginal regression coefficient satisfies
| (10) |
When β0 = 0, by Lemma 2 and the condition that λn → ∞ as n → ∞, we have in probability. When β0 ≠ 0, it is easy to verify that , which is positive under the condition that k0 is unique. Thus
tends to zero in probability when β0 ≠ 0, where the convergence follows from Lemma 1, Lemma 4 (below) and the condition that . Hence
tends to zero in probability. This implies that and conditionally in probability. Since 1|Tn|≤λn converges to 1β0=0 in probability, the result follows from Slutsky's lemma.
Lemma 4
If the conditions in Theorem 1 hold and β0 ≠ 0, then conditionally (on the data) a.s. and conditionally (on the data) in probability.
Proof
It follows from (10), the SLLN and Slutsky's lemma that, when β0 ≠ 0,
and a.s. for k = 1, . . . , p. Denote the bootstrap mean squared error
where and . Then we can write
since the denominator plays no role. By Slutsky's lemma
a.s. for k = 1, . . . , p, so we obtain
where the convergence follows from the condition that k0 is unique when β0 ≠ 0.
Recall that , where X̂ ≡ Xk̂n. Note that . By the definition of , we have
| (11) |
The last term in (11) is opM (1) a.s. because the first and last terms within the square bracket cancel asymptotically, similarly for the second and third terms, due to and k̂n → k0 a.s. We next show that the first term in (11) converges in distribution to Zk0 (β0) conditionally (on the data) in probability. By Lemma 1, it is easy to verify that and . Denote . Then the first term can be decomposed as
| (12) |
The first term in (12) is oPM (1) a.s. since . The second term in (12) can be written as
which is oPM (1) in probability by bootstrap consistency of the sample mean [see, e.g., Theorem 23.4 of van der Vaart (1998)], and the fact that X̂ = Xk0 for n sufficiently large a.s. Bootstrap consistency of the sample mean also gives that the third term in (12) converges in distribution to Zk0(β0) conditionally (on the data) in probability.
Similarly, the second and third terms in (11) and can be shown to be oPM (1) in probability. The result then follows from Slutsky's lemma.
Lemma 5
If all conditions in Theorem 1 hold and β0 = 0, then converges to the same limiting distribution as conditionally (on the data) in probability.
Proof
Define , and M′(b) to be p-vectors with kth components given by ,
respectively. Let be a p × p matrix with the (j, k)-th component given by
Also, let and D′(b) be p-vectors of zeros, apart from a 1 in the entry that maximizes and M′(b), respectively. Then
Similarly, define , and (without indexing by n) to be processes of the same form as , and , except with replaced by Zk(0), and the sample variances/covariances replaced by their population versions.
Referring to the notation in (4), it is clear that when β0 = 0,
Moreover, the second equality in the above display also holds for the bootstrap version. Writing the bootstrapped version of in (9) as
and using arguments similar to those in the proof Lemma 4 for handling (12), we have conditionally (on the data) in probability. As a result, conditionally (on the data) in probability, where is the sample version of D′(b), and and are the bootstrap versions of and , respectively. Finally, using similar arguments to those at the end of the proof of Lemma 2, along with the continuous mapping theorem, we conclude that
conditionally (on the data) in probability.
References
- Andrews D. Inconsistency of the Bootstrap when a Parameter is on the Boundary of the Parameter Space. Econometrica. 2000;68(2):399–405. [Google Scholar]
- Arias-Castro E, Candès EJ, Plan Y. Global Testing Under Sparse Alter natives: ANOVA, Multiple Comparisons and the Higher Criticism. Ann. Statist. 2011;39:2533–2556. [Google Scholar]
- Belloni A, Chernozhukov V, Hansen C. Inference on Treatment Effects After Selection Amongst High-Dimensional Controls. Review of Economic Studies. 2014;81(2):608–650. [Google Scholar]
- Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A prac tical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society, Ser. B. 1995;57:289–300. [Google Scholar]
- Berk R, Brown LD, Buja A, Zhang K, Zhao L. Valid Post-Selection Inference. Annals of Statistics. 2013;41:802–837. [Google Scholar]
- Breiman L. The Little Bootstrap and Other Methods for Dimensionality Se lection in Regression: X-Fixed Prediction Error. Journal of the American Statistical Association. 1992;87:738–754. [Google Scholar]
- Bühlmann P. Statistical Significance in High-dimensional Linear Models. Bernoulli. 2013;19:1212–1242. [Google Scholar]
- Cai TT, Jiang T. Phase Transition in Limiting Distributions of Co herence of High-dimensional Random Matrices. Journal of Multivariate Analysis. 2012;107:24–39. [Google Scholar]
- Cancer Genome Atlas Research Network Comprehensive Genomic Charac terization Defines Human Glioblastoma Genes and Core Pathways. Nature. 2008;455:1061–1068. doi: 10.1038/nature07385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chatterjee A, Lahiri SN. Bootstrapping Lasso Estimators. Journal of the American Statistical Association. 2011;106(494):608–625. [Google Scholar]
- Cheng X. Robust Confidence Intervals in Nonlinear Regression under Weak Identification. Department of Economics, University of Pennsylvania; 2008. Unpublished Manuscript. Version posted in 2015: http://www.sas.upenn.edu/xucheng/papers/Cheng mixed id 19.pdf. [Google Scholar]
- Cheng X. Robust Inference in Nonlinear Models with Mixed Identification Strength. Journal of Econometrics. 2015 to appear. [Google Scholar]
- Davies RB. Hypothesis Testing when a Nuisance Parameter Is Present Only under the Alternative. Biometrika. 1977;64(2):247–254. doi: 10.1111/j.0006-341X.2005.030531.x. [DOI] [PubMed] [Google Scholar]
- Donoho D, Jin J. Higher Criticism for Detecting Sparse Heterogeneous Mixtures. Annals of Statistics. 2004;32(3):962–994. [Google Scholar]
- Donoho D, Jin J. Higher Criticism for Large-Scale Inference, Especially for Rare and Weak Effects. Statistical Science. 2015;30(1):1–25. [Google Scholar]
- Dudoit S, Shaffer JP, Boldrick JC. Multiple Hypothesis Testing in Microarray Experiments. Statistical Science. 2003;18:71–103. [Google Scholar]
- Dudoit S, van der Laan MJ. Multiple Testing Procedures with Appli cations to Genomics. Springer; New York: 2008. [Google Scholar]
- Efron B. Large-scale Simultaneous Hypothesis Testing: the Choice of a Null Hypothesis. Journal of American Statistical Association. 2006;99:96–104. [Google Scholar]
- Efron B. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press; 2010. [Google Scholar]
- Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Chapman & Hall/CRC Monographs on Statistics & Applied Probability; 1993. [Google Scholar]
- Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- Fan J, Li R. Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery. In: Sanz-Sole M, Soria J, Varona JL, Verdera J, editors. Proceedings of the International Congress of Mathematicians. III. European Mathematical Society; Zurich: 2006. pp. 595–622. [Google Scholar]
- Fan J, Lv J. Sure Independence Screening for Ultra-high Dimensional Feature Space. Journal of the Royal Statistical Society, Ser. B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. with discussion. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Genovese C, Jin J, Wasserman L, Yao Z. A Comparison of the Lasso and Marginal Regression. Journal of Machine Learning Research. 2012;13:2107–2143. [Google Scholar]
- HIV Drug Resistance Database Genotype-Phenotype Datasets, Stanford Uni versity. 2014 http://hivdb.stanford.edu/pages/genopheno.dataset.html.
- Huang J, Ma S, Zhang C-H. Adaptive Lasso for High-dimensional Regression Models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
- Ingster YI, Tsybakov AB, Verzelen N. Detection Boundary in Sparse Regression. Electron. J. Statist. 2010;4:1476–1526. [Google Scholar]
- Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. 6th Edition Prentice Hall; New Jersey: 2007. [Google Scholar]
- Laber E, Murphy SA. Adaptive Confidence Intervals for the Test Error in Classification (with discussion). Journal of the American Statistical Association. 2011;106(495):904–913. doi: 10.1198/jasa.2010.tm10053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Laber E, Murphy SA. Adaptive Inference after Model Selection. 2013 Under review. [Google Scholar]
- Laber E, Lizotte D, Qian M, Murphy SA. Dynamic Treatment Regimes: Technical Challenges and Applications. Electronic Journal of Statistics. 2014;8:1225–1272. doi: 10.1214/14-ejs920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leeb H, Pötscher BM. Can One Estimate the Conditional Distribution of Post-model-selection Estimators? Annals of Statistics. 2006;34(5):2554–2591. [Google Scholar]
- Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R. A Significance Test for the Lasso. Annals of Statistics. 2014;42(2):413–468. doi: 10.1214/13-AOS1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McCloskey A. Bonferroni-based Size-correction for Nonstandard Test ing Problems. 2012 Working Paper, http://www.econ.brown.edu/fac/adam mccloskey/Research files/McCloskey BBCV.pdf.
- Meinshausen N, Meier L, Bühlmann P. P-values for High-dimensional Regression. Journal of the American Statistical Association. 2009;104:1671–1681. [Google Scholar]
- Ning Y, Liu H. A General Theory of Hypothesis Tests and Confidence Regions for Sparse High Dimensional Models. 2015 http://arxiv.org/abs/1412.8765.
- Rhee S-Y, Gonzales MJ, Kantor R, Betts BJ, Ravela J, Shafer RW. Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res. 2003;31(1):298–303. doi: 10.1093/nar/gkg100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Samworth R. A Note on Methods of Restoring Consistency to the Bootstrap. Biometrika. 2003;90:985–990. [Google Scholar]
- van der Vaart AW. Asymptotic Statistics. Cambridge University Press; 1998. [Google Scholar]
- Zhang C-H, Zhang S. Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models. J. R. Stat. Soc. B. 2014;76:217–242. [Google Scholar]
- Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society, Series B. 2005;67(2):301–320. [Google Scholar]







