Summary
Resampling-based multiple testing methods that control the Familywise Error Rate in the strong sense are presented. It is shown that no assumptions whatsoever on the data-generating process are required to obtain a reasonably powerful and flexible class of multiple testing procedures. Improvements are obtained with mild assumptions. The methods are applicable to gene expression data in particular, but more generally to any multivariate, multiple group data that may be character or numeric. The role of the disputed “subset pivotality” condition is clarified.
Keywords: Bootstrap, Exchangeability, Permutation, Resampling, Subset pivotality
1 Introduction
With the recent “-omics” revolution, there is great interest in high-dimensional multiple testing, where the number of variables far exceeds the sample size. Gene expression is a prototype application, but the applications are much broader. “Resampling” is a general term that encompasses bootstrap, permutation, and parametric simulation-based analyses; “resampling-based multiple testing” refers to the use of such methods in multiple testing applications. Resampling methods have become popular for “-omics” because they (a) require fewer assumptions (e.g. normality) about the data-generating process, thereby yielding procedures that are more robust, (b) utilize data-based distributional characteristics, e.g. discreteness and correlation structure, to make tests more powerful, and (c) scale up reasonably well to high-dimensional settings, particularly with modern computing.
A lot has been made of the “subset pivotality condition” coined by Westfall and Young (1993) for resampling-based multiple tests. It has been portrayed in the literature as too stringent; for example, Romano, Shaikh, and Wolf (2007) state
“the … condition of subset pivotality … assumed in … Westfall and Young (1993) … is quite restrictive.”
Our purposes are (a) to clarify the role of subset pivotality, (b) to show that it is hardly restrictive, and (c) to clarify what is actually needed for validity of multiple testing. We will show that resampling-based multiple testing procedures can be valid, powerful, and control the familywise error rate (FWE) in the strong sense (Hochberg and Tamhane, 1987, define strong control of the FWE), with no assumptions whatsoever on the data-generating process, yet where subset pivotality holds nevertheless. Slightly more power is available if one is willing to make one simple assumption about the data-generating process, an assumption distinct from, yet which is often confused with, subset pivotality.
Section 2 clarifies the role of the subset pivotality condition in multiple testing. Section 3 dispenses with assumptions altogether, notes that subset pivotality holds, and shows that valid, powerful, flexible FWE-controlling procedures are available. Section 4 shows how tests from Section 3 can be made more powerful, using a simple assumption that is not directly related to subset pivotality. Concluding remarks are given in Section 5.
2 The Role of Subset Pivotality in Closed Testing
Before describing the subset pivotality condition and its purpose, it is necessary to discuss the issues of choice of test statistic and computations. The reason is that the subset pivotality condition is only needed to simplify computations for resampling-based closed testing procedures.
2.1 Resampling-based multiple testing and closure
The closure principle of Marcus Peritz and Gabriel (1976) provides a unifying theory for hypothesis testing to control the FWE in the strong sense. (Hereafter “FWE control” is always assumed to mean “in the strong sense.”) Denoting the null hypotheses Hi, i = 1; … ; m, a closed testing procedure (CTP) requires that all intersection hypotheses be tested. Define intersection hypotheses HI = ∩i∈IHi, for I ⊆ S :={1; … ; m}. Let ℋ = {HI | I ⊆ S} denote of the set distinct intersection hypotheses. Then a CTP is one in which H ∈ ℋ is rejected iff g H− is rejected for every H− ∈ ℋ such that H− ⊆ H. When the tests of HI are α-level (not multiplicity adjusted) tests, the CTP controls the FWE at level α
For the purposes of this paper, a “resampling-based multiple testing procedure” is defined as one where each intersection HI is tested using a resampling-based test. Applications of resampling not using closure exist, but the subset pivotality condition is best understood in the context of closure.
2.2 Choice of test statistic
Closed testing is very flexible in that any test statistic may be used to test the intersections HI. There are many choices, including F- and related tests, Fisher combination tests, O'Brien-type tests, Simes-type tests, and weighted variants of all these tests. The choice of test statistic to use should primarily be based on power considerations. Once a powerful test statistic is chosen, resampling can be used to ensure that the test is robust to violations of distributional and/or dependence assumptions. An example appears in Dmitrienko, Offen and Westfall (2006), who show how to bootstrap the Simes test parametrically in a closed testing framework to accommodate correlation structure.
2.3 Computational issues: the MaxT test and subset pivotality condition
While power is the main concern for choice of a test statistic, expediency becomes important when m is large. There are O(2m) intersection hypotheses HI, and if m is large, it is computationally impossible to test every single HI. However, the computational burden can be eased dramatically if one is willing to
A: test each hypotheses HI using a “Max” statistic maxi∈I Ti, possibly sacrificing power, and
B: assume a model that implies “subset pivotality”, which states that the distributions of maxi∈I Ti |HI and maxi∈I Ti | H{1;…;m} are identical, for all I ⊂ {1; …, m}.
If A and B are adopted, one need only test m hypotheses corresponding to the ordered ti rather than all 2m intersections; further, resampling can be done simultaneously under a global null H{1;…,m}, rather than separately for each intersection. Note that “Max” subsumes “Min”, where the test statistic is −Ti; the Min P test is commonly used (Westfall and Young, 1993).
To illustrate, suppose the observed test statistics are t1 ≥ … ≥ tm, corresponding to hypotheses H1; …,Hm (ordered in this way without loss of generality), and that larger ti suggest alternative hypotheses. Suppose a p-value for testing HI using the statistic maxi∈I Ti is available, then
and HI is rejected at unadjusted level α if pI ≤ α. Applying closure, A, and B, we have the following algorithm for rejecting H1,H2, … in sequence using what we call the “Main Algorithm.”
Main Algorithm:
- 1. By closure,
- But if I ⊇ {1}, then maxi∈I ti = t1, hence the rule is
- Using subset pivotality (B), the rule becomes
- Use of the “Max” statistic (A) implies
- Hence, by subset pivotality and by use of the “Max” statistic, the rule by which we reject H1 simplifies to this:
- 2. Again by closure and subset pivotality,
- If I ⊇ {1}, then maxi∈I ti = t1; else maxi∈I ti = t2. Partitioning the set {I : I ⊇ {2}} into two sets,
we require
and
Since we are using the “Max” statistic, these conditions are equivalent to the following rejection rule: reject H2 if
and
j.: Continuing in this fashion, the rule is reject Hj if
and
and … and
One need not use resampling at all to apply the method. It is only called a “resampling-based” procedure if one uses resampling to obtain the probabilities P(maxi∈{j,…,m) Ti ≥ tj | H{1,…,m).
At step j the rule is equivalently stated in terms of p-values for the composite hypotheses as
hence the rule reduces to
where
is called the “adjusted p-value” (Westfall and Young, 1993).
In addition to subset pivotality and use of “Max” statistics, the additional assumption that there are no “logical constraints” among hypotheses (see Westfall and Tobias, 2007) is needed to assert that the algorithm is identical to closed testing using Max statistics. If there are logical constraints, power can be improved by restricting attention only to admissible subsets I, but in this case the computational shortcuts disappear, and one is back in the position of having to evaluate the tests for all intersections. The algorithm above can still be used, though, despite not being closed, as it provides a conservative procedure relative to the full closure.
3 Dispensing with Assumptions Altogether
While the subset pivotality condition is easily satisfied in many cases, including the general multivariate regression model with location-shift multivariate (possibly nonnormal) distributions (Westfall and Young, 1993, p. 123), researchers have questioned the assumption. In this section, we show how one can dispense with assumptions altogether, yet subset pivotality remains valid, and a powerful and flexible class of multiple testing procedures is obtained.
Let us clarify. By “dispensing with assumptions altogether,” we specifically mean “dispensing with all assumptions about the data-generating mechanism.” We will not assume the distributions lie in the location-shift family, or in any other family. We won't even assume that the data necessarily arise from a random process. This framework is similar to the simple falsification approach to non-multiplicity corrected hypothesis testing. For example, to test H0 : μ = 0 using a t-test, the i.i.d. N(0; σ2) assumptions all form the null hypothesis. No assumption is needed otherwise if you only want to control the type I error rate. Rejection of H0 in this setting does not necessarily imply μ ≠ 0; rather, it implies rejection of independence, identical distributions, normality, or μ = 0 (or, by the central limit theorem, in large samples it approximately implies rejection of independence, identical distributions, or μ = 0). The conclusion μ ≠ 0 follows only if one is willing to accept all the other assumptions. The “other assumptions” are what we avoid altogether: everything will be embedded within the null hypotheses. We do this by using permutation tests, which allows for elegant theory, but a similar theory might be developed using other tests.
This approach of embedding all assumptions into the null hypothesis may not be ideal, from a research standpoint, because rejection of the null hypothesis may give the researcher little information as to why the hypothesis was rejected. Nevertheless, a major contribution of this paper is the development of this point of view in the multiple testing arena. We show, despite the very minimal setup, that a reasonably powerful and flexible class of multiple testing procedures is obtained.
We adopt the following framework for the data structure, hypotheses, and test statistics. The first element of the framework is that we have multivariate G-sample data. Such data abound in biometrical research, from adverse events data in clinical trials, to animal carcinogenicity data, to gene expression data. The second element is that the hypotheses of interest are that the treatments have no effect on the data, i.e., that the data are exchangeable, and the third concerns the form of the test statistics.
In fairness, the elements of the “framework” are restrictive; for example, if the researcher is only interested in location shift effects, then he or she is not interested in permutation tests. However, as stated above, the elements of the “framework” are somewhat different from the probabilistic assumptions that are usually made about data-generating processes. Specific elements of the framework are as follows:
The data are multivariate m-dimensional data vectors from G groups, y11, …, y1n1, …, yG1, …, yGnG. Each ygj vector is comprised of m elements, which may be character, numeric, or mixed, hence missing values are allowed. The data need not be generated by any random mechanism. For I ⊆ {1, …, m}, let yIgj denote the subvector of ygj whose components are the elements of I.
- The hypotheses of interest are
Implicit in the hypothesis is an assumption of randomness; the hypothesis is equivalent to the statement that the observed data values
are realizations from an exchangeable random process (Pesarin, 2001, p. 5). - Hi is tested using a real-valued test statistic that is a function only of the data in variable i:
Without loss of generality, larger values of Ti suggest non-exhangeability.
With these elements, closed testing can be performed to control the FWE, entirely free of probabilistic assumptions about the data-generating process, other than what is embedded in the null hypotheses. Permutation testing is used. Let n = ∑ng, and suppose the data values
are observed to be , irrespective of specific group labels (g = 1, …, G) and replicate labels (j = 1, …, ng). Specifically, letting ℬ(y1, …, yn) denote the permutation orbit of y,
the conditioning set is Y{i} ∈ ℬ(y{i}).
Given Hi and Y{i} ∈ ℬ(y{i}), all n! permutations of the data vector are equally likely. Let denote a randomly selected permutation among the n!, and define by
if such a t exists, and otherwise. The rejection rule is
Equivalently, since the permutation-based p-value is the rejection rule is
Conditional on Y{i} ∈ ℬ (y{i}), the rejection rule has Type I error rate ≤ α, with strict inequality likely due to the discreteness of the distribution of . Since the Type I error rate is ≤ α conditional on Y{i} ∈ ℬ(y{i}), it is also ≤ α unconditionally.
To test all hypotheses using closure entails testing subsets HI = ∩i∈IHi. Note that HI implies marginal exchangeability of Y{i}, i ∈ I, but not of YI. A clarifying example is as follows: let Y11, …, Y1n be i.i.d. bivariate normal with mean vector zero and identity covariance matrix, and let Y21, …, Y2n be i.i.d. bivariate normal with mean vector zero and covariance matrix
The component distributions are exchangeable but the joint distribution is not, and this is an example of a probability model in the set H{1,2}.
Since the marginal models are specified but the joint models are not, the p-values P(maxi∈I Ti ≥ max i∈I ti | HI) needed for testing the intersections are not easily determined. However, a Bonferroni-like inequality can be used: let the critical value for testing HI at level α be
if such a c exists, define otherwise. Then the rule provides an α-level test for HI since
The p-value for this test is
note that if and only if .
FWE-controlling tests for the Hi follow closure and the “Main algorithm” of Section 2.3, with identical shortcuts resulting from use of “Max” tests and the resulting monotonicity of p-values. As in Section 2.3, suppose the observed test statisticsare t1 ≥ ⋯ ≥ tm, corresponding to hypotheses H1, …, Hm (again ordered in this way without loss of generality). The main algorithm becomes, in this case,
- Reject H1 if
- Reject H2 if
and if
j.: Continuing in this fashion, the rule is reject Hj if
and
and … and
As before, the adjusted p-values are .
This method is the “discrete Bonferroni method” described by Westfall and Wolfinger (1997), which has been hard-coded in PROC MULTTEST of SAS/STAT since 1996. What is unique about the above presentation is the generalization to arbitrary test statistics (Westfall and Wolfinger's results arise when the −ti is a marginal p-value, either exact or approximate). Use of p-values results in balance, where no particular hypotheses are favored. However, the more general framework allows deliberate weighting to favor certain hypotheses a priori. For example, if the supports of the permutation distributions of the Ti are completely disjoint, the algorithm reduces to the “a priori ordered” testing procedure described in, e.g., Maurer et al. (1995) and Kropf et al. (2004).
What is also unique about the above presentation is the assertion that absolutely no assumptions are needed concerning the data-generating process, not even independence, and certainly not subset pivotality. Curiously, even though subset pivotality is considered “restrictive,” subset pivotality holds in this general case where no assumptions are made: the joint distribution of the test statistics {Ti : i ∈ I} is the same under both HI and HS, following specifically from element 3 of the framework described above in this section. Thus, subset pivotality is hardly restrictive, since no assumptions, other than what is embedded in the null hypotheses, are made. The only problem is that HI is not easily characterized, so that p-values based on the joint distributions cannot be easily calculated without further assumptions. Hence the assumption that is questionable is not subset pivotality, but the assumption needed to calculate p-values under HI. That assumption will be given in Section 4. For now, we use the conservative Bonferroni-based approximation to the p-values based on the joint distributions, as shown above.
Example Consider the adverse event data set provided by Westfall et al. (1999, p. 243). There are G = 2 groups, control and treatment, with ng = 80 patients in each group, and m = 28 adverse event indicator variables per patient. Null hypotheses are that the adverse events indicator data are exchangeable in the combined treatment control sample, tested using Fisher exact upper tailed p-values, with smaller p-values indicating more adverse events in the treatment group. Unadjusted and adjusted p-values for the five most significant adverse events (AEs) are as follows:
Especially noteworthy is the adjusted p-value for the most extreme adverse event: since 0.0025 ≈ 3 × 0.0008, the effective number of tests is 3, not 28. The discrete Bonferroni method is thus vastly superior to the ordinary Bonferroni method, for which the adjusted p-value would have been 28 × 0.0008 = 0.0224. Again, this benefit comes at no expense of extra assumptions, since no assumptions are made concerning the data generating process.
Westfall and Soper (2001) note that this method automatically adjusts for selection effects concerning the observed variables. Here, the observed adverse events are self-reported, thus there is concern about a possible selection effect concerning the particular 28 adverse events that were reported. This is not an issue though when one realizes that in the collection of possible adverse events that could be reported, the total reports are 0 for all but the 28 in this study. Specifically, suppose there are 100 possible reportable AEs, 72 of which produce no reports. The permutation distributions for those 72 events with no reports place 100% of their mass on the p-value 1.0, hence the discrete Bonferroni analysis of all 100 AEs is identical to the analysis of the 28 where reports are received. However, the ordinary Bonferroni correction would change from 28 × 0.0008 = 0.0224 to 100 × 0.0008 = 0.080.
4 Improving Power Using Joint Distributions
The method described in Section 3 does not utilize joint distribution information, and therefore can be improved. To utilize this information, a joint exchangeability assumption is needed, as noted by Westfall (2003), and Calian et al. (2008). Consider the same setup as the previous section, with one assumption, rather than none, about the data-generating process. This is the assumption that people often confuse, erroneously, with subset pivotality:
Assumption C If for I, J ⊆ {1, …, m} the distribution of is exchangeable in its n components, and the distribution of is exchangeable in its n components, then the distribution of is exchangeable in its n components.
In particular, the assumption implies that ∩i∈IHi = HI : {the distribution of is exchangeable}. Like all assumptions, this one is questionable; a simple counter-example with two-group bivariate normal data is given in Section 3. However, several points can be made concerning the palatability of the assumption. First, the class of allowable models satisfying Assumption C is substantially more general than the multivariate location-shift class of models, allowing non-normality; while the multivariate location-shift class of models is very commonly used, often assuming normality. Second, it is perhaps not unrealistic to assume that if there is no treatment effect for each of variables 1, 2 individually, then the joint distribution of is exchangeable. Third, even if this is not a realistic assumption, failure of the assumption might imply excess Type I errors. But if the treatment really does affect the response, then the researcher might take comfort in conclusions of statistical significance, while acknowledging that due to assumption failure, the effect might be on correlation structure rather than on specific variables. Fourth, we reiterate that no assumption of independence is needed.
The benefit of Assumption C is that the exact p-value for HI can be calculated: the conditional p-value
is free of the HI-distribution of the data; pI is specifically equal to the proportion of the n! permutations yielding
Note also that , where is defined in the previous section, hence incorporating dependence information can provide greater power.
See Puri and Sen (1971) and Pesarin (2001) for further details on multivariate permutation tests.
As in Section 3, subset pivotality holds, and use of the “Max” statistic imply that the main algorithm can be used directly. In this case it becomes
- Reject H1 if
- Reject H2 if
and if
j.: Continuing in this fashion, the rule is reject Hj if
and
and … and
As before, the adjusted p-values are p̃j = max i≤jp{i, …, m}.
5 Concluding Remarks
Remark 1 Often, complete enumeration of the n! permutations is infeasible, so the p-values are instead approximated by randomly sampling permutations:
Generate a resampled data set , a without replacement sample from the observed vectors {y11, …, y1n1, …, yG1, …, yGnG}.
Compute the statistics from the .
- Check whether
Repeat 1.–3. a large number B (in the millions preferably) of times. The exact permutation p-value pI is approximately (within binomial simulation error) the proportion of the B samples where .
This method has been hard-coded in PROC MULTTEST of SAS/STAT since the inception of the PROC in 1992.
Consider the example in Section 3. The adjusted p-values using the joint distributions are shown below, all of which are calculated using 5,000,000 draws from the multivariate permutation distribution, along with the discrete Bonferroni adjustments from Section 3 for comparison.
The specific benefit of making Assumption C is the ability to incorporate correlation information; this benefit is shown by the smaller adjusted p-values. However, the main benefit shown by the discrete Bonferroni method comes at the expense of no additional assumptions whatsoever, and we reiterate that subset pivotality holds in that case, as described in Section 3.
Remark 2 The “joint distribution” method is exact, in the sense that all composite hypotheses in the closure are tested using exact, distribution-free permutation tests. The method also incorporates all correlation information between variables. It is perhaps surprising that exact, correlation-incorporating tests are possible, even when the multivariate dimension m is much greater than the sample size n.
Remark 3 Assumption C is equivalent to that assumed by Korn et al. (2004) for control of the False Discovery Proportion. Although the Korn et al. procedure is necessarily more computationally challenging when allowing for some false discoveries, it has a similar structure as the algorithm in Section 4, and it reduces to that algorithm when allowing no false discoveries.
To conclude, subset pivotality is an assumption made strictly for the computational convenience. It need not be restrictive: as shown in Section 3, subset pivotality holds in the most minimal setup. The assumption that people seem to object to is not subset pivotality, but the assumption that marginal exchangeability implies joint exchangeability, assumption C. However, we noted (a) assumption C might not be objectionable, and (b) as the analysis of the adverse events data show, even assumption C is not crucial. If objectionable, one can revert to the discrete Bonferroni method, which makes no assumptions. In cases with moderate dependence structure and highly discrete permutation distributions, the discrete Bonferroni method can provide a more important benefit than the benefit obtained by incorporating of joint distribution information, for which the potentially objectionable assumption is needed.
Table 1.
p-value | AE1 | AE8 | AE6 | AE5 | AE10 |
---|---|---|---|---|---|
Unadjusted | 0.0008 | 0.0293 | 0.0601 | 0.2213 | 0.2484 |
Discrete Bonferroni adjusted | 0.0025 | 0.1587 | 0.3321 | 1.0000 | 1.0000 |
Table 2.
p-value | AE1 | AE8 | AE6 | AE5 | AE10 |
---|---|---|---|---|---|
Unadjusted | 0.0008 | 0.0293 | 0.0601 | 0.2213 | 0.2484 |
Discrete Bonferroni adjusted | 0.0025 | 0.1587 | 0.3321 | 1.0000 | 1.0000 |
Joint distribution adjusted | 0.0021 | 0.1335 | 0.2605 | 0.6278 | 0.9275 |
Acknowledgements
This research was supported in part by the Intramural Research Program of the National Institutes of Health and the National Institute of Child Health and Human Development.
Footnotes
Conflict of Interests Statement
The authors have declared no conflict of interest.
References
- Calian V, Li D, Hsu JC. Partitioning to uncover conditions for permutation tests to control multiple testing error rates. Biometrical Journal. 2008;50:756–766. doi: 10.1002/bimj.200710471. this issue. [DOI] [PubMed] [Google Scholar]
- Dmitrienko A, Offen W, Westfall PH. Gatekeeping strategies for clinical trials that do not require all primary effects to be significant. Statistics in Medicine. 2003;22:2387–2400. doi: 10.1002/sim.1526. [DOI] [PubMed] [Google Scholar]
- Holm S. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics. 1979;6:65–70. [Google Scholar]
- Korn EL, Troendle JF, McShane LM, Simon R. Controlling the number of false discoveries: application to high-dimensional genomic data. Journal of Statistical Planning and Inference. 2004;124:379–398. [Google Scholar]
- Kropf S, Läuter J, Eszlinger M, Krohn K, Paschke R. Nonparametric multiple test procedures with data-driven order of hypotheses and with weighted hypotheses. Journal of Statistical Planning and Inference. 2003;125:31–47. [Google Scholar]
- Marcus R, Peritz E, Gabriel KR. On closed testing procedure with special reference to ordered analysis of variance. Biometrika. 1976;63:655–660. [Google Scholar]
- Maurer W, Hothorn LA, Lehmacher W. Multiple comparisons in drug clinical trials and preclinical assays: A-priori ordered hypotheses. In: Vollmar J, editor. Biometrie in der chem.-pharm. Industrie. Vol. 6. Fischer-Verlag; Stuttgart: 1995. pp. 2–18. [Google Scholar]
- Pesarin F. Multivariate Permutation Tests: With Applications in Biostatistics. Wiley; Chichester: 2001. [Google Scholar]
- Puri ML, Sen PK. Nonparametric Methods in Multivariate Analysis. Wiley; New York: 1971. [Google Scholar]
- Romano JP, Shaikh AM, Wolf M. Formalized data snooping based on generalized error rates. Econometric Theory. 2008;24:404–447. [Google Scholar]
- Westfall PH, Tobias R, Rom D, Wolfinger R, Hochberg Y. Multiple Comparisons and Multiple Tests using SAS®. SAS Institute Inc.; Cary, NC: 1999. [Google Scholar]
- Westfall PH, Young SS. Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. Wiley; New York: 1993. [Google Scholar]
- Westfall PH, Wolfinger RD. Multiple tests with discrete distributions. The American Statistician. 1997;51:3–8. [Google Scholar]
- Westfall PH, Soper KA. Using priors to improve multiple animal carcinogenicity tests. Journal of the American Statistical Association. 2001;96:827–834. [Google Scholar]
- Westfall PH, Tobias RD. Multiple testing of general contrasts: truncated closure and the extended Shaffer-Royen method. Journal of the American Statistical Association. 2007;102:487–494. [Google Scholar]
- Westfall PH. Comment on “Resampling-based Multiple Testing for Microarray Data Analysis,” Y. Ge, S. Dudoit and T. P. Speed. Test. 2003;12:60–65. [Google Scholar]