Abstract
We take the perspective of a researcher planning a randomized trial of a new treatment, where it is suspected that certain subpopulations may benefit more than others. These subpopulations could be defined by a risk factor or biomarker measured at baseline. We focus on situations where the overall population is partitioned into two, predefined subpopulations. When the true average treatment effect for the overall population is positive, it logically follows that it must be positive for at least one subpopulation. Our goal is to construct multiple testing procedures that maximize power for simultaneously rejecting the overall population null hypothesis and at least one subpopulation null hypothesis. We show that uniformly most powerful tests exist for this problem, in the case where outcomes are normally distributed. We construct new multiple testing procedures that, to the best of our knowledge, are the first to have this property. These procedures have the advantage of not requiring any sacrifice for detecting a treatment effect in the overall population, compared to the uniformly most powerful test of the overall population null hypothesis.
Keywords: Familywise Type I error, Optimization, Personalized medicine
1. Introduction
Planning a randomized trial of an experimental treatment can be challenging when it is suspected that certain populations may benefit more than others. For example, a metastudy by Kirsch et al. (2008) of certain antidepressants suggests that there may be a clinically meaningful benefit only for those with severe depression at baseline.
We focus on the case of two predefined subpopulations that partition the overall study population. For a given population, the mean treatment effect is defined as the difference between the mean outcome were everyone assigned to the treatment and the mean outcome were everyone assigned to the control. We develop procedures to simultaneously test the null hypotheses of no positive mean treatment effect for subpopulation one (H01), for subpopulation two (H02), and for the overall study population (H0*). These hypotheses are defined formally in Section 3.3 below. For each of these null hypotheses, the alternative hypothesis is that there is a positive mean treatment effect for the corresponding population.
In some cases, there is a subpopulation for which a larger treatment benefit is conjectured. We call this the favored subpopulation, and refer to the other as the complementary subpopulation. Since it is usually not known with certainty before the trial that the treatment will benefit the favored subpopulation, planning a hypothesis test for it is important. Also, in trials where the overall population null hypothesis is rejected, it is of clinical importance to determine if the treatment benefits the complementary subpopulation, since there was more a priori uncertainty about the treatment effect for this group. Therefore, planning a hypothesis test for this subpopulation is also valuable. This motivates our interest in testing both subpopulation null hypotheses.
Our goal is to maximize power for simultaneously rejecting the overall population null hypothesis and at least one subpopulation null hypothesis. We give new multiple testing procedures that maximize this power, uniformly over all possible alternatives as defined in Section 3.6, for the case where outcomes are normally distributed. This optimality result may be of interest, since according to Romano et al. (2011), “there are very few results on optimality in the multiple testing literature.” Our procedures require no sacrifice in detecting treatment effects for the overall population; that is, their probability of rejecting H0* equals that of the uniformly most powerful test of H0*, for any data generating distribution.
Our new procedures have a property called consonance. According to Bittman et al. (2009) “A testing method is consonant when the rejection of an intersection hypothesis implies the rejection of at least one of its component hypotheses.” Consonance was introduced by Gabriel (1969), and subsequent work on consonant procedures includes that of (Hommel, 1986; Romano and Wolf, 2005; Bittman et al., 2009; Brannath and Bretz, 2010; Romano et al., 2011). Consonance is desirable since whenever an intersection of null hypotheses is false, it follows logically that at least one of the corresponding individual null hypotheses must be false as well. A non-consonant procedure may reject an intersection of null hypotheses without rejecting any of the corresponding individual null hypotheses. In our context, a non-consonant procedure would sometimes make claims that logically imply the treatment is superior to control in at least one of the two subpopulations, without indicating which one. To the best of our knowledge, our procedures are the first multiple testing procedures for our problem that are consonant.
Multiple testing procedures for the family of null hypotheses H0*, H01, H02 can be constructed using the methods of, e.g., Holm (1979), Bergmann and Hommel (1988), Maurer et al. (1995), Song and Chi (2007), Rosenbaum (2008), and Alosh and Huque (2009). However, none of these procedures is uniformly most powerful for simultaneously rejecting the overall population null hypothesis and at least one subpopulation null hypothesis, as defined in Section 3.6. We compare the power of our uniformly most powerful procedures to this prior work, in a simulation study in Section 6.
2. Example: randomized trial of treatment for metastatic breast cancer
Before giving formal details, we illustrate our hypothesis testing problem in the context of a randomized trial of trastuzumab for treating women with metastasized breast cancer (Slamon et al., 2001). As described in the abstract of Slamon et al. (2001), two types of patients were enrolled in the trial: those who had previously received an anthracycline as adjuvant therapy and those who had not. We call these types of patients subpopulation one and subpopulation two, respectively. This distinction between subpopulations is important since the concomitant treatments received by each during the trial were different. All subjects were randomly assigned to trastuzumab or placebo.
The results of Slamon et al. (2001) are based on survival data obtained 31 months after the last participant was enrolled; the range of follow-up was 30 to 51 months. We consider the outcome of survival at 30 months, and test the null hypotheses H0*, H01, H02, that survival probability at 30 months is no greater under assignment to trastuzumab than under placebo, for the overall population, for subpopulation one, and for subpopulation two, respectively.
In Table 1, we give the percent surviving 30 months, for the overall population and for each subpopulation, by study arm, based on data in Figure 2 of Slamon et al. (2001). The values of the z-statistics for the overall population, subpopulation one, and subpopulation two, are Z* = 1.89, Z1 = 1.48, and Z2 = 1.14, respectively (rounded to two decimal places). The estimated correlation between Z* and Z1 is ρ1 = 0.79, and the estimated correlation between Z* and Z2 is ρ2 = 0.62. Formulas for these z-statistics and correlations are given in Section 3.5; in the above calculations, we substituted sample variances into these formulas.
Table 1.
Percent surviving 30 months, for the overall population and each subpopulation, by study arm. In each cell, the corresponding number of participants surviving 30 months divided by the number enrolled, is given in parentheses.
| Overall Population | Subpopulation 1 | Subpopulation 2 | |
|---|---|---|---|
| Treatment Arm | 41% (96/235) | 43% (62/143) | 37% (34/92) |
| Control Arm | 32% (76/234) | 35% (48/138) | 29% (28/96) |
We next introduce and demonstrate our new testing procedure. Let MUMP denote the following multiple testing procedure for {H01, H02, H0*}:
Define S to be subpopulation 1 if Z1 – (3/4)ρ1 ≥ Z2 – (3/4)ρ2, and subpopulation 2 otherwise. If Z* > Φ−1 (0.95), reject H0* and H0S.
The procedure MUMP strongly controls the familywise Type I error rate (defined in Section 3.4) at level 0.05. It is uniformly most powerful for simultaneously rejecting H0* and at least one subpopulation null hypothesis, as defined in Section 3.6.
We apply MUMP to the above data example. Since Z1 – (3/4)ρ1 = 0.89 and Z2 – (3/4)ρ2 = 0.67, the subpopulation S in M UMP equals 1. Since Z* = 1.89, which exceeds Φ−1 (0.95), our method MUMP rejects H0* and H01. Existing multiple testing procedures based on the related work in the last paragraph of Section 1 (and which are defined in Section 6.1) either reject only H0* or no null hypothesis. This shows a concrete example where MUMP could be advantageous. In other situations, different methods may do better. This example is for demonstration only, since the choice of analysis method must in general be prespecified.
3. Multiple testing problem
3.1. Randomized trial description
We consider two-armed randomized trials. We assume the overall population is partitioned into two subpopulations, defined in terms of pre-randomization variables. For each subpopulation s ∈ {1, 2}, let ps > 0 denote the fraction of the overall population in subpopulation s.
Let a = 1 denote the treatment arm and a = 0 denote the control arm. Let nsa denote the number of participants in subpopulation s ∈ {1, 2} who are assigned to study arm a ∈ {0, 1}. Denote the total sample size by n. We assume the fraction of trial participants in each subpopulation s, (ns0+ns1)/n, equals the corresponding population proportion ps. We also assume the proportion of participants from each subpopulation that are assigned to the treatment arm is 1/2; this can be approximately ensured if block randomization is used for each subpopulation.
3.2. Data collected on each participant
For each participant i, denote his/her subpopulation by Si ∈ {1, 2}, study arm assignment by Ai ∈ {0, 1}, and outcome by . We assum for each participant i, conditioned on Si = s and Ai = a, that his/her outcome Yi is a random draw from an unknown distribution Qsa and this draw is independent of the data of all other participants. Let μ(Qsa) denote the mean and σ2(Qsa) denote the variance of the outcome distribution Qsa for subpopulation s ∈ and study arm a ∈ {0, 1}. We represent (Q10, Q11, Q20, Q21) by Q. Let denote the class of all distributions Q that satisfy the moment condition in Section 3.8 below.
3.3. Hypotheses tested
Define the average (or mean) treatment effect for subpopulation one, subpopulation two, and the overall population, respectively, as
The three null hypotheses to be tested, which correspond to no positive mean treatment effect in each subpopulation and in the overall population, are:
| (3.1) |
| (3.2) |
We call these elementary null hypotheses. The corresponding alternative hypotheses are the complements of each of these null hypotheses. The key relationship among the above null hypotheses is
| (3.3) |
The set of distinct intersections of the three elementary null hypotheses is
The intersection H01 ∩ H02 ∩ H0* is not in the above list since it equals H01 ∩ H02.
We focus on the null hypotheses {H01, H02, H0*} rather than the simpler set {H01, H02, H01 ∩ H02}. The reason is that the primary hypothesis in many randomized trials concerns the average effect of treatment versus control in the overall population, as represented, e.g., in H0*. If H0* is false, the clinical implication is that giving everyone in the overall population the treatment, rather than giving everyone the control, improves the average outcome. In contrast, if H01 ∩ H02 is false, the aforementioned clinical implication does not necessarily hold.
A multiple testing procedure M is defined as a deterministic map from the data generated in the randomized trial to a subset of {H01, H02, H0*} representing the null hypotheses that are rejected. In order that probabilities such as the familywise Type I error rate of M are well-defined, we assume each M satisfies a measurability condition defined in Section A of the Supplementary Material.
Consider the multiple testing procedure MSTD, defined to be the standard, one-sided z-test at level α for H0*, i.e., the test that pools all participants and rejects if the standardized difference between sample means in the treatment arm and control arm exceeds Φ−1(1 – α). It is uniformly most powerful for H0* when outcomes under treatment and control are normally distributed with known variances; this follows directly from Proposition 15.2 of (van der Vaart, 1998).
We say a multiple testing procedure M’ dominates a procedure M if for any H ⊆ {H01, H02, H0*}, M’ rejects H (and possibly additional null hypotheses) whenever M rejects H, with probability 1.
3.4. Definition of strong control of asymptotic, familywise Type I error rate
We require our testing procedures to strongly control the familywise Type I error rate, also called the studywide Type I error rate, as defined by Hochberg and Tamhane (1987). Regulators such as the U.S. Food and Drug Administration and the European Medicines Agency generally require studywide Type I error control for confirmatory randomized trials (FDA and EMEA, 1998).
For a given multiple testing procedure, class of distributions , and sample size n, define the worst-case, familywise Type I error rate to be
| (3.4) |
where PQ,n is the probability distribution resulting from outcome data being generated according to Q, at sample size n. A multiple testing procedure strongly controls the familywise Type I error rate at level α over if for all sample sizes n, (3.4) is at most α. We say a multiple testing procedure strongly controls the asymptotic, familywise Type I error rate at level α over , if the lim sup as n → ∞ of (3.4) is at most α. For concreteness, we focus throughouton α = 0.05.
3.5. Subpopulation-specific and overall population z-statistics
For subpopulation one, subpopulation two, and the overall population, respec tively, define the following z-statistics:
| (3.5) |
| (3.6) |
where for each s ∈ {1, 2}, we define , and . For each j ∈ {*, 1, 2}, it follows that Zj has variance 1. For each s ∈ {1, 2}, denote the correlation between Zs and Z* by ρs. The following properties are straightforward to verify:
| (3.7) |
We assume throughout that the variances σ2(Qsa) are known. However, we also give a result for the case where variances are estimated, in Section 4.
3.6. First optimality criterion
Let denote the class of distributions in which each outcome distribution Qsa is normally distributed. Let denote the class of all multiple testing procedures for {H01, H02, H0*} that strongly control the familywise Type I error rate at level 0.05 over . This includes but is not limited to procedures based on the closure principle of Marcus et al. (1976), or procedures based on partitioning as in Finner and Strassburger (2002), for example.
We say a multiple testing procedure is uniformly most powerful for simultaneously rejecting H0* and at least one subpopulation null hypothesis, if for all for which H0* is false and all n > 0, it satisfies
| (3.8) |
For conciseness, we say a multiple testing procedure is uniformly most powerful for (3.8) to mean for all for which H0* is false and all n > 0, it achieves the supremum (3.8).
Consider the following properties:
Whenever the null hypothesis H0* for the overall population is rejected, at least one subpopulation null hypothesis is rejected, with probability 1.
The probability of rejecting the null hypothesis H0* is at least that of MSTD, i.e., the standard, one-sided z-test of H0*, at level 0.05, for any distribution.
Strong control of the familywise Type I error rate at level 0.05 over .
It follows from Theorem 3.2.1 of Lehmann and Romano (2005) that having properties B and C is equivalent to the following: rejecting H0* if and only if Z* > Φ−1 (0.95), with probability 1.
In Section C.4 of the Supplementary Material, we prove the following:
Theorem 3.1
Consider any multiple testing procedure M with property C. M is uniformly most powerful for (3.8) if and only if it has properties A and B.
3.7. Second optimality criterion
Consider the case where H0* and H01 are false, but H02 is true. It would be desirable to reject precisely {H0*, H01}. A limitation of the criterion (3.8) is that it gives the same credit for rejecting {H0*, H01} as it does for rejecting {H0*, H02}; that is, there is no preference in this criterion for rejecting H0* and only the false subpopulation null hypothesis. Even though (3.8) gives credit for rejecting H0* and a true null hypothesis, the probability of this occurring cannot exceed 0.05 due to the familywise Type I error constraint. For the procedures MUMP and MUMP+ (given in Section 4) that are uniformly most powerful for (3.8), this probability is at most 0.04 in all scenarios of our simulation study, described in Section 6.
It would be desirable to incorporate into an optimization criterion that when H0* and exactly one subpopulation null hypothesis H0s are false, credit is only given when precisely {H0*, H0s} is rejected. This is reflected in the following:
| (3.9) |
It would be ideal to have a uniformly most powerful test for (3.9), which we define to be a procedure such that for all for which H0* is false and all n > 0, satisfies
In Section E of the Supplementary Material we consider each subpopulation proportion p1 ∈ {0.05, 0.10, 0.15, … , 0.95}, and show in each case that no uniformly most powerful test exists for (3.9). Because of this, we instead focus on constructing procedures that optimize (3.9) at specific alternatives defined next.
Let Δmin > 0 denote the minimum, clinically meaningful difference between the mean under treatment and the mean under control. For each s ∈ {1, 2}, we define a distribution where the average treatment effect is Δmin for subpopulation s and is zero for the complementary subpopulation. Specifically, let each outcome distribution be normally distributed with all variances equal to a common value σ2; for each j ∈ {1, 2}, set , where 1[j = s] is the indicator taking value 1 if j = s and 0 otherwise. Precisely the null hypotheses {H0*, H0s} are false under Q(s)
Define n* to be the sample size n at which the standard, one-sided z-test of H0* at level 0.05 has 80% power, when Δs = Δmin in both subpopulations s ∈ to {1, 2}. It is straightforward to show (see Section B.1 of the Supplementary Material) that,
| (3.10) |
We aim to construct a multiple testing procedure that maximizes the sum of (3.9) over Q(1) and Q(2) at n = n*, i.e., it achieves the supremum over of
| (3.11) |
The above display is our second optimality criterion. We show in Section E of the Supplementary Material that the solution to the above optimization problem does not depend on Δmin or σ2. We construct procedures that approximately optimize (3.11) in Section 5.
3.8. Assumptions on the outcome distributions
No parametric model assumptions are made on the outcome distributions Qsa. Instead, we make a weaker assumption, given next. For fixed C > 0, let denote the class of distributions Q whose components Qsa each satisfy
| (3.12) |
This condition, combined with the multivariate central limit theorem of Götze (1991), implies the joint distribution of subpopulation-specific z-statistics converges uniformly to a multivariate normal distribution. Such convergence is generally required to ensure that even the standard, one-sided z-test strongly controls the asymptotic, familywise Type I error rate as defined in Section 3.4. Condition (3.12) is satisfied by any Q whose components Qsa are normally distributed, for C > 2. Also, for fixed K > 0 and τ > 0, condition (3.12) is satisfied for any Q such that Qsa has support in [−K, K] and variance at least τ, for C > (2K)3/τ3/2.
4. Uniformly most powerful tests for criterion (3.8)
In Section C of the Supplementary Material, we prove:
Theorem 4.1
MUMP, which is defined in Section 2, has the following properties:
It is uniformly most powerful for (3.8).
It satisfies properties A, B, and C from Section 3.6. Furthermore, it strongly controls the asymptotic, familywise Type I error rate at level 0.05 over
Theorem 4.1 holds for any p1 : 0 < p1 < 1; that is, regardless of the fraction p1 of the overall population in subpopulation one, MUMP is uniformly most powerful for (3.8), and has properties A, B, and C. Part (ii) also holds if we replace Z*, Z1, Z2, ρ1, ρ2 by corresponding quantities in which the variances σ2(Qsa) are estimated by sample variances rather than assumed known, under the additional assumption that (3.12) holds with the exponent 3 replaced by 4; this is proved in Section C.3 of the Supplementary Material.
Part (i) of Theorem 4.1 follows from part (ii) and Theorem 3.1, as shown in Section C.5 of the Supplementary Material. Properties A and B in part (ii) follow directly from the definition of MUMP. The main challenge in proving Theorem 4.1 is showing MUMP has property C.
In the special case that ρ1 = ρ2, the procedure MUMP reduces to the simpler procedure that, when Z* > Φ−1(0.95), rejects the overall population null hypothesis and the null hypothesis for the subpopulation with larger z-statistic. We denote this simpler procedure by M0. This special case occurs, for example, when p1 = p2 = 1/2 and the variances σ2(Qsa) are all equal. However, when the subpopulations have different sizes or when these variances differ, M0 can fail to strongly control the familywise Type I error rate, unlike MUMP.
We now describe the intuition for how MUMP reduces the worst-case, family-wise Type I error rate, compared to M0. We consider distributions where outcomes are normally distributed, and focus on the scenarios where the familywise Type I error rate of M0 can exceed 0.05. This cannot happen if H0* is true, since M0 makes a Type I error only if Z* > Φ−1 (0.95), and under H0* this happens with probability at most 0.05. A familywise Type I error cannot occur if both H01, H02 are false, since then H0* is false as well, making a Type I error impossible.
The remaining case is the class of distributions, denoted by , for which H0* is false and exactly one subpopulation null hypothesis, call it H01 without loss of generality, is false as well. Then M0 only makes a Type I error when it rejects H02, which occurs if both Z* > Φ−1(0.95) and Z2–Z1 > 0. Direct computation shows this occurs with probability exceeding 0.05 only when the correlation between Z and Z2 – Z1, which equals ρ2 – ρ1, is positive. The procedure MUMP raises the threshold for rejecting H02 in such cases, and therefore has a lower Type I error probability than M0. The tradeoff is MUMP has a higher Type I error probability than M0 for when ρ2 – ρ1 < 0. However, since the Type I error probability for M0 exceeds 0.05 only in the former case, this tradeoff reduces the worst-case Type I error probability over all . We further explain this tradeoff, and how we selected the constant 3/4 in2the procedure MUMP, in Section C.6 of the Supplementary Material.
We now show how to augment MUMP to allow simultaneous rejection of all three null hypotheses H0*, H01, H02 in some cases, while still having all of the properties in Theorem 4.1. We consider each ρ1 ∈ (0, 1) separately, and use a threshold function a(ρ1) defined in Section D of the Supplementary Material. The augmented procedure MUMP+, where the new part is in italics, is:
The set of values {a(ρ1) : ρ1 ∈ (0, 1)} ranges between 1.92 and 2.19, with the minimum occurring at ρ1 = 2−1/2, i.e., where ρ1 = ρ2. Since MUMP+ dominates MUMP, we have MUMP+ is also uniformly most powerful for (3.8). This shows there is not a unique procedure that is uniformly most powerful for (3.8).
5. Procedures maximizing the second criterion (3.11) at certain alternatives
For each p1 ∈ {0.05, 0.10, 0.15, … , 0.95}, we constructed a procedure denoted that approximately maximizes (3.11) over all . It is depicted in Figures 1b and 1d for the cases of p1 = 1/2 and p1 = 3/4, respectively. In each case, the value of (3.11) at M = M(3.11) is within 0.001 of the supremum of (3.11) over all . This is shown in Section E of the Supplementary Material, where details are given for how M(3.11) is constructed. For comparison, the procedure MUMP is depicted in Figures 1a and 1c for the cases of p1 = 1/2 and p1 = 3/4, respectively. The procedure M(3.11) has properties A and C, but not B.
Figure 1.
Rejection regions for MUMP (left column) and M(3.11) (right column) at p1 = 1/2 (top row) and p1 = 3/4 (bottom row). For comparability, the line corresponding to the boundary of the rejection region of MSTD at level 0.05 is given in all plots. For MUMP, this line coincides with the boundary of the region where at least H0* is rejected, which follows from MUMP having properties B and C.
M(3.11) was constructed by a new application of the method of Rosenblum et al. (In Press). It consists of first partitioning into tiny rectangles r. For each r, an optimization algorithm determines the null hypotheses to be rejected when (Z1, Z2)∈ r. This was done in a way that maximizes (3.11) under the constraint that the familywise Type I error rate is strongly controlled at level 0.05 over . For the case of p1 = 1/2, the resulting multiple testing procedure is depicted in Figure 1b, where each rectangle r is colored according to the set of null hypotheses to be rejected when (Z1, Z2) ∈ r. This general method, though it produces well-defined multiple testing procedures, does not provide a simple analytic expression for the boundaries of rejection regions, in contrast to MUMP and MUMP+. Also, this general method cannot be used to prove results such as Theorems 3.1 and 4.1. Full details are in Section E of the Supplementary Material.
6. Simulations to assess power and Type I error
6.1. Multiple testing procedures to be compared
The following multiple testing procedure, denoted MR, was given in the case of p1 = 1/2 by (Rosenbaum, 2008, Section 2):
Rosenbaum (2008) shows this procedure strongly controls the familywise Type I error rate at level 0.05 over . A straightforward extension of that proof shows the result holds for any p1 ∈ (0, 1). By construction, MR has property B. However, it does not have property A. That is, the procedure may reject the overall population null hypothesis without rejecting any subpopulation null hypothesis. This occurs if Z* > Φ−1(0.95), but neither Z1 nor Z2 exceeds Φ−1(0.95), as was the case in the data example in Section 2. We show in Section B.1 of the Supplementary Material that MR dominates a procedure based on the fixed sequence method of Maurer et al. (1995). We therefore only consider the former in the power comparison below.
Bergmann and Hommel (1988) give an improvement to the Holm step-down procedure for hypotheses that are logically related, as is the case in our context. We define their procedure in Section B.1 of the Supplementary Material, where we show it has neither property A nor B.
Song and Chi (2007) and Alosh and Huque (2009) designed multiple testing procedures involving the overall population and a single, prespecified subpopulation s*. Denote the original procedure of Song and Chi (2007) by MSC,s*. To tailor this procedure to our setting, we augment it to additionally allow rejection for the complementary subpopulation, without any loss in power for H0* or for H0s*. We define this augmented procedure, denoted MSC +,s* , in Section B.1 of the Supplementary Material, where we show in general it has neither property A nor B. Since the procedure of Song and Chi (2007) has similar performance to a procedure of Alosh and Huque (2009), we only consider the former below.
6.2. Power comparison
We compare MUMP+, MUMP, MR, M(3.11), MBH, MSC,1, MSC+,1, MSC+,2. A wide variety of distributions Q are considered, with full results given in Section B.2 of the Supplementary Material. Here, we focus on four representative cases. In each case, we set Qsa to be normally distributed with all variances σ2(Qsa) equal to a common value σ2.
We consider two scenarios. In the first, we set Δ1 = Δ2 = Δmin > 0, for Δmin defined in Section 3.7. In the second, we set Δ1 = Δmin and Δ2 = 0. In each scenario, we consider p1 ∈ {1/2, 3/4}. We set the sample size n = n*, as defined in (3.10).
We say a multiple testing procedure rejects at least the set of null hypotheses , if it rejects all of these null hypotheses and possibly more. For each procedure and distribution, we ran 106 Monte Carlo iterations and recorded the empirical probabilities of rejecting different subsets of null hypotheses in Table 2.
Table 2.
Power Comparison. Each cell reports the probability (as a percent) that the procedure in that row rejects at least the set of null hypotheses corresponding to that column. The column heading “H0*+sub” means H0* and at least one of H01, H02; “all” means all three null hypotheses; “H0*, H01, not H02” means H01, H0* are rejected but H02 is not rejected.
| Scenario 1: Both subpopulations benefit equally from treatment | ||||||||||
| p1 = 1/2 | p1 = 3/4 | |||||||||
|
|
|||||||||
| H 0* |
H0* +sub |
H0* +H01 |
H0* +H02 |
all | H 0* |
H0* +sub |
H0* +H01 |
H0* +H02 |
all | |
| M UMP+ | 80 | 80 | 49 | 49 | 19 | 80 | 80 | 62 | 28 | 10 |
| M UMP | 80 | 80 | 40 | 40 | 0 | 80 | 80 | 56 | 23 | 0 |
| M R | 80 | 74 | 52 | 52 | 30 | 80 | 75 | 67 | 32 | 24 |
| M (3.11) | 79 | 79 | 40 | 40 | 0 | 79 | 79 | 63 | 15 | 0 |
| M BH | 66 | 65 | 48 | 48 | 30 | 66 | 66 | 60 | 29 | 24 |
| M SC,1 | 79 | 52 | 52 | 0 | 0 | 79 | 67 | 67 | 0 | 0 |
| M SC+,1 | 79 | 74 | 52 | 52 | 30 | 79 | 75 | 67 | 32 | 24 |
| M SC+,2 | 79 | 74 | 52 | 52 | 30 | 79 | 75 | 66 | 32 | 24 |
| Scenario 2: Only subpopulation one benefits from treatment | ||||||||
| p1 = 1/2 | p1 = 3/4 | |||||||
|
|
|||||||
| H 0* |
H0* +sub |
H0* +H01 |
H0*, H01, not H02 |
H 0* |
H0* +sub |
H0* +H01 |
H0*, H01, not H02 |
|
| M UMP+ | 34 | 34 | 31 | 30 | 59 | 59 | 55 | 55 |
| M UMP | 34 | 34 | 31 | 31 | 59 | 59 | 55 | 55 |
| M R | 34 | 32 | 30 | 27 | 59 | 56 | 55 | 52 |
| M (3.11) | 34 | 34 | 31 | 31 | 58 | 58 | 56 | 56 |
| M BH | 22 | 22 | 20 | 18 | 44 | 44 | 43 | 40 |
| M SC,1 | 34 | 29 | 29 | 29 | 58 | 55 | 55 | 55 |
| M SC+,1 | 34 | 31 | 29 | 27 | 58 | 56 | 55 | 51 |
| M SC+,2 | 33 | 30 | 28 | 26 | 57 | 55 | 54 | 50 |
The power to reject H0* and at least one subpopulation null hypothesis, which is the quantity in the criterion (3.8), is given in the column “H0*+sub” of Table 2. In scenario 1, since all null hypotheses are false, this power equals (3.9); therefore, no extra column is devoted to (3.9). In scenario 2, precisely {H0*, H01} are false; the power to reject precisely this set, which equals (3.9), is given in the column “H0*, H01, not H02.”
In all cases, MUMP+ has the maximum power for rejecting H0* and at least one subpopulation null hypothesis. Also, in all cases, MUMP+ has the maximum power for rejecting at least the overall population null hypothesis H0*.
The procedure MR has the same power as MUMP+ to reject H0*, and has 5-6% less power to simultaneously reject H0* and at least one of H01, H02 in scenario 1. In scenario 2, MR has 3% less power than MUMP+ to reject precisely the false null hypotheses. However, MR is similar to or improves on the power of MUMP+ in almost all the other cases, with a large improvement (up to 14%) in scenario 1 in the power to reject all three null hypotheses.
The distribution in scenario 2 is identical to Q(1) defined in Section 3.7; this distribution is used in the optimization criterion (3.11), which gives credit for rejecting precisely {H0*, H01} under Q(1). In scenario 2, M(3.11) has the greatest power to reject precisely {H0*, H01}. The procedures MUMP and MUMP+ come in at a close second, having power to reject precisely {H0*, H01} within 1% of M(3.11) in scenario 2. Furthermore, the value of (3.9) for MUMP and MUMP+ was always within 1% of this value for M(3.11) in all simulations in Section B.2 of the Supplementary Material. We also show there that for each of MUMP and MUMP+, the value of the criterion (3.11) is within 2.2% of the optimal value of (3.11) over all , for each p1 ∈ {0.05, 0.10, 0.15, … , 0.95}.
The value in Table 2 for M(3.11) is within 1% of that for M}UMP in all cases except scenario 1 for p1 = 3/4. In that case, M(3.11) has 7% more power than MUMP to reject at least {H0* H01}, while M(3.11) has 8% less power than MUMP to reject at least {H0*, H02}. This power tradeoff between the subpopulation null hypotheses is apparent in the form of the rejection regions for MUMP and M(3.11) in Figures 1c and 1d; MUMP devotes a smaller region to rejecting {H0*, H01} and a larger region to rejecting {H0*, H02} than M(3.11) does.
The augmented version MSC+,1 of the procedure MSC,1 of Song and Chi (2007) has substantially more power (up to 52% more) than MSC,1 to reject H02 in scenario 1. This is not surprising, since MSC,1 was designed for testing only {H0*, H01}, rather than {H0*, H01, H02} as considered here.
The above scenarios capture the main features of the extensive simulations in Section B.2 of the Supplementary Material. The one exception is where the mean treatment effect is negative for one subpopulation and positive for the other. This is called a qualitative interaction, as opposed to a quantitative interaction. Since these treatment effects partially cancel out for the overall population, all the procedures above have relatively low power for rejecting H0*. The procedure MBH has the most power to reject at least one of H01, H02, but this power is not very large. All the procedures compared in our simulations are for situations where quantitative, rather than qualitative, interactions are expected.
6.3. Familywise Type I error rate of MUMP+
In the special case where each Qsa is a normal distribution, the familywise Type I error rate of MUMP+ is at most 0.05, at any sample size. In general, the familywise Type I error guarantee in Theorem 4.1 is asymptotic, as sample size goes to infinity. We did extensive simulations based on skewed and heavy-tailed distributions in , with sample sizes from 50 to 500, as described in Section B.3 of the Supplementary Material. As a benchmark for how challenging each distribution is, we computed the Type I error of the standard, one-sided z-test for H0* under c(Q), where c(Q) is the distribution resulting from centering each component distribution of Q to have mean 0. For each distribution we simulated from, the familywise Type I error rate of MUMP+ under Q was never more than the Type I error of the standard, one-sided z-test under c(Q).
The familywise Type I error rate of M(3.11) is at most 0.05, at any sample size, for any . Because of the computational complexity of evaluating the familywise Type I error rate of M(3.11) under non-normal distributions (as explained in Section E of the Supplementary Material), we did not conduct additional simulations for this procedure.
7. Discussion
The power comparisons in Section 6 provide information that may be useful in selecting a multiple testing procedure. If it is strongly desired to guarantee rejecting at least one subpopulation null hypothesis whenever H0* is rejected, MUMP+ could be useful. The procedure MR, though lacking this property, has quite favorable overall performance in the power comparison in Section 6; in particular, it has substantially greater power than MUMP+ for simultaneously rejecting all three null hypotheses. This can be an important consideration in practice. It is an open problem to simultaneously optimize a weighted combination of (i) power for rejecting all three null hypotheses, and (ii) power for rejecting H0*, at least one subpopulation null hypothesis, and no true null hypothesis, at specific alternatives.
For each of the procedures MUMP and MUMP+, which are uniformly most powerful for the criterion (3.8), the value of the alternative criterion (3.11) is within 2.2% of the optimal value, for each p1 we considered. This shows that these uniformly most powerful procedures for the criterion (3.8) have performance relatively close to optimal for the alternative criterion (3.11), at each value of p1 we considered.
Supplementary Material
Highlights.
▪ We examine situations where the population is partitioned into two subpopulations.
▪ We aim to reject null hypotheses for the overall population and ≥ 1 subpopulation.
▪ We construct new, uniformly most powerful testing procedures for this problem.
▪ It is proved that all such uniformly most powerful procedures are consonant.
Footnotes
Supplementary Material In Section A, we give the measurability condition from Section 3.3. Simulations comparing power and familywise Type I error rates are given in Section B. We prove Theorems 3.1 and 4.1 in Section C. Section D gives the algorithm for a(ρ1) used in MUMP+. The construction of M(3.11) is described in Section E. The relationship between property A and consonance is described in Section F.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- Alosh M, Huque MF. A flexible strategy for testing subgroups and overall population. Statistics in Medicine. 2009;28:3–23. doi: 10.1002/sim.3461. [DOI] [PubMed] [Google Scholar]
- Bergmann B, Hommel G. Improvements of general multiple test procedures for redundant systems of hypotheses. In: Bauer P, Hommel G, Sonnemann E, editors. Multiple Hypothesenprüfung–Multiple Hypotheses Testing. Springer; Berlin: 1988. pp. 100–115. [Google Scholar]
- Bittman RM, Romano JP, Vallarino C, Wolf M. Optimal testing of multiple hypotheses with common effect direction. Biometrika. 2009;96:399–410. [Google Scholar]
- Brannath W, Bretz F. Shortcuts for locally consonant closed test procedures. Journal of the American Statistical Association. 2010;105:660–669. [Google Scholar]
- FDA. EMEA . U.S. Food and Drug Administration: CDER/CBER. European Medicines Agency: CPMP/ICH/363/96; 1998. E9 statistical principles for clinical trials. [Google Scholar]
- Finner H, Strassburger K. The partitioning principle: a powerful tool in multiple decision theory. Ann. Statist. 2002;30:1194–1213. [Google Scholar]
- Gabriel KR. Simultaneous test procedures – some theory of multiple comparisons. The Annals of Mathematical Statistics. 1969;40:224–250. [Google Scholar]
- Götze F. On the rate of convergence in the multivariate CLT. Ann. Prob. 1991;19:724–739. [Google Scholar]
- Hochberg Y, Tamhane AC. Multiple Comparison Procedures. Wiley Interscience; New York: 1987. [Google Scholar]
- Holm S. A simple sequentially rejective multiple test procedure. Scand. J. Statist. 1979;6:65–70. [Google Scholar]
- Hommel G. Multiple test procedures for arbitrary dependence structures. Metrika. 1986;33:321–336. [Google Scholar]
- Hommel G, Bernhard G. Bonferroni procedures for logically related hypotheses. Journal of Statistical Planning and Inference. 1999;82:119–128. [Google Scholar]
- Kirsch I, Deacon BJ, Huedo-Medina TB, Scoboria A, Moore TJ, Johnson BT. Initial severity and antidepressant benefits: a meta-analysis of data submitted to the food and drug administration. PLoS Med. 2008;5:e45. doi: 10.1371/journal.pmed.0050045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lehmann EL, Romano JP. Testing Statistical Hypotheses. 3rd ed Springer; New York: 2005. [Google Scholar]
- Marcus R, Peritz E, Gabriel KR. On closed testing procedures with special reference to ordered analysis of variance. Biometrika. 1976;63:655–660. [Google Scholar]
- Maurer W, Hothorn LA, Lehmacher W. Multiple comparisons in drug clinical trials and preclinical assays: a-priori ordered hypotheses. In: Vollman J, editor. Biometrie in der chemische-pharmazeutichen Industrie. Fischer Verlag; Stuttgart: 1995. [Google Scholar]
- Romano JP, Shaikh A, Wolf M. Consonance and the closure method in multiple testing. The International Journal of Biostatistics. 2011:7. [Google Scholar]
- Romano JP, Wolf M. Exact and approximate stepdown methods for multiple hypothesis testing. J. Am. Statist. Assoc. 2005;100:94–108. [Google Scholar]
- Rosenbaum PR. Testing hypotheses in order. Biometrika. 2008;95:248–252. [Google Scholar]
- Rosenblum M, van der Laan MJ. Optimizing randomized trial designs to distinguish which subpopulations benefit from treatment. Biometrika. 2011;98:845–860. doi: 10.1093/biomet/asr055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosenblum M, Liu H, Yen EH. Optimal tests of treatment effects for the overall population and two subpopulations in randomized trials, using sparse linear programming. Journal of the American Statistical Association. doi: 10.1080/01621459.2013.879063. In Press. URL: http://dx.doi.org/10.1080/01621459.2013.879063. [DOI] [PMC free article] [PubMed]
- Slamon DJ, Leyland-Jones B, Shak S, Fuchs H, Paton V, Bajamonde A, Fleming T, Eiermann W, Wolter J, Pegram M, Baselga J, Norton L. Use of chemotherapy plus a monoclonal antibody against HER2 for metastatic breast cancer that overexpresses HER2. N. Engl. J. Med. 2001;344:783–792. doi: 10.1056/NEJM200103153441101. [DOI] [PubMed] [Google Scholar]
- Song Y, Chi GYH. A method for testing a prespecified subgroup in clinical trials. Statistics in Medicine. 2007;26:3535–3549. doi: 10.1002/sim.2825. [DOI] [PubMed] [Google Scholar]
- Sonnemann E, Finner H. Vollständigkeitssätze für multiple testprobleme. In: Bauer P, Hommel G, Sonnemann E, editors. Multiple Hypothesenprüfung. Springer; Berlin: 1988. pp. 121–135. [Google Scholar]
- van der Vaart AW. Asymptotic Statistics. Cambridge University Press; 1998. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

