Abstract
Background:
Many studies, such as a study of comparative effectiveness, entail a comparison of the beneficial and adverse effects of multiple K > 2 competing therapies. Often the analysis consists of a comparison of the K groups using an omnibus (T2-like) test for any difference among the groups followed by pairwise comparisons with adjustments for multiple tests.
Methods:
We evaluate the properties of an analysis strategy in which each group is compared to the average of the others in hopes of establishing the overall superiority (or harm) of at least one of the therapies. Testing of one-versus-others can be accomplished for virtually any model using simple tests and the type I error probability α can be controlled by conducting such tests under the closed testing principle. Testing using linear models, the family of generalized linear models (GLMs) and Cox proportional hazards are described with examples.
Results:
Since each tested hypothesis compares one treatment to the average of the others, the K–level null hypothesis in the tree of closed testing is equivalent to any of the K – 1 level tests, thus reducing the number of tests required. This applies to linear, generalized linear and Cox PH models. While the Bonferroni, Holm, and Hommell procedures preserve the desired level α, all are conservative relative to closed one-versus-others testing and closed testing in general provides greater power.
Conclusions:
Testing each of multiple treatments versus the average of the others is readily and efficiently conducted under the closed testing principle and may be especially useful in the assessment of studies of comparative effectiveness.
ClinicalTrials.gov Identifier for the ALLHAT study: NCT00000542
Keywords: Comparative Effectiveness, One-versus-Others Combined, Closed Testing
Background
The American Recovery and Reinvestment Act (ARRA) of 2009 reinvigorated the interest of the National Institutes of Health and other agencies (e.g. PCORI) in comparative effectiveness research aimed at identifying the most effective (and safest) therapy among the many that may be available to treat a specific condition [1]. Many such studies involve the comparison of multiple K > 2 groups. The objective is to then determine which treatment, if any, is “best”.
For example, the NIDDK-funded Glycemia Reduction Approaches in Diabetes, A Comparative Effectiveness Study (GRADE) is designed to compare the effectiveness of four classes of drugs commonly used to treat early type 2 diabetes of less than 10 years duration [2]. All 5047 participants had previously been treated with metformin alone and were randomly assigned to receive a second drug from among one of four commonly used drug classes: either glimiperide (a sulfonylurea), sitagliptin (a DPP-4 inhibitor), liraglutide (a GLP-1 agonist) or glargine (a basal insulin). All medications have been approved for use by the FDA and each is administered in accordance with the FDA-approved drug labeling. There is no placebo or “comparator” group. The cohort of 5047 participants is being followed until the summer of 2021 with follow-up ranging from 4 to 7.5 years, depending on the time of entry.
The primary analyses will compare the long-term differences between treatment groups in metabolic status as measured by the HbA1c levels over time. Traditionally this would be conducted using a T2-like omnibus test for any difference in any direction among the 4 groups. To illustrate, let μj denote the mean within the j th group, j = 1, . . . , K. A T2-like test on K – 1 df would provide a test of the joint null hypothesis
(1) |
against the global alternative
(2) |
Alternately, all K(K − 1)/2 pairwise comparisons could be conducted between the pairs of groups. To protect against inflation of the type I error probability, the set of pairwise tests would be conducted with an appropriate adjustment for multiple tests. The simplest adjustment would be an improved Bonferroni procedure such as that of Holm or Hommel [3], among others. Alternately, closed sequential testing might be employed [4].
However, neither the omnibus test nor the pairwise tests will address whether one treatment is superior to the other treatments combined for a given outcome. Herein we propose a different approach that compares each group in turn to the other treatments combined, such as group 1 versus groups 2, 3 and 4 combined, then group 2 versus groups 1, 3 and 4 combined, etc. The comparison of group 1 versus the others would compare μ1 versus (μ2 + μ3 + μ4)/3; and likewise for group 2 versus others, etc. Each would constitute a simple contrast among the 4 groups that is interpretable in the context of the GRADE study.
Herein we first show that this one-versus-others approach for the comparison of group means is conveniently conducted using the closed testing principle, followed by examples, simulations and numerical computations to show that it may have favorable properties in some settings. This is followed by a description of its implementation using the family of generalized linear regression models (GLMs) that includes logistic and Poisson regression models, among others, and using Cox Proportional Hazards models for event time data. We then present a detailed analysis of the past-completed “Antihypertensive and Lipid-lowering Treatment to Prevent Heart Attack Trial” (ALLHAT) [7].
Methods
One-Versus-Others Closed Testing
Consider a study comparing the means {μi} of K = 3 treatments versus the others using a testing procedure that preserves the type I error probability at the desired level α. The null one-versus-others elemental hypotheses of interest are
(3) |
To account for the three tests, the p -values could be adjusted for multiplicity using the Bonferroni or Holm (or other) adjustment. Alternatively, the closed testing procedure [4] provides strong control of type I error probability. To reject a simple hypothesis (e.g., H01) at level α, one needs to reject at level α all intersection hypotheses that include that particular simple hypothesis. More specifically, to reject H01 at level α, one needs to reject H01, H01∩H02, H01 ∩ H03 and H01 ∩ H02 ∩ H03 at level α. A similar testing tree would apply to the tests of H02 and H03. Note that the K = 3 level intersection hypothesis is equivalent to the joint null hypothesis of equality in the three groups H0,123: μ1 = μ2 = μ3.
However, in this instance, the testing tree can be further simplified since any level 2 intersection hypothesis, such as H01 ∩ H02, is equivalent to the joint null hypothesis H0,123. For example, H01 ∩ H02 specifies that
(4) |
Therefore, with three treatments, if the 2 df test of the joint null hypothesis H0,123 is rejected at level α, then one can proceed to test each of the elementary hypotheses H01, H02 and H03 at level α without the need to specifically test the intersection hypotheses H01 ∩H02, H01 ∩H03 and H02 ∩ H03.
Similarly, for K = 4 treatments, there are four individual hypotheses:
(5) |
To reject H01 at level α using the closed testing principle, one needs to reject at level α all intersection hypotheses that include H01, namely
(6) |
where the Level 4 intersection hypothesis is the joint null hypothesis
(7) |
Note, however, that H01 specifies that
(8) |
or equivalently
(9) |
Thus, the elemental hypotheses H02, H03, and H04 also specify that μ2, μ3 and μ4 equal so that each of the Level 3 intersection hypotheses is also equivalent to H0,1234. For example, H01 ∩ H02 ∩ H03 specifies that that can only occur if which yields H0,1234. Then the Level 2 intersection hypotheses specify linear functions of the means, such as H01 ∩ H02 which specifies that μ1 = μ2 = (μ3 + μ4)/2.
To summarize, with four groups it follows that rejecting H01 at level α requires rejecting H01, H01 ∩H02, H01 ∩H03, H01 ∩H04, and H01 ∩H02 ∩H03 each at level α. Note that the test of any level 3 hypothesis involving H01 suffices to test all other level 3 hypotheses involving H01, as well as the Level K = 4 intersection hypothesis. Also note that this equivalence applies for other values of K where the Level K and level K − 1 intersection hypotheses can be tested simultaneously using a test of any one of these hypotheses.
Simple Test Statistics
To describe a family of simple test statistics, we assume that the vector of estimates is asymptotically normally distributed with mean and covariance matrix being the variance of the observations in the jth group of size nj. These statistics could be a set of sample means or proportions or log (hazard rates), etc. Further, the design could be balanced (equal sample sizes) or not.
Then the joint null hypothesis (1) can be tested against the global alternative (2) with a T2-like test on K − 1 df using a (K − 1) × K contrast matrix of the form
(10) |
where q = 1/(K − 1), both satisfying CT J = 0 for a K element unit vector J. Given the vector of estimates and a consistent estimate of the covariance matrix , the test statistic
(11) |
is asymptotically distributed as chi-square on K − 1 df.
Moreover, any intersection hypothesis can also be tested using an appropriate contrast matrix. Let I = {i1, . . . imI} denote a subset of mI of the K treatments. The hypothesis H0I = ∩i∈I H0i can be testing using the contrast matrix CI with rows given by ci with elements cii = 1 and cij = −1/(K − 1) for j ≠ i where i ∈ I. This yields a T2-like test on mI df. For example, consider a test of the intersection hypothesis H01 ∩ H02 that equals the hypothesis μ1 = μ2 = (μ3 + μ4)/2. This would be tested using a contrast matrix
(12) |
on 2 df. Then each individual hypothesis H0i (i = 1, . . . K) can be tested using the contrast . Again, the K level and all (K − 1) level tests are numerically equivalent.
For the ith group, the elemental hypothesis H0i in (5) can also be tested using a large sample Z-test
(13) |
where SE is the root of the variance of the numerator, this being
(14) |
The statistic Zi can then be used to conduct a one- or a two-sided test at level α.
Results
Analysis of Means
McMillan-Price et al. [5] evaluated weight loss in 129 overweight or obese young adults (BMI≥25) who were randomized to one of four diets: Diet 1: high carbohydrate and high glycemic index; Diet 2: high carbohydrate and low glycemic index; Diet 3: high protein and high glycemic index; and Diet 4: high protein and low glycemic index. The mean reductions and standard errors were
Diet: 1 | 2 | 3 | 4 | |
---|---|---|---|---|
n | 32 | 32 | 32 | 33 |
Mean | 4.2 | 5.5 | 6.2 | 4.8 |
SE | 0.6 | 0.5 | 0.4 | 0.7 |
This yields the following closed set of tests of hypotheses comparing one group versus the others where ∗ designates a non-significant test such that further testing for the groups involved is not conducted
(15) |
The 3 df test is significant at the 0.05 level so that the 2-level tests can then be conducted. All those involving group 3 are significant at the 0.05 level whereas tests involving H01, H02 and H04 fail to reach significance and thus their elemental hypotheses are not tested. The test of group 3 versus the others is then significant at the 0.05 level comparing the mean in group 3 of 6.2 versus the average of 4.83 in groups 1, 2 and 4. Note that a simple test of H03 with a Holm (or Bonferroni) adjustment for 4 tests would also have reached significance but with a larger adjusted p = 0.040.
Analysis of Proportions
Treiman et al. [6] reported the results of a randomized clinical trial that compared four treatments (phenytoin, lorazepam, phenobarbital and diazepam & phenytoin) for convulsive status epilepticus. The proportions in the four groups who were successfully treated among the patients with overt status epilepticus were = (0.436,0.649,0.582,0.558) and the number of participants in each group was n = (101,97,91,95) with estimated covariance of being
(16) |
This yields the closed set of tests of hypotheses as in (15). The K-level Wald T2-test as in (11) yields p = 0.0199 for the 3 df test and the 2 df tests have p-values
(17) |
Since all intersection tests involving H01 are significant at the 0.05 level then the elementary hypothesis can also be tested and is significant at p = 0.0052 on 1 df. Thus, phenytoin is significantly more effective than the average of the other three treatments.
Note that the nominal p-value for the test H02 (lorazepam versus others, not shown) is 0.029 but H02 cannot be tested because the test of the intersection hypothesis H02 ∩ H04 is 0.067. The Holm procedure would also reject H01 (adjusted p-value=0.021) and fail to reject H02 (adjusted p-value=0.087).
The Supplemental Material contains the SAS program that performed the above calculations.
Simulations
Consider the case of three groups of n = 200 each with Xi ~ N (μi, σi2), σi = 5 (i = 1,2,3). For specified mean values {μ1, μ2, μ3}, we then performed 10,000 simulations to compare the rejection probabilities of the tests of the elemental hypotheses in (3) under closed testing compared to no adjustment for multiple tests, and adjustment using the Bonferroni, Holm or Hommel procedures. We employed the F-test on 2 df of the joint 3-group null hypothesis as in (1), and the t-test for the elemental hypotheses in (3). The simulation employed two scenarios with specified values for the {μi}. The probabilities of rejection of the elemental hypotheses of each group versus the others are presented in Table 1.
Table 1:
Probabilities of rejection of the elemental hypotheses of equality of each of three groups versus the others for specific mean values μ1, μ2 and μ3 and standard deviation of 5 without adjustment for three tests of each group versus the others, and using the Bonferroni, Holm, Hommel and closed testing procedures; 10,000 simulations with n=200 per group.
Parameters | Rejection Probabilities | |||||
---|---|---|---|---|---|---|
Test | μ1 | μ2 | μ3 | μ1 vs. μ23 | μ2 vs. μ13 | μ3 vs. μ12 |
Unadjusted | 0 | 0.25 | 0.5 | 0.517 | 0.052 | 0.501 |
Bonferroni | 0 | 0.25 | 0.5 | 0.338 | 0.020 | 0.326 |
Holm | 0 | 0.25 | 0.5 | 0.365 | 0.030 | 0.343 |
Hommel | 0 | 0.25 | 0.5 | 0.375 | 0.033 | 0.351 |
Closed testing | 0 | 0.25 | 0.5 | 0.436 | 0.049 | 0.421 |
Unadjusted | 0 | 0.50 | 0.5 | 0.706 | 0.221 | 0.250 |
Bonferroni | 0 | 0.50 | 0.5 | 0.533 | 0.121 | 0.139 |
Holm | 0 | 0.50 | 0.5 | 0.547 | 0.151 | 0.168 |
Hommel | 0 | 0.50 | 0.5 | 0.552 | 0.159 | 0.175 |
Closed testing | 0 | 0.50 | 0.5 | 0.583 | 0.207 | 0.234 |
Let μij = (μi + μj)/2 for (ij) = (12,13,23). The first 5 rows present results for the scenario with ordered means (0, 0.25, 0.5), which yields μ1 − μ23 = −0.375, μ2 − μ13 = 0 and μ3 − μ12 = 0.375. Under this scenario, the hypothesis μ2 = μ13 that group 2 equals the others combined is true (i.e. a null hypothesis) even though μ1 ≠ μ2 ≠ μ3. In this case the Bonferroni, Holm and Hommel procedures all protect the type I error probability α at the 0.05 level, but are unduly conservative with rejection probabilities ≤ 0.033 relative to the closed testing procedure with rejection probability of 0.049. For the other comparisons under this scenario, μ1 − μ23 = −0.375 and μ3 − μ12 = 0.375, the closed testing approach provides rejection probability (power) somewhat greater than the other adjusted procedures.
The second 5 rows present the scenario with means (0, 0.5, 0.5), in which case μ1 − μ23 = −0.5 whereas μ2 −μ13 = μ3 −μ12 = 0.25. Again the closed testing procedure is more powerful than the other methods except the unadjusted test owing to its inflated type I error probability.
Figures 1 and 2 present contour plots comparing the probability of rejecting H01 : μ1 = (μ2+μ3)/2 using the closed testing procedure versus using a t-test with Bonferroni adjustment. These calculations employed numerical integration of the corresponding power functions, rather than simulations. See the Supplemental Material.
Figure 1:
Ratio between the power of rejecting H01 : μ1 = (μ2 +μ3)/2 using the closed testing procedure versus using a t-test with Bonferroni adjustment where μ1 = 0 and μ2 and μ3 range over (−1, 1).
Figure 2:
Ratio between the power of rejecting H01 : μ1 = (μ2 +μ3)/2 using the closed testing procedure versus using a t-test with Bonferroni adjustment where μ1 = 1 and μ2 and μ3 range over (−1, 1).
Figure 1 employs μ1 = 0 and values for μ2 and μ3 ranging from −1 to 1 whereas Figure 2 does so for μ1 = 1. Each plot shows contours for the ratio of the rejection probability using closed testing versus that of a Bonferroni adjustment. As seen from both figures, the proposed closed testing procedure for testing one group versus the mean of all others provides more power (ratio > 1) than the Bonferroni adjustment over all values of μ2 and μ3 considered.
For μ1 = 0 (Figure 1), the ratio of the powers increases as the parameters μ2 and μ3 approach the null μ1 = μ2 = μ3 = 0 (i.e., the center of Figure 1), most likely due to the conservativeness of the Bonferroni adjustment. Conversely, for μ2 and μ3 further from the null values (i.e., the SW and NE regions in Figure 1), both tests have power approaching one, and the ratio decreases.
The same pattern is observed for μ1 = 1 (Figure 2), where values for μ2 and μ3 approaching the upper right corner approach the null whereas values in the lower left region have substantial departure from the null. Again the ratio of the closed testing to Bonferroni power is largest around the null parameters μ2 = μ3 = 1 (i.e., the NE region of Figure 2), and decreases as μ2 and μ3 depart from the null values (e.g., the SW region).
Regression Models
We now extend the above results for tests of means to tests of the coefficients for group comparisons in a regression model, starting with the family of Generalized Linear Models that includes the linear, logistic and Poisson models as special cases, followed by the Cox Proportional Hazards model.
Generalized Linear Models.
Consider an analysis using a member of the family of generalized linear models. Let μx denote E(Y |x) where g(μx) = xT β for covariate vector x. For a linear model, μx is the conditional mean with the identity link function. For a logistic model, μx is the probability of an outcome and g(·) is the logit so that . For a Poisson model, μx is the rate of an event and g(·) is the log so that . Herein we describe application to logistic regression.
Again consider the case of K = 3 groups with expectations μi in group i = 1,2,3 parameterized as
(18) |
where βi:j = g(μi) − g(μj) comparing group i versus group j, with group j as the reference, and where βj:i = −βi:j. In logistic regression βi:j is the log odds ratio comparing group i versus j, where βi:j < 0 provides evidence that individuals in group i have lower risk than those in group j (1 ≤ i < j ≤ K). The elementary null hypothesis for the test of group 1 versus the others then specifies that other then specifies that
(19) |
The other elemental hypotheses are
(20) |
Expressed in terms of the g(μi), the Level 2 hypothesis H01 ∩ H02 specifies that 2g(μ3) − g(μ1) − g(μ2) = 0 that is the elemental hypothesis H03. Thus, H01 ∩ H02 implies the Level 3 hypothesis H01 ∩ H02 ∩ H03 and vice versa. As was the case of an analysis of means, it also follows that the K-level and all (K − 1)-level hypotheses are equal and a single test of any one will suffice. Thus, to reject H01, for example, accounting for the multiple tests, the closed testing procedure with three groups requires rejecting H01 and H01 ∩ H02 at level α.
For a logistic model with 4 groups, the variable group would provide the coefficient estimates ,and . The analysis would then entail tests of the following contrasts for each of the elemental hypotheses
The Supplemental Material provides the SAS code that could be used to fit a logistic model for an analysis with three groups, either balanced or unbalanced, using 1 df contrasts. It also describes contrasts for the analysis of four groups.
The Supplemental Material also includes a simulation assessment of the properties of this approach and confirms that the advantages noted in the above simulation also apply to analyses using a logistic regression model.
Cox Proportional Hazards Models.
Consider the case of K = 3 groups with hazard functions λi(t) over time in group i = 1,2,3. The elementary null hypothesis for the test of group 1 versus the others then specifies that the hazard function in the first group equals the average of the hazards in the other groups, or
(21) |
Under the assumption of proportional hazards among groups, let βi:j denote the log hazard ratio (HRi:j) comparing group i versus group j, with group j as the reference, and where βi:j < 0 provides evidence that individuals in group i have lower risk than those in group j (1 ≤ i < j ≤ K). Then the elementary null hypotheses become
(22) |
(23) |
(24) |
The individual coefficients are related such that
(25) |
the denominator of one term canceling with the numerator of the next. To evaluate H01 ∩H02, setting β1:2 = −β2:1 in (22) yields
(26) |
while setting β2:3 = −β3:2 in (23) yields
(27) |
Replacing the expressions above for β3:1 and β2:3 in (25) yields
(28) |
The left hand side of (28) is an increasing function in β1:2, and therefore it has a unique root given by β1:2 = 0. Replacing β1:2 with 0 in (22) and (23) then yields β1:2 = β2:3 = β3:1 = 0 that equals the joint null H01 ∩ H02 ∩ H03. Again H01 ∩ H02 implies H01 ∩ H02 ∩ H03.
The test of the intersection hypothesis for the closed testing procedure is provided by the test of the 2 df test for β2:1 = β3:1 = 0. If signficant at level α then tests of the elementary hypotheses can also be conducted at level α.
To test the null hypothesis H01 in (22), estimates for β2:1 and β3:1, denoted by , along with their variance-covariance matrix, denoted by can be obtained by fitting a Cox PH model with group as a class variable with 3 levels and level 1 as the reference group. Then from (22) the test statistic is
(29) |
where the variance in the denominator is obtained using the delta method and is asymptotically distributed as chi-square on 1 df.
The null hypotheses in (23) and (24) can similarly be tested by noting that
(30) |
where
(31) |
so that
(32) |
Then the test of H02 is constructed as in (29) using , and and that of H03 using .
The Supplemental Material includes a SAS program and an R/latex program to conduct one-versus-others analyses for three groups. The supplement also provides a generalization to an analysis of four groups. Simulations also show that for time-to-event outcomes, as for the comparisons of means and proportions, the closed testing procedure is preferable to the other adjustments for multiple tests.
Example: ALLHAT
The Antihypertensive and Lipid-lowering Treatment to Prevent Heart Attack Trial (ALLHAT) [7] compared the risks of cardiovascular outcomes in 33,357 patients who were randomly assigned to receive the diuretic Chlorthalidone (n =15,255) versus the ACE inhibitor Lisinopril (n =9,054) versus the calcium channel blocker Amlodipine (n =9,048). The main report [7] presents analyses comparing Lisinopril versus Chlorthalidone and Amlodipine versus Chlorthalidone for 12 major cardiovascular clinical outcomes, one designated as primary. As is common for comparison of two groups (Lisinopril and Amlodipine) versus a common control group (Chlorthalidone), the control group sample size was increased.
Herein we consider the three pairwise comparisons among the three groups using a Holm adjustment for multiple tests, and closed testing of each group versus the others combined starting with the 2-df test of the joint null hypothesis. Note that the statistical analysis comparing one group versus the others is based on combinations of coefficients comparing 2 groups at a time and thus is not affected by an imbalance among groups as in this example.
Of the 12 variables analyzed, 5 had at least one multiplicity-corrected test (pairwise Holm adjusted or closed one-versus-others) that met the criteria for significance at the 0.05 level two-sided. For each of these 5 outcomes, Table 2 presents the number of subjects (cases) who experienced each type of event within each treatment group and the corresponding event rate per 1000 patent years. The rates of angina, combined cardiovascular disease and stroke were highest in the Lisinopril group, and those of congestive and combined heart failure highest with Amlodipine, while the rates of angina, combined cardiovascular disease, congestive and combined congestive heart failure were lowest with Chlorthalidone.
Table 2:
Number of subjects (cases) with each of the selected type of events and the corresponding rate per 1000 patients years among subjects receiving the diuretic Chlorthalidone (C) versus the ACE inhibitor Lisinopril (L) versus the calcium channel blocker Amlodipine (A) in ALLHAT.
Amlodipine | Lisinopril | Chlorthalidone | ||||
---|---|---|---|---|---|---|
n = 9,048 | n = 9,054 | n = 15,255 | ||||
Outcome | Cases | Rate | Cases | Rate | Cases | Rate |
Angina | 950 | 23.6 | 1019 | 25.8 | 1567 | 23.2 |
Combined CVD | 2432 | 65.4 | 2514 | 69.2 | 3941 | 62.8 |
Congestive Heart Failure | 706 | 16.9 | 612 | 14.8 | 870 | 12.3 |
Combined Heart Failure | 578 | 13.7 | 471 | 11.3 | 724 | 10.2 |
Stroke | 377 | 8.9 | 457 | 10.9 | 675 | 9.5 |
Table 3 then presents the hazard ratios and p-values for these 5 outcomes. For each type of event the three pairwise tests are presented with Holm-adjusted p-values as well as the three one-versus-others comparisons using closed testing. Note that owing to the construction in (21) the models assess the hazard ratio of the other combined groups versus a given group rather than that of the one group versus the others.
Table 3:
Comparisons of the risk of cardiovascular outcomes among subjects receiving the diuretic Chlorthalidone (C) versus the ACE inhibitor Lisinopril (L) versus the calcium channel blocker Amlodipine (A). Hazard ratio and Wald-test p-value shown. For pairwise comparisons the Holm-adjusted p-value is shown. If the ordered p-values are p1 < p2 < p3 then the corrected values are 3p1, max(p1, 2p2) and max(p2, p3). For closed testing the 2 df p-value of the joint null hypothesis is presented, and if ≤ 0.05 then the one-group versus others are tested at the 0.05 level.
Holm-Corrected Pairwise Comparisions | Closed Testing One vs. Others | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A vs C | L vs C | A vs L | 2 df | Others vs A | Others vs L | Others vs C | |||||||
Outcome | HR | P | HR | P | HR | P | P | HR | P | HR | p | HR | P |
Angina | 1.11 | 0.0292 | 0.0304 | 0.91 | 0.0076 | ||||||||
Combined CVD | 1.10 | 0.0005 | 0.0008 | 0.93 | 0.0009 | 1.07 | 0.0018 | ||||||
Congestive Heart Failure | 1.38 | <.0001 | 1.20 | 0.0009 | 1.14 | 0.0159 | <.0001 | 0.80 | <.0001 | 1.29 | <.0001 | ||
Combined Heart Failure | 1.35 | <.0001 | 1.22 | 0.0030 | <.0001 | 0.78 | <.0001 | 1.23 | 0.0002 | ||||
Stroke | 1.15 | 0.0416 | 0.81 | 0.0090 | 0.0084 | 1.15 | 0.0298 | 0.84 | 0.0084 |
For angina, the pairwise test of Lisinopril versus Chlorthalidone was significant with an adjusted p = 0.0292. Further, the 2-df test of the joint null hypothesis was significant (p = 0.0304), and in closed testing the other groups combined had a 9% lower risk of angina than Lisinopril (HR = 0.91, p = 0.0076).
For combined cardiovascular disease, the pairwise test shows a 10% greater risk for Lisinopril versus Chlorthalidone (p = 0.0005) whereas the closed testing shows that the other groups combined have 7% lower risk (HR = 0.93, p = 0.0009) than Lisinopril, and further that the other groups have a 7% greater risk than Chlorthalidone (HR = 1.07, p = 0.0018). A similar pattern is observed for congestive heart failure and combined heart failure in which one or more pairwise comparison showed higher risk for either Amlodipine or Lisinopril versus Chlorthalidone but the closed testing showed that the other groups combined had a significantly higher risk than Chlorthalidone.
For stroke, the pairwise comparisons showed that Lisinopril had a significantly higher risk of stroke than did Chlorthalidone (HR = 1.15) and that Amlodipine had a significantly lower risk than Lisinopril (HR = 0.81). Alternately, closed testing showed that the others combined had higher risk than Amlodipine (HR = 1.15), and that the others combined had significantly lower risk than Lisinopril (HR = 0.84).
The results of the pairwise and closed testing versus the others are largely concordant. However, the closed one-versus-others results provide a less complicated and more global conclusion for some outcomes, such as where Chlorthalidone has significantly lower risks than the other two treatments combined for combined cardiovascular disease, congestive heart failure and combined heart failure.
In addition, we also computed Holm-corrected p-values for the comparison of each group versus the others (data not shown). By construction the Holm p-values were at least as large as the closed testing p-values since, for example, the Holm procedure penalizes the smallest of the three tests multiplying by 3 and so on.
Discussion
The comparison of multiple alternative therapies for the treatment of a given condition is the hallmark of a comparative effectiveness assessment. While it might be useful in such situations to compare each of the K alternative treatments against each other, as in pairwise testing, it often could suffice from a public health perspective and cost-efficiency to assess whether any one treatment is better than all others combined. Herein we describe such analyses and show that testing each therapy versus the others combined can be conducted efficiently through the closed-testing principle.
The traditional pairwise testing approach consists of K(K − 1)/2 pairwise tests, while the proposed closed testing procedure for one versus the others requires 2K−1 − K + 1 tests, counting all tests of intersection and elementary hypotheses. The closed testing approach yields fewer tests than pairwise testing for K = 3 or K = 4, and more tests for K ≥ 5. However, all individual tests in the closed testing procedure are conducted at level α, while marginally adjusting for multiplicity with the Holm procedure requires testing at a much smaller significance level (e.g., α/(K(K − 1)/2) for the most significant pairwise test.
Note that the test of each group versus the others employs the total sample size N that will afford greater power than the sample size of 2(N/K) that would be employed in each of the pairwise tests. This is true for the three examples presented herein.
Closed one-versus-others testing herein is described in terms of test statistics that are based on combinations of effects or coefficients from pairwise 1:1 group comparisons. This approach can then be applied to any study regardless of variation of the sample sizes among groups. However, in the case of a balanced design with equal sample sizes (or approximately so), another approach might be to conduct the tests using the pooled comparator groups.
For example, consider the analysis of means in a balanced 3 group study. Then, the elemental hypothesis for group 1 becomes H0,1:23: μ1 = μ23 where μ23 is the mean of groups 2 and 3 combined in aggregate that in turn equals the unweighted average of the group means in a balanced study, such as
(33) |
H0,1:23 could be tested using a simple t-test based on . However, in a 4 or more group study we would then need to test the intersection hypotheses such as
(34) |
However, a T2-like test of this intersection hypothesis would require an estimate of the as would higher order intersection hypotheses.
For three balanced groups, this approach also would provide a simplification of the Cox PH Model analysis. In that case the elemental hypotheses for group 1 versus the others can be tested using the Wald test of the coefficient in a model with a binary covariate for group 1 versus the other two. Likewise the other coefficients and can be tested. Again, in a 4 or more group study such an approach would require the covariance between terms such as and which may be obtained using a sandwich estimator.
However, the pooled comparator approach should not be employed with a logistic regression model because the logistic model hypotheses (19) and (20) were specified on the log odds scale. Therefore, comparing group 1 versus a pooled comparator group combining the other two groups would not provide a valid test for (19). Note that (19) specifies that g(μ1) = [g(μ2) + g(μ3)]/2. However, in an analysis comparing group 1 to the pooled groups 2 and 3 the hypothesis states that g(μ1) = g[(μ2 +μ3)/2] that is different from (19). In fact, the same applies to a model based on a member of the exponential family with a link function other than the identity function.
On the other hand, a Cox proportional hazards model in a balanced study with a lumped comparator group is valid since the analysis is conducted on the hazard scale. For example, with three groups (21) becomes λ1(t) = λ23(t) = [λ2(t) + λ3(t)]/2.
A further natural question is whether the closed testing of one versus the others could be combined with pairwise testing. For example, suppose the hypothesis H01 comparing group 1 versus the others is rejected using the closed testing procedure described herein. The question is whether the corresponding pairwise comparisons still require adjustment for multiplicity. Let denote the null hypothesis μi = μj, 1 ≤ i < j ≤ K. Using K = 4 to illustrate the ideas, the individual hypotheses are then given by H0i and with 1 ≤ i < j ≤ K. Under the closed testing procedure, rejecting (say) requires rejecting all intersection hypotheses including it. It is easy to show that and are each equivalent to H01 ∩H02. However, is equivalent to μ1 = μ2 = (3μ3 − μ4)/2, which was not tested under the closed testing procedure for H01. So the answer seems to be that a further adjustment is required, or additional intersection hypotheses have to be tested (again, all at level α).
Supplementary Material
Acknowledgment
The data from The Antihypertensive and Lipid-lowering Treatment to Prevent Heart Attack Trial (ALLHAT) study were provided by the National Heart, Lung and Blood Institute’s Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC).
Funding
This work was partially supported by grant U01-DK-098246 from the National Institute of Diabetes, Digestive and Kidney Diseases (NIDDK), NIH for the Glycemia Reduction Approaches in Diabetes: A Comparative Effectiveness (GRADE) Study.
The second author (I.B.) was also supported by the Samuel W. Greenhouse Biostatistics Research Enhancement Award.
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
References
- [1].Lauer MS, Collins FS. Using science to improve the nation’s health System. NIH’s commitment to comparative effectiveness research. JAMA 2010; 303(21): 2182–2183. [DOI] [PubMed] [Google Scholar]
- [2].Nathan DM, Buse JB, Kahn SE, Krause-Steinrauf H, Larkin ME, Staten M, Wexler D, and Lachin JM. Rationale and Design of the Glycemia Reduction Approaches in Diabetes: A Comparative Effectiveness Study (GRADE). Diabetes Care 2013; 36(8): 2254–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Bretz F, Hothorn T, Westfall P (2011). Multiple Comparisons Using R, CRC Press. [Google Scholar]
- [4].Marcus R, Peritz E, Gabriel KR. On closed testing procedures with special reference to ordered analysis of variance. Biometrika 1976; 63: 655–660. [Google Scholar]
- [5].McMillan-Price J, Peteocz P, Atkinson F, O’Neill K, Samman S, Steinbeck K, Caterson I, Brand-Miller J. Comparison of 4 diets of varying glycemic load on weight loss and cardiovascular risk reduction in overweight and obese young aduls. Archives of Internal Medicine 2006; 166: 1466–1475. [DOI] [PubMed] [Google Scholar]
- [6].Treiman DM, Meyers PD, Walton NY, Collins JF, Colling C, Rowan AJ, Handforth A, Faught E, Calabrese VP, Uthman BM, Ramsay RE, Mamdani MB, Yagnik P, Jones JC, Barry E, Boggs JG, Kanner AM, for the Veterans Affairs Status Epilepticus Cooperative Study Group. A comparison of four treatments for generalized convulsive status epilepticus. The New England Journal of Medicine 1998; 339: 792–798. [DOI] [PubMed] [Google Scholar]
- [7].ALLHAT Officers and Coordinators for the ALLHAT Collaborative Research Group. Major outcomes in high-risk hypertensive patients randomized to angiotensin-converting enzyme inhibitor or calcium channel blocker vs diuretic: the Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial (ALLHAT). JAMA 2002; 288: 2981–2997. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.