Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2010 Feb 23.
Published in final edited form as: Biometrics. 2005 Jun;61(2):532–539. doi: 10.1111/j.1541-0420.2005.00322.x

Adjusting O'Brien's Test to Control Type I Error for the Generalized Nonparametric Behrens–Fisher Problem

Peng Huang 1,*, Barbara C Tilley 1, Robert F Woolson 1, Stuart Lipsitz 1
PMCID: PMC2827210  NIHMSID: NIHMS170925  PMID: 16011701

Summary

O'Brien (1984, Biometrics 40, 1079–1087) introduced a simple nonparametric test procedure for testing whether multiple outcomes in one treatment group have consistently larger values than outcomes in the other treatment group. We first explore the theoretical properties of O'Brien's test. We then extend it to the general nonparametric Behrens–Fisher hypothesis problem when no assumption is made regarding the shape of the distributions. We provide conditions when O'Brien's test controls its error probability asymptotically and when it fails. We also provide adjusted tests when the conditions do not hold. Throughout this article, we do not assume that all outcomes are continuous. Simulations are performed to compare the adjusted tests to O'Brien's test. The difference is also illustrated using data from a Parkinson's disease clinical trial.

Keywords: Bonferroni, Global statistical test, Multivariate test, Rank-sum-type test, Rank test

1. Introduction

Parkinson's disease is one of the most common adult-onset neurodegenerative disorders. In recent years there has been an intensive search for neuroprotective therapies that can slow, stop, or reverse the degenerative process. A multicenter controlled clinical trial of Coenzyme Q10 in early Parkinson's disease (QE2 trial) organized by the University of California, San Diego, in conjunction with the Parkinson Study Group was a study to determine whether Coenzyme Q10 could slow the functional decline in Parkinson's disease (Shults et al., 2002). Multiple outcomes were collected to measure the disability. These included the mental (mentation), motor, and average daily living (ADL) subscales of the Unified Parkinson's Disease Rating Scale (UPDRS), and the Schwarb and England ADL (SEADL) score. The changes from baseline to the last visit in 16 months of these outcomes were used to compare the treatments.

Various multivariate tests have been proposed to compare two groups with multivariate outcomes. To list a few, there are the global statistical tests given by O'Brien (1984), Tang, Gnecco, and Geller (1989), Tang, Geller, and Pocock (1993), Tang and Lin (1997), Tang and Geller (1997), Lefkopoulou, Moore, and Ryan (1989), Lefkopoulou and Ryan (1993), and Pocock, Geller, and Tsiatis (1987), and nonparametric multivariate methods by Puri and Sen (1985). Most of these tests are derived under the null hypothesis that the outcome distributions from the two comparison groups are identical. Such a condition of identical distribution functions insures that the proposed test is distribution-free under the null hypothesis. This assumption is imposed for mathematical convenience because it allows the formulation of an exact significance level (α) critical region for the test. However, this assumption is not appropriate in the QE2 study. For example, a test of equal variance in mental score between the placebo group and the treatment group gives a p value of 0.002 (see details in Section 4). Ignoring unequal variance using conventional tests such as Hotelling's T2 test or multivariate Wilcoxon test can result in a biased inference.

Pratt (1964) and Van der Vaart (1961) have studied how type I errors of Mann–Whitney–Wilcoxon and the normal scores tests are affected by the different distribution shapes or variances of the two treatment groups. Miller (1986) discussed how type I error of a t-test is affected by the unequal variance. A general nonparametric problem of comparing two groups without the assumption for the shapes of their distributions is called a nonparametric Behrens–Fisher problem that has been studied as early as 1963 by Potthoff. Fligner and Policello (1981) and Fligner and Rust (1982) provided nonparametric tests to compare medians. Recent work includes Troendle's (2002) numerical likelihood ratio test, Brunner, Munzel, and Puri's test (1999), and Munzel and Tamhane's test (2002) for a univariate outcome, and Brunner, Munzel, and Puri's (2002) test for multivariate outcomes. For multivariate outcomes, Brunner et al. (2002) proposed Wald-type and ANOVA-type tests for the general nonparametric Behrens–Fisher hypothesis problem with null hypothesis of the form

H0:pv=P(Xiv<Yjv)+12P(Xiv=Yjv)=12,v=1,,k, (1)

where Xiv and Yjv are the vth outcome from the ith subject in group 1 and the jth subject in group 2, respectively (v = 1, …, k). Parameter pv was called relative treatment effect for the vth outcome by Brunner et al. (2002).

In Parkinson's disease clinical trials, the goal is often to test whether one treatment is more effective than the other treatment on multiple outcomes. The null hypothesis is that the two treatments are equally effective. The alternative is that one treatment is preferred over the other treatment. Similar to Hotelling's T2 test, Wald-type and ANOVA-type tests proposed by Brunner et al. (2002) assess whether two treatment groups differ. The null hypothesis can be rejected if a treatment greatly improves some outcomes and also greatly worsens some other outcomes at the same time. O'Brien (1984) proposed a simple rank-sum-type test to assess whether outcome measures from one group are consistently larger than outcome measures from the other group. Hence, O'Brien's test is appropriate to use under such a setting.

Adjusting for covariates in a nonparametric Behrens–Fisher problem is challenging, especially when covariates are continuous. When all covariates are categorical (or ordinal) with finite number of possible values, there are at most a finite number of covariate value combinations. If we introduce several dependent variables, one for each combination of the covariate values, then the original hypothesis testing problem can be reformulated as a multivariate hypothesis testing problem without any covariate. The split-plot factorial designs considered by Brunner et al. (1999) are one example of such a setting. Because O'Brien's test uses rank-sums, the multivariate problem is reduced to a univariate problem. When sample size is large, the correlation among rank-sums becomes small. Another advantage of O'Brien's test is that it is relatively easy to extend to cases with both continuous and categorical covariates and repeated measures. O'Brien (1984) also showed that the rank-sum-type test is robust when the sample size is smaller than the number of outcomes and when the distribution is skewed or there are outliers. Sankoh et al. (1999) also evaluated the performance of O'Brien's test under various covariance structures through simulation.

O'Brien's rank-sum-type test is being widely used in clinical research including studies in neurology, HIV, cancer, health services, psychiatry, and autoimmune disease. For example, it was used in a randomized clinical trial in dermatology (Kaufman et al., 1998); a randomized trial in multiple sclerosis (Li, Zhao, and Paty, 2001); an observational study comparing women with and without perimenstrual asthma (Shames et al., 1998); and for the secondary analyses of data from a series of rheumatoid arthritis clinical trials (Tilley et al., 2000). Irrespective of its numerous applications in medical research, the properties of O'Brien's rank-sum-type test have been investigated primarily through simulations. The theoretical justification for the test has not been established.

The major goal of this article is twofold: We first derive the theoretical properties of O'Brien's rank-sum-type test. This provides the necessary foundation to use O'Brien's rank-sum-type test and to understand its limitations. In Section 2, we demonstrate that the rank-sum-type test is neither distribution-free nor asymptotically distribution-free for testing the general Behrens–Fisher hypothesis problem (1). Simulations in Section 3 show that for both large and small samples, the actual significance level of O'Brien's rank-sum-type test can exceed the nominal level when the means are the same but the variances from both samples differ. We then provide an adjustment of O'Brien's test so that its use can be extended to the general Behrens–Fisher hypothesis problem. Although O'Brien (1984) considered only continuous distributions, the results presented in this article do not require all outcomes to be continuous. Section 2 gives the asymptotic properties of O'Brien's test. We provide conditions when O'Brien's test controls the type I error probability asymptotically and when it fails. Based on a consistent estimate for the variance of O'Brien's test, we propose a modified test that controls the significance level when the conditions do not hold. The new test reduces to O'Brien's test when the conditions hold. Section 3 compares the type I errors of O'Brien's test and our modified test numerically through simulation. In Section 4, we illustrate the difference of these tests using data from a Parkinson's disease clinical trial.

2. Notation and Asymptotic Distribution

Consider a randomized clinical trial, with m independent subjects randomized to treatment 1 (say, the placebo), and n independent subjects randomized to treatment 2 (say, the new treatment). Suppose there are k different outcomes of interest. Outcomes are coded such that larger (or smaller) values are preferred for all outcomes. Let Xi = (Xi1, …, Xik) be the multiple outcomes from subject i in treatment group 1 (i = 1, …, m), and let Yi = (Yj1, …, Yjk) be the multiple outcomes from subject j in treatment group 2 (j = 1, …, n). Suppose the Xi's are independent and identically distributed, with joint cumulative distribution function F(t1, …, tk) = P(Xi1t1, …, Xiktk), and the Yj's are independent and identically distributed, with joint cumulative distribution function G(t1, …, tk) = P(Yj1t1, …, Yjktk). Denote θv = P(Xiv < Yjv) − P(Xiv > Yjv) for v = 1, …, k, and the middistribution functions Fuo(t)=P(Xu<t)+12P(Xu=t), and Guo(t)=P(Yu<t)+12P(Yu=t) for u = 1, …, k. Throughout the article, we impose some regularity conditions on the outcomes to rule out degenerate distributions and redundant parameters: Var[Gvo(Xv)]>0 and Var[Fvo(Yv)]>0 for all v = 1, …, k. An equivalent form of null hypothesis (1) is

H0:θ1==θk=0. (2)

Note, when all outcomes are continuous, (2) reduces to the simpler hypothesis form of H0:P(Xiv<Yjv)=12, v = 1, …, k.

Let N = m + n be the total number of observations in the sample. For the vth outcome (v = 1, …, k), we rank the observations from all N subjects X1v, …, Xmv, and Y1v, …, Ynv, regardless of treatment. Let Rx,iv = midrank (Xiv), and Ry,jv = midrank (Yjv). The midrank of an observation is defined by either the regular rank when there is no tie on the observation or the average rank among the tied observations (Lehmann, 1975). The total rank from the ith individual in treatment group 1 is defined as the sum of the ranks over all k outcomes: Rxi=Σv=1kRx,iv. Similarly, the total ranks from the jth individual in treatment group 2 is defined as Ryj=Σv=1kRy,jv. O'Brien's (1984) rank-sum-type test ψ1 is defined as the regular univariate two-sample t-test with pooled standard deviation for the two rank-sum samples: Rx1, Rx2, …, Rxm and Ry1, Ry2, …, Ryn. In particular, O'Brien's test statistic can be written as

T1=RyRxσ^(1m+1n), (3)

or, if there is concern about possible unequal variances of the ranks, Welch-modified two-sample t-test statistic can be used

T2=RyRxσ^x2m+σ^y2n, (4)

where Rx=Σi=1mRxim, Ry=Σj=1nRyjn, σ^x2=(1m1)×Σi=1m(RxiRx)2, σ^y2=(1n1)Σj=1m(RyjRy)2, and σ^2=(m1)σ^x2+(n1)σ^y2m+n2. O'Brien's test ψl rejects H0 at significance level α whenever T > tdf or T < −tdf (ℓ = 1, 2) for two different one-sided alternatives, respectively; or it rejects H0 whenever |T| > tdf,α/2 for a two-sided alternative, where tdf is the (1 − α)th percentile of the tdf distribution with df degrees of freedom. Here, df = N − 2 when l = 1 and df = [ζ2/(m − 1) + (1 − ζ)2/(n − 1)]−1 when l = 2, ζ=(σ^x2m)(σ^x2m+σ^y2n). Theorem 1 gives the asymptotic distribution of statistics T1 and T2 under the null hypothesis (2):

Theorem 1

Suppose m/n → λ as N = (m + n) → ∞ for some finite constant 0 < λ < +∞. Then, under the null hypothesis (2), both statistics T1 and T2 defined by (3) and (4) converge in distribution to a normal distribution with mean 0 and variances h1 and h2, respectively, as N → ∞, where

h1=u=1kv=1k(1+λ)2(auv+buvλ)u=1kv=1k[euvλ3+(buv+2fuv)λ2+(auv+2quv)λ+puv]h2=u=1kv=1k(1+λ)2(auv+buvλ)u=1kv=1k[buvλ3+(euv+2quv)λ2+(puv+2fuv)λ+auv]auv=cov(Guo(Xu),Gvo(Xv)),buv=cov(Fuo(Yu),Fvo(Yv)),euv=cov(Fuo(Xu),Fvo(Xv)),fuv=cov(Fuo(Xu),Gvo(Xv)),puv=cov(Guo(Yu),Gvo(Yv)),andquv=cov(Guo(Yu),Fvo(Yv)). (5)

Proof of Theorem 1 is given in the Appendix. Simple algebra shows that h1 = h2 = 1 when F = G. This establishes the asymptotic validity of O'Brien's rank-sum-type test for the null hypothesis of type H0 : F = G In general, we have h1 ≠ 1 and h2 ≠ 1 when FG.

To extend the use of O'Brien's rank-sum-type test for the general Behrens–Fisher null hypothesis problem (1), we first note that the null hypothesis assumption of (2) may not imply F = G. There are families of distributions satisfying (2) but with quite different underlying distributions, e.g., equal medians, but unequal dispersion. If all outcomes have arbitrary nondegenerative symmetric distributions around zero: P(Xv ≤ −t) = P(Xvt), P(Yv ≤ −t) = P(Yvt) (v = 1, …, k), then the parameters θv = P(Xv < Yv) − P(Xv > Yv) = 0 for all v = 1, …, k. However, F and G can still be quite different in their shapes. Hence, O'Brien's rank-sum-type test ψ1 or ψ2 is neither distribution-free nor asymptotically distribution-free under the null hypothesis of the general Behrens–Fisher problem (1). The following simple example shows how h1 and h2 depend on the underlying distribution.

Example

Suppose Xi1, …, Xik are independent, identically distributed with Uniform (−1, 1) distribution, and the Yj1, …, Yjk are independent but not identically distributed, Yju ~ Uniform(−ru,ru), (u = 1, …, k). Let m=n. Note that E(Xiu) = E(Yju) = θu = 0, but Var(Xiu) = 1/3 and Var(Yiu)=ru23. Then, the h1 and h2 in Theorem 1 are

h1=h2=u=1k4(ru2+1ru2)u=1k(1+ru2)(1+1ru)2.

Thus, for uniform distributions, h1 ≥ 1 and h2 ≥ 1 for all ru > 0 (u = 1, …, k), and h1 = h2 = 1 if and only if r1 =…= rk = 1 which is the same as F = G. O'Brien's (1984) simulations considered only a special case of F = G; under this condition, h1 and h2 are reduced to 1. Hence, his simulations demonstrated a proper control of the type I probabilities. Our simulation in Section 3 shows that significance levels of ψ1 and ψ2 may not be preserved both for small and large sample size when distributions from the treatment groups have different shapes.

It is seen that h1 and h2 are functions of the dependence of the k outcomes. Since the t distribution converges asymptotically to a standard normal distribution when N → ∞, the type I error of O'Brien's test ψ1 converges to Φ(−zα/[h1]1/2) for one-sided alternatives, and converges to 2Φ(−zα/2/[h1]1/2) for two-sided alternatives, where Φ(·) is the cumulative distribution function of the standard normal distribution, and zα is defined by Φ(zα) = 1 − α. Similar results hold for test ψ2. Theorem 1 shows that, for any 0 < α < 0.5, O'Brien's test ψ1(or ψ2) controls its type I error asymptotically if and only if h1 = 1 (or h2 = 1). Thus O'Brien's test inflates the type I error asymptotically when h1 > 1 (or h2 > 1), and is too conservative when h1 < 1 (or h2 < 1).

Based on Theorem 1, a direct adjustment to O'Brien's tests ψ1 and ψ2 is to modify their test statistics in (3) and (4) by using some consistent estimates of h1 and h2 in the denominators, i.e.,

T1a=RyRxσ^h^1(1m+1n),T2a=RyRxh^2(σ^x2m+σ^y2n), (6)

respectively, where h^ is a consistent estimate of h under general distribution functions F and G; and let the adjusted statistic Tla take the same degrees of freedom as that of Tl (ℓ = 1, 2). The resulting adjusted tests are denoted as ψ1a and ψ2a. Consistent estimates of h1 and h2 can be obtained by using empirical estimates of F and G. Define the midranks Ry(xiu) = the midrank of xiu among {xiu, y1u, …, ynu}; Rx0(xiu) = the midrank of xiu among {x1u, …, xmu}; Rx(yiu) = the midrank of yiu among {yiu, x1u, …, xmu}; and Ry0(yiu) = the midrank of yiu among {y1u, …, ynu}. Let A1 and A2 be two m × k matrices with (i, u) elements {2Ry(xiu)2n+nθ^u} and {2Rx0(xiu) − 1 − m}, respectively; and let B1 and B2 be two n × k matrices with (i, u) elements {2Rx(yiu)2mmθ^u} and {2Ry0(yiu) − 1 − n}, respectively, where θ^u=Σi=1mΣj=1n{I[(xiv<yjv)]I[(xiv>yjv)]}(mn) which is an unbiased estimate of θu. The indicator I[E] is defined by I[E] = 1 if event E is true, and I[E] = 0 otherwise. We define

h^1={N2mn}×JT(A1TA1+B1TB1)JJT{(A1+A2)T(A1+A2)+(B1+B2)T(B1+B2)}J,h^2=N2JT(A1TA1+B1TB1)JJT{n2(A1+A2)T(A1+A2)+m2(B1+B2)T(B1+B2)}J, (7)

where J is a vector of 1's. The asymptotic distribution of T1a and T2a can be established through the following theorem.

Theorem 2

Under the conditions of Theorem 1, random variables T1a and T2a with h^1 and h^2 given by (7) converge in distribution to a standard normal distribution as (m/n) → λ and (m + n) → ∞.

Proof of Theorem 2 is given in the Appendix. The adjusted test statistics T1a and T2a will have the same asymptotic distributions as T1 and T2, respectively, if the assumption F = G is true. They deviate from one another when the shapes of F and G differ.

3. Simulations

In this section, we explore how type I errors are affected when F and G have different shapes. Because a Parkinson's disease trial is used as our example, and many Parkinson's disease clinical outcomes are ordinal variables with five different levels (“normal,” “mild,” “moderate,” “severe,” and “most serious”), we generated ordinal data with five levels: −2, −1, 0, 1, and 2. Data are simulated under the null hypothesis θ1 =…= θk = 0 but with FG. In particular, we generate samples from F and G with zero means but different variances. We evaluate the type I error rate of O'Brien's tests ψ1, ψ2 as well as our modified tests ψ1a, ψ2a. Simulations presented in Tables 13 quantify the type I errors of ψ1, ψ2, ψ1a, and ψ2a when h1 ≠ 1 and h2 ≠ 1. Recall, m is the sample size in treatment group 1, i.e., Xi = (Xi1, …, Xik), (i = 1, …, m), and n is the sample size in treatment group 2, i.e., Yj = (Yj1, …, Yjk), (j = 1, …, n). For chosen k = 2 and 10, the outcomes (Xi1, …, Xik) are generated according to the following formula: Xiu=2I[(Xiu<r11)]I[(r11Xiu<r12)]+I[(r13Xiu<r14)]+2I[(Xiur14)], where Xiu=(ρ)12Xi1+(1ρ)12Xiu(u>1), and Xi1,,Xik~iid Uniform (−1, 1). (Yi1, …, Yik) are generated similarly, but with r1 = (r11, r12, r13, r14) replaced by r2 = (r21, r22, r23, r24). For illustration purpose, r1 = (−0.1, 0, 0, 0.1) and r2 = (−0.9, −0.8, 0.8, 0.9) in all tables. Type I error rates are presented only for nominal level α = 0.05. Similar results were seen for nominal level α = 0.01. Three types of rejection rules are shown in each table: two-sided test when H0 is rejected for large observed absolute values of test statistics; one-sided test 1 when H0 is rejected for large observed test statistic values; and the one-sided test 2 when H0 is rejected for small observed values.

Table 1.

Observed rejection rate (type I error) under the null hypothesis (2) with m = n. ψ1 and ψ2 are O'Brien's tests assuming equal and unequal variance, respectively. ψ1a and ψ2a are adjusted tests for ψ1 and ψ2, respectively (2,000 simulations).

α = 0.05
Two-sided test
One-sided test 1
One-sided test 2
ρ k m = n ψ1 ψ2 ψ1a ψ2a ψ1 ψ2 ψ1a ψ2a ψ1 ψ2 ψ1a ψ2a
0.0 2 20 0.114 0.110 0.059 0.058 0.090 0.088 0.057 0.054 0.089 0.086 0.055 0.054
0.0 2 200 0.108 0.107 0.053 0.053 0.091 0.091 0.052 0.052 0.090 0.089 0.050 0.050
0.0 10 20 0.117 0.109 0.056 0.053 0.116 0.113 0.065 0.061 0.083 0.080 0.046 0.045
0.0 10 200 0.108 0.107 0.050 0.050 0.086 0.086 0.047 0.047 0.091 0.091 0.051 0.051
0.9 2 20 0.104 0.096 0.047 0.044 0.094 0.091 0.057 0.055 0.085 0.082 0.046 0.045
0.9 2 200 0.099 0.098 0.046 0.046 0.090 0.090 0.046 0.046 0.077 0.077 0.042 0.042
0.9 10 20 0.106 0.100 0.056 0.053 0.090 0.087 0.054 0.053 0.085 0.080 0.046 0.043
0.9 10 200 0.116 0.115 0.053 0.053 0.102 0.101 0.055 0.054 0.088 0.088 0.049 0.049

Table 3.

Observed rejection rate (type I error) under the null hypothesis (2) with m = 2n. ψ1 and ψ2 are O'Brien's tests assuming equal and unequal variance, respectively. ψ1a and ψ2a are adjusted tests for ψ1 and ψ2, respectively (2,000 simulations).

α = 0.05, m = 2n
Two-sided test
One-sided test 1
One-sided test 2
ρ k n ψ1 ψ2 ψ1a ψ2a ψ1 ψ2 ψ1a ψ2a ψ1 ψ2 ψ1a ψ2a
0.0 2 20 0.050 0.121 0.046 0.045 0.044 0.091 0.043 0.043 0.052 0.102 0.050 0.050
0.0 2 200 0.061 0.148 0.055 0.055 0.056 0.109 0.052 0.052 0.063 0.113 0.059 0.060
0.0 10 20 0.055 0.139 0.050 0.050 0.043 0.105 0.041 0.040 0.060 0.110 0.056 0.056
0.0 10 200 0.058 0.133 0.055 0.055 0.053 0.096 0.051 0.051 0.053 0.110 0.051 0.051
0.9 2 20 0.048 0.129 0.050 0.050 0.047 0.094 0.048 0.048 0.053 0.107 0.055 0.055
0.9 2 200 0.050 0.130 0.049 0.049 0.043 0.102 0.043 0.044 0.052 0.108 0.051 0.051
0.9 10 20 0.058 0.155 0.063 0.062 0.056 0.111 0.056 0.056 0.063 0.123 0.066 0.067
0.9 10 200 0.047 0.133 0.049 0.049 0.046 0.090 0.048 0.048 0.045 0.114 0.047 0.047

Because sample sizes from different treatment groups may not be equal in medical applications, we consider three cases: m = n (Table 1); n = 2m (Table 2); and m = 2n (Table 3) in our simulation. Miller (1986) discussed how a t-test's type I error rate is affected by the unequal variances: Although it can tolerate large disparities in the variances (viz., ratios of 4 and up) without showing major ill effects on α, it can be seriously affected when the population with much larger variance has much smaller sample size. Similar results are seen in our simulation. Because Var(Xiv) > Var(Yiv), ψ1 is more seriously affected under m < n than the case under m > n. When m = n, Table 1 shows that tests ψ1 and ψ2 can inflate their type I errors by 100% (to 0.10). When n = 2m or m = 2n, either ψ1 or ψ2 is more seriously affected—it could inflate their type I errors by up to 400% (to 0.20). In all cases, our adjusted tests ψ1a and ψ2a have their type I errors close to the target nominal level α = 0.05.

Table 2.

Observed rejection rate (type I error) under the null hypothesis (2) with n = 2m. ψ1 and ψ2 are O'Brien's tests assuming equal and unequal variance, respectively. ψ1a and ψ2a are adjusted tests for ψ1 and ψ2, respectively (2,000 simulations).

α = 0.05, n = 2m
Two-sided test
One-sided test 1
One-sided test 2
ρ k m ψ1 ψ2 ψ1a ψ2a ψ1 ψ2 ψ1a ψ2a ψ1 ψ2 ψ1a ψ2a
0.0 2 20 0.192 0.091 0.067 0.056 0.141 0.085 0.067 0.062 0.133 0.073 0.051 0.046
0.0 2 200 0.179 0.090 0.053 0.053 0.132 0.079 0.056 0.055 0.130 0.077 0.049 0.049
0.0 10 20 0.193 0.097 0.071 0.060 0.126 0.079 0.060 0.056 0.138 0.085 0.062 0.056
0.0 10 200 0.185 0.076 0.046 0.044 0.132 0.074 0.047 0.046 0.129 0.076 0.045 0.044
0.9 2 20 0.184 0.075 0.056 0.048 0.118 0.063 0.048 0.041 0.144 0.078 0.061 0.055
0.9 2 200 0.194 0.098 0.064 0.062 0.127 0.075 0.054 0.053 0.149 0.091 0.060 0.060
0.9 10 20 0.202 0.087 0.067 0.055 0.132 0.075 0.059 0.052 0.152 0.081 0.065 0.058
0.9 10 200 0.195 0.081 0.051 0.051 0.138 0.078 0.054 0.053 0.130 0.072 0.045 0.044

4. An Example

To illustrate the difference between O'Brien's test and our adjusted test, we use data from the multicenter controlled clinical trial of Coenzyme Q10 in early Parkinson's disease (QE2 trial). The trial was conducted in 1999–2001 to determine whether Coenzyme Q10 could slow the functional decline in Parkinson's disease (Shults et al., 2002). There were 16, 21, 20, and 23 patients randomized to placebo or Coenzyme Q10 at dosages of 300, 600, or 1200 mg/day, respectively. Patients were evaluated at the screening, baseline, and 1-, 4-, 8-, 12-, and 16-month visits. Subjects were followed for up to 16 months or until disability requiring treatment with levodopa had developed. Outcome measures for the treatment efficacy comparison include the mental (mentation), motor, and average daily living (ADL) subscales of the Unified Parkinson's Disease Rating Scale (UPDRS), and the Schwarb and England ADL (SEADL) score. The primary outcome was the change in the total score (the sum of mental, motor, and the ADL) on the UPDRS from baseline to the last visit at 16 months. Last observation carrying forward for missing data was used by the trial investigators. The primary analysis was a test for a trend between dosage and the mean change in the UPDRS score. A p-value of 0.09 (two-sided) was reported by the investigators (Shults et al., 2002).

As a secondary analysis, the trial investigators conducted a series of univariate tests for each single outcome, respectively. The goal is to assess whether Coenzyme Q10, at any dose, is more effective than placebo with respect to changes in the mental, motor, and ADL of the UPDRS subscale, and the SEADL from the baseline visit to the last visit at 16 months. While we had performed both O'Brien's test and our adjusted test to contrast the placebo group to each of the three dose groups separately, for illustration purposes, here we combine all three Coenzyme Q10 groups into a single Coenzyme Q10 group. (We note parenthetically that the results of each pairwise comparison also provide evidence of differing p values when comparing O'Brien's test to the adjusted methods. These results are available from the authors.) With this simplification to a single combined Coenzyme group, the goal was to assess whether this combined Coenzyme group would perform better than the placebo. Five patients had missing observations at the final 16-month visit. Their 12-month visit measures were carried forward for these five patients. While the last-observation-carried forward approach is less than optimal, we used this approach in our example so that our results could be comparable to the previous reported trial results.

Figure 1 gives the density plot for all four outcomes. The variances in the two groups were not the same. For example, the placebo group had larger variance (=2.267) in the change of mental score compared to the variance (=0.729) in the treatment group. A test for equal variance gave a p value of 0.002. Since smaller values were considered as better functional disability measures for mental, motor, and ADL, while larger values of SEADL were considered poorer functional disability measures, we reversely coded the SEADL by multiplying (−1) so that smaller outcomes were preferred for all outcomes. All four outcomes were correlated. Table 4 gives the correlations and the corresponding p values among the four outcomes (combining all 80 patients). To compare the treatment versus placebo in these four outcomes, the p values from ψ1 and ψ2 were 0.0368 and 0.0493, respectively, h^1=1.4717, h^2=1.4404, and p values from the adjusted tests, ψ1a and ψ2a, were 0.0839 and 0.1014, respectively, that are 100% larger than the p values from tests ψ1 and ψ2. Hence, if the significance level were set to 0.05, O'Brien's test would reject the null hypothesis while the adjusted test would not.

Figure 1.

Figure 1

Densities of the outcome changes from the baseline to the last visit (N = 80 patients) in the QE2 trial. The solid line indicates the treatment group and the dashed line indicates the placebo group.

Table 4.

Correlation (p values) among motor, mental, ADL, and SEADL in the change from the baseline visit to the last visit at 16 months when combining all 80 patients. Last observation carrying forward. SEADL is reversely coded by multiplying (−1).

Mental ADL SEADL
Motor 0.0797 (0.4823) 0.3840 (0.0004) 0.4715 (<0.0001)
Mental 0.3625 (0.0010) 0.1562 (0.1693)
ADL 0.4931 (<0.0001)

5. Discussion

O'Brien's rank-sum-type test provides a simple method to compare two groups with multiple outcomes. It is useful and appropriate to use when the rejection of the null hypothesis requires improvement in outcomes. When applying O'Brien's test to compare treatments, we need to specify clearly the definition of “no difference” between the two treatments. If it is specified that, under the null, the two distributions are identical, then O'Brien's test provides a simple valid test. Under this situation, ψ1 = ψ1a and ψ2 = ψ2a asymptotically, and all of them control type I errors asymptotically. We suggest the use of ψ1 due to its simplicity. If the interest is to test whether the new treatment increases the outcome measures without assuming an identical covariance matrix or other features of the joint distribution that are not of interest to the clinicians, then this is a Behrens–Fisher problem and the adjusted tests ψ1a or ψ2a are recommended. Although ψ1a and ψ2a give similar results in our simulation, our experience suggests that ψ1a gives slightly better results compared to ψ2a.

The attractiveness of O'Brien's rank-sum-type test is its simplicity. Its statistical properties allow us to extend its use to more general settings. For example, we can use the asymptotic normality of mean rank-sums (such as Rx and Ry when there are only two groups) and similar methods used in the construction of the Kruskal–Wallis test to construct a test for multiple group (or dose-level) comparison. Depending on the question of interest, the test can be constructed based on a linear combination or a quadratic function of these mean rank-sums. When there are covariates or repeated measures of interest, conventional univariate response models for longitudinal data can be applied to the rank-sums with adjusted covariance matrices. Suppose yijk is the jth repeated measure of the kth outcome from the ith subject with covariate vector xij (i = 1, …, N; j = 1, …, T; k = 1, …, K). The treatment assignment is considered as a covariate. Let Rijk be the rank of yijk among all observations from the kth outcome {yijk, i = 1, …, N; j = 1, …, T}. Compute rank-sums Rij=Σk=1KRijk. For each j (j = 1, …, T), construct rank scores {aj(Rij), i = 1, …, N} using some nondecreasing function ϕj:aj(i) = ϕj(i/(N + 1)). The test statistics can be constructed using a linear combination of rank statistics Tj=Σi=1N(xijxj)aj(Rij), j = 1, … ,T. This form of test statistics has been used by many authors. For example, Hájek and Šidák (1967), Puri and Sen (1969, 1971, 1985), and Hettmansperger (1984). As discussed in this article, adjustment for the dependency among all Rij's is needed to provide valid inference from the model. We are currently investigating extensions of the current methods to this problem. In particular, adjusted test statistics will be derived in which quantities similar to h1 and h2 must be estimated and applied to the covariance matrix. When there are missing observations, Domhof, Brunner, and Osgood (2002) considered rank procedures for univariate outcomes with missing observations. Their procedures can be easily extended to cases when there are discrete covariates with a finite number of levels. More-over, the use of regression model with the rank-sums allows adjustment for both continuous and categorical covariates. Some other multiple imputation methods or data augmentation method (Schafer, 1997) to the ranks of missing data may also be considered.

Acknowledgements

We want to thank Drs Peter C. O'Brien, Nancy Geller, and Karl Kieburtz for helpful discussions and comments, and the QE2 steering committee for providing the data. We thank the editor, the associate editor, and three anonymous referees for many helpful suggestions that led to a much improved manuscript. This work is partially supported by two NIH/NINDS grants, R21 NS43569 and U01 NS43127.

Appendix

Proof of Theorem 1

It is seen that RyRx=N12JTW where W = (W1, …, Wk)T and Wv=N12Σi=1mΣj=1n{I[(xiv<yjv)]I[(xiv>yjv)]}(2mn). Since W is a U-statistic, it converges in distribution to a normal distribution with mean zero and variance Σ ≡ Var[W] when Var[Gvo(Xv)]>0 and Var[Fvo(Yv)]>0 for all v = 1, …, k, and n/m + m/n = O(1) as N → ∞. We first compute (details are available from the first author) E[σ^x2], E[σ^y2], and E[σ^2], then we obtain expressions mnJTΣJE[σ^2]=h1+O(1N) and NJTΣJE[σ^x2m+σ^y2n]=h2+O(1N). Based on Slutsky's Theorem, it suffices to show that σ^2(mn) and (σ^x2m+σ^y2n)N converge in probability to JT ΣJ/h1 and JT ΣJ/h2, respectively, as N → ∞.

For convenience in notation, we denote Ri = Rxi if 1 ≤ im, and Ri = Ry,im if m + 1 ≤ iN. Denote xiv = yim,v for m + 1 ≤ iN, v = 1, 2, …, k. For any l1, l2, l3, l4 ∈ {1, 2, …, N},

cov(Rl1Rl2,Rl3Rl4)=i1l1i2l2i3l3i4l4u1=1ku2=1ku3=1ku4=1k×cov[(I[(xi1u1<xl1u1)]+12I[(xi1u1=xl1u1)]+1N1)]×(I[(xi2u2<xl2u2)]+12I[(xi2u2=xl2u2)]+1N1),(I[(xi3u3<xl3u3)]+12I[(xi3u3=xl3u3)]+1N1)×[(I[(xi4u4<xl4u4)]+12I[(xi4u4=xl4u4)]+1N1)]=O(N3). (A.1)

This is because all summands in (A.1) are uniformly bounded by one, and the summand is zero whenever {i1, i2} ∩ {i3, i4} is an empty set. The sums for i1, i2, i3, i4 in (A.1) are from 1 to N except i1 = l1, i2 = l2, i3 = l3, and i4 = l4, respectively. Rewrite (N2)σ^2=Σi=1NRi2mRx2nRy2. Applying (A.1), we have

var[Ri2]=O(N3),var[Rx2]=1m4l1=1ml2=1ml3=1ml4=1m×cov(Rl1Rl2,Rl3Rl4)=O(N3),var[Ry2]=1n4l1=m+1Nl2=m+1Nl3=m+1Nl4=m+1N×cov(Rl1Rl2,Rl3Rl4)=O(N3).

Thus

var[(N2)σ^2](2N)2max{var[R12],,Var[RN2],var[Rx2],var[Ry2]}=(2N)2O(N3)var[σ^2(mn)]=O(1N).

For any constant ϵ > 0, applying Chebyshev's inequality, we have

P(σ^2mnJTΣJh1>ϵ)1ϵ2E[σ^2mnJTΣJh1]2=1ϵ2[var(σ^2mn)+O(1N2)]0asN.

Hence σ^2(mn) converges in probability to JT ΣJ/h1 as N → ∞. Similarly we can show that (σ^x2m+σ^y2n)N converges in probability to JT ΣJ/h2 as N → ∞.

Proof of Theorem 2

It suffices to show that h^h(=1,2) in probability as (m/n) → λ and (m + n) → ∞. The empirical estimates of Guo(t) and Fuo(t) are G^uo(t)1nΣj=1n(I[(yju<t)]+12I[(yju=t)]) and F^uo(t)1mΣi=1m×(I[(xiu<t)]+12I[(xiu=t)]). Note that F^uo(xiu)=[Rx0(xiu)12]m; G^uo(yiu)=[Ry0(yiu)12]n; F^uo(yiu)=[Rx(yiu)1]m; G^uo(xiu)=[Ry(xiu)1]n. The proof can be completed by using the consistent estimates of matrices (a^uv)=A1TA1(4mn2); (b^uv)=B1TB1(4m2n), (e^uv)=A2TA2(4m3), (f^uv)=A2TA1(4m2n), (p^uv)=B2TB2(4n2), (q^uv)=B1TB2(4mn2), and the Continuous Mapping Theorem 5.1 of Billingsley (1968, p. 30). Details are available from the first author.

References

  1. Billingsley P. Convergence of Probability Measures. Wiley; New York: 1968. [Google Scholar]
  2. Brunner E, Munzel U, Puri ML. Rank-score tests in factorial designs with repeated measures. Journal of Multivariate Analysis. 1999;70:286–317. [Google Scholar]
  3. Brunner E, Munzel U, Puri ML. The multivariate nonparametric Behrens-Fisher problem. Journal of Statistical Planning and Inference. 2002;108:37–53. [Google Scholar]
  4. Domhof S, Brunner E, Osgood DW. Rank procedures for repeated measures with missing values. Sociological Methods & Research. 2002;30:367–393. [Google Scholar]
  5. Fligner MA, Policello GE., II Robust rank procedures for the Behrens-Fisher problem. Journal of the American Statistical Association. 1981;76:162–168. [Google Scholar]
  6. Fligner MA, Rust SW. A modification of Mood's median test for the generalized Behrens-Fisher problem. Biometrika. 1982;69:221–226. [Google Scholar]
  7. Hájek J, Šidák Z. Theory of Rank Tests. Academic Press; New York: 1967. [Google Scholar]
  8. Hettmansperger TP. Statistical Inference Based on Ranks. Wiley; New York: 1984. [Google Scholar]
  9. Kaufman KD, Olsen EA, Whiting D, Savin R, De Villez R, Bergfeld W. Finasteride in the treatment of men with androgenetic alopecia. Journal of the American Academy of Dermatology. 1998;39:578–589. doi: 10.1016/s0190-9622(98)70007-6. [DOI] [PubMed] [Google Scholar]
  10. Lefkopoulou M, Ryan L. Global tests for multiple binary outcomes. Biometrics. 1993;49:975–988. [PubMed] [Google Scholar]
  11. Lefkopoulou M, Moore D, Ryan L. The analysis of multiple correlated binary outcomes: Application to rodent teratology experiments. Journal of the American Statistical Association. 1989;84:810–815. [Google Scholar]
  12. Lehmann EL. Nonparametrics: Statistical Methods Based on Ranks. Holden Day; New York: 1975. [Google Scholar]
  13. Li DK, Zhao GJ, Paty DW. Randomized controlled trial of interferon-beta-la in secondary progressive MS: MRI results. Neurology. 2001;56:1505–1513. doi: 10.1212/wnl.56.11.1505. [DOI] [PubMed] [Google Scholar]
  14. Miller RG. Beyond ANOVA, Basics of Applied Statistics. Wiley; New York: 1986. [Google Scholar]
  15. Munzel U, Tamhane AC. Nonparametric multiple comparisons in repeated measures designs for data with ties. Biometrical Journal. Journal of Mathematical Methods in Biosciences. 2002;44:762–779. Continues: Biometrische Zeitschrift. Zeitschrift für mathematische Methoden in den Biowissenschaften. [Google Scholar]
  16. O'Brien PC. Procedures for comparing samples with multiple endpoints. Biometrics. 1984;40:1079–1087. [PubMed] [Google Scholar]
  17. Pocock SJ, Geller NL, Tsiatis AA. The analysis of multiple endpoints in clinical trials. Biometrics. 1987;43:487–498. [PubMed] [Google Scholar]
  18. Potthoff RF. Use of the Wilcoxon statistic for a generalized Behrens-Fisher problem. Annals of Mathematical Statistics. 1963;34:1596–1599. [Google Scholar]
  19. Pratt JW. Robustness of some procedures for the two-sample location problem. Journal of the American Statistical Association. 1964;59:665–680. [Google Scholar]
  20. Puri ML, Sen PK. A class of rank order tests for a general linear hypothesis. Annals of Mathematical Statistics. 1969;40:1325–1343. [Google Scholar]
  21. Puri ML, Sen PK. Nonparametric Methods in Multivariate Analysis. Wiley; New York: 1971. [Google Scholar]
  22. Puri ML, Sen PK. Nonparametric Methods in General Linear Models. Wiley; New York, Chichester: 1985. [Google Scholar]
  23. Sankoh A, Hugue M, Russell H, D'Agostino R. Global two-group multiple endpoint adjustment methods applied to clinical trials. Drug Information Journal. 1999;33:119–140. [Google Scholar]
  24. Schafer JL. Analysis of Incomplete Multivariate Data. Chapman & Hall; London, New York: 1995. [Google Scholar]
  25. Shames RS, Heilbron DC, Janson SL, Kishiyama JL, Au DS, Adelman DC. Clinical differences among women with and without self-reported perimenstrual asthma. Annals of Allergy Asthma Immunology. 1998;81:65–72. doi: 10.1016/S1081-1206(10)63111-0. [DOI] [PubMed] [Google Scholar]
  26. Shults CW, Oakes D, Kieburtz K, et al. Effects of coenzyme Q10 in early Parkinson disease: Evidence of slowing of the functional decline. Archives of Neurology. 2002;59:1541–1550. doi: 10.1001/archneur.59.10.1541. [DOI] [PubMed] [Google Scholar]
  27. Tang D, Geller N. Closed testing procedures for group sequential clinical trials with multiple endpoints. Biometrics. 1999;55:1188–1192. doi: 10.1111/j.0006-341x.1999.01188.x. [DOI] [PubMed] [Google Scholar]
  28. Tang D, Lin S. An approximate likelihood ratio test for comparing several treatments to a control. Journal of the American Statistical Association. 1997;92:1155–1162. [Google Scholar]
  29. Tang D, Gnecco C, Geller N. An approximate likelihood ratio test for a normal mean vector with non-negative components with application to clinical trials. Biometrika. 1989;76:577–583. [Google Scholar]
  30. Tang D, Geller N, Pocock S. On the design and analysis of randomized clinical trials with multiple endpoints. Biometrics. 1993;49:23–30. [PubMed] [Google Scholar]
  31. Tilley BC, Pillemer SR, Heyse SP, et al. Global test for comparing multiple outcomes in rheumatoid arthritis trials. Arthritis and Rheumatism. 2000;42:1879–1888. doi: 10.1002/1529-0131(199909)42:9<1879::AID-ANR12>3.0.CO;2-1. [DOI] [PubMed] [Google Scholar]
  32. Troendle JF. A likelihood ratio test for the nonparametric Behrens-Fisher problem. Biometrical Journal. 2002;44:813–824. [Google Scholar]
  33. Van der Varrt HR. On the robustness of Wilcoxon's two-sample test. In: de Jonge H, editor. Quantitative Methods in Pharmacology. Wiley Interscience; New York: 1961. pp. 140–158. [Google Scholar]

RESOURCES