Summary
For comparing the distribution of two samples with multiple endpoints, O’Brien (1984) proposed rank-sum-type test statistics. Huang et al. (2005) extended these statistics to the general nonparametric Behrens-Fisher hypothesis problem and obtained improved test statistics by replacing the ad hoc variance with the asymptotic variance of the rank-sum statistics. In this paper we generalize the work of O’Brien (1984) and Huang et al. (2005) and propose a weighted rank-sum statistic. We show that the weighted rank-sum statistic is asymptotically normally distributed, permitting the computation of power, p-values and confidence intervals. We further demonstrate via simulation that the weighted rank-sum statistic is efficient in controlling the type I error rate and under certain alternatives, is more powerful than the statistics of O’Brien (1984) and Huang et al.(2005).
Keywords: Asymptotic normality, Behrens-Fisher problem, Case-Control, Clinical trials, Multiple endpoints, Rank-sum statistics, Weights
1. Introduction
Comparison of two or more samples with multiple endpoints is a common statistical problem in biomedical research. As an example, O’Brien (1984) described a randomized clinical trial of two therapies for the treatment of diabetes to investigate whether the experimental therapy yields better nerve function as measured by 34 electromyographic variables. Huang et al. (2005) gave another example of a clinical trial of Coenzyme Q10 in treating early Parkinson’s disease to slow the functional decline of the disease, as indexed by a number of outcome measures, including mentation, motor and average daily living scales. Other examples can be found in Pocock, Geller and Tsiatis (1987), Shames et al. (1998), Tilley et al. (2000), and Li, Zhao and Paty (2001), to name a few.
Hotelling’s T2 and the Bonferroni procedure are two popular approaches for comparing two multivariate samples. Hotelling’s T2 is a global test statistic and makes no distinction between variables in their direction of change. The Bonferroni procedure assigns the Type I error for each variable and then tests the null hypothesis concerning each individual variable. Noting these drawbacks of the two methods, O’Brien (1984) proposed a nonparametric procedure, a rank-sum-type test, which is based on the rank of each individual variable among the combined observations from the two samples. Under the null hypothesis that the two multivariate samples have the same distribution, O’Brien’s (1984) rank-sum test statistic asymptotically is distribution-free and follows a standard normal distribution. Huang et al. (2005) noticed that under a more general null hypothesis in the Behrens-Fisher problem, e.g. Troendle (2002), O’Brien’s (1984) test statistics are no longer distribution-free and can substantially inflate the Type I error rate when used for testing the general Behrens-Fisher hypothesis. Subsequently Huang et al. (2005) provided a modification of O’Brien’s (1984) test by adjusting for the variances of the rank sums.
Generalizing O’Brien’s (1984) rank-sum test and the modified test of Hang et al. (2005), we propose a weighted-rank-sum statistic for testing the general nonparametric Behrens-Fisher hypothesis. The weights can be chosen to be constants emphasizing the importance of the individual variables, or they can be chosen to minimize the variance of the weighted-rank-sum statistic. Under mild conditions, the weighted-rank-sum statistic is asymptotically normally distributed, thus permitting the computation of power, p-values and confidence intervals. Simulation studies demonstrate that the weighted rank-sum statistic is efficient in controlling type I error and is more powerful than the statistics of O’Brien (1984) and Huang et al. (2005) for certain alternatives.
2. Weighted-Rank-Sum Statistics for the Behrens-Fisher Problem
Suppose our interest is to compare the distribution of two p-dimensional variables, X = (X1, ···, Xp)’, and Y = (Y1, ···, Yp)’, representing the outcomes of p endpoints from subjects in, say, the standard therapy arm and the experimental therapy arm in a clinical trial, or the controls and cases in a case-control study, respectively. We assume that X and Y follow distributions F and G, with marginal distributions Fa and Ga of Xa and Ya respectively, where a = 1, ···, p. Following Huang et al. (2005), we define
(1) |
and consider testing the null hypothesis
(2) |
This is a nonparametric version of the Behrens-Fisher problem. The null space under H0, {(F, G): θ1 = ··· = θp = 0.}, is larger than the usual null space under . In a clinical trial setting θa can be viewed as a measure of the marginal treatment efficacy (corresponding to the ath endpoint) of the experimental therapy relative to the standard therapy, assuming that larger outcomes indicate better treatment results. Thus a larger positive value of θa indicates better treatment results with respect to the ath outcome variable for the experimental therapy than the standard.
2.1 Rank-sum type test statistics
Let xi = (xi1, ···, xip)’, i = 1, ···, m, be the outcomes for the ith subject from the X-sample and yj = (yj1, ···, yjp)’, j = 1, ···, n, be the outcomes of the jth subject from the Y-sample, and write N = m + n. For the ath outcome variable, a = 1, ···, p, we combine the two samples and rank the N observations x1a, ···, xma, y1a, ···, yna, and denote by Rxia and Ryja, the midrank of xia and yja, respectively. Then we observe for each subject, i = 1, ···, m, from the X-sample, and for each subject, j = 1, ···, n from the Y-sample, by summing up the ranks of the p variables. O’Brien (1984) suggested reducing the problem of comparing two multivariate distributions to one of comparing the rank sums between {Sxi, i = 1, ···, m} and {Syj, j = 1, ···, n} using the usual two-sample t-tests. This yields a t-test statistic
(3) |
analogous to the usual two-sample t-test with equal variances and unequal variances, respectively, where , , , and .
Huang et al. (2005) noticed that under the more restricted null hypothesis, : F = G, both T1 and T2 asymptotically follow the standard normal distribution. However, when F ≠ G these two statistics remain asymptotically normally distributed, but with nonunit variances. When used to test the Behrens-Fisher hypothesis H0, these test statistics can substantially inflate the Type I error rate, as demonstrated in Huang et al. (2005). To make O’Brien’s (1984) test suitable for testing the null hypothesis H0, Huang et al. (2005) derived the asymptotic variances of the two statistics and suggested using the following two modified test statistics for H0:
(4) |
where and are consistent estimates of
and
respectively, with , , , , , and , .
Huang et al. (2005) further showed that, under the null hypothesis H0, the two test statistics asymptotically follow the standard normal distribution and thus the Type I error rate can be controlled at significance level α by rejecting H0 if the magnitude of the test statistic exceeds the critical value of Φ-1(1 - α/2), where Φ is the standard normal distribution function.
2.2 Weighted rank-sum statistics
O’Brien’s (1984) rank-sum test and the modified version of Huang et al. (2005) gave equal weights to the rank of each individual variable. In many situations unequal weights are desirable so that different emphasis can be assigned to different variables. Moreover statistical optimization requires that the weights be proportional to the reciprocals of the variances of the variables to be combined, when the variables are mutually independent. Taking these arguments into consideration, we propose using weighted rank-sum statistics. Let wa ≥ 0, a = 1, . . ., p, be a constant or a random variable. The weighted rank for the ith subject in the X-sample is defined as , i = 1, ···, m, and the weighted rank for the jth subject in the Y-sample is , j = 1, ···, n. Setting wa = 1 for every a, a = 1, . . . , p leads to the test statistics considered by O’Brien (1984) or Huang et al. (2005). Moreover, if only a subset of the p variables are of interest, we can set the weights to be one for variables in the subset and zero for variables not in the subset.
Using the weighted rank sum, we propose the following two test statistics:
(5) |
where , , , and and and are consistent estimates (e.g. empirical estimates) of
and
respectively, with cab, dab, eab, fab, ξab and ηab as for Eq. (4).
We have the following results.
Theorem 1
Under the null hypothesis H0, Tw1andTw2 both converge in distribution to the standard normal distribution as min{m, n} → ∞ and 0 < λ = lim{m/n} < ∞.
Therefore H0 is rejected if |Tw1| or |Tw2| is larger than Φ-1(1 - α/2); both tests asymptotically maintain the Type I error rate at the nominal level of α.
Proof of Theorem 1
Denote the indicator function by I{·} on a set by I{·}. Let Λ = (λab), with and let w = (w1, ···, wp)’.
where
Note that U is a p-dimensional vector with each element being a U-statistic. It follows from standard asymptotic theory on U-statistics that, under the null hypothesis H0, U converges to a p-variate normal distribution with mean 0 = (0, ···, 0)’ and variance-covariance matrix Δ = Cov(U) as min{m, n} → ∞ and 0 < m/n < ∞. Therefore asymptotically follows a standard normal distribution under the null hypothesis H0. Hence, it suffices to show that and are consistent estimates of , which can be derived following the proof of Theorem 1 and Theorem 2 in Huang et al. (2005).
3. Selection of Weights
The weights w can be chosen to meet practical needs, for example, to exclude some variables by setting the weights to be zero. In other situations the weights can be so determined that the variances of the weighted rank-sum is minimized at certain parameter values in the null or alternative hypothesis space. Here we search for the w that minimizes the variance V(w) of under the null hypothesis. To this end, we first give the following definitions. For any a ∈ {1, ···, p}, define Ry(xia) to be the midrank of xia among {xia, y1a, ···, yna}, Rx(xia) the midrank of xia among {x1a, ···, xma}, Rx(yja) the midrank of yja among {x1a, ···, xma, yja}, Ry(yja) the midrank of yja among {y1a, ···, yna}. Let Ik be the identity matrix of order k, Jk be the column vector of order k whose elements are 1, and define . Then following Huang et al. (2005), we can obtaine consistent estimates of and as
and
respectively, where P = (pia)m×p with , Q = (qia)m×p with qia = 2Rx(xia) - 1 - m, U = (uja)n×p with , and V = (vja)n×p with qja = 2Ry(yja)-1-n, where i = 1, ···, m, a = 1, p, j = 1, ···, n, and .
Hence, for Tw1 and Tw2, we have the estimated variances of ,
and
where Mx = (Rxia) and My = (Ryja), the rank matrix for the X-sample and Y-sample, respectively.
The optimal weights w1 (or w2) are those that minimize the variance of , and they can be estimated by
The weights and their estimates can be computed only numerically, since there are no closed forms. Furthermore numerical results show that . This is understandable since, as pointed out earlier, both and are consistent estimates of .
4. Simulation Study and Real Data Example
4.1 Simulation studies
In this section, we conduct a simulation study to evaluate the type I error rate and power of the proposed tests, Tw1 and Tw2, for comparison with those of O’Brien (1984), T1 and T2, and Huang et al. (2005), Th1 and Th2. To this end, we consider X = (xi1, Xi2)’, i = 1, . . ., m, random samples from a bivariate normal distribution with mean (0, 0.5)′ and variance-covariance matrix , and Y = (Yj1, Yj2)’, j = 1,..., n, random samples from a bivariate normal distribution with mean (0, 0.5)’ and variance-covariance matrix . Clearly the null hypothesis holds with these two distributions, i.e., for any i and j, Pr(Xi1 < Yj1)-Pr(Xi1 > Yj1) = Pr(Xi2 < Yj2) - Pr(Xi2 > Yj2) = 0. We generate 10,000 replicates for each pair of m and n selected from {50, 100, 200}. For each replicate the optimal weights are estimated from the simulated data using the method described in Section 3. The simulated Type I error is the proportion of the null hypothesis H0 being rejected at nominal significance level of 0.05 (two-sided).
The simulated power is obtained similarly under the same settings except that the mean vector of the X-samples is set to (0,-0.5)’, and the variance-covariance matrices are set to for the X-sample and for the Y-sample.
Table 1 presents the simulation results for the type I error and power. The results indicate that both the methods in the present paper and in Huang et al. (1984) effectively maintain the nominal type I error, with a minor discrepancy possibly due to variation in the simulation. In comparison, O’Brien’s (1984) tests produce inconsistent type I error rates, mostly inflated over the nominal significance level of 0.05. For example, with m = 200, n = 100, the empirical type I error rates of O’Brien’s tests are 0.101 and 0.059, while the tests in Huang et al. (2005) are 0.051 and 0.050, and the proposed two tests give 0.054 and 0.053, respectively. The power of the proposed tests is substantially higher than those of O’Brien (1984) and Huang et al. (2005). For example, with m = 100, n = 50, the power values are, respectively, 0.394 and 0.351 for O’Brien’ (1984) tests, 0.360 and 0.355 for the tests of Huang et al. (2005), and 0.567 and 0.563, for the proposed tests, more than 15% higher than other tests. It is worth noting that even when O’Brien’s (1984) tests produce smaller type I error (m = 50, n = 100 and m = 100, n = 200), the proposed tests still achieve a considerably higher power than other tests.
Table 1.
m | n | T1 | T2 | Th1 | Th2 | Tw1 | Tw2 |
---|---|---|---|---|---|---|---|
Type I error rate | |||||||
50 | 50 | 0.064 | 0.063 | 0.052 | 0.051 | 0.056 | 0.055 |
100 | 100 | 0.066 | 0.066 | 0.052 | 0.051 | 0.052 | 0.052 |
200 | 200 | 0.061 | 0.061 | 0.048 | 0.048 | 0.049 | 0.049 |
50 | 100 | 0.031 | 0.069 | 0.051 | 0.051 | 0.053 | 0.053 |
100 | 200 | 0.029 | 0.067 | 0.049 | 0.050 | 0.050 | 0.051 |
100 | 50 | 0.105 | 0.061 | 0.056 | 0.054 | 0.060 | 0.059 |
200 | 100 | 0.101 | 0.059 | 0.051 | 0.050 | 0.054 | 0.053 |
Power | |||||||
50 | 50 | 0.311 | 0.311 | 0.316 | 0.315 | 0.512 | 0.511 |
100 | 100 | 0.534 | 0.533 | 0.535 | 0.535 | 0.734 | 0.734 |
200 | 200 | 0.825 | 0.825 | 0.825 | 0.825 | 0.933 | 0.933 |
50 | 100 | 0.375 | 0.431 | 0.436 | 0.433 | 0.637 | 0.635 |
100 | 200 | 0.664 | 0.716 | 0.717 | 0.716 | 0.866 | 0.865 |
100 | 50 | 0.394 | 0.351 | 0.360 | 0.355 | 0.567 | 0.563 |
200 | 100 | 0.641 | 0.591 | 0.596 | 0.593 | 0.790 | 0.788 |
4.2 An example
The role of certain growth hormones is one major objective of The Growth and Maturation in Children with Autism or Autistic Spectrum Disorder (ASD) Study (the Autism/ASD Study), a case-control study conducted by the Eunice Kennedy Shriver National Institute of Child Health and Human Development in 2002-2005; see Mills et al. (2007) for details of subject enrollment and data collection. The study enrolled eighty-one subjects, 75 boys and 6 girls, diagnosed as having autism/ASD, and eighty age-matched controls (59 boys and 21 girls). Blood samples were assayed for insulin-like growth factors (IGF-1, IGF-2), insulin-like growth factor binding protein (IGFBP-3), and growth hormone binding protein (GHBP), as well as for dehydroepiandrosterone (DHEA) and DHEA-sulphate (DHEAS).
To illustrate the proposed methods in comparison with other approaches, we exclude from the analysis data from the girls, due to their small sample size, and four boys in the case group who did not provide blood samples, thus yielding 71 cases and 59 controls in the analysis. We confine our attention to five hormones: insulin-like growth factor-1 (IGF-1), insulin-like growth factor 2 (IGF-2), IGF binding protein (IGFBP-3), growth hormone binding protein (GHBP), and dehydroepiandrosterone (DHEA). DHEA-sulphate (DHEAS) was not included in the analysis since its levels were undetectable in more than half of the subjects (Mills et al., 2007). To be investigated is whether the levels of a growth-related hormone, if any, differ between cases and controls.
We applied the proposed test and the tests of O’Brien (1984) and Huang et al. (2005) to the five growth-related hormone levels in cases and controls. The P-values were 2.86 × 10-6 and 1.75 × 10-6 for O’Brien’s tests and 6.28 × 10-7 and 6.30 × 10-7 for the two tests of Huang et al. In contrast the proposed two tests give P-values 1.93 × 10-7 and 3.33 × 10-7, respectively. For this example, the proposed method is shown to be more powerful than O’Brien’s method, but only slightly better than the tests of Huan et al. (2005). This is partly because the differences between cases and controls all fall into the same direction, that is, for each hormone, its level among cases are higher than among controls.
5. Discussion
For testing the Behrens-Fisher hypothesis, we proposed a weighted rank-sum test statistic that effectively maintains the type I error rate and possesses higher power than the tests of O’Brien (1984) and Huang et al. (2005). The optimal weights do not have closed forms and need to be estimated using available data.
All the tests discussed are nonparametric in nature and are “global” in the sense that they summarize the multi-dimensional data into one-dimensional statistics. Under the most restricted null hypothesis that the two multivariate distributions are identical these tests are asymptotically equivalent. However, under the less restrictive Behrens-Fisher hypotheses, they perform differently. It would be interesting to see how these test statistics behave under other hypotheses.
The proposed test gains its power by accumulating evidence across comparisons on each individual outcomes, but may still have low power in situations when the differences in the outcomes between the two samples, as measured by the θas, exist but fall into different directions (some differences are positive and some are negative or zero). In this regard, more robust tests, such as the one described in Yu et al. (2006), could serve as plausible alternatives.
Acknowledgements
We would like to thank Dr. B.J. Stone for help. The authors are supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (AL, KFY), and the National Cancer Institute (QL, KY), National Institutes of Health. The opinions expressed in the article are not necessarily those of the National Institutes of Health. Q Li is also supported in part by the Knowledge Innovation Program of the Chinese Academy of Sciences, No. 30465W0 and 30475V0. The authors thank the referee, an Associate Editor and the Editor for helpful comments and suggestions.
References
- Huang P, Tilley BC, Woolson RF, Lipsitz S. Adjusting O’Brien’s test to control type I error for the generalized nonparametric Behrens-Fisher problem. Biometrics. 2005;61:532–539. doi: 10.1111/j.1541-0420.2005.00322.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li DK, Zhao GJ, Paty DW. Randomized controlled trial of interferon-beta- 1a in secondary progressive MS: MRI results. Neurology. 2001;56:1505–1513. doi: 10.1212/wnl.56.11.1505. [DOI] [PubMed] [Google Scholar]
- Mills JL, Hediger ML, Molloy CA, Chrousos GP, Manning-Courtney P, Yu KF, Brasington M, England LJ. Elevated levels of growth-related hormones in autism and autism spectrum disorder. Clinical Endocrinology. 2007;67:230–237. doi: 10.1111/j.1365-2265.2007.02868.x. [DOI] [PubMed] [Google Scholar]
- O’Brien PC. Procedures for comparing samples with multiple endpoints. Biometrics. 1984;40:1079–1087. [PubMed] [Google Scholar]
- Pocock SJ, Geller NL, Tsiatis AA. The analysis of multiple endpoints in clinical trials. Biometrics. 1987;43:487–498. [PubMed] [Google Scholar]
- Shames RS, Heilbron DC, Janson SL, Kishiyama JL, Au DS, Adelman DC. Clinical differences among women with and without self-reported perimenstrual asthma. Annals of Allergy Asthma Immunology. 1998;81:65–72. doi: 10.1016/S1081-1206(10)63111-0. [DOI] [PubMed] [Google Scholar]
- Tilley BC, Pillemer SR, Heyse SP, Li S, Clegg DO, Alarcn GS. Global statistical tests for comparing multiple outcomes in rheumatoid arthritis trials. Arthritis and Rheumatism. 1999;42:1879–1888. doi: 10.1002/1529-0131(199909)42:9<1879::AID-ANR12>3.0.CO;2-1. [DOI] [PubMed] [Google Scholar]
- Troendle JF. A likelihood ratio test for the nonparametric Behrens-Fisher problem. Biometrical Journal. 2002;44:813–824. [Google Scholar]
- Yu K, Gu C, Xiong CJ, An P, Province MA. Global transmission/disequilibrium tests for haplotypes reconstructed from multiple genes. Genetic Epidemiology. 2005;29:323–335. doi: 10.1002/gepi.20102. [DOI] [PubMed] [Google Scholar]