Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Aug 1.
Published in final edited form as: Stat Methods Med Res. 2016 Dec 29;27(8):2384–2400. doi: 10.1177/0962280216680524

An optimal Wilcoxon–Mann–Whitney test of mortality and a continuous outcome

Roland A Matsouaka 1,2, Aneesh B Singhal 3, Rebecca A Betensky 4,5
PMCID: PMC5393279  NIHMSID: NIHMS847056  PMID: 27920364

Abstract

We consider a two-group randomized clinical trial, where mortality affects the assessment of a follow-up continuous outcome. Using the worst-rank composite endpoint, we develop a weighted Wilcoxon–Mann–Whitney test statistic to analyze the data. We determine the optimal weights for the Wilcoxon–Mann–Whitney test statistic that maximize its power. We derive a formula for its power and demonstrate its accuracy in simulations. Finally, we apply the method to data from an acute ischemic stroke clinical trial of normobaric oxygen therapy.

Keywords: Missing data, survivor bias, multiple endpoints, weighted Wilcoxon–Mann–Whitney test, censored-by-death, composite endpoints

1 Introduction

In many randomized clinical trials, the difference between treatment groups is evaluated using measurements of an outcome of interest after a pre-specified follow-up time. However, for some participants, follow-up measurements may be missing if a disease-related event, such as death (or withdrawal due to worsening disease condition), has occurred prior to the end of follow-up time. Our motivating example is a clinical trial of acute ischemic stroke conducted at Massachusetts General Hospital in Boston, MA. In this trial, patients who had acute ischemic stroke were randomized to either normobaric oxygen (NBO) therapy or room air and assessed serially to monitor their functional ability. Among other measures, patients’ neurological recovery was assessed and quantified using the NIH Stroke Scale (NIHSS) score, a function rating scale used to quantify neurological deficit due to stroke.1,2 However, investigators were confronted with early deaths, which precluded measurements of NIHSS scores for some participants at the end of the three-month follow-up period. Any analysis of the data that includes solely the subjects who survived would be biased and give spurious results.3

One approach to handle this issue is to combine the primary endpoint and mortality into a single composite endpoint: the worst-rank composite endpoint. It is calculated by considering death as the worst outcome on the same scale as the measure outcome and analyzed using ranks of these combined outcomes.46 Unlike traditional analyses of composite endpoints that treat all of the component endpoints equally and focus on each study participant’s first occurring event, worst-rank composite endpoints incorporate a hierarchical ranking of these individual outcomes based on their clinical importance, frequency of occurrence or severity. Moreover, in contrast to the typical “time-to- event” analyses, worst-rank composite endpoints allow us to combine individual outcomes from multiple clinical domains, while accounting for their heterogeneity. Such outcomes could include both clinical events (e.g., death), continuous variables, or other clinical measurements (e.g., biomarker or quality-of-life measures.)7

Ranking individual outcomes that characterize various aspects of patients’ disease experience based on a prespecified hierarchy of various components suggest the existence of an implicit weighting scheme. In fact, several authors have suggested the use of a priori determined utility (or sometimes severity) weights to reflect the relative importance of the components of composite outcomes and add another layer of discrimination beyond hierarchical ordering alone.8,9 Such weighting may be based on subjective criteria or elicitation of experts. However, deriving such a priori weights and finding a consensus about them have proven to be difficult.1014

Building upon our previous work on this topic,6 and assuming there is a pre-specified hierarchy of various components of a composite outcome, we introduce an optimal approach that not only acknowledges such a hierarchy, but also estimates the weights so as to maximize the power to detect globally any treatment effect when present.

The use of multivariate tests to compare treatment effects from multivariate outcomes has gain interest in clinical trials of multifaceted complex diseases, where the clinical course of the disease is manifested in complex ways through a host of clinical outcomes. A global test statistic for composite endpoints that accounts for the complexity of the disease, rather than evaluating individual components, provides a comprehensive method to evaluate more effectively and more efficiently the efficacy of a treatment.15,16 Tests such as O’Brien test,17 Wei and Johnson’s test,18 Finkelstein and Schoenfeld’s test,19 Moye; et al.’s test20,21 are rank-based tests developed using U- statistics. Some of these tests of combined endpoints are weighted tests where the optimal weights are determined by maximizing the power of the test statistic under a particular alternative hypothesis: this is the framework we will focus on in this paper.

In this paper, we use the given hierarchy of outcomes to construct a worst-rank composite endpoint such that death (or a missing continuous outcome due to worsening of the disease condition) is considered a worse outcome than any observed primary endpoint measurement. Furthermore, two subjects who died are ranked with respect to their survival times.46 In Section 2, we give the rationale for the weighted Wilcoxon–Mann–Whitney (WMW) test statistic for such a worst-rank composite endpoint. We then derive data-based optimal weights that maximize the power of the weighted WMW test statistic along with its analytical power formula. We demonstrate that the optimal-weighted WMW test statistic has greater power than the ordinary WMW test statistic. We illustrate the accuracy of our results through simulation studies (Section 3). Finally, we apply the procedures to the clinical trial of the NBO therapy for acute ischemic stroke patients.

2 Weighted WMW

2.1 Notations

In this section, we present the ordinary WMW test for the worst-rank composite outcome and its analytical power formula that we previously derived.6 Then, we motivate its extension to a weighted WMW test through a decomposition of the WMW U-statistic.

Consider a randomized clinical trial in which m and n subjects are assigned, respectively, to the control treatment (group 1) and the active treatment (group 2) and then followed for time period T. For subject j in group i, Xij denotes the value of the continuous endpoint at the end of the follow-up time, tij denotes the time to death or disease-related withdrawal (for simplicity, we will refer to both as death), δij = I(tij ≤ T) indicates early death (i.e., before T), and pi = E(δij) = P(tij ≤ T) the probability of early death for subjects in group i.

If the subject died before T, X is unknown. Thus, following the assumed hierarchy of outcomes, this subject is assigned a worst-rank score equal to η + tij, which is a function of his or her survival time, where η = min(X) − 1 − T.

Without loss of generality, we assume larger values of X correspond to better health outcome. For each subject, the worst-rank composite endpoint is thus

Xij=(1δij)Xij+δij(η+tij),i=1,2andj=1,,N (1)

Let Fi and Gi be, respectively, the cumulative conditional distributions of the informative event times and observed non-fatal outcome for patients in group i, i.e. Fi(v) = P(tijv|0 < tijT) and Gi(x) = P(Xijx|tij > T). The distribution of Xi is given by

Gi(x)=piFi(xη)I(x<ζ)+(1pi)Gi(x)I(xζ),ζ=min(X)1 (2)

We would like to test the null hypothesis that the two treatments are equivalent with respect to both survival and the non-fatal outcome

H0:G1(x)=G2(x)andF1(t)=F2(t)forallxandt (3)

against the uni-directional alternative hypothesis that the active treatment is at least as effective as the control treatment for both mortality and the non-fatal outcome and is not harmful for either, i.e.

H1:G1(x)G2(x)andF1(t)F2(t),forsomexand/ort (4)

with both G1(x) = G2(x) and F1(t) = F2(t) not occurring simultaneously for all x and t.

2.2 Ordinary WMW test

We will now define the ordinary WMW test using the framework of the worst-rank composite endpoint X of the previous section. The ordinary WMW U-statistic is defined by

U=(mn)1k=1ml=1nI(X1k<X2l) (5)

Using equation (1), we note that I(X1k<X2l) is equal to

δ1kδ2lI(t1k<t2l)+δ1k(1δ2l)+(1δ1k)(1δ2l)I(X1k<X2l) (6)

Therefore

μ1=E(U)=πU1σ12=Var(U)=(mn)1[πU1(1πU1)+(m1)(πU2πU12)+(n1)(πU3πU12)] (7)

where

qi=1pi,πt1=P(t1k<t2l|t1kT,t2lT)πt2=P(t1k<t2l,t1k<t2l|t1kT,t1kT,t2lT)πt3=P(t1k<t2l,t1k<t2l|t1kT,t2lT,t2lT)πx1=P(X1k<X2l),πx2=P(X1k<X2l,X1k<X2l)πx3=P(X1k<X2l,X1k<X2l)πU1=p1p2πt1+p1q2+q1q2πx1πU2=p12q2+p12p2πt2+2p1q1q2πx1+q12q2πx2πU3=p1q22+p1p22πt3+2p1p2q2πt1+q1q22πx3

(see the proof in Appendix 1).

Under the null hypothesis (H0) of no difference between the two treatment groups, μ0 = E0(U) = 1/2 and σ02=Var0(U)=(n+m+1)/(12mn). The distribution of the ordinary WMW test statistic

Z=UE0(U)Var0(U) (8)

converges to the standard normal distribution N(0, 1) as m and n tend to infinity, and m/nρ, 0 < ρ < 1.

The power of this WMW test is given by

Φ(σ0σ1Zα2+μ1μ0σ1)+Φ(σ0σ1Zα2μ1μ0σ1)Φ(σ0σ1Zα2+|μ1μ0|σ1) (9)

where μ1 = E(U) and of σ12=Var(U) under the alternative hypothesis (H1) (see the proof in Matsouaka and Betensky).6

2.3 Weighted WMW test

To motivate our weighted test, we now write the WMW U-statistic applied to the worst-rank scores (5) as a sum of three dependent WMW U-statistics. Then, we demonstrate that to optimally compare two treatment groups using worst-rank scores, we need to use a weighted statistic that takes into account the dependence that exists among the three statistics.

Assume there exists weights w = (w1, w2), w1 + w2 = 1, such that equation (1) becomes

Xij=w1δij(η+tij)+w2(1δij)Xij,i=1,2andj=1,,N (10)

The U-statistic (5) then becomes Uw=w12Ut+w1w2Utx+w22Ux, where Ut, Utx and Ux are defined by

Ut=(mn)1k=1ml=1nδ1kδ2lI(t1k<t2l)Utx=(mn)1k=1ml=1nδ1k(1δ2l)Ux=(mn)1k=1ml=1n(1δ1k)(1δ2l)I(X1k<X2l) (11)

Using vector notation, we can write Uw as Uw = cU where we define U′ = (Ut, Utx, Ux) and c=(c1,c2,c3)=(w12,w1w2,w22). Notice that c1 + 2c2 + c3 = (w1 + w2)2 = 1.

Using the results in Appendix 2, we have

μ1w=E(Uw)=c(p1p2πt1,p1q2,q1q2πx1)σ1w=Var(Uw)=cc

where Σ = Var(U) is a 3 × 3 matrix given in Appendix 2.

Under the null hypothesis

μ0w=E0(Uw)=12c(p2,2pq,q2)=12[w12p2+2w1w2pq+w22q2]=12[w1p+w2q]2σ0w=Var0(Uw)=c0c

with Σ0 = Var0(U) a 3 × 3 matrix given in Appendix 2.

2.3.1 Pre-specified weights

When there are pre-specified weights, usually determined as to reflect the relative importance or the severity of component outcomes, they can be used to calculate the weighted WMW test statistic

Zw=UwE0(Uw)Var0(Uw) (12)

Zw converges to the standard normal distribution N(0, 1) as m and n tend to infinity, and m/nρ, 0 < ρ < 1.

The corresponding power is given by

Φ(σ0wσ1wzα2+μ1wμ0wσ1w)+Φ(σ0wσ1wzα2μ1wμ0wσ1w)Φ(σ0wσ1wzα2+|μ1wμ0w|σ1w) (13)

For instance, after surveying a panel of clinical investigators, Bakal et al.9 used pre-specified weights in a study that used a composite endpoints of death, cardiogenic shock (Shock), congestive heart failure (CHF), and recurrent myocardial infarction (RE-MI). The weights were 1 for death, 0.5 for Shock, 0.3 for hospitalization for CHF, and 0.2 for RE-MI, i.e., in this context, w=12(1,0.5,0.3,0.2). In another example,22 the composite outcome consisted of events weighted according to their severity: RE-MI (weight w1 = 0.415), CHF that required the use of open- label angiotensin-converting enzyme (ACE) inhibitors (weight w2 = 0.17), and hospitalization to treat CHF (weight w3 = 0.415).

Although the use of pre-specified weights provides a more nuanced approach to the importance of individual endpoints of a composite outcome, recognizes the potential underlying differences that exists among them, and facilitates the results interpretation compare to traditional composite endpoints, the selection of appropriate weights is not straightforward since inherently subjective.2224 However, when they exist, failing to use such utility (or severity) weights to highlight clinical importance of the component outcomes of a composite endpoint implies that we assume equal weights, which sometimes even worse.2325

We note that when the weights w1 and w2 are equal, i.e., c1=c2=c3=w12, the test statistic Zw coincides with the (ordinary) WMW test statistic Z given in equation (8). Indeed, in that case, cU=w12[Ut+Utx+Ux]=w12U with U given by equation (5). Thus, cE0(U)=w12E0(U) and Var0(cU)=w14Var0(U), which implies that Z = Zw

2.3.2 Optimal weights

Now we want to estimate the optimal weights w for the weighted WMW test statistic

Zc=c(UE0(U)Var0(cU)=c(UE0(U))cVar0(U)c (14)

with U′ = (Ut, Utx, Ux) and c=(c1,c2,c3)=(w12,w1w2,w22). Optimal weights c1, c2, and c3 for the test statistic Zw are those that maximize its power.

We will use the power formula of Zc, to derive its optimal weights. Then, we introduce the optimal-weighted WMW test statistic Zopt and highlight some of its properties and characteristics.

From the definition of U, we show in Appendix 2 that

E(U)=(E(Ut),E(Utx),E(Ux))=(πt1p1p2,p1q2,πx1q1q2) (15)

and Var(U) = Σ, where =(mn)1(ij)1i,j3 is a 3 × 3 matrix.

Under the null hypothesis of no difference between the two groups, with respect to both survival and nonfatal outcome, we have p1 = p2 = p, q1 = q2 = q = 1 − p, πt1 = πx1 = 1/2, and πt2 = πx2 = πt3 = πx3 = 1/3. Thus

E0(U)=12(p2,2pq,q2)andVar0(U)=0 (16)

where 0=(mn)1(0ij)1i,j3 is a symmetric matrix with

011=p212A(p),012=021=p2q24(n+m1),013=031=p2q2((n1)qmp)022=q212A(q),023=032=pq22((m1)pnq),033=pq(nq2+mp2+pq)A(x)=6+4(n+m2)x3(n+m1)x2

Moreover, since Var0(Uw) = Var0(cU) = cΣ0c ≥ 0 by definition, the matrix Σ0 is a semi-positive definite.

The power formula for the weighted WMW, similar to equation (9), is

Φ(σ0wσ1wzα2+μ1wμ0wσ1w)+Φ(σ0wσ1wzα2μ1wμ0wσ1w)Φ[σ0wσ1w(zα2+|μ1wμ0w|σ0w)] (17)

where μ1w= cE(U), μ0w = cE(U), σ1w = cΣc, and σ0w= cΣ0c.

Under the assumptions that

  1. n/m converges to a constant ρ (0 < ρ < 1),

  2. both N{F1(t)F2(t)}andN{G1(x)G2(x)} are bounded, i.e. σ0wσ1w converges to 1 as N = m + n →∞,

a weight-vector c maximizes the power (17) if and only if it maximizes |μ1wμ0w|/σ0w.

We prove in Appendix 3 that the optimal-weight vector copt is given by

copt=01μb01μ (18)

for b′ = (1, 2, 1) and μ=E(U)E(U)=E0(U)=(πt1p1p212p2,p1q2pq,πx1q1q212q2). Therefore, from equation (14), the corresponding optimal test statistic Zw (denoted here Zopt) is then given by

Zopt=copt(UE0(U))copt0copt=μ01(UE0(U))μ01μ (19)

2.3.3 Remarks

  1. The test statistic Zopt given by equation (19) encompasses the contributions of the effects of treatment on both mortality (via Ut) and the non-fatal outcome (via Ux) as well as the corresponding proportions of deaths and survivors in both treatment groups (via Utx) and their relative importance and magnitude, where each component is weighted accordingly through copt.

  2. As demonstrated, the ordinary WMW test statistic is a special case of a weighted WMW test statistics (corresponding to a weighted WMW test statistic with equal weights). This implies that both the ordinary and the optimal-weighted WMW test statistics belong to same family of weighted WMW tests.

  3. Note that the optimal weight vector copt=01μ depends on unknown population parameters πt1, πx1, p1, p2, and p which must be estimated in practice (since they are not available from the observed sample data). A good estimation method of these unknown parameters is needed to calculate the test statistic Zopt given by equation (19):
    1. When the distributions of the primary endpoint, X, and the survival time, t, are known approximately, we can estimate analytically the probabilities πt1 and πx1, p1, p2 (as we have done in Appendix 4 for our simulation studies) and calculate an estimate of the probability p under the null hypothesis (H0) as p^=(mp^1+np^2)/(m+n) (pooled sample proportion).
      In general, the distributions of both the primary endpoint and the survival time are not known. Optimal weights are estimated using either data from a pilot study (or from previous studies, when available) or the data at hand.
    2. If we have data from prior studies, we can leverage them to estimate these parameters. Using Bayesian methods, we can elicit expert opinions to define prior distributions associated with Σ0 and μ that best reflect the characteristics of the disease under study and determine posterior distributions to provide a more accurate assessment of the optimal weights.26 Alternatively, if the data are structured such that we have multiple strata available (e.g., different enrollment periods or different clinical centers for patients), we can use an adaptive weighting scheme to estimate Σ0 and μ.27,28
    3. In absence of data from prior studies, it is recommended to use a bootstrap approach to estimate the weights. To do this, we generate B bootstrap samples (e.g., B = 500, 1000, or 2000) and, for each bootstrap sample, we estimate the corresponding optimal weight vector copt. Then, we compute the average weights from the B estimates. Finally, using these average weights, we compute the test statistic Zopt on the original sample with the average weights estimated in the first part and test the null hypothesis.
    4. With the data at hand, we can also use a K-fold cross-validation. In that regard, we divide the data into K subsets of roughly equal size and estimate the weights copt,k and the test statistic Zopt,k exactly K times. At the k-th time, k = 1,…, K, we use the k-th subset as validation data to calculate the weights copt,k and combine the remaining K − 1 subsets as training data to estimate the test statistic Zopt,k using the weights defined at the validation stage. Then, we estimate the test statistic Zopt by averaging over all the K test statistics Zoptk,k=1,,K and run the hypothesis test.

3 Simulation studies

We conducted simulation studies to assess the performance of the weighted test statistic. We generated data set to follow the pattern seen in stroke trials, where the outcome of interest (patient’s improvement on the NIHSS score over a three-month period) may be missing for some patients due to death. We simulated death times under a proportional hazards model with t1k ~ Exp1), t2l ~ Exp2), such that q2 = exp(−λ2T) and HR = λ12 with T = 3 months, HR = 1.0, 1.2, 1.4, 1.6, 2.0, 2.4, 3.0 and q2 = 0.6, 0.8. For the non-fatal outcome, X1k ~ N(0, 1), X2l~N(2Δx,1), k = 1,…,m; l = 1,…,n with Δx=(μx2μx1)/(σx12)=0.0,0.1,0.2,0.3,0.4,0.5,0.6. The conditional probabilities, πty and πxy, γ = 1, 2, 3, are given in Appendix 4. We computed power for the weighted WMW test for n = m = 50 patients, using the analytical power formula (17) and a two-sided α = 0.05. In addition, we estimated power empirically by averaging over 10,000 simulated data sets.

The results, given in Table 1, illustrate the accuracy of the analytical power formula (17). They indicate also that the weighted WMW test statistic is more powerful than the ordinary WMW test for the worst-rank score composite outcome. The largest differences are seen in two different scenarios:

  1. The standardized difference in the non-fatal outcome Δx is small (Δx < 0.3) and the difference in mortality is moderate or high (HR ≥ 1.2)

  2. The difference in mortality is small (HR < 1.2) and the standard difference in the non-fatal outcome Δx is moderate or high (Δx ≥ 0.3).

Table 1.

Power comparisons for a continuous outcome under proportional hazards for time to death.

Δx HR
q2 = 60 %
q2 = 80 %
1.0 1.2 1.4 1.6 2.0 2.4 3.0 1.0 1.2 1.4 1.6 2.0 2.4 3.0
(a) Analytical power for the weighted WMW test
0.0 0.05a 0.11 0.24 0.41 0.73 0.90 0.98 0.05a 0.08 0.15 0.24 0.45 0.68 0.87
0.1 0.08 0.12 0.25 0.42 0.73 0.90 0.98 0.09 0.12 0.18 0.28 0.51 0.70 0.88
0.2 0.15 0.19 0.30 0.46 0.75 0.91 0.98 0.21 0.24 0.30 0.38 0.58 0.75 0.90
0.3 0.27 0.30 0.40 0.53 0.78 0.92 0.98 0.39 0.41 0.46 0.53 0.69 0.82 0.93
0.4 0.41 0.44 0.51 0.61 0.82 0.93 0.98 0.59 0.61 0.64 0.69 0.79 0.88 0.95
0.5 0.55 0.57 0.62 0.70 0.86 0.94 0.99 0.76 0.77 0.79 0.81 0.87 0.92 0.97
0.6 0.68 0.68 0.72 0.77 0.89 0.95 0.99 0.88 0.88 0.89 0.90 0.93 0.96 0.98
(b) Empirical power for the weighted WMW test
0.0 0.05a 0.10 0.23 0.40 0.72 0.91 0.99 0.05a 0.08 0.15 0.24 0.45 0.67 0.87
0.1 0.08 0.12 0.24 0.41 0.73 0.90 0.99 0.09 0.12 0.18 0.28 0.51 0.70 0.89
0.2 0.15 0.19 0.29 0.47 0.75 0.91 0.99 0.21 0.24 0.30 0.38 0.58 0.76 0.91
0.3 0.26 0.30 0.40 0.53 0.78 0.92 0.99 0.39 0.41 0.46 0.54 0.69 0.83 0.94
0.4 0.39 0.43 0.51 0.63 0.81 0.93 0.99 0.59 0.61 0.65 0.71 0.81 0.89 0.96
0.5 0.54 0.56 0.63 0.71 0.87 0.94 0.99 0.76 0.78 0.81 0.83 0.90 0.94 0.98
0.6 0.67 0.68 0.73 0.79 0.89 0.96 0.99 0.89 0.89 0.91 0.92 0.95 0.97 0.99
(c) Empirical power for the ordinary WMW test in worst-rank scores
0.0 0.05 0.09 0.17 0.31 0.62 0.84 0.98 0.05 0.06 0.09 0.13 0.29 0.48 0.74
0.1 0.06 0.12 0.22 0.38 0.67 0.87 0.98 0.08 0.11 0.17 0.24 0.42 0.61 0.82
0.2 0.07 0.16 0.30 0.44 0.74 0.90 0.99 0.14 0.21 0.29 0.37 0.56 0.72 0.89
0.3 0.12 0.22 0.37 0.53 0.78 0.93 0.99 0.26 0.33 0.43 0.53 0.70 0.83 0.94
0.4 0.16 0.29 0.44 0.59 0.82 0.94 0.99 0.40 0.50 0.59 0.66 0.81 0.89 0.96
0.5 0.22 0.36 0.52 0.66 0.86 0.96 0.99 0.57 0.66 0.73 0.79 0.89 0.95 0.98
0.6 0.30 0.44 0.59 0.70 0.88 0.97 0.99 0.71 0.78 0.84 0.88 0.94 0.97 0.99
a

The weights are equal and fixed to 1. We assumed the treatment is better either on both mortality and non-fatal outcome or on one outcome and not different from the control on the other outcome. We used exponential distributions for the survival times, normal distributions for the non-fatal outcome, and the same number of subjects in each group (n1 = n2 = 50). Δx: standardized mean difference on the non-fatal outcome of interest; HR: hazard ratio; q2 survival probability (proportion of patients alive) at three months in the treatment group. (a) Estimated using formula (9); (b) and (c) Proportion of simulated data sets for which |Zopt| > 1.96 and |Z| > 1.96, respectively.

Overall, these results mean that whenever the effect on the primary outcome is small, the larger difference in mortality is diluted when assessing the overall difference through the ordinary WMW, where mortality and the non-fatal outcome are weighted equally. Likewise, if the difference in mortality is small, but the difference in the non-fatal outcome is moderate or high, the ordinary WMW test on the composite outcome has less power than the weighted WMW.

4 Application to a stroke clinical trial

A clinical trial of NBO therapy was conducted at Massachusetts General Hospital for patients who had an acute ischemic stroke.1,2 In this trial, 85 patients were randomly assigned to either NBO therapy (43 patients) or to room air (control) for 8 h and assessed serially with clinical function scores. The primary efficacy and safety endpoints were, respectively, the mean change in NIHSS from baseline to 4 h (during therapy) and 24 hours (after therapy).1 For illustration purposes, we focused on the secondary endpoint and examined the mean change in NIHSS scores from baseline to three months or at discharge.

Twenty-four of the 85 patients died, 17 of whom were in the NBO group. Fifty-three patients (with 31 in the control group) were discharged prior to the three-month follow-up period. Subjects with missing three-month NIHSS scores were included in the estimation of the log rank test, but excluded in the assessment of the change in NIHSS scores. The log rank test of survival was significant (χ2 = 6 with 1 d.f., p = 0.016), indicating that the active treatment had an unfavorable effect on mortality. The ordinary WMW test applied to the survivors was not significant (W = 572.5, p = 0.27). Using the untied worst-rank composite endpoint of death times and NIHSS scores, we found a significant result with the ordinary WMW test (W = 1112.5, p = 0.01).

Finally, we applied the proposed method, estimating the weights and the test statistic Zw using B = 2000 bootstrap samples, as explained in part (iii) of the Remarks 2.3.3. The estimated weight vector c′, the mean difference μ, the variance-covariance matrix for U under the null, and the probability p were, respectively,

c=(0.45,0.16,0.24),0=(0.590.500.900.504.771.270.901.275.16),μ=(0.016,0.098,0.073),andp=0.283.

This corresponds to w1 = 0.61 and w2 = 0.39, which means mortality was weighted more heavily (61 % of the weight) than NIHSS score, in addition to ranking death worse than any measure of the continuous outcome (NIHSS score). The optimally weighted WMW test statistic Zopt was equal to 3.42 with a correspondingp value of 6.2 × 10−4. This result is stronger than that from the ordinary WMW test as it captures the significant difference in mortality between the two treatment groups and demonstrates the efficiency of our test statistic.

5 Discussion

In this paper, we have generalized the notion of the WMW test for a worst-rank composite outcome by deriving the optimally weighted WMW test. Against the null hypothesis of no difference on both mortality and continuous endpoint, we have focused on the alternative hypothesis that “the active treatment has a preponderance of positive effects on the multiple outcomes considered, while not being harmful for any.”29 We have motivated the worst- rank composite outcome in the context of the clinical trial of a non-mortality primary outcome where the assessment of the primary outcome of interest at a pre-specified time-point may be precluded by death, any other debilitating event, or worsening of the disease condition. The corresponding composite outcome takes into account all patients enrolled in the trial, including those who had terminal events before the end of follow-up.

When there exists a hierarchy of the constituent endpoints of a composite outcome, the method we have presented in this paper enables different components of the WMW test statistic to be weighted differentially. Using weights allows for an additional level of discrimination between the component outcomes beyond ranks alone. While the worst-rank score mechanism pertains with how the different component outcomes of the composite endpoint are aggregated, assigning weights strengthen (or lessen) the influence these prioritized component outcomes exert in the overall composite. We considered weights obtained or elicited from expert judgments (utility weights) or determined in a way that the corresponding WMW test statistic has a maximum power. Based on a U-statistic approach, we first provided the test statistic and the power of the weighted WMW test when utilities (or severity) weights, determined a priori, are available. We also demonstrated that the ordinary (unweighted) WMW test on the worst-rank score outcome is a special case of the weighted WMW test, i.e. when the weights are all equal. Then, we derived the optimal weights such that the power of the corresponding weighted WMW test statistic is maximal. Finally, we conducted simulation studies to evaluate the accuracy of our power formula and confirmed, in the process, that the weighted WMW is more powerful than ordinary WMW test.

We applied the proposed method to the data from a clinical trial of NBO therapy for patients with acute ischemic stroke. Patients’ improvement was assessed using the National Institutes of Health Stroke Scale (NIHSS) Scores. The results indicated a statistically significant difference between NBO therapy and room air—using either the proposed method or the ordinary WMW test on the worst-rank composite outcome of death and change in NIHSS—which we couldn’t detect using the ordinary WMW on the survivors alone.

The difference between NBO therapy and room air was driven by the difference in mortality since there was a disproportionate number of NBO-treated patients who died. It is actually for this reason the trial was stopped by the Data and Safety Monitoring Board (DSMB) after 85 patients out of the projected 240 were enrolled. The stark imbalance between the two treatment group, although not attributed to the treatment, made it untenable to continue the trial.1,30

The end result of the NBO trial is one of the dreaded scenarios in the (traditional) analysis of composite endpoints. That the active treatment must be better than the control for one or both of the constituent outcomes (mortality and non-fatal outcome) and not worse for either of them as suggested by our alternative hypothesis H1 (stated in equation (4)), was clearly not the case for the NBO trial. While the active treatment was equivalent to the control treatment in change in NIHSS, the data showed also that NBO therapy increased mortality. Ideally, components of a composite endpoint should have similar clinical importance, frequency, and treatment effect. However, this is rarely the case as outcomes of different levels of severity are usually combined to facilitate the interpretation of such results, several authors have suggested running complementary analyses on components of the composite outcome.3138

When the impact of the active treatment on mortality is of greater clinical importance than its effect on the primary outcome of interest, the weighted WMW test statistic we have presented can be included into a set of testing procedures that ensure that the treatment is not inferior on both mortality and the outcome of interest and that it is superior on a least one of these endpoints. In the context of ischemic stroke, the clinical investigators desired a treatment that would have a positive impact on mortality while also improving survivors’ functional outcomes. Testing procedures that incorporate contributions of each individual component of the composite while penalizing for any disadvantage in the active treatment when the treatment operates in opposite directions on the components of the composite outcome have been discussed.3942 For the analysis of NBO clinical trial, we propose two different stepwise procedures to analyze data using this weighted test: (1) two individual non-inferiority tests on mortality and non-fatal outcome followed (if non-inferiority established) by a global test using the optimal- weighted WMW test on the worst-rank composite endpoint; or (2) a global test using the optimal-weighted WMW test on the worst-rank composite endpoint, and then (if significant global test) two individual non-inferiority tests followed by individual superiority tests on mortality and non-fatal outcome. In either scenario, the overall type I error is preserved.39,40,43,44

The method presented in this paper can be applied or extended to many other settings of composite endpoints beyond the realm of death-censored observations. The rationale, advantages (and limitations), and recommendations for using composite outcomes—based on clinical information, expert knowledge or practical matters—abound in the literature.14,35,45 One can also accommodate ties as well as noninformative censoring in the definition of the WMW U-statistic. In particular, when non-informative censoring is present (and, without loss of generality, assuming there is no ties), survival times can be assessed using Gehan’s U-statistic, which is an extension of the WMW U-statistic to right censored data.46 In this case, I(t1k < t2l) will be equal to 1 if subject l in group 2 lived longer than subject k in group 1 and 0 if it is uncertain which subject lived longer.

Our method can be applied in many disease areas in which different outcomes are clinically related and represent the manifestation of the same underlying condition. Clinical trials of unstable angina and non-ST segment elevation myocardial infarction are examples of such an application.47,48 The method can also be applied in clinical trials where the overall effect of treatment on a disease depends on hierarchy of meaningful—yet of different importance, magnitude, and impact—heterogenous outcomes. For instance, in clinical trials of asthma or of benign prostatic hyperplasia (BPH), several outcomes are necessary to capture the multifaceted manifestations of the disease. For patients with asthma, four outcomes (forced expiratory volume in 1 second (FEV1), peak expiratory flow (PEF) rate, symptom score, and additional rescue medication use) are necessary to measure the different manifestations of the disease.49. Due to subjective nature of BPH symptoms, in addition to BPH symptom score index, measures to assess disease progression include: prostate specific antigen (PSA), urinary cytology, post-void residual volume (PVR), urine flow rate, cystoscopy, urodynamic pressure-flow study, and ultrasound of the kidney or the prostate.

Our method does not immediately apply to the case where the treatment effect is assessed by stratifying for a confounding variable (baseline scores, baseline disease severity, age,…) pre-specified in the study design.50,51 For the NBO trial, had the investigators anticipated the imbalance between subjects on some baseline variables (e.g., large infarcts, advanced age, co-morbidities, and most importantly, withdrawal of care based on pre-expressed wishes or family preference), they could have stratified the study population with respect to these variable.1,30 The test statistic we have proposed does not adjust for such baseline covariates as the appropriate-weighted WMW test for this case must take into account the stratum specific characteristics in addition to the specificities of the worstranking procedure; this is a topic for future investigations.

A strong case may be made on why one should prefer analysis of covariance to the analysis of change from baseline score as we have done in this paper.52 But in reality, issues are more nuanced and the approach to use depends closely on the nature of the data as well as the clinical question of interest.5358 For the difference in NIHSS scores (from baseline to three months), the fundamental question of interest was “on average, how much NBO-treated patients changed over three-month period compare to patients assigned to room air?” The change- from-baseline-score paradigm assumes that the same measure is used before and after the treatment and that these two measures are highly correlated.59,60 In the stroke literature, it is proven that change from baseline in NIHSS satisfies this assumption since baseline NIHSS is a strong predictor of outcome after stroke.61,62 Moreover, it has been shown that change in the NIHSS score is a useful tool to measure treatment effect in acute stroke trials (see for instance the papers by Bruno et al.63 and by Parsons et al.64) Hence, this justified the choice of improvement (or change) in NIHSS score as outcome of interest in this paper.

We have assumed throughout this paper that mortality is worse than any impact ischemic stroke may have on patients. Our assumption stems from the common view that ranks death as inferior to any quality-of-life measure, such a view is advocated in several medical fields.7,8,6570 However, some people (patients, their family members or caregivers) may argue otherwise and affirm that there are levels of stroke that are worse than death. For instance, in a study of the effects of thrombolytic therapy in reducing damage from a myocardial infarction, the hierarchy of the quality of component outcomes was “stroke resulting in a vegetative state, death, serious morbidity requiring major assistance, serious morbidity but capable of self-care, excess spontaneous hemorrhage (≥ 3 blood transfusions), and 1–2 transfusions”.10 There are number of papers in the causal inference literature that offer an alternative approach based on Rosenbaum’s proposal of using different “placements of death”.71 However, as Rubin72 pointed out, this elegant idea “maybe difficult to convey to consumers”72 and we have not pursued this avenue here.

Finally, the null hypothesis (3) for WMW test stipulates that the treatment does not change the outcome distribution, which means that the treatment has no effect on any patient. However, some studies may require a weaker version of the null hypothesis, i.e. the treatment does not affect the average group response.73,74 In such a case, the WMW is not an asymptotically valid test for the weaker null hypothesis.75,76 As an alternative, one can use the Brunner and Munzel test77 where the marginal distribution functions of the two treatment groups are not assumed to be equal and may have different shapes, even under the null hypothesis. In this paper, we have chosen the WMW test because it is simple, widely used, efficient, and robust against parametric distributional assumptions. The use of a weighted Brunner-Munzel test for analysis of the worst-rank composite outcome of death and a quality-of-life (such as the NIHSS score) warrants further investigations and is beyond the scope of this paper.

Acknowledgments

The content of this paper is solely the responsibility of the authors and does not necessarily represent the official view of the National Institutes of Health.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by grants P50-NS051343, R01-CA075971, T32 NS048005, and UL1TR001117 awarded by the National Institutes of Health. This work was also supported by grants 1R01HL118336-01 (PI: Anastasios Tsiatis) and R01- NS051412 awarded by the National Institutes of Health.

Appendix 1. Mean and variance of the U-statistic

Consider the untied worst-rank adjusted values for subjects in the control and active treatment groups X1k=(1δ1k)X1k+δ1k(η+t1k), for k = 1,…, m and X2l=(1δ2l)X2l+δ2l(η+t2l), for l = 1,…, n.

Define the WMW U-statistic

U=(mn)1k=1ml=1nUkl,whereUkl=I(X1k<X2l)

Since Ukl = 1 if {t1k < t2l and δ1kδ2l = 1}, {δ1k = 1 and δ2l = 0}, or {X1k < X2l and (δ1k = δ1l = 0)}, we have Ukl = I(t1k < t2l, δ1kδ2l = 1) + I(δ1k = 1, δ2l = 0) + I(X1k < X2l, δ1k = δ2l = 0)

Therefore

E(U)=E(Ukl)=P(t1k<t2l|δ1kδ2l=1)P(δ1kδ2l=1)+P(δ1k=1,δ2l=0)+P(X1k<X2l)P(δ1k=δ2l=0)=p1p2P(t2l<t2l|δ1k=δ2l=1)+p1q2+q1q2P(X1k<X2l)=p1p2πt1+p1q2+q1q2πx1=πU1 (20)

where q1 = 1− p1, q2 = 1 − p2, πt1 = P(t1k < t2l1k = δ2l =1), and πx1 = P(X1k < X2l)

Var(U)=(mm)2[k=1ml=1nVar(Ukl)+k=1ml=1nk=1ml=1nCov(Ukl,Ukl)],withkkorllorboth=(mn)1[Var(Ukl)+(m1)Cov(Ukl,Ukl)+(n1)Cov(Ukl,Ukl)]

Note that Cov(Ukl, Uk′l) = E(Ukl, Uk′l)−E(Ukl)E(Uk′l) = 0 Cov(Ukl, Uk′l) = E(Ukl, Uk′l)−E(Ukl)E(Uk′l) and Cov(Ukl, Ukl) = E(UklUkl) − E(Ukl)E(Ukl), for kk′, ll′. In addition, because Ukl=I(X1k<X2l) follows Bernoulli distribution with probability πU1, we derive the variance Var(Ukl) = E(Ukl)[1 − E(Ukl)] = πU1 (1 − πU1).

E(UklUkl)=P(UklUkl=1)=P(δ1kδ1k=1,δ2l=0)+P(t1k<t2l,t1k<t2l|δ1kδ1kδ2l=1)P(δ1kδ1kδ2l=1)+P(X1k<X2l)P(δ1k=1,δ1k=δ2l=0)+P(X1k<X2l)P(δ1k=0,δ1k=1,δ2l=0)+P(X1k<X2l,X1k<X2l)P(δ1k=δ1k=δ2l=0)=p12q2+p12p2πt2+2p1q2πx1+q12q2πx2E(UklUkl)=P(UklUkl=1)=P(δ1k=1,δ2l=δ2l=0)+P(t1k<t2l,t1k<t2l|δ1kδ2lδ2l=1)P(δ1kδ2lδ2l=1)+P(t1k<t2l|δ1kδ2l=1,δ2l=0)P(δ1kδ2l=1,δ2l=0)+P(t1k<t2l|δ1k=1,δ2l=0,δ2l=1)P(δ1k=1,δ2l=0,δ2l=1)+P(X1k<X2l,X1k<X2l)P(δ1k=δ2l=δ2l=0)=p1q22+p1q22πt3+2p1p2q2πt1+q1q22πx3withπt2=P(t1k<t2l,t1k<t2l|δ1k=δ1k=δ2l=1),πx2=P(X1k<X2l,X1k<X2l)πt3=P(t1k<t2l,t1k<t2l|δ1k=δ2l=δ2l=1),andπx3=P(X1k<X2l,X1k<X2l)

In summary

Var(U)=(mn)1[πU1(1πU1)+(m1)(πU2πU12)+(n1)(πU3πU12)] (21)

where πU2=p12q2+p12p2πt2+2p1q1q2πx1+q12q2πx2 and πU3=p1q22+p1q22πt3+2p1p2q2πt1+q1q22πx3.

Under the null hypothesis of no difference between the two groups, with respect to survival and non-fatal outcome, we have F1 = F2 = F, G1 = G2 = G, and p1 = p2 = p, q1 = q2 = q. This implies

πt1=P(t1k<t2l/t1kT,t2lT)=1p20F(t)dF(t)=12p2[F(T)2F(0)2]=12πt2=P(t1k<t2l,t1k<t2l|t1kT,t1kT,t2lT)=1p30F(t)2dF(t)=13p3[F(T)3F(0)3]=13πt3=P(t1k<t2l,t1k<t2l|t1kT,t2lT,t2lT))=1p3oT[1F(t)]2dF(t)=13p3{[1F(T)]3[1F(0)]3}=13πx1=P(X1k<X2l)=G(x)dG(x)=12[G(x)2]=12πx2=P(X1k<X2l,X1k<X2l)=G(t)2dG(t)=13[G(x)3]=13πx3=P(X1k<X2l,X1k<X2l)[1G(t)]2dG(t)=13{[1G(x)]3}=13

Therefore

πU1=p1p2πt1+p1q2+q1q2πx1=12p2+pq+12q2=12(p+q)2=12πU2=p12q2+p12p2πt2+2p1q1q2πx1+q12q2πx2=p2q+13p3+pq2+13q3=13(p+q)3=13πU3=p1q22+p1p22πt3+2p1p2q2πx1+q1q22πx3=pq2+13p3+p2q+13q3=13(p+q)3=13.

The mean and variance become

μ0=E0(U)=πU1=12σ02=Var0(U)=(mn)1[πU1(1πU1)+(m1)(πU2πU12)+(n1)(πU3πU12)]=(mm)1[12(112)+(m1)(13(12)2)+(n1)(13(12)2)]=(mm)1[14+112(m1)+112(n1)]=m+n+112mn

Appendix 2. Mean and variance of the weighted U-statistic

Consider the weights w = (w1, w2), we define the vector c=(c1,c2,c3)=(w12,w1w2,w22). Let X1k=w1δ1k(η+t1k)+w2(1δ1k)X1k, for k = 1,…, m and X2l=w1δ2l(η+t2l)+w2(1δ2l)X2l, for l=1,…,n.

We define the weighted WMW U-statistic by c′U=(Ut, Utx,Ux)where U′ = (Ut, Utx,Ux) and

Ut=(mn)1k=1ml=1nδ1kδ2lI(t1k<t2l)Utx=(mn)1k=1ml=1nδ1k(1δ2l)Ux=(mn)1k=1ml=1n(1δ1k)(1δ2l)I(X1k<X2l) (22)
E(U)=(P(δ1k=1)P(δ2l=1)P(t1k<t2l|δ1k=δ2l=1),P(δ1k=1)P(δ2l=0)p(δ1k=0)P(δ2l=0)P(X1k<X2l))=(p1p2·P(t1k<t2l|δ1k=δ2l=1),p1q2,q1q2·P(X1k<X2l))=(p1p2πt1,p1q2,q1q2πx1) (23)

where q1 =1−p1, q2=1−p2, πt1 = P(t1k<t2l|δ1k=δ2l=1) and πx1 = P(X1k < X2l).Var(U) = Σ, where =(mn)1(ij)1i,j3 is a 3 × 3 matrix such that

11=E[(Utp1p2πt1)(Utp1p2πt1)]=p1p2[πt1(1πt1)+p1(m1)(πt2πt12)+p2(n1)(πt3πt12)+πt12(mp1q2+(n1)p2q1+q1)]12=21=E[(Utp1p2πt1)(Utxp1q2)]=πt1p1p2q2[(n1)q1mp1]13=31=E[(Utp1p2πt1)(Uxq1q2πx1)]=πt1πx1(m+n1)p1q1p2q222=E[(Utxp1q2)(Utxp1q2p)]=p1q2[mp1p2+(n1)q1q2+q1]23=32=E[(Utxp1q2)(Uxq1q2πx1)]=πx1p1q1q2[(m1)p2nq2]33=q1q2[πx1(1πx1)+q1(m1)(πx2πx12)+q2(n1)(πx3πx12)+πx12(mq1p2+(n1)q2p1+p1)]===

Therefore

Var(cU)=cc

Under the null hypothesis of no difference between the two groups, with respect to both survival and non-fatal outcome, we have p1 = p2 = p, q1 = q2 = q = 1 −p, πx1 = 1/2, and πt2 = πx2 = πt3 = πx3 = 1/3: Thus

E0(U)=12(p2,2pq,q2)andVar0(U)=0 (24)

where 0=(mn)1(0ij)1i,j3 is a symmetric matrix with

011=p212A(p),012=021=p2q2((n1)qmp),013=031=p2q24(n+m1)022=pq(nq2+mp2+pq),023=032=pq22((m1)pnq),033=q212A(q)A(x)=6+4(n+m2)x3(n+m1)x2

Moreover, since Var0(cU)=c0c0 by definition, the matrix Σ0 is positive semi-definite. In practice, p is estimated by the pooled sample proportion p^=(mp1+np2)/(m+n) and both E0(U) and Var0(U) are calculated accordingly.

Appendix 3. Optimal weights

From equation (17), we have

μ1wμ0w=c1(πt1p1p212p2)+c2(p1q2pq)+c3(πx1q1q212q2)cμ

where c = (c1, c2, c3), c1+2c2+c3 = 1, and μ=(πt1p1p212p2,p1q2pq,πx1q1q212q2) and p is estimated by p^=(mp1+np2)/(m+n).

We assume that det(Σ0) > 0 i.e. Σ0 is positive-definite. Maximizing |μ1wμ0w|σ0w, subject to c1 + 2c2+c3 = 1 with respect to c corresponds to maximizing the Lagrange function

O(c,λ)=|cμ|(c0c)12λ(cb1)

with respect to the vector c and λ where λ is the Lagrange multiplier and b′ = (1, 2, 1)

Let K(c)=sign(cμ)[(c0c)32], we have

cO(c,λ)=K(c)[(c0c)μ(0c)(cμ)]λb=0 (25)
λO(c,λ)=cb1=0 (26)

From equations (25) and (26), we have

0=c{K(c)[(c0c)μ(0c)(cμ)]λb}=K(c)[(c0c)cμ(c0c)(cμ)]λcb=λ

because both (c′Σ0c) and (c′μ) are scalars and c′b = c1+2c2+c3 = 1.

Then equation (25) implies (c′Σ0c)μ = (Σ0c)(c′μ), i.e. μ=(0c)(cμ)(c0c)=0(cμ)(c0c)c. Since we assume that the matrix 01 exists, this implies

01μ=(cμ)(c0c)c (27)

and thus, b01μ=(cμ)(c0c)bc=(cμ)(c0c).

Replacing (cμ)(c0c) by b01μ in equation (27) yields 01μ=(b01μ)c. Therefore, the optimal weight-vector is

copt=01μb01μ (28)

as long as b01μ0. In addition

2c2[O(c)]c=copt=sign(cμ)(c01c)32[2(c0)μμ(0c)0(cμ)]c=copt3sign(cμ)(0c)(μ01μ)52[(c0c)μ(0c)(cμ)]c=copt=2sign(cμ)(μ01μ)32(b01μ)2[μμ(μ01μ)0]=2sign(b01μ)(μ01μ)32(b01μ)2[μμ(μ01μ)0]

Since Σ0 is positive definite, we can show that the border-preserving principal minors of order k > 2 have sign (−1)k Therefore, c=copt=01μb01μ maximizes O(c).

Let us define two vectors. d1=(1,1,0) and d2=bd1=(0,1,1). To calculate w1 and w2, we just need to consider the relationships c=(w12,w1w2,w22) and w1+w2 = 1. We have d1c=w12+w1(1w1)=w1. Therefore, using the result given in equation (28), we can deduce w1=d1c=d101μb01μ and w2=1d1c=(bd1)01μb01μ=d201μb01μ.

Appendix 4. Conditional probabilities

D.1. Exponential distribution

Suppose that the death times t1, t2 follow exponential distributions with hazards λ1, λ2, respectively, and denote θ=λ1λ2,q1=q2θ,andq2=eTλ2 Given that P(δ1k=1)=p1,p(δ21=1)=p2, we have

πt1=P(t1k<t21|δ1k=δ21=1)=(p1p2)1oT(1eλ1u)λ2eλ2udu=1(1q2θ)[11q2(1+θ)(1+θ)(1q2)]πt2=P(t1k<t2l,t1k<t2l|δ1k=δ1k=δ2l=1)=p12p21oT(1eλ1u)2λ2eλ2udu=(1q2θ)2{1+1(1q2)[1q2(1+2θ)1+2θ2(1q2(1+θ))1+θ]}πt3=P(t1k<t2l,t1k<t2l|δ1k=δ2l=δ2l=1)=p11p220T(eλ2Teλ2u)2λ1eλ1udu=(q21q2)2[1+θ(1q2(2+θ))(2+θ)(1q2θ)q222θ(1q2(1+θ))(1+θ)(1q2θ)q2]

D.2. Normal distribution

Suppose that the non-fatal outcomes X1, X2 follow normal distributions N(μx1, σx1) and N(μx2, σx2), respectively.

Consider Δx=μx2μx1σx12+σx22, ρxj=σxj2σx12+σx22, and Zkl=x1kx2l(μx1μx2)σx12+σx22

We can show that

πx1=P(X1k<X2l)=Φ(Δx)πx2=P(X1k<X2l,X1k<X2l)=P(Zkl<Δx,Zkl<Δx)πx3=P(X1k<X2l,X1k<X2l)=P(Zkl<Δx,Zkl<Δx)(Zkl,Zkl)N((00),(1ρx2ρx21))and(Zkl,Zkl)N((00),(1ρx1ρx11))

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

  • 1.Singhal AB. Normobaric oxygen therapy in acute ischemic stroke trial. ClinicalTrials.gov Database. http://clinicaltrials.gov/ct2/show/NCT00414726 (accessed 7 November 2016)
  • 2.Singhal AB. A review of oxygen therapy in ischemic stroke. Neurol Res. 2007;29:173–183. doi: 10.1179/016164107X181815. [DOI] [PubMed] [Google Scholar]
  • 3.Little RJ, Rubin DB. Statistical analysis with missing data. Hoboken, New Jersey: Wiley; 2002. [Google Scholar]
  • 4.Lachin J. Worst-rank score analysis with informatively missing observations in clinical trials. Control Clin Trials. 1999;20:408–422. doi: 10.1016/s0197-2456(99)00022-7. [DOI] [PubMed] [Google Scholar]
  • 5.McMahon R, Harrell F., Jr Power calculation for clinical trials when the outcome is a composite ranking of survival and a nonfatal outcome. Control Clin Trials. 2000;21:305–312. doi: 10.1016/s0197-2456(00)00052-0. [DOI] [PubMed] [Google Scholar]
  • 6.Matsouaka RA, Betensky RA. Power and sample size calculations for the Wilcoxon-Mann-Whitney test in the presence of death-censored observations. Stat Med. 2015;34:406–431. doi: 10.1002/sim.6355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Felker GM, Maisel AS. A global rank end point for clinical trials in acute heart failure. Circulation. 2010;3:643–646. doi: 10.1161/CIRCHEARTFAILURE.109.926030. [DOI] [PubMed] [Google Scholar]
  • 8.Follmann D, Wittes J, Cutler JA. The use of subjective rankings in clinical trials with an application to cardiovascular disease. Stat Med. 1992;11:427–437. doi: 10.1002/sim.4780110402. [DOI] [PubMed] [Google Scholar]
  • 9.Bakal JA, Westerhout CM, Armstrong PW. Impact of weighted composite compared to traditional composite endpoints for the design of randomized controlled trials. Stat Med Med Res. 2012;24:980–988. doi: 10.1177/0962280211436004. [DOI] [PubMed] [Google Scholar]
  • 10.Hallstrom A, Litwin P, Douglas Weaver W. A method of assigning scores to the components of a composite outcome: an example from the MITI trial. Control Clin Trials. 1992;13:148–155. doi: 10.1016/0197-2456(92)90020-z. [DOI] [PubMed] [Google Scholar]
  • 11.Neaton J, Gray G, Zuckerman B, et al. Key issues in end point selection for heart failure trials: composite end points. J Cardiac Fail. 2005;11:567–575. doi: 10.1016/j.cardfail.2005.08.350. [DOI] [PubMed] [Google Scholar]
  • 12.Califf R, DeMets D. Principles from clinical trials relevant to clinical practice: part I. Circulation. 2002;106:1015. doi: 10.1161/01.cir.0000023260.78078.bb. [DOI] [PubMed] [Google Scholar]
  • 13.Braunwald E, Cannon C, McCabe C. An approach to evaluating thrombolytic therapy in acute myocardial infarction. The ‘unsatisfactory outcome’ end point. Circulation. 1992;86:683. doi: 10.1161/01.cir.86.2.683. [DOI] [PubMed] [Google Scholar]
  • 14.Moyé L. Multiple analyses in clinical trials: fundamentals for investigators. New York City, New York: Springer Verlag; 2003. [Google Scholar]
  • 15.Huang P, Tilley BC, Woolson RF, et al. Adjusting O’Brien’s test to control type i error for the generalized nonparametric behrens-fisher problem. Biometrics. 2005;61:532–539. doi: 10.1111/j.1541-0420.2005.00322.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Häberle L, Pfahlberg A, Gefeller O. Assessment of multiple ordinal endpoints. Biometrical J. 2009;51:217–226. doi: 10.1002/bimj.200810502. [DOI] [PubMed] [Google Scholar]
  • 17.O’Brien PC. Procedures for comparing samples with multiple endpoints. Biometrics. 1984;40:1079–1087. [PubMed] [Google Scholar]
  • 18.Wei L, Johnson W. Combining dependent tests with incomplete repeated measurements. Biometrika. 1985;72:359. [Google Scholar]
  • 19.Finkelstein D, Schoenfeld D. Combining mortality and longitudinal measures in clinical trials. Stat Med. 1999;18:1341–1354. doi: 10.1002/(sici)1097-0258(19990615)18:11<1341::aid-sim129>3.0.co;2-7. [DOI] [PubMed] [Google Scholar]
  • 20.Moyé L, Davis B, Hawkins C. Analysis of a clinical trial involving a combined mortality and adherence dependent interval censored endpoint. Stat Med. 1992;11:1705–1717. doi: 10.1002/sim.4780111305. [DOI] [PubMed] [Google Scholar]
  • 21.Moyé LA, Lai D, Jing K, et al. Combining censored and uncensored data in a u-statistic: design and sample size implications for cell therapy research. Int J Biostat. 2011;7:1–29. doi: 10.2202/1557-4679.1286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Sampson UK, Metcalfe C, Pfeffer MA, et al. Composite outcomes: weighting component events according to severity assisted interpretation but reduced statistical power. J Clin Epidemiol. 2010;63:1156–1158. doi: 10.1016/j.jclinepi.2010.01.019. [DOI] [PubMed] [Google Scholar]
  • 23.Ahmad Y, Nijjer S, Cook CM, et al. A new method of applying randomised control study data to the individual patient: a novel quantitative patient-centred approach to interpreting composite end points. Int J Cardiol. 2015;195:216–224. doi: 10.1016/j.ijcard.2015.05.109. [DOI] [PubMed] [Google Scholar]
  • 24.Wilson RF, Berger AK. Are all end points created equal? The case for weighting. J Am Coll Cardiol. 2011;57:546–548. doi: 10.1016/j.jacc.2010.10.014. [DOI] [PubMed] [Google Scholar]
  • 25.Armstrong PW, Westerhout CM, Van de Werf F, et al. Refining clinical trial composite outcomes: An application to the assessment of the safety and efficacy of a new thrombolytic-3 (assent-3) trial. Am Heart J. 2011;161:848–854. doi: 10.1016/j.ahj.2010.12.026. [DOI] [PubMed] [Google Scholar]
  • 26.Minas G, Rigat F, Nichols TE, et al. A hybrid procedure for detecting global treatment effects in multivariate clinical trials: theory and applications to fMRI studies. Stat Med. 2012;31:253–268. doi: 10.1002/sim.4395. [DOI] [PubMed] [Google Scholar]
  • 27.Fisher LD. Self-designing clinical trials. Stat Med. 1998;17:1551–1562. doi: 10.1002/(sici)1097-0258(19980730)17:14<1551::aid-sim868>3.0.co;2-e. [DOI] [PubMed] [Google Scholar]
  • 28.Ramchandani R, Schoenfeld DA, Finkelstein DM. Global rank tests for multiple, possibly censored, outcomes. Biometrics. 2016;72:s1–s10. doi: 10.1111/biom.12475. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Lachin JM, Bebu I. Application of the wei-lachin multivariate one-directional test to multiple event-time outcomes. ClinTrials. 2015;12:627–633. doi: 10.1177/1740774515601027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Samson K. News from the AAN annual meeting: why a trial of normobaric oxygen in acute ischemic stroke was halted early. Neurol Today. 2013;13:34–35. [Google Scholar]
  • 31.Freemantle N, Calvert M, Wood J, et al. Composite outcomes in randomized trials: greater precision but with greater uncertainty? JAMA. 2003;289:2554. doi: 10.1001/jama.289.19.2554. [DOI] [PubMed] [Google Scholar]
  • 32.Cordoba G, Schwartz L, Woloshin S, et al. Definition, reporting, and interpretation of composite outcomes in clinical trials: systematic review. Br Med J. 2010;341:c3920. doi: 10.1136/bmj.c3920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Tomlinson G, Detsky AS. Composite end points in randomized trials: there is no free lunch. JAMA. 2010;303:267–268. doi: 10.1001/jama.2009.2017. [DOI] [PubMed] [Google Scholar]
  • 34.Ferreira-Gonzalez I, Permanyer-Miralda G, Busse J, et al. Composite outcomes can distort the nature and magnitude of treatment benefits in clinical trials. Ann Intern Med. 2009;150:566. doi: 10.7326/0003-4819-150-8-200904210-00016. [DOI] [PubMed] [Google Scholar]
  • 35.Ferreira-Gonzalez I, Permanyer-Miralda G, Busse JW, et al. Methodologic discussions for using and interpreting composite endpoints are limited, but still identify major concerns. J Clin Epidemiol. 2007;60:651–657. doi: 10.1016/j.jclinepi.2006.10.020. [DOI] [PubMed] [Google Scholar]
  • 36.Ferreira-Gonzalez I, Permanyer-Miralda G, Domingo-Salvany A, et al. Problems with use of composite end points in cardiovascular trials: systematic review of randomised controlled trials. BMJ. 2007;334:786. doi: 10.1136/bmj.39136.682083.AE. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Lubsen J, Just H, Hjalmarsson A, et al. Effect of pimobendan on exercise capacity in patients with heart failure: main results from the Pimobendan in Congestive Heart Failure (PICO) trial. Heart. 1996;76:223. doi: 10.1136/hrt.76.3.223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Lubsen J, Kirwan BA. Combined endpoints: can we use them? Stat Med. 2002;21:2959–2970. doi: 10.1002/sim.1300. [DOI] [PubMed] [Google Scholar]
  • 39.Huque MF, Alosh M, Bhore R. Addressing multiplicity issues of a composite endpoint and its components in clinical trials. J Biopharm Stat. 2011;21:610–634. doi: 10.1080/10543406.2011.551327. [DOI] [PubMed] [Google Scholar]
  • 40.Mascha EJ, Turan A. Joint hypothesis testing and gatekeeping procedures for studies with multiple endpoints. Anesth Anal. 2012;114:1304–1317. doi: 10.1213/ANE.0b013e3182504435. [DOI] [PubMed] [Google Scholar]
  • 41.Dmitrienko A, D’Agostino RB, Huque MF. Key multiplicity issues in clinical drug development. Stat Med. 2013;32:1079–1111. doi: 10.1002/sim.5642. [DOI] [PubMed] [Google Scholar]
  • 42.Sankoh AJ, Li H, D’Agostino RB. Use of composite endpoints in clinical trials. Stat Med. 2014;33:4709–4714. doi: 10.1002/sim.6205. [DOI] [PubMed] [Google Scholar]
  • 43.Logan B, Tamhane A. Superiority inferences on individual endpoints following noninferiority testing in clinical trials. Biometrical J. 2008;50:693–703. doi: 10.1002/bimj.200710447. [DOI] [PubMed] [Google Scholar]
  • 44.Röhmel J, Gerlinger C, Benda N, et al. On testing simultaneously non-inferiority in two multiple primary endpoints and superiority in at least one of them. Biometrical J. 2006;48:916–933. doi: 10.1002/bimj.200510289. [DOI] [PubMed] [Google Scholar]
  • 45.Gómez G, Lagakos SW. Statistical considerations when using a composite endpoint for comparing treatment groups. Stat Med. 2013;32:719–738. doi: 10.1002/sim.5547. [DOI] [PubMed] [Google Scholar]
  • 46.Gehan EA. A generalized wilcoxon test for comparing arbitrarily singly-censored samples. Biometrika. 1965;52:203–223. [PubMed] [Google Scholar]
  • 47.Braunwald E, Antman EM, Beasley JW, et al. ACC/AHA 2002 guideline update for the management of patients with unstable angina and non-st-segment elevation myocardial infarctionsummary article: a report of the American College of Cardiology/American Heart Association Task force on practice guidelines (committee on the management of patients with unstable angina) J Am Coll Cardiol. 2002;40:1366–1374. doi: 10.1016/s0735-1097(02)02336-7. [DOI] [PubMed] [Google Scholar]
  • 48.Grech E, Ramsdale D. Acute coronary syndrome: unstable angina and non-st segment elevation myocardial infarction. BMJ. 2003;326:1259. doi: 10.1136/bmj.326.7401.1259. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.National Asthma Education and Prevention Program (National Heart, Lung, and Blood Institute) Third Expert Panel on the Management of Asthma. Expert panel report 3: guidelines for the diagnosis and management of asthma. NIH Publication: US Department of Health and Human Services, National Institutes of Health, National Heart, Lung, and Blood Institute; 2007. [Google Scholar]
  • 50.Van Elteren P. On the combination of independent two-sample tests of Wilcoxon. Bull Int Stat Inst. 1960;37:351–361. [Google Scholar]
  • 51.Zhao Y. Sample size estimation for the van Elteren test - a stratified Wilcoxon-Mann-Whitney test. Stat Med. 2006;25:2675–2687. doi: 10.1002/sim.2441. [DOI] [PubMed] [Google Scholar]
  • 52.Senn S. Change from baseline and analysis of covariance revisited. Stat Med. 2006;25:4334–4344. doi: 10.1002/sim.2682. [DOI] [PubMed] [Google Scholar]
  • 53.Fitzmaurice G. A conundrum in the analysis of change. Nutrition. 2001;17:360–361. doi: 10.1016/s0899-9007(00)00593-1. [DOI] [PubMed] [Google Scholar]
  • 54.van Breukelen GJ. Ancova versus change from baseline in nonrandomized studies: the difference. Multivariate Behav Res. 2013;48:895–922. doi: 10.1080/00273171.2013.831743. [DOI] [PubMed] [Google Scholar]
  • 55.Shahar E, Shahar DJ. Causal diagrams and change variables. J Eval Clin Pract. 2012;18:143–148. doi: 10.1111/j.1365-2753.2010.01540.x. [DOI] [PubMed] [Google Scholar]
  • 56.Pearl J. Technical report. Citeseer; 2014. Lord’s paradox revisited-(oh lord! kumbaya!) [Google Scholar]
  • 57.Oakes JM, Feldman HA. Statistical power for nonequivalent pretest-posttest designs the impact of change-score versus ancova models. Eval Rev. 2001;25:3–28. doi: 10.1177/0193841X0102500101. [DOI] [PubMed] [Google Scholar]
  • 58.Willett JB. Questions and answers in the measurement of change. Rev Res Edu. 1988;15:345–422. [Google Scholar]
  • 59.Bonate PL. Analysis of pretest-posttest designs. Boca Raton, Florida: CRC Press; 2000. [Google Scholar]
  • 60.Campbell DT, Kenny DA. A primer on regression artifacts. New York City, New York: Guilford Publications; 1999. [Google Scholar]
  • 61.Young FB, Weir CJ, Lees KR, et al. Comparison of the national institutes of health stroke scale with disability outcome measures in acute stroke trials. Stroke. 2005;36:2187–2192. doi: 10.1161/01.STR.0000181089.41324.70. [DOI] [PubMed] [Google Scholar]
  • 62.Adams H, Jr, Davis P, Leira E, et al. Baseline NIH Stroke Scale score strongly predicts outcome after stroke: a report of the Trial of Org 10172 in Acute Stroke Treatment (TOAST) Neurology. 1999;53:126. doi: 10.1212/wnl.53.1.126. [DOI] [PubMed] [Google Scholar]
  • 63.Bruno A, Saha C, Williams LS. Using change in the national institutes of health stroke scale to measure treatment effect in acute stroke trials. Stroke. 2006;37:920–921. doi: 10.1161/01.STR.0000202679.88377.e4. [DOI] [PubMed] [Google Scholar]
  • 64.Parsons M, Spratt N, Bivard A, et al. A randomized trial of tenecteplase versus alteplase for acute ischemic stroke. N Engl J Med. 2012;366:1099–1107. doi: 10.1056/NEJMoa1109842. [DOI] [PubMed] [Google Scholar]
  • 65.Brittain E, Palensky J, Blood J, et al. Blinded subjective rankings as a method of assessing treatment effect: a large sample example from the systolic hypertension in the elderly program (SHEP) Stat Med. 1997;16:681–693. doi: 10.1002/(sici)1097-0258(19970330)16:6<681::aid-sim487>3.0.co;2-h. [DOI] [PubMed] [Google Scholar]
  • 66.Felker G, Anstrom K, Rogers J. A global ranking approach to end points in trials of mechanical circulatory support devices. J Cardiac Fail. 2008;14:368–372. doi: 10.1016/j.cardfail.2008.01.009. [DOI] [PubMed] [Google Scholar]
  • 67.Allen LA, Hernandez AF, O’Connor CM, et al. End points for clinical trials in acute heart failure syndromes. J Am Coll Cardiol. 2009;53:2248–2258. doi: 10.1016/j.jacc.2008.12.079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Sun H, Davison BA, Cotter G, et al. Evaluating treatment efficacy by multiple endpoints in phase ii acute heart failure clinical trials: analyzing data using a global method. Circulation. 2012;5:742–749. doi: 10.1161/CIRCHEARTFAILURE.112.969154. [DOI] [PubMed] [Google Scholar]
  • 69.Subherwal S, Anstrom KJ, Jones WS, et al. Use of alternative methodologies for evaluation of composite end points in trials of therapies for critical limb ischemia. Am Heart J. 2012;164:277. doi: 10.1016/j.ahj.2012.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Berry JD, Miller R, Moore DH, et al. The combined assessment of function and survival (cafs): a new endpoint for als clinical trials. Amyotroph Lateral Scler Frontotemp Degen. 2013;14:162–168. doi: 10.3109/21678421.2012.762930. [DOI] [PubMed] [Google Scholar]
  • 71.Rosenbaum PR. Comment: the place of death in the quality of life. Stat Sci. 2006;21:313–316. [Google Scholar]
  • 72.Rubin DB. Rejoinder:causal inference through potential outcomes and principal stratification: Application to studies with “censoring” due to death. Stat Sci. 2006;21:319–321. [Google Scholar]
  • 73.Fay MP, Proschan MA. Wilcoxon-Mann-Whitney or t-test? on assumptions for hypothesis tests and multiple interpretations of decision rules. Stat Surv. 2010;4:1. doi: 10.1214/09-SS051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Gail MH, Mark SD, Carroll RJ, et al. On design considerations and randomization-based inference for community intervention trials. Stat Med. 1996;15:1069–1092. doi: 10.1002/(SICI)1097-0258(19960615)15:11<1069::AID-SIM220>3.0.CO;2-Q. [DOI] [PubMed] [Google Scholar]
  • 75.Pratt JW. Robustness of some procedures for the two-sample location problem. J Am Stat Assoc. 1964;59:650–665. [Google Scholar]
  • 76.Chung E, Romano JP. Asymptotically valid and exact permutation tests based on two-sample U-statistics. J Stat Plann Infer. 2016;168:97–105. [Google Scholar]
  • 77.Brunner E, Munzel U. The nonparametric behrens-fisher problem: asymptotic theory and a small-sample approximation. Biometrical J. 2000;42:17–25. [Google Scholar]

RESOURCES