Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Feb 10.
Published in final edited form as: Stat Med. 2014 Nov 13;34(3):406–431. doi: 10.1002/sim.6355

Power and Sample Size Calculations for the Wilcoxon–Mann–Whitney Test in the Presence of Missing Observations due to Death

Roland A Matsouaka a,*, Rebecca A Betensky a
PMCID: PMC4289456  NIHMSID: NIHMS637492  PMID: 25393385

Abstract

We consider a clinical trial of a potentially lethal disease in which patients are randomly assigned to two treatment groups and are followed for a fixed period of time; a continuous endpoint is measured at the end of follow-up. For some patients, however, death (or severe disease progression) may preclude measurement of the endpoint. A statistical analysis that includes only patients with endpoint measurements may be biased. An alternative analysis includes all randomized patients, with rank scores assigned to the patients who are available for the endpoint measurement based on the magnitude of their responses and with “worst-rank” scores assigned to those patients whose death precluded the measurement of the continuous endpoint. The worst-rank scores are worse than all observed rank scores. The treatment effect is then evaluated using the Wilcoxon–Mann–Whitney (WMW) test. In this paper, we derive closed-form formulae for the power and sample size of the WMW test when missing measurements of the continuous endpoints due to death are replaced by worst-rank scores. We distinguish two approaches for assigning the worst-rank scores. In the tied worst-rank approach, all deaths are weighted equally and the worst-rank scores are set to a single value that is worse than all measured responses. In the untied worst-rank approach, the worst-rank scores further rank patients according to their time of death, so that an earlier death is considered worse than a later death, which in turn is worse than all measured responses. In addition, we propose four methods for implementation of the sample size formulae for a trial with expected early death. We conduct Monte Carlo simulation studies to evaluate the accuracy of our power and sample size formulae and to compare the four sample size estimation methods.

Keywords: Composite outcome, Informative missingness, Intention-to-treat, Worst-rank scores, Wilcoxon Rank Sum test

1. Introduction

In many randomized clinical trials, treatment benefit is evaluated through a post-randomization outcome measured at a pre-specified follow-up time (or times). For some patients, however, the post-randomization measurements may be missing due to a disease-related terminal event that occurs before the end of follow-up period. Such missing observations are informatively missing since they are related to the status of patient’s underlying disease. This informative missing outcome conundrum is sometimes referred to in the literature as a truncation-by-death or censoring-by-death problem, and is handled differently from traditional missing data mechanisms [1, 2, 3]. Any statistical analysis based solely on the subset of completely observed measurements may provide a spurious estimate of the treatment effect because the subjects who survived may not be comparable between the treatment groups. For example, in a study of oxygen therapy in acute ischemic stroke (AIS) [4, 5] conducted at Massachusetts General Hospital, patients with imaging-confirmed AIS were randomized to treatment with normobaric oxygen therapy (NBO) or room air for 8 hours and assessed serially—during the follow-up period of 3 months—using the NIH stroke scale (NIHSS), a function rating scale used to quantify neurological deficit due to stroke, and MRI imaging. Unfortunately, in the case of stroke, irreversible cell death begins within minutes after stroke; the final outcome depend largely on where it occurs and its severity. The extent of penumbral tissue diminishes rapidly with time, causing serious disability or the death of some patients before they could be assessed at the follow-up times [6]. As both death from stroke and poor final NIHSS score are indicative of disease worsening, patients with missing follow-up NIHSS scores could not be excluded from the analysis.

One solution to this problem of informative missingness is to include all randomized patients in the analysis, with rank scores assigned to the patients who are available for the post-randomization continuous measurement based on the magnitude of their responses and with “worst-rank” scores assigned to those patients whose death precluded the final measurement of the continuous response [7]. The worst-rank scores are worse than all observed rank scores. This is a composite outcome that combines assessment of the continuous response with death. We distinguish two approaches for assigning the worst-rank scores. In the untied worst-rank approach, the worst-rank scores further rank patients according to their time of death, so that an earlier death is considered worse than a later death, which is worse than all measured responses. In the tied worst-rank approach, all deaths are weighted equally and the worst-rank scores are set to a single value that is worse than all measured responses. The treatment effect is then evaluated using the Wilcoxon–Mann–Whitney (WMW) test.

The use of worst-rank composite endpoints is prevalent across disease areas and has become well-accepted and favored in many settings. For example, a 2013 paper advocated for the “Combined Assessment of Function and Survival (CAFS)” endpoint for amyotrophic lateral sclerosis (ALS) clinical trials [8]. The authors of this paper concluded that “CAFS is a robust statistical tool for ALS clinical trials and appropriately accounts for and weights mortality in the analysis of function.” A 2010 paper advocated for a global rank score that combines clinical events (e.g., death and hospitalization) with continuous biomarker measurements in acute heart failure studies [9]. These authors proposed a hierarchy of the individual endpoints that is incorporated into a composite rank endpoint. There are several other examples from the literature in the settings of HIV-AIDS trials [13, 14, 15], cardiovascular and heart failure trials [16, 17, 11, 9, 18, 19, 20], critical limb ischemia trials [21], and orthopedic trials [22].

The idea of assigning scores to informatively missing observations was first introduced by Gould [23] and was used by Richie et al. [24, 25] to handle cases of patients’ informative withdrawal. Gould suggested using a rank-based test for the composite outcome scores. To avoid dealing with the multiple ties introduced by Gould’s approach, Senn [26] proposed a modified version by assigning patients with missing observations ranks that depend on their times of withdrawal. Lachin is among those who extended the idea to settings in which disease-related withdrawal is due to death [7]. Lachin also explored properties of the WMW test applied to worst-rank score composite endpoints and demonstrated the unbiasedness of this approach for some restricted alternative hypotheses. Finally, McMahon and Harrell [27] derived power and sample size calculations for the untied worst-rank score composite outcome of mortality and a non-fatal outcome measure.

To evaluate the power and sample size of the WMW test, it is necessary to specify the alternative to the null hypothesis and to calculate certain probabilities based on the alternative that are used in the mean and variance formulae. Generally this is a difficult task, especially when the underlying distributions are unknown [28]. Lehmann [29] and Noether [30] proposed some simplifications under the location shift alternative, when the location shift is small. Rosner and Glynn [31] considered a probit-shift alternative, which has some advantages over the shift alternative, and provided formulae for the necessary probabilities. Wang et al. [32] suggested using data from a pilot study to estimate the necessary probabilities.

To our knowledge, the paper by McMahon and Harrell [27] is the only paper that has attempted to establish the power and sample size formulae for the WMW test for a worst-rank score composite outcome. While the paper has the merit of addressing this important problem, the power formula as well as the sample size estimation approach proposed by McMahon and Harrell have some limitations. First, the variance estimator for the WMW U-statistic under the alternative hypothesis contains an error, which leads to an incorrect power estimation. Second, the sample size estimation and power calculation methods proposed in the paper rely heavily on the conservative assumption of no treatment effect on mortality (i.e., the “mortality neutral” assumption). This assumption is not tenable in practice. When there is an actual treatment effect on mortality, estimating the sample size under the mortality neutral assumption will unnecessarily and drastically inflate the sample size. In addition, even when there is not a significant difference in mortality, there may still be a moderate effect on mortality, which should be accounted for in the overall estimation of the treatment effect [33]. Finally, the paper does not provide power and sample size formulae for the tied worst-rank composite outcome, where missing observations are all set to a fixed (arbitrary) value. Our paper extends the results of McMahon and Harrell [27] as it considers both the tied and untied worst rank composite outcomes, provides an accurate variance estimator for use in power calculations, and estimates sample sizes without making the mortality neutral assumption.

In Section 2 we present the notation and hypotheses that are used throughout the paper. Next, in Section 3, we derive the analytical power formulae for the WMW test for both tied and untied worst-rank score outcomes. We also present a Monte Carlo simulation study that evaluates the accuracy of the proposed analytical formulae. In Section 4, we derive the sample size formulae for the WMW test for both tied and untied worst-rank composite outcomes. In addition, we present four different methods that have been used in the literature to estimate sample size for the traditional WMW test and extent them for estimation of sample size in the context of worst-rank composite outcomes. We report the results of Monte Carlo simulation studies to assess the validity of our sample size formulae and to evaluate and compare the accuracy of the proposed methods.

2. Notation and hypotheses

Suppose that N subjects are randomly assigned to either the control (i = 1) or the active treatment group (i = 2) and followed for a pre-specified period of time T. We denote by Xij the primary outcome of interest in subject j in group i, which is the continuous response measured at time T. Without loss of generality, we assume that larger values of X correspond to improved medical condition. We denote the survival time for subject j in group i by tij, with the associated event indicator δij = I(tijT) of an event occurrence (prior to T). Note that Xij is missing if δij = 1. We denote by pi = Eij) = P(tijT) the probability of the event (prior to T) in group i.

Under the untied worst-rank score composite outcome, any subject j in group i with missing measurement Xij is assigned a value η + tij, with η = min(X) − 1 − T. Thus, for each subject, the untied worst-rank adjusted value is given by (see Lachin [7])

ij=δij(η+tij)+(1δij)Xij,i=1,2andj=1,,N. (1)

By assigning these values, we ensure that (1) patients who have had an event prior to T are ranked appropriately by their survival times and (2) patients with observed follow-up measurements are ranked above all those who had an event, based on their observed measurements.

Let Fi and Gi denote the conditional cumulative distributions of the event time and observed continuous response measure, respectively, for patients in group i i. e., Fi(υ) = P(tij ≤ υ|0 < tijT) and Gi(x) = P(Xijx|tij > T). The distribution of ij is given by

i(x)=piFi(xη)I(x<ζ)+(1pi)Gi(x)I(xζ),ζ=min(X)1.

We would like to test the null hypothesis that the two groups do not differ with respect to both survival times and the non-fatal outcome

H0:G1(x)=G2(x)andF1(t)=F2(t)

versus the uni-directional alternative hypothesis

H1:G1(x)G2(x)andF1(t)F2(t), (2)

with both G1(x) = G2(x) and F1(t) = F2(t) not occurring simultaneously.

In the case of the tied worst-rank assignment, all the subjects with missing measurements are assigned a fixed score ζ = min(X) − 1. For each subject in the study, the tied worst-rank adjusted value is

ij=δijζ+(1δij)Xij. (3)

The cumulative distribution of i is given by

i(x)=piI(x=ζ)+(1pi)Gi(x)I(xζ).

The null and alternative hypotheses are defined by

H0:1(x)=2(x)versusH1:1(x)2(x), (4)
i.e.,H0:G1(x)=G2(x)andp1=p2versusH1:G1(x)G2(x)andp1p2,

where both G1(x) = G2(x) and p1(t) = p2(t) do not happen at the same time.

Throughout this paper, α denote the type I error, β the type II error, and zp and Φ, respectively, the pth percentile and the cumulative distribution function of the standard normal distribution.

Note that we focus on a restricted alternative hypotheses (2) and (4), which are uni-directional in the sense that the active treatment effect is favorable on both the primary outcome of interest and mortality. In the case where higher values of the primary outcome of interest are indicative of better health outcome, this means subjects in the active treatment group tend to have higher values of the non-fatal outcome or longer survival times, while the distribution of the other outcome is either better in the active treatment group or (at least) equivalent in both the active and control treatment groups. These restricted alternatives are also the most considered in the literature [7, 27]. The cases in which the treatment has a favorable effect on survival, but a less favorable effect on the continuous response measure, or vice versa, do not provide a clear signal for a treatment benefit. Furthermore, regulatory agencies and investigators are reluctant to support a trial that does not emphasize mortality. The uni-directional alternative hypotheses we considered here allow us to avoid these ambiguous cases.

3. Power Calculation

In Section 3.1 we focus on the untied worst-rank test. In Section 3.1.1 we derive an analytical formula for the power of the untied worst-rank test. We evaluate the test, as well as the analytical approximation of its power in simulation studies in Section 3.1.2. In Section 3.2.1 we derive an analytical formula for the power of the tied worst-rank test. We evaluate the test, as well as the analytical approximation of its power in simulation studies in Section 3.2.2. Sections 3.2.1 and 3.2.2 contain the analogous results for the tied worst-rank test.

3.1. Untied Worst-Rank Scores

3.1.1. Power Formula

Let 1k and 2l denote the worst-rank scores for the kth subject in group 1 and lth subject in group 2, respectively, that is 1k = (1 − δ1k)X1k + δ1k(η + t1k), 2l = (1 − δ2l)X2l + δ2l(η + t2l), for k = 1, …, m and l = 1, …, n, where N = m + n. We define the WMW U-statistic by

U=(mn)1k=1ml=1nI(1k<2l).

Since 1k < 2l when {t1k < t2l and (δ1k = δ2l = 1)}, {δ1k = 1 and δ2l = 0}, or {X1k < X2l and (δ1k = δ2l = 0)}, we have I(1k < 2l) = I(t1k < t2l, δ1k = δ2l = 1) + I1k = 1, δ2l = 0) + I(X1k < X2l, δ1k = δ2l = 0). Therefore,

U=(mn)1k=1ml=1n[I(t1k<t2l,δ1k=δ2l=1)+I(δ1k=1,δ2l=0)+I(X1k<X2l,δ1k=δ2l=0)]. (5)

The equality I(1k < 2l) = I(t1k < t2l, δ1k = δ2l = 1) + I1k = 1, δ2l = 0) + I(X1k < X2l, δ1k = δ2l = 0) means that a patient assigned to the active treatment group (group 2) has a better outcome (and hence a better score) than a patient assigned to the control treatment group (group 1) if:

  1. both patients died before time T, but the patient in the active treatment group lived longer;

  2. the patient in the control group died before time T, while the patient in the active treatment group survived and had the primary outcome interest measured;

  3. both patients survived until time T and had their primary outcomes of interest measured; however, the patient in the active treatment group had a higher measure of the primary outcome of interest.

We show, in Appendix A, equations (A.1) and (A.2), that the mean and variance of U under the alternative hypothesis are:

μ1=E(U)=p1p2πt1+p1q2+q1q2πx1=πU1,σ12=Var(U)=(nm)1[πU1(1πU1)+(m1)(πU2πU12)+(n1)(πU3πU12)], (6)
πU1=p1p2πt1+p1q2+q1q2πx1πU2=p12q2+p12p2πt2+2p1q1q2πx1+q12q2πx2πU3=p1q22+p1p22πt3+2p1p2q2πt1+q1q22πx3, (7)

where pi = P(tijT), qi = 1 − pi, for i = 1, 2, and πt1 = P(t1k < t2l1k = δ2l = 1);

πt2=P(t1k<t2l,t1k<t2l|δ1k=δ1k=δ2l=1);πt3=P(t1k<t2l,t1k<t2l|δ1k=δ2l=δ2l=1)πx1=P(X1k<X2l);πx2=P(X1k<X2l,X1k<X2l);πx3=P(X1k<X2l,X1k<X2l).

Under the null hypothesis of no difference between the two treatment groups, we have p1 = p2, q1 = q2, πt1 = πx1 = 1/2, and πt2 = πx2 = πt3 = πx3 = 1/3, which imply that πU1 = 1/2 and πU2 = πU3 = 1/3. Hence,

μ0=E0(U)=12andσ02=Var0(U)=n+m+112mn. (8)

To evaluate the power of the WMW for the untied worst-rank score composite outcome, we use the asymptotic distribution of the WMW test statistic

Z=UE0(U)Var0(U), (9)

which converges to the standard normal distribution N(0, 1) as m and n tend to infinity, assuming that n/N is bounded away from 0.

Practical guidelines for how large m and n should be in order for the normal approximation such as in (9) to be valid for a traditional Wilcoxon test are provided in the literature. Siegel and Castallan [34] recommended using the approximation for either n = 3 or 4 and m > 12 or for n > 4 and m > 10. Ballera et al. [35] recommended use of a graphical display to assess the normal approximation.

The power of the WMW test statistic (9) is given by

Φ(σ0σ1zα2+μ1μ0σ1)+Φ(σ0σ1zα2μ1μ0σ1)Φ(σ0σ1zα2+|μ1μ0|σ1). (10)

To calculate the conditional probabilities πtγ and πxγ, γ = 1, 2, 3, we need to specify the distributions of the survival times, t, and the non-fatal continuous response, X, or make some additional assumptions. For instance, we might assume that Xik ~ Nxi, σxi) and tik ~ Expi)

3.1.2. Remarks

  1. At this point, it is worth mentioning that the U–statistic considered here is exactly the same as the one used by McMahon and Harrell [27]. In their paper, U is defined as
    U=(mn)1k=1ml=1n[I(t1k<t2l)I(δ1k=1orδ2l=1)+I(X1k<X2l,δ1k=δ2l=0)], (11)
    which is equivalent to (5).
    Indeed, notice that
    I(δ1k=1orδ2l=1)=I(δ1k=1andδ2l=1)+I(δ1k=0andδ2l=1)+I(δ1k=1andδ2l=0).
    Since patients who survived have longer survival time than those who died, we have
    I(t1k<t2l)I(δ1k=0andδ2l=1)=0×I(δ1k=0andδ2l=1)=0;I(t1k<t2l)I(δ1k=1andδ2l=0)=1×I(δ1k=1andδ2l=0)=I(δ1k=1andδ2l=0). (12)
    where δ1k = 1 indicates death, while δ1j = 0 indicates survival and measurement of the primary outcome of interest. Thus,
    I(t1k<t2l)I(δ1k=1orδ2l=1)=I(t1k<t2l)I(δ1k=δ2l=1)+I(δ1k=1,δ2l=0) (13)
    6 and not equal to I(t1k < t2l) [1 − I1k = 0, δ2l = 0)] as they suggested.
    Plugging (13) into (11), the U–statitic becomes
    U=(mn)1k=1ml=1n[I(t1k<t2l,δ1k=δ2l=1)+I(δ1k=1,δ2l=0)+I(X1k<X2l,δ1k=δ2l=0)],
    which is exactly the U–statistic we have defined in (5).

    Now, using the key result (13) we can show that the probability P(t1k < t2l, (δ1k = 1 or δ2l = 1)) is not equal to P(t1k < t2l1k = 1 or δ2l = 1)[1 − P1k = δ2l = 0)] but equal to P(t1k < t2l1k = δ2l = 1)P1k = δ2l = 1) + P1k = 1, δ2l = 0) instead. Therefore, the expected value E(U) obtained in McMahon and Harrell’s paper (and denoted here by) πU1*=[1P(δ1k=δ2l=0)]πt1*+P(δ1k=δ2l=0)πx1=(1q1q2)πt1*+q1q2πx1 is incorrect, where πt1*=P(t1k<t2l|δ1k=1orδ2l=1).

    In addition, their results for πU2, and πU3 (denoted here by πU2*, and πU3*)
    πU2*=[1P(δ1k=δ1k=δ2l=0)]πt2*+P(δ1k=δ1k=δ2l=0)πx2=(1q12q2)πt2*+q12q2πx2πU3*=[1P(δ1k=δ2l=δ2l=0)]πt3*+P(δ1k=δ2l=δ2l=0)πx3=(1q1q22)πt3*+q1q22πx3,
    are both also incorrect, where they defined πt2*=P(t1k<t2l,t1k<t2l| at least one of δ1k, δ1k, δ2l is equal to 1) and πt3*=P(t1k<t2l,t1k<t2l| at least one of δ1k, δ2l, δ2l is equal to 1). Indeed, unlike πU1*,πU2*, and πU3*, the probabilities πU1, πU2, and πU3 have each an additional term that depends only on p1 and q2. Furthermore, πU2 depends also on πx1, and πU3 depends on πt1. In fact, we have used arguments similar to (12) in Appendix A to show that both πU2* and πU3* are not only different from πU2 and πU3, but also incorrect.
  2. For any given values of q1 and q2 in (7), there exist pairs (πt1, πx1) other than (0.5, 0.5) for which the Mann-Whitney worst rank test statistic has an expected value πU1 = (1 − q1)(1 − q2t1 + q1q2πx1 + (1 − q1)q2 equal to 0.5. For instance, when the effects of treatment on the components of the worst-rank composite outcome are in opposite directions and show lack of consistency, such as a treatment that increases mortality while improving significantly the primary outcome of interest. This is the very reason we focused on a particular alternative hypothesis, the unidirectional alternative, in which the active treatment fares better with respect to the (non-fatal) primary outcome of interest and the incidence of death. In the scenario where higher values of the non-fatal outcome reflect better health condition, this means that subjects in the active treatment group tend to have higher values on the non-fatal outcome and/or longer survival times while not having lower values for either.

    We are not interested here to test the null hypothesis of no difference on both mortality and the primary outcome of interest against the complete complement of the null hypothesis (the omnibus alternative hypothesis), but against the (restrictive) one-directional alternative hypothesis defined by (2). In this context, under the latter alternative hypothesis, we expect the pairs (πt1, πx1) to belong to the set [0.5, 1] × [0.5, 1] − {(0.5, 0.5)}.

  3. On may object that in practice, it is not possible to know in advance whether a specific drug would satisfy such a requirement and that we are doomed to produce misleading results when the treatment is inconsistent in the sense the overall treatment effect on the composite outcome might appear beneficial, while its harmful effect on one of the components of the composite outcome may go unnoticed. We have recently revised and re-submitted a paper to Statistics in Medicine [36] that gives practical guidelines and highlights two stepwise procedures that dealt with this issue. These procedures ensure that a treatment is declared superior only when it has a beneficial effect on either the primary outcome of interest or mortality and does not fare worse on either of them.

3.1.3. Simulation Study

We conducted simulation studies to assess the accuracy of the power formula (10), using a two-sided α = 0.05. In each of 10,000 simulated trials, we generated exponentially distributed survival times and normally distributed primary outcomes of interest. We estimated the power as a function of the hazard ratio (HR), the standardized difference in the primary outcome of interest Δx = (μx2 − μx1)/(21/2σx1), and the survival probability in the active treatment group, q2 = 1 − p2, for n = m = 50. Specifically, we generated ti ~ Expi), i = 1, 2, such that q2 = exp(−λ2T) = 0.6, 0.8, T = 3, HR = λ12 = 1.0, 1.2, 1.4, 1.6, 2.0, 2.4, 3.0, X1 ~ N(0, 1) and X2 ~ N(21/2Δx, 1), for Δx = 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6. Calculations for the conditional probabilities πtγ and πxγ for γ = 1, 2, 3 are given in Appendix B. The analytical power values based on (10) are listed in Part (1) of Table 1. For each data set, we computed the WMW test statistic, Z, given in equation (9). We also calculated the empirical power as the proportion of simulated data sets for which |Z| > 1.96. For the sake of comparison, we also calculated the empirical power for the WMW test using the measured primary outcomes of interest of the survivors, i.e., reducing the sample to include survivors only. These results are presented in Part (2) of Table 1.

Table 1.

Power for the Wilcoxon–Mann–Whitney test for untied worst-rank scores.

q2 = 60% q2 = 80% q1 = q2 = 100%



HR HR
Δx 1.0 1.2 1.4 1.6 2.0 2.4 3.0 1.0 1.2 1.4 1.6 2.0 2.4 3.0
(1) Analytical power
0.0 0.05 0.08 0.17 0.30 0.63 0.87 0.99 0.05 0.06 0.09 0.14 0.28 0.47 0.73 0.05
0.1 0.06 0.11 0.22 0.37 0.69 0.90 0.99 0.07 0.11 0.17 0.24 0.42 0.60 0.83 0.08
0.2 0.08 0.16 0.28 0.44 0.75 0.93 1.00 0.14 0.20 0.28 0.37 0.56 0.73 0.90 0.16
0.3 0.11 0.21 0.36 0.52 0.80 0.95 1.00 0.25 0.34 0.43 0.52 0.70 0.83 0.94 0.31
0.4 0.16 0.28 0.43 0.59 0.84 0.96 1.00 0.40 0.49 0.58 0.67 0.81 0.90 0.97 0.49
0.5 0.22 0.36 0.51 0.66 0.88 0.97 1.00 0.56 0.65 0.72 0.79 0.89 0.95 0.99 0.68
0.6 0.29 0.43 0.58 0.72 0.91 0.98 1.00 0.71 0.78 0.83 0.88 0.94 0.98 1.00 0.83

(2) Survivors-only power
0.0 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
0.1 0.08 0.08 0.07 0.07 0.07 0.06 0.06 0.10 0.10 0.10 0.10 0.10 0.10 0.09 0.08
0.2 0.18 0.18 0.17 0.16 0.14 0.13 0.11 0.25 0.25 0.25 0.25 0.25 0.23 0.23 0.16
0.3 0.34 0.33 0.31 0.30 0.27 0.24 0.21 0.50 0.48 0.48 0.48 0.47 0.46 0.45 0.31
0.4 0.54 0.52 0.50 0.48 0.43 0.38 0.31 0.74 0.73 0.72 0.71 0.71 0.70 0.68 0.49
0.5 0.74 0.71 0.69 0.66 0.59 0.55 0.45 0.90 0.89 0.89 0.89 0.88 0.87 0.86 0.68
0.6 0.87 0.86 0.84 0.81 0.76 0.70 0.60 0.97 0.97 0.97 0.97 0.97 0.96 0.96 0.83

We used exponential distributions for the survival times, normal distributions for the non-fatal outcomes, with 50 subjects in each group.

Δx: standardized difference in the primary outcome of interest; HR: hazard ratio; q1 (reps. q2): survival probability in the control (resp. active) treatment group.

(1) Analytic power formula (10); (2) Empirical power obtained by reducing the sample to include survivors only.

First, the simulations indicate that for exponentially distributed survival times and normally distributed non-fatal outcomes, there is virtually no difference between the empirical power and the analytical power calculated using formula (10). The differences in power, for any value of the hazards ratio or the standardized difference in the primary outcome of interest is less than 0.01. Thus, as long as the survival probability and outcome distributions are well-approximated, our analytical power formula will be accurate in the context of untied worst-rank score outcome.

Secondly, the power of the WMW test based on the composite outcome increases considerably as the hazard ratio increases, for any given difference in the primary outcome of interest. This is due to its use of the actual death times, which contribute treatment information for large hazard ratios; this effect is larger for smaller survival probabilities (i.e., 60% versus 80%), as it translates into more early deaths. This also provides a compelling argument against the ”mortality neutral” assumption in designing a study.

Thirdly, the (ordinary) WMW test in case of no deaths has more power compare to the untied worst-rank WMW test with deaths when there is a null or moderate effect of treatment on survival (HR < 1.4) and the difference on the primary outcome of interest is moderate or high (Δx > 0.3). However, when the the hazard ratio of death is high (HR ≥ 2) or when it is moderate 1.4 ≤ HR < 2 and the difference on the primary outcome of interest is low (Δx ≤ 0.2), the WMW test on the untied worst-rank composite outcome is more powerful.

Fourthly, when subjects with missing observations due to death are excluded from the analysis, the power of detecting a treatment difference decreases as the hazard ratio increases, with lower power when the overall probability of survival is 60% versus 80%, due to a smaller remaining sample size in the former case. In addition, this power is extremely low when the standardized difference in the primary outcome of interest is low or moderate and the hazard ratio of death is moderate or high. This proves that the results are unreliable when informatively missing observations due to death are excluded from the analyses.

Although inefficient (and inappropriate in case of informatively missing observations), the “survivor-only” analysis is a natural approach that has been used in the literature [37] and may be done in practice. The simulation results highlight the loss of power associated with this approach.

3.2. Tied Worst-Rank Scores

3.2.1. Power Formula

Under the tied worst-rank score assignment procedure, we set 1k = (1 − δ1k)X1k + δ1kζ, 2l = (1 − δ2l)X2l + δ2lζ, for k = 1, …, m, l = 1, …. n, and ζ = min(X) − 1.

We define the WMW U-statistic as

Ũ=(mn)1k=1ml=1n[I(1k<2l)+12I(1k=2l).] (14)

Note that 1k < 2l when {X1k < X2l, δ1k = δ2l = 0} or (δ1k = 1, δ2l = 0), whereas 1k = 2l when (δ1k = δ2l = 1). Thus, I(1k<2l)+12I(1k=2l)=I(X1k<X2l,δ1k=δ2l=0)+I(δ1k=1,δ2l=0)+12I(δ1k=δ2l=1). This means that a patient assigned to the active treatment group (group 2) has a better score compare to a patient assigned to the control treatment group (group 1) if:

  1. both patients survive until time T, but the patient in the active treatment group has a higher measure of the primary outcome of interest;

  2. the patient in the control group dies before time T, while the patient in the active treatment group survives.

In Appendix A, equations (A.3)(A.4), we show that

μ1=E(Ũ)=q1q2πx1+p1q2+12p1p2=πŨ1,σ12=Var(Ũ)=(nm)1[πŨ1(1πŨ1)+(m1)(πŨ2πŨ12p12p212)+(n1)(πŨ3πŨ12p1p2212)p1p24], (15)

where πxγ, γ = 1, 2, 3, are defined as in (7), and

πŨ1=q1q2πx1+p1q2+12p1p2πŨ2=πx2q12q2+2πx1p1q1q2+p12q2+13p12p2πŨ3=πx3q1q22+p1q22+p1p2q2+13p1p22.

The additional term p1p2[3 + (m − 1)p1 + (n − 1)p2]/12 in the variance formula represents the correction for ties.

Under the null hypothesis of no difference between the two treatment groups, p1 = p2, q1 = q2, πx1 = 1/2, and πx2 = πx3 = 1/3, which implies πŨ1 = 1/2 and πŨ2 = πŨ3 = 1/3. Therefore,

μ0=E0(Ũ)=12,σ02=Var0(Ũ)=112mn{(n+m+1)p2[3+(m+n2)p]}. (16)

In practice, p is estimated by the pooled sample proportion p = (mp1 + np2)/(m + n) and Var0(Ũ) is calculated accordingly.

As in the case of untied worst-rank scores, the distribution of the WMW test statistic,

=ŨE0(Ũ)Var0(Ũ),

converges to a standard normal distribution as m and n tend to infinity.

The power of the WMW test is given by

Φ(σ0σ1zα2+μ1μ0σ1)+Φ(σ0σ1zα2μ1μ0σ1)Φ(σ0σ1zα2+|μ1μ0|σ1). (17)

3.2.2. Simulation Study

We conducted simulation studies to assert the validity of the power formula (17) for the tied worst-rank test, based on exponentially distributed death times and normally distributed non-fatal outcomes, with a two-sided α = 0.05. We considered the same parameters as in Section 3.1 and generated a total of 10,000 trials. The results are presented in Table 2.

Table 2.

Power for the Wilcoxon–Mann–Whitney test for tied worst-rank scores.

q2 = 60% q2 = 80% q1 q2 = 100%



HR HR
Δx 1.0 1.2 1.4 1.6 2.0 2.4 3.0 1.0 1.2 1.4 1.6 2.0 2.4 3.0
(1) Analytical power
0.0 0.05 0.08 0.17 0.30 0.61 0.84 0.97 0.05 0.06 0.09 0.14 0.28 0.47 0.73 0.05
0.1 0.06 0.12 0.23 0.37 0.67 0.87 0.98 0.07 0.11 0.17 0.24 0.42 0.60 0.82 0.08
0.2 0.08 0.17 0.29 0.45 0.73 0.90 0.99 0.14 0.21 0.28 0.37 0.56 0.73 0.89 0.16
0.3 0.12 0.23 0.37 0.53 0.79 0.93 0.99 0.26 0.34 0.43 0.52 0.70 0.83 0.94 0.31
0.4 0.17 0.30 0.45 0.60 0.83 0.94 0.99 0.40 0.50 0.59 0.67 0.81 0.90 0.97 0.49
0.5 0.23 0.37 0.53 0.67 0.87 0.96 0.99 0.57 0.65 0.73 0.79 0.89 0.95 0.99 0.68
0.6 0.31 0.45 0.60 0.73 0.90 0.97 1.00 0.71 0.78 0.84 0.88 0.94 0.98 0.99 0.83

We used exponential distributions for the survival times, normal distributions for the non-fatal outcomes, with 50 subjects in each group.

Δx: standardized difference in the primary outcome of interest; HR: hazard ratio; q1 (reps. q2): survival probability in the control (resp. active) treatment group.

(1) Analytical power formula (17);

As in the case of the untied worst-rank score test, there is virtually no difference between the empirical power and the analytical power calculated using formula (17). We also note that the power of the tied worst-rank test is nearly identical to that of the untied test for these simulation scenarios. There are instances where the actual time to early death (before the end of follow-up) may not be very important or in case of drop-out due worsening of the disease conditions, the time of drop-out may be hard to assess for practical reasons. The simulation results we have obtained are reassuring since they demonstrate that, in these cases, one does not lose much when considering using tied worst-rank scores—by lumping all informative missing observations together to shared the same tied rank value—in lieu of untied worst-rank scores. The untied worst-rank test has higher power than the tied worst-rank test when there is substantial early death. This would not be expected in most clinical trials, as if it were, a survival analysis would likely be the primary endpoint, rather than an continuous response. This is depicted in Figure 1, for n = m = 50, Δx = 0.1 and survival probabilities in the active treatment group of q2 = 30%, 40%, 60%, and 80%.

Figure 1.

Figure 1

Power comparison for the Wilcoxon–Mann–Whitney test under tied (solid line) and untied (dashed line) worst-rank scores where survival times follow exponential distribution and the non-fatal outcome follow normal distribution. The survival probabilities in the active treatment group are: (a) 30%; (b) 40%; (c) 60%; (d) 80%. There are 50 subjects in each group with Δx = 0.1.

3.2.3. Comments

The result that there is only a minor computational advantage in using the tied worst-rank composite outcome instead of the untied worst-rank composite outcome is extremely important. Up to now, it was argued in the literature—without proof—that by using the untied worst-rank composite outcome in lieu of tied worst-rank composite composite outcome one may gain some efficiency (see [7].) Our simulations contradict this assertion. Even though assigning a fixed score to all missing observations—under the tied worst-rank score imputation procedure (as oppose to the untied worst-rank score imputation)—results in lower sum of ranks in the active treatment group when there is truly a treatment effect on mortality, this goes along with lower variances of the MWM test statistic under the null and alternative hypotheses (due to the multiple ties induced by the imputation procedure.) As a result, the powers under both untied and tied imputation procedures are similar.

Usually, the decision to use either tied or untied worst-rank scores is not based merely on the desire to increase power. In practice, it depends also on the nature the disease and the willingness by the investigators to weights equally all deaths—be it, for example, two days after the treatment initiation or ten months later. Therefore, the choice of the imputation method should be made by eliciting and using experts opinions. Furthermore, the use of worst-rank composite outcome has been advocated in many settings including clinical trials where the treatment effects on a disease are assessed through a host of meaningful—yet of different importance, magnitude, and impact—heterogenous clinical events (hospitalization, death, …) and clinical endpoints that provide a more comprehensive understanding of these effects (see [11, 9]). No matter what the situation, it is reassuring to know that a Wilcoxon–Mann–Whitney test based on tied worst-rank scores is almost as efficient as the one based on untied worst-rank scores. In our knowledge, this surprisingly encouraging finding has not been established so far in the literature.

4. Sample Size Calculation

4.1. Sample Size formula

Let N = n + m be the total sample size and let s = n/N be the active treatment fraction. In the case of untied worst-rank scores, if we ignore the lower order terms, the equations (6)(8) for means and variances can be re-written so that:

μ1μ0=u,σ02112s(1s)N,σ12υ1s(1s)N,

with

u=πU112,υ1=(1s)(πU2πU12)+s(πU3πU12).

Setting the power to 1 − β, from formula (10), we derive |μ1μ0|σ0=(σ1σ0zβ+zα2)=(σ1σ0z1β+z1α2), which implies that

N=(z1α2+12υ1z1βu12s(1s))2. (18)

For the tied worst-rank test, if we also ignore the lower order terms, equations (15) and (16) yield

μ1μ0=ũ,σ02υ̃012s(1s)N,σ12υ̃1s(1s)N,

where ũ=πŨ112, υ̃0 = 1 − p3, υ̃1=(1s)(πŨ2πŨ12)+s(πŨ3πŨ12)112p1p2[(1s)p1+sp2], implying that

N=(υ̃0z1α2+12υ̃1z1βũ12s(1s))2 (19)

4.2. Sample Size Estimation

When the distributions of the non-fatal outcome, X, and the survival time, t, are known approximately, it is possible to estimate the probabilities πtγ and πxγ, γ = 1, 2, 3, and thus πUγŨγ) γ = 1, 2, 3, and thereby derive estimates of u and υ1 (ũ, υ̃0, and υ̃1), as

û=π̂U11/2,υ̂1=(1s)(π̂U2π̂U12)+s(π̂U3π̂U12),u˜^=π̂Ũ11/2,υ˜^0=13,υ˜^1=(1s)(π̂Ũ2π̂Ũ12)+s(π̂Ũ3π̂Ũ12)12[(1s)1+s2]/12;

and evaluate the desired sample size.

In this section, we consider four methods for estimating πtγ and πxγ, γ = 1, 2, 3, based on either the availability of data from previous studies or assumptions that can be made about the underlying distributions of the death times and continuous outcome of interest.

Method A: When data from a previous pilot study are available, Wang et al. [32] recommended estimating πtγ and πxγ, γ = 1, 2, 3, with their sample counterparts:

π̂t1=(m1n1)1k=1m1l=1n1I(t1k<t2l|t1kT,t2lT), (20)
π̂x1=(m2n2)1k=1m2l=1n2I(X1k<X2l),π̂t2=[m1n1(m11)]1kkl=1n1I(t1k<t2l,t1k<t2l|t1kT,t1kT,t2lT),π̂x2=[m2n2(m21)]1kkl=1n2I(X1k<X2l,X1k<X2l),π̂t3=[m1n1(n11)]1k=1m1llI(t1k<t2l,t1k<t2l|t1kT,t2lT,t2lT),π̂x3=[m2n2(n21)]1k=1m2llI(X1k<X2l,X1k<X2l), (21)

where n1 (resp. n2) is the number of deaths (resp. survivors) in the active treatment group, and m1 (reps. m2) is the number of deaths (resp. survivors) in the control group.

The probabilities πU1, πU2 and πU3 (resp. πŨ, πŨ2, πŨ3, and p) can be estimated as:

1=m1m;2=n1n;1=11;2=12;π̂U1=12πt1+12+12π̂x1 (22)
π̂U2=122+122π̂t2+2112π̂x1+122π̂x2;π̂U3=122+122π̂t3+2122π̂t1+122π̂x3π̂Ũ1=12π̂x1+12+1212;π̂Ũ2=π̂x2122+2π̂x1112+122+13122π̂Ũ3=π̂x3122+122+122+13122;=(m·1+n·2)m+n=m1+n1m+n. (23)

Method B: Noether [30] suggested assuming σ0 = σ1. The formulae (18) and (19) then become, respectively,

N=(z1α2+z1βu12s(1s))2andN=υ̃0(z1α2+z1βũ12s(1s))2. (24)

Hence, we need to estimate πU1 (resp. πŨ1) in equations (22) (resp. (23)) via estimates of πt1 and πx1 from previous similar studies. When a pilot study exists, this estimate of πU1 (resp. πŨ1) can be simply derived using (20) (resp. (21)).

Method C: One particular alternative hypothesis that has been extensively considered in the literature is the location shift alternative. In our setting, this translates into

H1:G1(xΛx)=G2(x)andF1(tΛt)=F2(t), (25)

for the untied worst-rank scores. For the tied worst-rank scores, it translates into

H1:G1(xΛx)=G2(x)andp1p2. (26)

When the shift Λx is small, Lehmann [29] assumed σ0 = σ1 and used the approximation

π̂x1=1/2+Λxg*(0) (27)

to estimate πx1, where g*(0) is the density of G* evaluated at 0. G* denotes the distribution of difference of two independent random variables, each with distribution G. For example, g*(0) = 1/2 for the exponential distribution with mean one, and g*(0) = π−1/2/2 for the standard normal distribution. When g* is unknown, as it is often the case, one can use a normal approximation g*(0) = π−1/2/(2σx), with σx being the standard deviation of the distribution of X.

Method D: In practice, the underlying distributions Gi, i = 1, 2, are almost always unknown, which makes it difficult to estimate the sample size under the alternative shift hypothesis. In addition, the meaning of Λx in the location shift alternative varies for each G1. For these reasons, Rosner and Glynn [31] proposed the use of the probit shift alternative.

Let X2c be defined as the counterfactual random variable obtained if each of the active treatment group subjects received the control treatment. Let HX1 = Φ−1(G1), and HX2 = Φ−1(G2). By definition G1 = G2c, so that HX2c = HX1. We can then consider the class of probit shift alternatives for the primary outcome of interest given by

H0:HX1=HX2=HX2candF1(t)=F2(t)versusH1:HX2=HX2cνxandF1(t)F2(t).

Therefore, the null and alternative hypotheses become H0 : G1(x) = G2(x) and F1(t) = F2(t), versus

H1:Φ[Φ1{G1(x)}νx]=G2(x)andF1(t)F2(t), (28)

for the untied worst-rank scores. In the case of the tied worst-rank scores, we have

H0:G1(x)=G2(x)andp1=p2,versusH1:Φ[Φ1{G1(x)}νx]=G2(x)andp1p2. (29)

For any random variable X1, HX1 has a standard normal distribution. Thus, under the alternative (28), HX1 ~ N(0, 1) and HX2 ~ N(−νx, 1). We can show (see Rosner and Glynn [31]) that

πx1=Φ(νx2)andπx2=πx3=P(Z<νx2,Z<νx2),for(Z,Z)~N((00),(11/21/21)). (30)

The probit shift alternatives (28) and (29) have an advantage over the location shift alternatives (25) and (26) since we need simply to either (a) estimate πx1 from (21), then derive νx, πx2 and πx3 from (30) or (b) prespecify νx and derive πx1, πx2, and πx3 from (30). Moreover, according to Rosner and Glynn [38], the rationale of using a probit transformation to define a shift alternative in (28) and (29) is that

  1. any continuous distribution Gi has a corresponding underlying probit transformation HXi;

  2. it is natural to define location shifts between distributions in terms of a normaly distributed scale, which by definition is satisfied by the probit transformation;

  3. νx can be interpreted as an effect sizes on the probit scale, which is very useful for sample size and power calculations.

For the standard WMW test applied to a continuous outcome, methods A and B are used when data are available from previous pilot studies, while methods C and D are used when data from both groups are limited to summary statistics [39]. Method D can also be used for data not satisfying the location shift alternative or for an exploratory or a pilot study [31]. In general, the computations of πt2, πt3, πx2, and πx3 are more involved than those of πt1 and πx1. Methods B and C circumvent the evaluation of πt2, πt3, πx2, and πx3 and only necessitate estimation of the probabilities πt1 and πx1.

4.3. Simulation Study

To evaluate the performances of the four sample size methods presented in Section 4.2 and to assess the accuracy of the formulae (18) and (19), we conducted extensive simulation studies, similar to those conducted by Zhao [40]. First, we estimated – via these formulae – the sample sizes required to detect a specific difference in the primary outcome for the nominal power fixed at 80%. Then, we examined the accuracy of these sample size formulae by comparing these results to the (estimated) true simple sizes obtained by inverting the empirical power, as outlined in the algorithm below. Moreover, for each sample size obtained in the first step, we calculated the corresponding empirical power and evaluated its discrepancy from the target, nominal power.

We set the active treatment fraction to be s = 1/2 (i.e., m = n) and the two-sided significance level to be α = 0.05. We generated survival times from exponential (mean 1), Weibull (with shape parameter equal to 1.2), and log-logistic (with shape parameter 1) distributions. We fixed the follow-up time T at 3 months, the survival probability q2 in the active treatment group at 60% and 80%, and the hazard ratio (HR) for mortality (control vs. active treatment group) at 1.0, 1.5, and 3.0. For the log-logistic distribution, the odds ratio (OR) of surviving longer for subjects in the active treatment group were fixed also at 1.0, 1.5, and 3.0. We generated non-fatal outcomes such that

Xij=μxi+εx (31)

with mean response μx1 = 0 (control group) and μx2 = 0.5 (active treatment group). We considered three distributions for εx: normal, lognormal (each with mean 0 and variance 1), and t-distribution with 3 degrees of freedom.

For each method, we estimated the relevant probabilities from among πt1, πt2, πt3, πx1, πx2, and πx3 as well as πU1, πU2 and πU3 (resp. πŨ1, πŨ2, and πŨ3) using the methods described in Section 4.2. More precisely, we generated a total of 1000 simulated data sets, each with a sample size of 10,000, and estimated these probabilities. We then used these estimates to calculate the sample sizes under both the untied (18) and tied (19) worst-rank missing data assignment. For method B, we used the same πt1 and πx1 from method A. For method C, we used πt1 obtained from method A. In addition, we used the normal approximation in (27) to obtain π̂x1 = 1/2 + Λx(2σx)−1π−1/2. We estimated the common standard deviation for non-fatal outcomes σx using the pooled standard deviation. We estimated both Λx = μx2 − μx1 and σx by their sample counterparts. Finally, for method D, we considered πt1, πt2, πt3, and πx1 calculated through method A. We estimated νx = 21/2Φ−1x1), and derived probabilities πx2 and πx3 from (30).

To evaluate these estimated sample sizes relative to the true sample sizes, we used the following algorithm:

  1. Begin with a pre-specified sample size N.

  2. Set counter C = 0.

  3. Generate ij given in equation (1) (resp. equation (3)) for the untied (resp. tied) worst rank scores, with tij ~ exponential, Weibull, or loglogistic and Xij as defined in equation (31).

  4. Increment C by 1 if |Z| > 1.96 (resp. || > 1.96.)

  5. Repeat steps (3) and (4) for R = 10, 000 simulated datasets.

  6. Calculate τ = C/R. If τ = 60% (reps. τ = 80%), stop and choose N as the estimate of the true sample size. This corresponds to 60% (reps. 80%) power. If τ < 60% (reps. τ < 80%), increase N and return to step (2). If τ > 60% (reps. τ > 80%), decrease N and return to step (2).

We also evaluated the discrepancy between the nominal power and the corresponding (empirically) estimated power based on the estimated sample size. The adequacy of power estimations under these methods is measured by the relative percentage error, which we define as

Relative percentage error=estimated powernominal powerestimated power×100%.

The estimated power is obtained by following the steps (1) to (6) of the algorithm above, with the exception that at step (6), instead of re-calibrating N, we simply calculate the estimated power, τ.

The results are given in Tables 3 for the untied worst-rank test and in Table 4 for the tied worst-rank test. Results for the Weibull and log-logistic distributions are given in Appendix C.

Table 3.

Sample size estimation and corresponding relative percentage error for the Wilcoxon–Mann–Whitney test applied to untied worst-rank scores (exponential survival)

q2 = 60% q2 = 80%


Distr.a HR A B C D Truth A B C D Truth
normal 1.0 1071 (1.0) 1071 (1.0) 1033 (−1.4) 1071(1.0) 1070 327 (−1.4) 329 (−1.2) 317 (−2.2) 327 (−1.4) 330
1.5 210 (−0.3) 214 (−0.2) 211(−0.4) 209 (−0.3) 210 173 (−1.6) 176 (0.3) 172 (−0.3) 173 (−1.6) 174
3.0 44 (−0.4) 48 (3.9) 48 (3.9) 44 (−0.4) 44 60 (−0.9) 63 (1.8) 63 (1.8) 60 (−0.9) 60
t3 1.0 1619 (0.8) 1619 (0.8) 3024 (17.5) 1619 (0.8) 1616 497 (−0.2) 499 (−0.8) 937 (17.6) 496 (−0.6) 502
1.5 242 (0.0) 245 (0.3) 292 (7.7) 242 (0.0) 242 224 (0.1) 227 (0.2) 326 (13.1) 224 (0.1) 224
3.0 46 (−0.2) 49 (1.0) 51 (3.5) 46 (−0.2) 46 67 (−1.3) 70 (2.7) 81 (8.3) 67 (−1.3) 68
lognormal 1.0 814 (−0.6) 812 (0.0) 4721 (20.2) 812 (0.0) 812 260 (0.9) 261 (1.3) 1594 (20.0) 258 (0.8) 250
1.5 191 (−1.0) 194 (0.7) 321 (15.9) 191 (−1.0) 192 146 (0.4) 148 (0.2) 411(19.8) 145 (−1.2) 146
3.0 43 (−2.2) 47 (2.4) 52 (8.0) 43 (−2.2) 44 55 (−2.1) 59 (1.2) 86 (14.7) 54 (−2.1) 56
a

Distribution of the primary outcome of interest; t3 denotes Student’s t distribution with 3 degrees of freedom. The nominal power is equal to 80%.

Table 4.

Sample size estimation and corresponding relative percentage error for the Wilcoxon–Mann–Whitney test applied to tied worst-rank scores (exponential survival)

q2 = 60% q2 = 80%


Distr.a HR A B C D Truth A B C D Truth
normal 1.0 998 (0.7) 998 (0.7) 963 (−1.3) 998 (0.6) 996 324 (−0.7) 327 (−0.5) 315 (−3.7) 324 (−0.7) 322
1.5 205 (−0.3) 209 (0.7) 205 (−0.3) 205 (−0.3) 206 172 (0.2) 175 (0.8) 171 (−0.6) 172 (0.2) 172
3.0 45 (0.7) 48 (3.9) 48 (3.9) 45 (0.7) 44 59 (−0.9) 63 (1.9) 62 (1.9) 59 (−0.9) 60
t3 1.0 1515 (0.9) 1514 (1.8) 2828 (17.6) 1514 (0.7) 1512 493 (−0.6) 495 (−0.4) 930 (17.5) 493 (−0.6) 498
1.5 237 (0.5) 240 (0.4) 290 (8.8) 237 (0.5) 236 223 (0.4) 226 (−0.4) 327 (13.3) 222 (0.1) 222
3 46 (0.5) 49 (1.9) 51 (3.9) 46 (0.5) 44 67 (−0.5) 70 (2.2) 80 (8.0) 67 (−0.5) 68
lognormal 1.0 758 (0.5) 757 (0.5) 4425 (20.1) 756 (−1.2) 758 257 (0.9) 258 (1.4) 1578 (20) 256 (1.0) 254
1.5 187 (0.1) 189 (0.8) 320 (16.4) 186 (0.2) 186 145 (−0.23) 147 (0.5) 410 (19.9) 144 (−0.2) 146
3.0 44 (1.7) 46 (3.2) 52 (8.3) 43 (0.2) 42 55 (−2.1) 58 (1.3) 86 (14.9) 54 (−1.1) 56
a

Distribution of the primary outcome of interest; t3 denotes Student’s t distribution with 3 degrees of freedom. The nominal power is equal to 80%.

As seen in Tables 3 and 4 for the case of exponential survival, the sample sizes estimated by methods A, B and D are fairly similar and close to the true sample sizes across all the distributions for the nominal power considered. As expected, sample sizes estimation by method C are close to the true sample size only when the non-fatal outcome is normally distributed. This method performs poorly when the distribution is the Student’s t with 3 degree of freedom or lognormal, with estimates as high as three times the true sample size. Method A and D are suitable when one has data from a pilot study that provides a good estimate of sample sizes. Method D has the advantage over method A of requiring only estimates for πt1 and/or πx1. Although method B makes one additional assumption than method A, the results are quite similar, which indicates that the equal-variance assumption may be reasonable. Lastly, we note that the sample size requirements for the untied and tied worst-rank tests are quite similar, which is not surprising given the similarity in their power seen in Tables 1 and 2. Analogous results are obtained in Appendix Tables C.1, C.2, C.3, and C.4 for the Weibull and log-logistic distributions of survival times.

Table C.1.

Sample size estimation for the Wilcoxon–Mann–Whitney test for untied worst-rank scores (Weibull survival)

q2 = 60% q2 = 80%


Distr.a HR A B C D Truth A B C D Truth
Normal 1.0 1024 (−1.3) 1025 (−1.3) 982 (−3.8) 1024 (−1.3) 1026 326 (−1.0) 329 (−0.2) 316 (−3.3) 326(−1.0) 332
1.5 168 (0.2) 172 (1.2) 170 (1.1) 168 (0.2) 168 152 (0.5) 156 (0.9) 152 (0.5) 152 (0.5) 152
3.0 31 (−3.8) 35 (3.8) 35 (3.8) 31(−3.8) 32 44 (0.0) 47 (1.9) 47 (1.9) 44 (0.0) 44
t3 1.0 1604 (0.7) 1604 (0.7) 2995 (17.33) 1604(0.7) 1600 497 (−0.6) 499 (0.0) 938 (17.6) 497 (−0.6) 498
1.5 185 (−1.1) 189 (0.9) 217 (6.1) 185 (−1.1) 186 192 (−0.1) 195 (0.0) 268 (12.2) 191(−1.4) 194
3.0 32 (−0.6) 35 (2.9) 36 (5.1) 32 (−0.6) 32 47 (−1.6) 51 (−1.2) 56 (6.5) 47 (−2.7) 52
Lognormal 1.0 794 (−1.7) 792 (−1.4) 4586 (20) 792 (−1.3) 800 252 (−0.3) 253 (0.4) 1499 (20.0) 251 (−0.3) 252
1.5 155 (0.14) 158 (1.0) 239 (14.5) 154 (−0.4) 154 130 (−0.4) 132 (0.7) 334 (19.6) 129 (−0.7) 130
3.0 31 (−3.4) 35 (4.5) 37 (6.0) 31 (−3.4) 32 42 (0.7) 44 (3.2) 59 (12.2) 41 (−2.0) 42
a

Distribution of the primary outcome of interest; t3 denotes Student’s t distribution with 3 degrees of freedom. The nominal power is equal to 80%.

Table C.2.

Sample size estimation for the Wilcoxon–Mann–Whitney test for tied worst-rank scores (Weibull survival)

q2 = 60% q2 = 80%


Distr.a HR A B C D Truth A B C D Truth
Normal 1.0 956 (−1.0) 956 (−1.0) 917 (−4.0) 956 (−1.0) 958 324 (−0.9) 326 (−0.2) 314 (−2.1) 324 (−0.9) 326
1.5 166 (0.7) 169 (1.8) 167 (0.6) 166 (0.7) 164 152 (−0.2) 155 (1.0) 152 (−0.2) 152 (−0.2) 152
3.0 32 (0.6) 35 (2.7) 35 (2.7) 32 (0.6) 32 43 (−2.3) 47 (2.0) 46 (3.0) 43 (−2.3) 44
t3 1.0 1502 (−0.2) 1502 (−0.2) 2806 (17.4) 1502 (−0.2) 1502 494 (−0.4) 496 (−0.8) 931 (17.5) 493 (−0.9) 494
1.5 184 (0.2) 187 (0.3) 217 (7.0) 184 (0.2) 184 192 (−0.2) 196 (0.3) 275 (12.9) 193 (0.9) 194
3 33 (−1.5) 36 (5.1) 37 (5.3) 33 (−1.5) 34 47 (−2.0) 51 (2.3) 56 (7.2) 47 (−2.0) 48
Lognormal 1.0 742 (−1.0) 740 (−0.6) 4273 (20.0) 740 (−0.6) 740 250 (−1.5) 252 (0.4) 1435 (20.0) 249 (−0.8) 250
1.5 151 (0.9) 153 (1.0) 237 (15.5) 150 (−0.1) 150 129 (−0.5) 132 (1.3) 334 (19.6) 128 (0.3) 128
3.0 32 (0.9) 35 (4.1) 37 (7.3) 32 (0.9) 0.32 41 (−2.8) 44 (3.1) 59 (12.4) 42 (1.3) 43
a

Distribution of the primary outcome of interest; t3 denotes Student’s t distribution with 3 degrees of freedom. The nominal power is equal to 80%.

Table C.3.

Sample size estimation and corresponding relative percentage error for the Wilcoxon–Mann–Whitney test on untied worst-rank scores (log-logistic survival)

q2 = 60% q2 = 80%


Distr.a HR A B C D Truth A B C D Truth
Normal 1.0 1079 (−0.1) 1080 (0.7) 1038 (−0.9) 1079 (−0.1) 178 327 (−1.6) 330 (0.3) 318 (−2.9) 327 (−1.6) 330
1.5 272 (−0.3) 276 (0.7) 271 (−0.14) 272 (−0.3) 272 186 (−1.0) 190 (1.5) 185 (−0.9) 186 (−1.0) 186
3.0 71 (−1.1) 74 (0.7) 74 (0.7) 71 (−1.1) 72 75 (−1.3) 79 (1.4) 78 (1.6) 75 (−1.3) 74
t3 1.0 1617 (1.4) 1616 (1.0) 3021 (17.7) 1616 (1.0) 116 520 (2.2) 522 (1.5) 1011 (18.2) 519 (1.5) 518
1.5 317 (−1.0) 320 (−1.0) 398 (9.0) 317 (−1.0) 318 241 (−1.7) 244 (−0.6) 354 (13.2) 240 (−0.8) 246
3.0 76 (−0.1) 80 (2.2) 85 (4.6) 76 (−0.1) 78 86 (−1.1) 90 (1.5) 106 (8.2) 86 (−1.1) 88
Lognormal 1.0 791 (−0.7) 790 (−1.23) 4483 (20.0) 789 (−1.3) 792 253 (0.4) 254 (0.3) 1539 (20.0) 251 (0.0) 252
1.5 240 (−0.1) 243 (−0.4) 453 (17.5) 240 (−0.5) 244 154 (−0.7) 156 (0.2) 456 (19.9) 153 (−1.6) 154
3.0 69 (−1.2) 73 (1.5) 89 (9.5) 69 (−1.2) 72 69 (−1.1) 72 (1.6) 118 (16.2) 68 (−0.8) 70
a

Distribution of the primary outcome of interest; t3 denotes Student’s t distribution with 3 degrees of freedom. The nominal power is equal to 80%.

Table C.4.

Sample size estimation and corresponding relative percentage error for the Wilcoxon–Mann–Whitney test for tied worst-rank scores (log-logistic survival)

q2 = 60% q2 = 80%


Distr.a HR A B C D Truth A B C D Truth
Normal 1.0 1010 (1.1) 1010 (1.1) 971 (−1.2) 1010 (1.1) 1008 325 (−1.1) 328 (−0.9) 315 (−2.0) 325 (−1.1) 324
1.5 282 (0.9) 285 (0.8) 280 (0.0) 282 (0.9) 280 186 (−0.3) 190 (1.0) 184 (−1.3) 186 (−0.3) 188
3.0 82 (−1.0) 85 (1.3) 85 (1.3) 82 (−1.0) 84 78 (−0.1) 81 (2.2) 80 (1.1) 78 (−0.1) 78
t3 1.0 1519 (0.7) 1518 (1.5) 2841 (17.6) 1518 (1.5) 1516 515 (0.9) 517 (2.1) 1002 (18.1) 514 (1.6) 512
1.5 331 (−1.3) 334 (−0.6) 422 (9.1) 330 (−0.6) 332 242 (−0.6) 245 (−0.2) 357 (13.1) 242 (−0.6) 244
3.0 89 (−1.1) 92 (1.6) 100 (5.9) 89 (−1.1) 90 88 (−1.0) 92 (2.2) 110 (9.3) 88 (−1.0) 90
Lognormal 1.0 739 (−0.9) 738 (−1.2) 4178 (20.0) 737 (−1.2) 740 251 (0.0) 252 (0.9) 1531 (20) 249 (−0.4) 252
1.5 247 (0.4) 249 (0.3) 485 (17.8) 246 (0.1) 246 154 (−0.3) 156 (1.1) 460 (19.8) 153 (−0.8) 154
3.0 79 (−0.8) 82 (1.2) 104 (11.0) 79 (−0.8) 80 70 (−0.4) 73 (1.0) 122 (16.3) 70 (−0.4) 70
a

Distribution of the primary outcome of interest; t3 denotes Student’s t distribution with 3 degrees of freedom. The nominal power is equal to 80%.

5. Discussion

We have considered the use of the ordinary Wilcoxon–Mann–Whitney (WMW) test in the context of informatively missing observations, with the tied or untied worst-rank score imputation procedure for missing observations. This is important in the context of trials with primary endpoints that are measured at a pre-specified fixed time-point for highly lethal diseases. The impact of missing observations due to disease-related events is often overlooked [41, 37, 42]. We focused on testing the null hypothesis of no difference in both mortality and the primary outcome of interest against the restricted, uni-directional alternative hypothesis that the active treatment has a favorable effect on these two outcomes. In the context where higher values of the primary outcome of interest are indicative of better health outcome, the uni-directional alternative hypothesis specifies that subjects in the active treatment tend to have higher values of the primary outcome of interest or longer survival, while the distribution of the other outcome is either better in the active treatment group or at least equivalent in both treatment groups.

We have found that these worst-rank tests can be considerably more powerful than the alternative approach of analyzing the survivors only. In particular, this is the case as long as there is a large survival advantage to the treated group, or a moderate survival advantage coupled with a small to moderate advantage in the non-fatal outcome. This reflects the tradeoff between the added power of the worst-rank tests due to their use of all subjects and the added power of the survivors-only test due to their not diluting the non-fatal outcome effect with the survival effect. This also suggests that the “mortality neutral” assumption of McMahon and Harrell [27] in their power and sample size calculations is not advisable if a treatment effect on survival is plausible in that it would lead to a much larger sample size than necessary.

We also found that there is no real power advantage to the untied worst-rank test relative to the tied worst-rank test when the survival probability is at least 60%, and thus due to the simplicity of the tied worst-rank test, that may be preferable. While there may be an advantage to the untied worst-rank test when the survival probability is lower, we did not consider this case as it is unlikely to represent a realistic scenario for a trial of a non-fatal outcome, except maybe in clinical trials of critically ill patients [43].

We have derived formulae for power calculation for the WMW test for both tied and untied worst-rank tests, using the standard normal approximation of the distribution of the WMW test. These formulae take into account rank assignment procedures, and evaluate the mean and variance of the WMW test statistic accordingly. Our simulation studies demonstrate that our formulae are correct and highly accurate and thereby improve upon formulae in the literature.

We have also extended four methods commonly used in the literature to estimate a sample size to the setting of the worst-rank tests. Three of the methods do not require an assumption of normality for the non-fatal outcome and do not assume a location shift alternative. In simulation studies, we found these three methods to be highly accurate in their estimation of the sample size. Choice of which method to use depend on the presence of pilot data and the nature of the primary outcome of interest.

The WMW test on tied and untied worst-rank composite outcome presented in this paper can be extended to more complex settings when there are multiple components of the clinical course that could be considered in assigning worst-rank scores. Examples of these are the variations of worst-rank imputations proposed by Felker et al. [11, 9] and Moyé et al. [44, 45] and the longitudinally measured non-fatal outcomes introduced by Finkelstein and Schoenfeld [46].

Acknowledgments

Contract/grant sponsor: NIH; P50-NS051343; R01-CA075971; T32-NS048005

Appendix A

Mean and variance for U and Ũ

A.1 Untied Worst-Rank Scores

Consider the untied worst-rank adjusted values for subjects in the control and active treatment groups

1k=(1δ1k)X1k+δ1k(η+t1k),and2l=(1δ2l)X2l+δ2l(η+t2l),fork=1,,mandl=1,,n.

Define the WMW U-statistic by U=(mn)1k=1ml=1nUkl=(mn)1k=1ml=1nI(1k<2l).

Since 1k < 2l when {t1k < t2l and (δ1k = δ2l = 1)}, {δ1k = 1 and δ2l = 0}, or {X1k < X2l and (δ1k = δ2l = 0)}, we have Ukl = I(1k < 2l) = I(t1k < t2l, δ1k = δ2l = 1) + I1k = 1, δ2l = 0) + I(X1k < X2l, δ1k = δ2l = 0).

E(U)=E(Ukl)=P(t1k<t2l|δ1k=δ2l=1)P(δ1k=δ2l=1)+P(δ1k=1,δ2l=0)+P(X1k<X2l)P(δ1k=δ2l=0)=p1p2·P(t1k<t2l|δ1k=δ2l=1)+p1q2+q1q2·P(X1k<X2l)=p1p2πt1+p1q2+q1q2πx1=πU1 (A.1)

where q1 = 1 − p1, q2 = 1 − p2, πt1 = P(t1k < t2l1k = δ2l = 1), and πx1 = P(X1k < X2l).

Var(U)=(mn)2[k=1ml=1nVar(Ukl)+k=1ml=1nk=1ml=1nCov(Ukl,Ukl)],withkkorllor both=(mn)1[Var(Ukl)+(m1)Cov(Ukl,Ukl)+(n1)Cov(Ukl,Ukl)].

Note that Cov(Ukl, Ukl) = E(UklUkl) − E(Ukl)E(Ukl) = 0, Cov(Ukl, Ukl) = E(UklUkl) − E(Ukl)E(Ukl) and Cov(Ukl, Ukl′) = E(UklUkl′) − E(Ukl)E(Ukl′), for kk′, ll. In addition, because Ukl = I(1k < 2l) follows Bernoulli distribution with probability πU1, we derive the variance Var(Ukl) = E(Ukl) [1 − E(Ukl)] = πU1(1 − πU1).

E(UklUkl)=P(UklUkl=1)=P(δ1k=δ1k=1,δ2l=0)+P(t1k<t2l,t1k<t2l|δ1k=δ1k=δ2l=1)P(δ1k=δ1k=δ2l=1)+P(X1k<X2l)P(δ1k=1,δ1k=δ2l=0)+P(X1k<X2l)P(δ1k=0,δ1k=1,δ2l=0)+P(X1k<X2l,X1k<X2l)P(δ1k=δ1k=δ2l=0)=p12q2+p12p2πt2+2p1q1q2πx1+q12q2πx2,E(UklUkl)=P(UklUkl=1)=P(δ1k=1,δ2l=δ2l=0)+P(t1k<t2l,t1k<t2l|δ1k=δ2l=δ2l=1)P(δ1k=δ2l=δ2l=1)+P(t1k<t2l|δ1k=δ2l=1,δ2l=0)P(δ1k=δ2l=1,δ2l=0)+P(t1k<t2l|δ1k=1,δ2l=0,δ2l=1)P(δ1k=1,δ2l=0,δ2l=1)+P(X1k<X2l,X1k<X2l)P(δ1k=δ2l=δ2l=0)=p1q22+p1p22πt3+2p1p2q2πt1+q1q22πx3

with

πt2=P(t1k<t2l,t1k<t2l|δ1k=δ1k=δ2l=1),πx2=P(X1k<X2l,X1k<X2l),πt3=P(t1k<t2l,t1k<t2l|δ1k=δ2l=δ2l=1),andπx3=P(X1k<X2l,X1k<X2l).

Therefore,

Var(U)=(mn)1[πU1(1πU1)+(m1)(πU2πU12)+(n1)(πU3πU12)], (A.2)

where πU2=p12q2+p12p2πt2+2p1q1q2πx1+q12q2πx2 and πU3=p1q22+p1p22πt3+2p1p2q2πt1+q1q22πx3

A.2 Tied Worst-Rank Scores

Let 1k = (1 − δ1k)X1k + δ1kη, 2l = (1 − δ2l)X2l + δ2lη, for k = 1, …, m and l = 1. …, n be the tied adjusted worst-rank values for subjects in control and active treatment groups. Consider the WMW U-statistic defined by

Ũ=(mn)1k=1ml=1nŨkl=(mn)1k=1ml=1n[I(1k<2l)+12I(1k=2l)].

Ũkl=I(X1k<X2l)I(δ1k=δ2l=0)+I(δ1k=1,δ2l=0)+12I(δ1k=δ2l=1), which implies

E(Ũkl)=P(X1k<X2l)P(δ1k=δ2l=0)+P(δ1k=1,δ2l=0)+12P(δ1k=δ2l=1)=q1q2πx1+p1q2+12p1p2=πŨ1 (A.3)
Var(Ũ)=(mn)2[k=1ml=1nVar(Ũkl)+k=1ml=1nk=1ml=1nCov(Ũkl,Ũkl),]forkkorllor both=(mn)1[Var(Ũkl)+(m1)Cov(Ũkl,Ũkl)+(n1)Cov(Ũkl,Ũkl)].Var(Ũkl)=Var(I(X1k<X2l,δ1k=δ2l=0))+Var(I(δ1k=1,δ2l=0))+14Var(I(δ1k=δ2l=1))+Cov(I(δ1k=1,δ2l=0),I(δ1k=δ2l=1))+Cov(I(X1k<X2l,δ1k=δ2l=0),I(δ1k=δ2l=1))+2Cov(I(X1k<X2l,δ1k=δ2l=0),I(δ1k=1,δ2l=0))=πx1q1q2(1πx1q1q2)+p1q2(1p1q2)+14p1p2(1p1p2)p12p2q2πx1q1q2p1p22πx1p1q1q22=(πx1q1q2+p1q2+12p1p2)(1πx1q1q2p1q212p1p2)14p1p2=πŨ1(1πŨ1)14p1p2.

Cov(Ũkl, Ũk′l) = E(ŨklŨk′l) − E(Ũkl)E(Ũk′l) and Cov(Ũkl, Ũkl′) = E(ŨklŨkl′) − E(Ũkl)E(Ũkl′), for kk′, ll

E(ŨklŨkl)=E[I(X1k<X2l,δ1k=δ2l=0)I(X1k<X2l,δ1k=δ2l=0)]+E[I(X1k<X2l,δ1k=δ2l=0)×I(δ1k=1,δ2l=0)]+12E[I(X1k<X2l,δ1k=δ2l=0)I(δ1k=δ2l=1)]+E[I(δ1k=1,δ2l=0)I(X1k<X2l,δ1k=δ2l=0)]+E[I(δ1k=1,δ2l=0)I(δ1k=1,δ2l=0)]+12E[I(δ1k=1,δ2l=0)I(δ1k=δ2l=1)]+12E[I(δ1k=δ2l=1)I(X1k<X2l,δ1k=δ2l=0)]+12E[I(δ1k=δ2l=1)I(δ1k=1,δ2l=0)]+14E[I(δ1k=δ2l=1)I(δ1k=δ2l=1)]=P(X1k<X2l,X1k<X2l)P(δ1k=δ1k=δ2l=0)+P(X1k<X2l)P(δ1k=1,δ1k=δ2l=0)+P(X1k<X2l)P(δ1k=1,δ1k=δ2l=0)+P(δ1k=δ1k=1,δ2l=0)+14P(δ1k=δ1k=δ2l=1)=πx2q12q2+2πx1p1q1q2+p12q2+14p12p2.E(ŨklŨkl)=E[I(X1k<X2l,δ1k=δ2l=0)I(X1k<X2l,δ1k=δ2l=0)]+E[I(X1k<X2l,δ1k=δ2l=0)×I(δ1k=1,δ2l=0)]+12E[I(X1k<X2l,δ1k=δ2l=0)I(δ1k=δ2l=1)]+E[I(δ1k=1,δ2l=0)I(X1k<X2l,δ1k=δ2l=0)]+E[I(δ1k=1,δ2l=0)I(δ1k=1,δ2l=0)]+12E[I(δ1k=1,δ2l=0)I(δ1k=δ2l=1)]+12E[I(δ1k=δ2l=1)I(X1k<X2l,δ1k=δ2l=0)]+12E[I(δ1k=δ2l=1)I(δ1k=1,δ2l=0)]+14E[I(δ1k=δ2l=1)I(δ1k=δ2l=1)]=P(X1k<X2l,X1k<X2l)P(δ1k=δ2l=δ2l=0)+P(δ1k=1,δ2l=δ2l=0)+12P(δ1k=δ2l=1,δ2l=0)+12P(δ1k=δ2l=1,δ2l=0)+14P(δ1k=δ2l=δ2l=1)=πx3q1q22+p1q22+p1p2q2+14p1p22.

Therefore,

Var(U)=(nm)1[πŨ1(1πŨ1)+(m1)(πŨ2πŨ12p12p212)+(n1)(πŨ3πŨ12p1p2212)p1p24] (A.4)

where πŨ2=πx2q12q2+2πx1p1q1q2+p12q2+13p12p2 and πŨ3=πx3q1q22+p1q22+p1p2q2+13p1p22.

Appendix B

Conditional probabilities

Suppose the death times follow exponential distributions i.e. ti ~ Expi), i = 1, 2. Let θ=λ1λ2,q1=q2θ, q2 = eTλ2. Since P1k = 1) = p1, P2l = 1) = p2, we have

πt1=P(t1k<t2l|δ1k=δ2l=1)=(p1p2)10T(1eλ1u)λ2eλ2udu=1(1q2θ)[11q2(1+θ)(1+θ)(1q2)] (B.1)
πt2=P(t1k<t2l,t1k<t2l|δ1k=δ1k=δ2l=1)=p12p210T(1eλ1u)2λ2eλ2udu=1(1q2θ)2{1+1(1q2)[1q2(1+2θ)1+2θ2(1q2(1+θ))1+θ]}; (B.2)
πt3=P(t1k<t2l,t1k<t2l|δ1k=δ2l=δ2l=1)=p11p220T(eλ2Teλ2u)2λ1eλ1udu (B.3)
=(q21q2)2[1+θ(1q2(2+θ))(2+θ)(1q2θ)q222θ(1q2(1+θ))(1+θ)(1q2θ)q2]. (B.4)

Now, suppose that the non-fatal outcomes X1, X2 follow normal distributions Nx1, σx1) and Nx2, σx2), respectively. Consider Δx=μx2μx1σx12+σx22,ρxj=σxj2σx12+σx22, and Zkl=X1kX2l(μx1μx2)σx12+σx22. We can show that

πx1=P(X1k<X2l)=Φ(Δx),πx2=P(X1k<X2l,X1k<X2l)=P(Zkl<Δx,Zkl<Δx),(Zkl,Zkl)~N((00),(1ρx2ρx21))πx3=P(X1k<X2l,X1k<X2l)=P(Zkl<Δx,Zkl<Δx),(Zkl,Zkl)~N((00),(1ρx1ρx11)).

Appendix C

Simulation results for Weibull and log-logistic survival times

References

  • 1.Zhang JL, Rubin DB. Estimation of causal effects via principal stratification when some outcomes are truncated by death. Journal of Educational and Behavioral Statistics. 2003;28(4):353–368. [Google Scholar]
  • 2.Rubin DB. Causal inference through potential outcomes and principal stratification: application to studies with” censoring” due to death. Statistical Science. 2006:299–309. [Google Scholar]
  • 3.Frangakis CE, Rubin DB. Principal stratification in causal inference. Biometrics. 2002;58(1):21–29. doi: 10.1111/j.0006-341x.2002.00021.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Singhal A. Normobaric Oxygen Therapy in Acute Ischemic Stroke Trial. ClinicalTrials.gov Database. 2006 Dec; URL http://clinicaltrials.gov/ct2/show/NCT00414726, [online] http://clinicaltrials.gov/ct2/show/NCT00414726.
  • 5.Singhal A, Benner T, Roccatagliata L, Koroshetz W, Schaefer P, Lo E, Buonanno F, Gonzalez R, Sorensen A. A pilot study of normobaric oxygen therapy in acute ischemic stroke. Stroke. 2005;36(4):797. doi: 10.1161/01.STR.0000158914.66827.2e. [DOI] [PubMed] [Google Scholar]
  • 6.Kumar S, Selim M, Caplan L. Medical complications after stroke. The Lancet Neurology. 2010;9(1):105–118. doi: 10.1016/S1474-4422(09)70266-2. [DOI] [PubMed] [Google Scholar]
  • 7.Lachin J. Worst-rank score analysis with informatively missing observations in clinical trials. Controlled clinical trials. 1999;20(5):408–422. doi: 10.1016/s0197-2456(99)00022-7. [DOI] [PubMed] [Google Scholar]
  • 8.Berry JD, Miller R, Moore DH, Cudkowicz ME, Van Den Berg LH, Kerr DA, Dong Y, Ingersoll EW, Archibald D. The combined assessment of function and survival (cafs): A new endpoint for als clinical trials. Amyotrophic lateral sclerosis and frontotemporal degeneration. 2013;14(3):162–168. doi: 10.3109/21678421.2012.762930. [DOI] [PubMed] [Google Scholar]
  • 9.Felker G, Maisel A. A global rank end point for clinical trials in acute heart failure. Circulation: Heart Failure. 2010;3(5):643–646. doi: 10.1161/CIRCHEARTFAILURE.109.926030. [DOI] [PubMed] [Google Scholar]
  • 10.Zannad F, Garcia AA, Anker SD, Armstrong PW, Calvo G, Cleland JG, Cohn JN, Dickstein K, Domanski MJ, Ekman I, et al. Clinical outcome endpoints in heart failure trials: a european society of cardiology heart failure association consensus document. European journal of heart failure. 2013;15(10):1082–1094. doi: 10.1093/eurjhf/hft095. [DOI] [PubMed] [Google Scholar]
  • 11.Felker G, Anstrom K, Rogers J. A global ranking approach to end points in trials of mechanical circulatory support devices. Journal of cardiac failure. 2008;14(5):368–372. doi: 10.1016/j.cardfail.2008.01.009. [DOI] [PubMed] [Google Scholar]
  • 12.Subherwal S, Ohman EM, Mahaffey KW, Rao SV, Alexander JH, Wang TY, Alexander KP, Hasselblad V, Roe MT. Incorporation of bleeding as an element of the composite end point in clinical trials of antithrombotic therapies in patients with non–st-segment elevation acute coronary syndrome: Validity, pitfalls, and future approaches. American heart journal. 2013;165(5):644–654. doi: 10.1016/j.ahj.2012.11.012. [DOI] [PubMed] [Google Scholar]
  • 13.Neaton JD, Wentworth DN, Rhame F, Hogan C, Abrams DI, Deyton L. Considerations in choice of a clinical endpoint for aids clinical trials. Statistics in medicine. 1994;13(19–20):2107–2125. doi: 10.1002/sim.4780131919. [DOI] [PubMed] [Google Scholar]
  • 14.Lisa AB, James SH. Rule-based ranking schemes for antiretroviral trials. Statistics in Medicine. 1997;16:1175–1191. doi: 10.1002/(sici)1097-0258(19970530)16:10<1175::aid-sim522>3.0.co;2-g. [DOI] [PubMed] [Google Scholar]
  • 15.Joshua Chen Y, Gould AL, Nessly ML. Treatment comparisons for a partially categorical outcome applied to a biomarker with assay limit. Statistics in medicine. 2005;24(2):211–228. doi: 10.1002/sim.1833. [DOI] [PubMed] [Google Scholar]
  • 16.Follmann D, Wittes J, Cutler JA. The use of subjective rankings in clinical trials with an application to cardiovascular disease. Statistics in medicine. 1992;11(4):427–437. doi: 10.1002/sim.4780110402. [DOI] [PubMed] [Google Scholar]
  • 17.Brittain E, Palensky J, Blood J, Wittes J. Blinded subjective rankings as a method of assessing treatment effect: a large sample example from the systolic hypertension in the elderly program (shep) Statistics in medicine. 1997;16(6):681–693. doi: 10.1002/(sici)1097-0258(19970330)16:6<681::aid-sim487>3.0.co;2-h. [DOI] [PubMed] [Google Scholar]
  • 18.Allen LA, Hernandez AF, O’Connor CM, Felker GM. End points for clinical trials in acute heart failure syndromes. Journal of the American College of Cardiology. 2009;53(24):2248–2258. doi: 10.1016/j.jacc.2008.12.079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Allen LA, Spertus JA, et al. End points for comparative effectiveness research in heart failure. Heart failure clinics. 2013;9(1):15–28. doi: 10.1016/j.hfc.2012.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Sun H, Davison BA, Cotter G, Pencina MJ, Koch GG. Evaluating treatment efficacy by multiple end points in phase ii acute heart failure clinical trials analyzing data using a global method. Circulation: Heart Failure. 2012;5(6):742–749. doi: 10.1161/CIRCHEARTFAILURE.112.969154. [DOI] [PubMed] [Google Scholar]
  • 21.Subherwal S, Anstrom KJ, Jones WS, Felker MG, Misra S, Conte MS, Hiatt WR, Patel MR. Use of alternative methodologies for evaluation of composite end points in trials of therapies for critical limb ischemia. American Heart Journal. 2012;164(3):277–284. doi: 10.1016/j.ahj.2012.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.DeCoster T, Willis M, Marsh J, Williams T, Nepola J, Dirschl D, Hurwitz S. Rank order analysis of tibial plafond fractures: does injury or reduction predict outcome? Foot & ankle international. 1999;20(1):44–49. doi: 10.1177/107110079902000110. [DOI] [PubMed] [Google Scholar]
  • 23.Gould A. A new approach to the analysis of clinical drug trials with withdrawals. Biometrics. 1980;36(4):721–727. [PubMed] [Google Scholar]
  • 24.Ritchie J, Cerqueira M, Maynard C, Davis K, Kennedy J. Ventricular function and infarct size: the Western Washington intravenous streptokinase in myocardial infarction trial. Journal of the American College of Cardiology. 1988;11(4):689. doi: 10.1016/0735-1097(88)90197-0. [DOI] [PubMed] [Google Scholar]
  • 25.Ritchie J, Davis K, Williams D, Caldwell J, Kennedy J. Global and regional left ventricular function and tomographic radionuclide perfusion: the Western Washington Intracoronary Streptokinase in Myocardial Infarction Trial. Circulation. 1984;70(5):867. doi: 10.1161/01.cir.70.5.867. [DOI] [PubMed] [Google Scholar]
  • 26.Senn S. Statistical issues in drug development. Wiley; 1997. [Google Scholar]
  • 27.McMahon R, Harrell F., Jr Power calculation for clinical trials when the outcome is a composite ranking of survival and a nonfatal outcome. Controlled clinical trials. 2000;21(4):305–312. doi: 10.1016/s0197-2456(00)00052-0. [DOI] [PubMed] [Google Scholar]
  • 28.Shieh G, Jan S, Randles R. On power and sample size determinations for the wilcoxon–mann–whitney test. Nonparametric Statistics. 2006;18(1):33–43. [Google Scholar]
  • 29.Lehmann E, D’abrera H. Nonparametrics: statistical methods based on ranks. Vol. 204. Holden-Day San Francisco; 1975. [Google Scholar]
  • 30.Noether G. Sample size determination for some common nonparametric tests. Journal of the American Statistical Association. 1987;82(398):645–647. [Google Scholar]
  • 31.Rosner B, Glynn R. Power and sample size estimation for the Wilcoxon rank sum test with application to comparisons of C statistics from alternative prediction models. Biometrics. 2009;65(1):188–197. doi: 10.1111/j.1541-0420.2008.01062.x. [DOI] [PubMed] [Google Scholar]
  • 32.Wang H, Chen B, Chow S. Sample size determination based on rank tests in clinical trials. Journal of Biopharmaceutical Statistics. 2003;13(4):735–751. doi: 10.1081/BIP-120024206. [DOI] [PubMed] [Google Scholar]
  • 33.Greene T, Joffe M, Hu B, Li L, Boucher K. The balanced survivor average causal effect. The international journal of biostatistics. 2013;9(2):291–306. doi: 10.1515/ijb-2012-0013. [DOI] [PubMed] [Google Scholar]
  • 34.Siegel S, Castellan N., Jr . Nonparametric statistics for the behavioral sciences. McGraw-Hill Book Company; 1988. [Google Scholar]
  • 35.Bellera C, Julien M, Hanley J. Normal Approximations to the Distributions of theWilcoxon Statistics: Accurate to What N? Graphical Insights. Journal of Statistics Education. 2010;18(2) [Google Scholar]
  • 36.Matsouaka R, Singhal A, Betensky R. The Optimal Wilcoxon–Mann–Whitney Test in the Presence of Death-Censored Observations. submitted to Statistics in medicine. 2014 doi: 10.1002/sim.6355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Shih W. Problems in dealing with missing data and informative censoring in clinical trials. Current Controlled Trials in Cardiovascular Medicine. 2002;3(1):4. doi: 10.1186/1468-6708-3-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Rosner B, Glynn R. Power and Sample Size Estimation for the Clustered Wilcoxon Test. Biometrics. 2010 doi: 10.1111/j.1541-0420.2010.01488.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Rahardja D, Zhao Y, Qu Y. Sample size determinations for the wilcoxon–mann–whitney test: A comprehensive review. Statistics in Biopharmaceutical Research. 2009;1(3):317–322. [Google Scholar]
  • 40.Zhao Y. Sample size estimation for the van Elteren test—a stratified Wilcoxon–Mann–Whitney test. Statistics in medicine. 2006;25(15):2675–2687. doi: 10.1002/sim.2441. [DOI] [PubMed] [Google Scholar]
  • 41.Wood A, White I, Thompson S. Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals. Clinical Trials. 2004;1(4):368. doi: 10.1191/1740774504cn032oa. [DOI] [PubMed] [Google Scholar]
  • 42.Armijo-Olivo S, Warren S, Magee D. Intention to treat analysis, compliance, drop-outs and how to deal with missing data in clinical research: a review. Physical Therapy Reviews. 2009;14(1):36–49. [Google Scholar]
  • 43.Freeman BD, Danner RL, Banks SM, Natanson C. Safeguarding patients in clinical trials with high mortality rates. American journal of respiratory and critical care medicine. 2001;164(2):190–192. doi: 10.1164/ajrccm.164.2.2011028. [DOI] [PubMed] [Google Scholar]
  • 44.Moyé L, Davis B, Hawkins C. Analysis of a clinical trial involving a combined mortality and adherence dependent interval censored endpoint. Statistics in medicine. 1992;11(13):1705–1717. doi: 10.1002/sim.4780111305. [DOI] [PubMed] [Google Scholar]
  • 45.Moyé LA, Lai D, Jing K, Baraniuk MS, Kwak M, Penn MS, Wu CO. Combining censored and uncensored data in a u-statistic: Design and sample size implications for cell therapy research. The international journal of biostatistics. 2011;7(1):1–29. doi: 10.2202/1557-4679.1286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Finkelstein D, Schoenfeld D. Combining mortality and longitudinal measures in clinical trials. Statistics in medicine. 1999;18(11):1341–1354. doi: 10.1002/(sici)1097-0258(19990615)18:11<1341::aid-sim129>3.0.co;2-7. [DOI] [PubMed] [Google Scholar]

RESOURCES