Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Nov 30.
Published in final edited form as: Stat Med. 2019 Dec 19;39(5):602–616. doi: 10.1002/sim.8431

Analysis of ordered composite endpoints

Dean Follmann 1, Michael P Fay 1, Toshimitsu Hamasaki 2, Scott Evans 2
PMCID: PMC9709888  NIHMSID: NIHMS1848713  PMID: 31858640

Abstract

Composite endpoints are frequently used in clinical trials, but simple approaches, such as the time to first event, do not reflect any ordering among the endpoints. However, some endpoints, such as mortality, are worse than others. A variety of procedures have been proposed to reflect the severity of the individual endpoints such as pairwise ranking approaches, the win ratio, and the desirability of outcome ranking. When patients have different lengths of follow-up, however, ranking can be difficult and proposed methods do not naturally lead to regression approaches and require specialized software. This paper defines an ordering score O to operationalize the patient ranking implied by hierarchical endpoints. We show how differential right censoring of follow-up corresponds to multiple interval censoring of the ordering score allowing standard software for survival models to be used to calculate the nonparametric maximum likelihood estimators (NPMLEs) of different measures. Additionally, if one assumes that the ordering score is transformable to an exponential random variable, a semiparametric regression is obtained, which is equivalent to the proportional hazards model subject to multiple interval censoring. Standard software can be used for estimation. We show that the NPMLE can be poorly behaved compared to the simple estimators in staggered entry trials. We also show that the semiparametric estimator can be more efficient than simple estimators and explore how standard Cox regression maneuvers can be used to assess model fit, allow for flexible generalizations, and assess interactions of covariates with treatment. We analyze a trial of short versus long-term antiplatelet therapy using our methods.

Keywords: composite endpoints, DOOR, pairwise regression, probabilistic index model, win ratio

1 |. INTRODUCTION

Composite endpoints are frequently used in clinical trials to provide a more comprehensive characterization of clinical outcomes than can be achieved by a single endpoint. In cardiovascular disease, the composite endpoint might include death, stroke, myocardial infarction (MI), major bleed, or revascularization. In trials of infections with drug resistant pathogens, death and kidney failure might be used while in trials of multiple sclerosis, death, relapse, or “substantial progression” might be used. Often, components of the endpoint are treated equally, for example, by using the time to occurrence of the first of any constituent endpoints.

In fact, the different components of the composite can be quite different in terms of clinical importance, eg, death is much worse than revascularization. Different researchers have noted the inadequacy of treating all components the same and have proposed different remedies such as subjective rankings,13 combining mortality with longitudinal measures,4 pairwise regression,5 generalized pairwise comparisons,6 the win ratio (WR),7 probabilistic index models,8 and the desirability of outcome ranking (DOOR),9,10 The WR, for example, estimates the ratio of the number of “wins” or treatment subjects with better outcomes than control subjects divided by the number of treatment “losses.” Under no ties, the DOOR ranking estimates the probability that a person on treatment has a better outcome than a person on control, known as the Mann-Whitney (MW) parameter. The key idea of these approaches is that the analysis should respect a hierarchical ordering and not treat all outcomes equally. If all patients are followed for the same amount of time (τ), then ranking by the desirability of the outcome can proceed easily and methods of analysis for ordinal outcomes easily applied. If there is censoring, however, ranking becomes more difficult. Furthermore, methods are typically restricted to the two-group setting and incorporation of covariates can be complicated.

In this paper, we show how an ordering score, O, can be defined to operationalize the ranking rule. We show how this device can be used to nonparametrically estimate different estimands such as the MW parameter and WR in the presence of censoring, provided the censoring is independent of the outcome. To allow covariates X to impact the ordering score, we assume the ordering score is transformable to an exponential random variable with parameter exp(Xβ). This exponential transformation assumption results in a semiparametric model, which is equivalent to a proportional hazards (PH) model with the traditional time to event T replaced by O. Right censoring of follow-up corresponds to multiple interval censoring of the ordering score. We show how software for survival analysis can be repurposed to provide nonparametric estimates of the distribution function of the ordering score in each group. We also show how to estimate β and the “baseline” distribution function for the regression setting by modifying standard software for Cox regression. Use of the ordering score O coupled with the PH assumption allows semiparametric/rank regression estimation of the different metrics described above. The exponential transformation assumption was used in the work of Follmann,5 but the development was for pairs of patients similar to the work of Thas et al,8 and there was no concept of an ordering score and no repurposing of standard survival software for estimation. Our development is similar to the work of Oakes11 who focuses on the WR. A key difference is that we restrict an individual to a single ordering score while, using our notation,11 allowing multiple ordering scores. We explore the different approaches by simulation and illustrate these methods by analyzing a trial of short versus long-term dual antiplatelet therapy (DAPT)in patients with drug eluting stents using a hierarchical endpoint involving death, stroke, MI, and major bleed.

2 |. METHOD

We demonstrate the approach by using a cardiovascular clinical trial. Assume patients are randomized to treatment Z = 1 or control Z = 0. The times to death, (first) stroke or MI, (first) major bleed, (TD, TS, TB) or censoring are recorded. Let C be the censoring time, ie, when follow-up stops and ΔDSB the indicators for the different outcomes. All patients are followed for at most τ = 1 year. A subject who has a stroke at time 0.5 and is censored at time 1 has (TD, TS, TB) = (1, 0.5, 1) and ΔDSB = (0, 1, 0).

We order the experience of patients using the following hierarchy.

  • Death is worst with earlier deaths worse than later deaths.

  • Stroke/MI is second worst with earlier events worse than later events.

  • Major bleed is third worst with earlier events worse than later events.

  • Being event free is best.

This ordering is consistent with a consensus ordering from a group of experts in cardiovascular disease.1 If all patients are followed for the same amount of time, we can easily estimate the probability that a person on treatment has a better outcome than a person on control. This can be formalized by creating an ordering score. For us, ordering score is just a device for ranking patients or comparing pairs of patients, so the actual value is not necessarily meaningful. Let

O=TDΔD+(1ΔD)ΔS(τ+TS)+(1ΔD)(1ΔS)ΔB(2τ+TB)+(1ΔD)(1ΔS)(1ΔB)3τ. (1)

Note that subjects who are event free get the best ordering score, O, which is 3τ.

Using the ordering score, we can easily calculate different measures of the effect of treatment. For example, the proportion of times a treatment person is known to have a better outcome than a control person for our setting with a common censoring time is

P^τ(O1>O0O1O0)=i=1nj=1n1(Oi>Oj)Zi(1Zj)i=1nj=1nZi(1Zj)1(OiOj), (2)

where 1() is the indicator function, Oi is the score for subject i, and, with a slight abuse of notation, Oz, z = 0, 1 is a random score from group z. We use τ in P^τ to emphasize the dependence on the maximal length of follow-up. This is a nonparametric estimator that makes no assumptions about the form of the distributions of O0 and O1 (see the work of Follmann5 for a version that allows covariates and censoring). Pairs of patients who experience identical events or are censored at τ, ie, {Oi = Oj}, are not included in the estimator.

An attractive and popular approach to ranking in clinical trials is given by the WR.7 We can describe the “unmatched” WR approach using ordering scores. Here, Oi > Oj for (Zi, Zj)=(1,0) is a win for treatment and Oi < Oj for (Zi, Zj)=(1,0) is a loss for treatment. The WR estimator is simply the number of winners on treatment divided by the number of losers on treatment.

WR^τ=i=1nj=1n1(Oi>Oj)Zi(1Zj)i=1nj=1n1(Oi<Oj)Zi(1Zj). (3)

As for P^τ(O1>O0O1O0), tied ordering scores not tallied and (2) and (3) are related as WR^τ=P^τ(O1>O0O1O0)/P^τ(O1<O0O1O0), or the odds of a better outcome on treatment ignoring ties. A closed form variance estimate for the WR was developed in the work of Bebu and Lachin.12

A related estimand that counts such “tied” subjects is the MW parameter given by Pτ(O1 > O0) + 1∕2Pτ(O1 = O0),13 which was mentioned in the work of Evans and Follmann10 as a natural metric for the DOOR ranking approach. A form that allows for covariates with uncensored data is known as the probabilistic index.8 The MW estimator is given by

P^τ(O1>O0)+1/2P^τ(O1=O0)=i=1nj=1n1(Oi>Oj)Zi(1Zj)n0n1+12i=1nj=1n1(Oi=Oj)Zi(1Zj)n0n1, (4)

where nz is the number of subjects in group z.

Another approach is the proportion in favor of treatment,6 which is given by

P^τ(O1>O0)P^τ(O1<O0)=i=1nj=1n1(Oi>Oj)Zi(1Zj)n0n1i=1nj=1n1(Oi<Oj)Zi(1Zj)n0n1. (5)

This corresponds to Gehan’s test statistic for testing equality of the distributions of O0 and O1.14

For simplicity, we will focus on the WR and MW estimators as (2) and (5) can be derived from them. The WR and MW parameters complement each other. The WR is akin to a relative risk estimator and reflects a relative advantage of treatment to control. The MW is akin to an absolute risk estimator and draws closer to 0.5 as fewer subjects have events. The above approaches were either developed without censoring in mind or dealt with censoring by comparing pairs of subjects over the common interval of follow-up, effectively the member of the pair with less follow-up. In the next section, we develop a representation of the ranking that can be used to address censoring in a different way.

2.1 |. Censoring

In practice, censoring often occurs in clinical trials. This may be due to patients enrolling at different calendar times, or because they drop out of the study. We will assume that censoring is governed by an independent mechanism. Early censoring after randomization, ie, before τ, corresponds to multiple interval censoring of the ordering score. As an illustration, consider our example that has 3 levels and intended follow-up on the interval (0, τ = 1) for all. Suppose the first subject say “Clare” is lost to follow-up at time 0.5; then, her ordering score may truly be in (0.5, 1], (1.5,2], (2.5,3), or [3,3]. These correspond to second half events of death, stroke/MI, or major bleed, respectively, or no event at all over the first year. If “Clare” had a stroke at time 0.4, before being lost to follow-up at time 0.5, her ordering score may truly be in (0.5, 1] (as she might have died) or [1.4, 1.4] (as she might have survived). Figure 1 gives examples of how different clinical trajectories, subject to censoring, are mapped to an ordering score.

FIGURE 1.

FIGURE 1

Clinical trajectories for five subjects mapped to ordering scores. The symbols denote D-death, S-Stroke, B-Bleed. Subjects 4 and 5 have complete follow-up from (0, τ), while subjects 1 and 2 are censored halfway through at time τ∕2. Subject 3 dies at time 0.7 while subjects 1 and 2 are multiply interval censored. Clare’s experience, discussed in the text, corresponds to Trajectories 1 and 2. The data are also presented in Table A1. MI, myocardial infarction

Understanding that right censoring of the clinical experience trajectories corresponds to multiple interval censoring of ordering scores, which can be exploited as suggested by Oakes.11 The estimation problem for the different preference metrics reduces to estimation of the distribution function of O when our data may be multiply interval censored. Turnbull derived the nonparametric maximum likelihood estimator (NPMLE) for this setting.15

In general, we define τ as the maximal censoring time and let L be the number of levels in the hierarchy. For our example, τ = 1 and L = 3. With estimates of Fz(), the cumulative distribution function (cdf) for group z, we can estimate the WR as

P^τ(O1>O0O1O0)P^τ(O0>O1O1O0)=0LτS^1(s)dF^0(s)0LτS^0(s)dF^1(s), (6)

where Oz is a randomly drawn ordering score from group Z = z and S^(s)=1F^(s). The nonparametric MW estimator is

P^τ(O1>O0)+12P^(O1=O0)=0LτS^1(s)dF^0(s)+12[10LτS^1(s)dF^0(s)0LτS^0(s)dF^1(s)], (7)

where we estimate P(O1 = O0) as 1 minus the probability of a win minus the probability of a loss.

While the ordering scores are conceived purely as a device for ranking and not easily interpretable, Os that are multiples of τ have special meaning. In our running example, F(1) is the proportion who dies and F(2) − F(1) is the proportion of survivors who have nonfatal strokes/MI. Note that, with this construction, O can be either discrete or continuous on the interval [0, Lτ].

The standard approach to censoring in the hierarchical ranking literature is to restrict comparison to pairs of patients over the interval of common follow-up, see the work of Follmann5 or the work of Pocock et al.7 We will call this the simple approach. The simple WR and MW estimators are defined below for our running example. We first define wins and losses for death

WDij=I(TDi>TDj)ΔDjZi(1Zj)
LDij=I(TDi<TDj)ΔDiZi(1Zj).

We next define for stroke/MI

WSij=I(TSi>TSj)ΔSjZi(1Zj)(1WDij)(1LDij)
LSij=I(TSi<TSj)ΔSiZi(1Zj)(1WDij)(1LDij).

Finally, for bleeds,

WBij=I(TBi>TBj)ΔBjZi(1Zj)(1WDij)(1LDij)(1WSij)(1LSij)
LBij=I(TBi<TBj)ΔBiZi(1Zj)(1WDij)(1LDij)(1WSij)(1LSij).

We next define

Wij=WDij+WSij+WBij
Lij=LDij+LSij+LBij.

The simple estimate of the WR is

ijWijijLij. (8)

The “simple” estimate of the MW probability is not all that simple. The MW parameter is

Pτ(O1>O0)+1/2Pτ(O1=O0).

We use the fact that

Pτ(O1>O0)=Pτ(O1>O0O1O0)Pτ(O1O0),

and estimate

P^τ(O1>O0O1O0)=ijWijijWij+ijLij.

We thus obtain our “simple” MW probability estimate as

ijWijijWij+ijLij×{0LτS^1(s)dF^0(s)+0LτS^0(s)dF^1(s)}+1/2{10LτS^1(s)dF^0(s)0LτS^0(s)dF^1(s)}. (9)

Note that applying (4) directly results in an estimator whose expectation depends on the censoring distribution before τ with more early censoring giving more ties and pushing the MW estimator (4) closer to 0.5. In the Appendix, we discuss a “PH” condition. One can show that the simple WR estimator (8) converges to P(O1 > 00)∕P(O1 < O0) under independent censoring of O0 and O1 for any τ given this PH condition. Relatedly, the simple MW estimator (9) converges to P(O1 > O0)+ 1∕2Pτ(O1 = O0), under these same conditions. Note that the MW parameter does depend on τ as subjects who have no events over [0, τ) are counted as ties and more ties happen with smaller τ.

2.2 |. Regression

While censoring can be dealt with directly as described above to obtain nonparametric estimates of different preference metrics such as WR and MW, to incorporate covariates, we need structure. Let X be a vector of covariates, which implicitly includes the treatment indicator Z and we can write X(Z) for emphasis. We will assume that the ordering score O over the interval [0, Lτ) is drawn from some distribution FX() with support on [0,∞) such that for some unknown transformation Λ0(), Λ0(O) follows an exponential distribution with parameter equal to exp(Xβ) over the interval [0,Λ0()). We restrict to the interval because the behavior of O beyond Λ0() is unknowable and unnecessary. The ordering score, for a subject with no events, equals , which occurs with probability 1 − FX(). It follows that the conditional distribution of Λ0(o)|Λ0(o) < Λ0() is the same as that of a conditional exponential. Note that the exponential transformation assumptions imply

Pτ(O0)=P{Λ0(O)Λ0(o)}=1exp{Λ0(o)exp(Xβ)}. (10)

However, the RHS is just the cdf for a random variable that follows the PH model, where Λ0(t)=0tλ0(s)ds is called the integrated baseline hazard. A key feature here, uncommon in typical right-censored survival analysis, is that O is subject to multiple interval censoring. Nonetheless, we can estimate both Λ0(o)=λ0(s)ds and β using PH regression with discontinuous risk intervals (ie, multiple interval censoring). Estimation using readily available software can be accomplished using data that have been segregated by events and stratified. One can also use the coxph package in R with multiple discontinuous risk intervals (see Appendix for details of both approaches).

Using (10), we can estimate the advantage of treatment over control for a subject with covariate vector X using the different metrics such as MW or WR. We simply estimate Fz(u|X) from (10) by replacing Λ0(), β with estimates. We then replace the F^z() in (6) or (7) with our estimated F^z(uX) to obtain semiparametric regression estimates of MW and WR.

With a regression formulation, interactions with treatment can be specified. Thus, one can see if the MW or WR is roughly constant over X or varies. One can look to identify a “qualitative interaction” where for some Xs treatment is preferred to control and where for other Xs control is preferred to treatment.

While the usual interpretation of exp{(XiXj}β) as the hazard ratio for subject i compared to subject j is awkward for a nonsurvival endpoint such as O, we can directly interpret β. As shown in the work of Follmann,5 and the Appendix, one has that

exp{(XjXi)β} (11)

representing the odds that person i has a better outcome than person j over the calendar time follow-up period [0, τ), which is given by (11). Or equivalently, (11) is the WR for people with Xi compared to people with Xj.

Note that if O is simply the time to death, exp(−β) is the odds that a person on treatment has a longer life than a person on control. This complements the usual interpretation of exp(β) as the treatment over control hazard ratio. The change in sign of β for the two interpretations reflects that a beneficial treatment that increases the odds of a longer life will reduce the instantaneous risk of death.

The PH assumption is chosen for convenience and interpretation. It is convenient because standard Cox software can be used to estimate β. It is interpretable because we can readily calculate common metrics to evaluate treatment and interpret β in a meaningful way. As always, one can question whether (10) fits the data and how assumptions might be relaxed. Using Cox regression methods allows one to leverage a wide range of methods to examine and weaken this assumption. For example, one might wonder if (10) applies to the constituents of the composite. Referring to our example, are the odds of death on all and the odds of stroke/MI in survivors and the odds of major bleed in stroke/MI-free survivors all equal to exp(β)? By the mapping to ordering scores, we can recast this as addressing whether β is the same for O in [0,1) as it is in [1,2) as it is in [2,3). This is a familiar question in traditional survival analysis akin to whether the effect of β effect varies with time. We can formally test this assumption by allowing different βs in the different ordering score intervals. Or we could just use the different βs for the different intervals and thus estimate different metrics such as the WR and MW parameters based on (10).

In our development, we assumed that within each level, longer times were associated with better outcomes. What if our ordering score has two levels such as time to death (earlier bad) and time to cure (earlier good)? For example, a trial to compare antibiotics for tuberculosis might focus on shorter time to cures, but require a principled way to deal with deaths.16 Assuming a maximum follow-up of τ, we can define an ordering score for this setting as

O=TDΔD+(1ΔD)ΔC(2τTC)+(1ΔD)(1ΔC)τ.

Thus, for survivors, a time to cure of τ = 1 is ranked worse than a time to cure of 0. Subjects who are censored at time t before dying or being cured are censored in the intervals (t, 1) and (1, 1 − t). Subjects who are censored at time t but are cured at time tc < t are censored in the intervals (t, 1) and the point [2 − tc, 2 − tc]. A person who is censored at time 0.7 without death or cure would be censored at (0.7, 1) (as they might die), (1, 1.3) as they might survive and be cured, and [1, 1] as they might survive without cure, or more simply over the interval (0.7, 1.3). This type of setup was discussed in the work of Shaw and Fay,17 which dealt with censoring but only considered the two-group setting with L = 2, did not allow for covariates and used different assumptions.

3 |. SIMULATIONS

We performed a set of simulations to assess the different methods of estimation under different scenarios. As suggested by Oakes,11 we generate (T1, T2, T3) that follow the PH model using the approach of Hougaard.18 This approach allows for correlation among the failure times and different risk for different events. For the control group, we generate Z, which follows a positive stable distribution with parameter α given by the Laplace transform. Given Z, we generate T1, T2, T3 as exponentials with parameters k for k = 1, 2, 3. Thus, the Tis are correlated. Nicely, the marginal distributions integrating over Z are Weibull with hazard λkααtα1 for k = 1, 2, 3. Failure times for the treatment group are analogous, but we specify a constant θ so that conditional on Z, T1, T2, T3 are exponentials with parameters Zθλk, for k = 1, 2, 3. It thus follows that the hazard ratio for treatment compared to control is θα for each of the failure times as is the hazard ratio for the ordering score on treatment compared to control. Informally, over the interval [0, τ), successive deaths remove the frailer individuals from both control:treatment in a 1: θα ratio. For ordering scores in the interval [τ, 2τ), successive strokes remove the relatively frailer individuals who survived past τ, again using the 1: θα ratio. An analogous culling occurs for ordering scores in the interval [2τ, 3τ). We set α = 0.75 and θ = exp(log(2)∕0.75), thus the marginal hazard ratio is 1/2 and the WR is 2. We set λD, λS, λB = 1∕3, 1∕2, 3∕4. The (simulated) correlation between any pair (T1, T2, or T2, T3 or T1, T3) of (uncensored) Ts is about 0.23. We also set τ = 0.5 and m = 200 subjects per group. For concreteness, one can think of T1, T2, T3 as the times to death, stroke/MI, or major bleed, respectively.

We considered five clinical trial scenarios:

  1. Simultaneous entry with death alone. All patients are followed over the interval [0, τ). Only death is used to order outcomes.

  2. Simultaneous entry with hierarchical ranking. All patients are followed up over the interval [0, τ). Death is the worst outcome, followed by stroke/MI and then by major bleed.

  3. Staggered entry with death alone. Enrollment is uniform over the period [0, τ). Only death is used to order outcomes.

  4. Staggered entry with hierarchical ordering I. Enrollment is uniform over the period [0, τ). Ordering is based on the above hierarchy.

  5. Staggered entry with hierarchical ordering II. Enrollment is uniform over the period [0, τ∕2). Ordering is based on the above hierarchy.

We ran 1000 simulated clinical trials for each scenario. We estimated the treatment effect using both the WR and MW metrics. We evaluated semiparametric estimators based on (10), NPMLE estimators based on (6) and (7), and simple estimators (8) and (9). Table 1 reports the results.

TABLE 1.

Simulated clinical trials to assess the performance of different estimates of the win ratio (WR) and the Mann-Whitney (MW) probability. Times to three types of ordered event, conditional on frailty Z are exponential with parameters λkθZ for k = 1, 2, 3 denoting event type and Z = 0, 1 the treatment indicator. The frailty Z follows a positive stable distribution with parameter α. Each trial has 200 subjects per group with maximum follow-up τ = 0.5. PH is based on the exponential transformation/PH assumption, NPMLE is the nonparametric maximum likelihood estimator, and simple compares pairs of patients over their common follow-up time. Pairs of rows provide the average and (sample variance) over 1000 simulated trials. Variances for MW estimators have been multiplied by 100. The true WR is 2.0 and the true MW parameter is 0.55 for scenarios 1 and 3 and 0.62 for scenarios 2, 4, and 5

WR estimators MW estimators
Scenario Enrollment Ranking PH NPMLE Simple PH NPMLE Simple
1 All at time 0 T 1 2.085 (0.297) 2.085 (0.299) 2.085 (0.299) 0.554 (0.036) 0.554 (0.036) 0.554 (0.036)
2 All at time 0 T1, T2, T3 2.017 (0.089) 2.020 (0.092) 2.020 (0.092) 0.618 (0.058) 0.619 (0.059) 0.618 (0.059)
3 Uniform (0, τ) T 1 2.198 (0.616) 2.264 (1.030) 2.230 (0.803) 0.554 (0.061) 0.554 (0.125) 0.554 (0.066)
4 Uniform (0, τ) T1, T2, T3 2.062 (0.181) 2.122 (0.386) 2.048 (0.196) 0.621 (0.113) 0.623 (0.266) 0.620 (0.123)
5 Uniform (0, τ/2 T1, T2, T3 2.027 (0.124) 2.059 (0.215) 2.024 (0.130) 0.620 (0.079) 0.621 (0.141) 0.619 (0.081)

We first discuss the WR estimators. On average, the WR estimators have similar means but are slightly larger than the true value of 2.00 with the largest (smallest) mean values for the scenarios with largest (smallest) sample variances. The NPMLE and simple estimators are identical for simultaneous entry trials (scenarios 1 and 2) with variances similar to the variance of the semiparametric estimator. For staggered entry trials (scenarios 3–5), the semiparametric estimator is more efficient; for scenario 5, the variance ratios of semiparametric/comparator are 0.58 for the NPMLE and 0.96 for the simple estimator. For staggered entry studies, we see that the NPMLE has an appreciably larger variance than the simple variance with ratios of 0.385/0.196=1.96 and 0.215/0.129=1.67 for scenarios 4 and 5, respectively. This behavior of the NPMLE is well known in the standard survival setting where the NPMLE corresponds to the odds from Efron’s estimator of (in our notation) P(O1 > O0). Efron’s estimator is known to be unstable in the tails.19 Indeed, the tail problem may be worse with hierarchical ranking as the variance ratio NPMLE/simple is 1.029/0.803 = 1.28 using death alone (scenario 3) but 0.385/0.196 = 1.96 for the analogous hierarchical ordering (scenario 4).

We next consider the MW estimators; note that the MW parameter depends on the number of levels of the hierarchy and equals 0.55 for scenarios 1 and 3 and 0.62 for scenarios 2, 4, and 5. The relative behavior of the three MW estimators is similar to the relative behavior of the three WR estimators for all scenarios.

4 |. EXAMPLE

To illustrate our methods, we perform an analysis of long-term follow-up of the NIPPON trial.20 The study was designed to evaluate the noninferiority of the short-term DAPT (6 months) compared to long-term DAPT (18 months) in patients with drug eluting stents. The primary endpoint was time to the first occurrence of all-cause mortality, MI, stroke, or major bleed by 18 months. The study was terminated early for slow recruitment and low event rates and analyzed when the follow-up of 3773 patients for 18 months was complete. We analyze a dataset with extended follow-up here. Since the two arms are identical for the first 6 months, we discard patients with fewer than 6 months of follow-up and time 0 is when DAPT is stopped in the short-term arm but continues in the long-term arm.

Due to the different clinical severity of the different component endpoints, we evaluate a hierarchical rating with the ordering death worse than stroke or MI worse than major bleed. Within each category, earlier events are worse than later events. The resultant ordering score is given by (1) with τ = 1251, which is the maximum follow-up in days.

Figure 2 provides the NPMLEs of the complement of the cdf of the ordering scores for the different two groups. Long-term DAPT is always better than short-term DAPT and we visually see a big impact of a few death events near the end of follow-up. As we will see, these events have a relatively large impact on the NPMLE approach as was foreshadowed by the simulations of Table 1.

FIGURE 2.

FIGURE 2

Nonparametric maximum likelihood estimators (NPMLEs) of one minus the cumulative distribution function of the ordering score for the two arms of the NIPPON trial. Solid lines denote long-term DAPT, dashed lines denote short-term DAPT. Maximum follow-up is 1251 days. DAPT, dual antiplatelet therapy; MI, myocardial infarction

Table 2 provides a tally of endpoints for the two groups along with the log of the WR based on the PH model for each of the three types of endpoints as well as overall. We see that the largest benefit of long-term dual therapy is on death with a slight advantage for stroke or MI and major bleed. The estimates are numerically different with 0.405 overall and 0.698, 0.133, and 0.185 for death, stroke/MI, and major bleed, respectively. To test the PH assumption for all three levels of the hierarchy, we compared the likelihoods for the hierarchical total model and the model that allows a different β for each category. The likelihood ratio test for different ratios for different levels had a p-value of 0.22, showing no strong opposition to the use of a single parameter.

TABLE 2.

Events by arm and type along with a category-specific estimate of the log win ratio −β, obtained by fitting a Cox regression model to O with time (category) varying covariates

Treatment duration
Event Short Long β^se
Death 27 14 0.698 (0.329)
Stroke or myocardial infarction (MI) (without death) 13 10
11 10 0.133 (0.437)
Major bleed (without death/stroke/MI) 17 14
14 12 0.185 (0.393)
Hierarchical total 52 36 0.405 (0.217)

Table 3 provides different estimates of the treatment effects along with bootstrap standard errors, Wald statistics, and permutation p-values. We see that the NPMLE is significant at the 0.05 level while the others are not quite significant. To explore the effect of influential points, we created 88 datasets, which, in turn, deleted each of the subjects with an event. The semiparametric and simple leave-one-out estimates of the WR differed by at most by 4% from the estimator based on all the data. The NPMLE WR estimates were much more sensitive with the three largest having changes of 18%, 10%, and 8%, respectively. These three correspond to the last three deaths, which all occur in the short-term group. Their inclusion appreciably increases the estimated advantage of long-term DAPT. We note that the PH model was fit using coxph software as described in the Appendix. The standard error estimate of log(WR), ie, −β from the coxph software, was 0.217, quite close to the bootstrap value of 0.215. For comparison with the Wald p-value, we also calculated a permutation p-value. For this, we estimated the parameter of interest say ω^ and then repeatedly permute the treatment label and reestimated ω leading to 1000 permuted estimates ω^1(p),,ω^1000(p). The permuted estimates provide a null reference distribution for ω^.

TABLE 3.

Different estimates of hierarchical-based treatment effects of long-term DAPT in the NIPPON trial. Log and logit transformations are used for the WR and MW estimates, respectively, to improve behavior. The bootstrap approach used 100 bootstrap samples, while the permutation p-value is based on 1000 resamples

WR MW
Feature PH NPMLE Simple PH NPMLE Simple
Estimate 1.499 2.127 1.446 0.510 0.519 0.509
log(WR), logit(MW) 0.405 0.755 0.369 0.041 0.075 0.037
Bootstrap se 0.215 0.324 0.225 0.024 0.035 0.024
Wald p-value 0.060 0.020 0.100 0.088 0.030 0.116
Permutation p-value 0.072 0.012 0.112 0.065 0.011 0.109

Abbreviations: DAPT, dual antiplatelet therapy; MW, Mann-Whitney; NPMLE, nonparametric maximum likelihood estimator; PH, proportional hazards; WR, win ratio.

To illustrate the regression approach, we specified a semiparametric regression model using age, treatment, and their interaction. The interaction is nearly significant at p=0.058 and directionally qualitative. Figure 3 graphs the WR as a function of age along with pointwise 95% confidence intervals based on the delta method. For 30 year olds, the estimated WR is nearly 5, while for 90 year olds, the WR is approximately 1/4 with a confidence interval that excludes unity.

FIGURE 3.

FIGURE 3

The win ratio for long-term dual antiplatelet therapy (DAPT) as a function of age under the proportional hazards (PH) model along with pointwise 95% confidence intervals

5 |. DISCUSSION

This paper demonstrates how to analyze hierarchical endpoints in clinical trials with censoring. We defined an ordering score to map a hierarchical ordering into a single number, subject to multiple interval censoring. This connection allowed standard software for survival analysis to be used to estimate nonparametric and semiparametric regression models for the effect of covariates on a hierarchical endpoint. We applied this formulation to the WR and MW, two commonly used metrics to measure the effect of treatment on ranked endpoints. Our work allows covariates other than a treatment indicator to be assessed.

This work is related to that of the works of Follmann5 and Oakes.11 Under the PH assumption, Follmann5 developed a pairwise regression approach to estimation of β, the current work uses partial likelihood that is much less computationally demanding and permits use of standard software for estimation. Oakes11 developed nonparametric and PH-based approaches to estimate the WR. Our approach is very similar to his formulation. Oakes allows for multiple ordering scores per person while our approach only uses the worst ordering score as is done for most other methods. Our restriction to the worst score allows the use of standard software for estimation of parameters and standard errors, and may dampen the impact of influential subjects with multiple effects. On the other hand, we note that Oakes identifies a class of models that result in the same β or log “hazard” ratio for multiple ordering scores per person and regardless of the duration of follow-up (in general, the value of β will vary with the duration of follow-up). If the additional ordering scores from each person follow this model, then including them can lead to greater efficiency.

The simulations and data analysis indicate that the NPMLE can be quite influenced by staggered entry trials, ie, with a nonpoint mass censoring distribution. This result is similar to the known behavior of Efron’s estimator of P(X > Y) for survival times X, Y in different groups.21 Efron’s two-sample test for right censored data based on the NPMLE is quite variable. Future work might explore how to improve the behavior of the NPMLE as was done with the Peto-Prentice approach to testing, which uses the combined NPMLE (ie, the NPMLE of both groups combined) and provides a more robust estimate of P(X > Y).22,23 In practice, the simple estimator that compares pairs of patients over periods of common follow-up has less problems with these issues. As might be expected, if the PH assumption holds, the semiparametric approach is more efficient than nonparametric methods. This assumption can be checked using standard methods. An appealing feature of the semiparametric approach is that covariates beyond the treatment indicator can be used, this can allow for seamless evaluation of covariate effects or interactions of covariates with treatment.

In this paper, we have focused on the setting where all components of the component endpoint are times to various ordered events. However, the method is much more general. Some examples follow. Suppose that a randomized trial focuses on 28-day mortality for an acute illness such as stroke or Ebola virus disease, but among survivors, there are gradations of disability based on a continuous outcome say X. If survival is measured in all, and X measured in all survivors, then our methods trivially apply as there is no “censoring.” If some patients are lost to follow-up before day 28 = τ, then there is multiple interval censoring. For example, if a patient is lost on day 14, then they are censored over the interval [14, 28]∪[29, 56] or simply [14,56]. A more general version of this would be where X is measured serially during follow-up. We illustrate this idea for the simple case where a patient is censored at the end of day 14 after x14 is measured. First, one might simply do a “last observation carried forward” imputation so that, effectively, the subject is interval censored at [14, 28]∪[28 + x14]. Second, one might fit a model to the serial Xs and then singly or multiply impute X28 based on the observed X14, say X^28(x14) obtaining [14,28][28+X^28(x14)].

Supplementary Material

R code (text file)
CSV File, simulated data

ACKNOWLEDGEMENT

This work utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov).

APPENDIX

SOME TECHNICAL RESULTS

In general, β from the regression model depends on the length of follow-up τ. For the two sample setting, Oakes11 gave a “PH” condition under which the same WR obtained for any length of follow-up. For our setting, this condition can be written as

PX(T1>t1,,Tp>tp)=P0(T1>t1,,Tp>tp)exp(Xβ),

where PX() is the joint survivor function for the random variables T1, …,Tp|X. Under this condition, the same β governs the distribution of O for all τ.

Additionally, under this assumption, we can show consistency of the simple estimators. Let O0, O1 be the ordering score for treatment and control that follow the PH model. Thus, without loss of generality, take O0, O1 to be exponential λ0 and λ1, respectively. Suppose that C0 and C1 are independent censoring times with P(Cz>s)=G¯z(s).

We only tally a win for treatment if we know O1 > O0, which happens with probability

P(Win)=P(O0<O1C1>O0C0>O0)=exp(λ1s)λ0exp(λ0s)G¯0(s)G¯1(s)ds=λ0exp((λ0+λ1)s)G¯0(s)G¯1(s)ds.

Similarly,

P(Loss)=λ1exp((λ0+λ1)s)G¯0(s)G¯1(s)ds.

TABLE A1.

Data from Figure 1. ΔD, ΔS, ΔB are the indicators of death, stroke/myocardial infarction (MI), or bleed events. TD, TS, TB are the times to death, stroke or bleed, or censoring, whichever comes first. Z is the treatment indicator and W is age

Person ΔD ΔS ΔB TD TS TB Z W
1 0 0 0 0.5 0.5 0.5 1 61
2 0 1 0 0.5 0.4 0.5 0 46
3 1 0 0 0.7 0.7 0.7 1 73
4 0 0 1 1.0 1.0 0.3 0 29
5 0 1 1 1.0 0.8 0.6 1 63

It follows that the simple WR estimator converges to λ1/λ0. Similarly, it follows that

WijWij+Lijλ0λ1+λ0=P(O1>O0),

which is free of τ.

We next provide details on the interpretation of exp{(XjXi)β} as the odds that person i has a better outcome than person j. Recall that if Oi, Oj are independent exponentials with parameters λi, λj, then P(Oi > Oj) = λ/(λi +𝜆j). As shown in the work of Follmann,5 one has the similar result conditionally,

Pτ[Oi>Ojmin{Λ0(Oi),Λ0(Oj)}<Λ0(Lτ)]=exp(Xjβ)exp(Xjβ)+exp(Xiβ). (A1)

It follows immediately that the odds of the event [Oi > OJ| min {Λ0(Oi),Λ0(OJ)} < Λ0()] is given by

exp{(XjXi)β}. (A2)

Put another way, the odds that person i has a better outcome than person j over the calendar time followup period [0, τ) is given by (A2). Or equivalently, (A2) is the win ratio for people with Xi compared to people with Xj.

SOFTWARE IMPLEMENTATION

A simple way to program multiple interval censoring is to use the counting process formulation for subjects with discontinuous intervals of risk. This allows estimation of the baseline “hazard” as well as the covariate effects β. Discontinuous risk intervals are discussed in the work of Therneau and Grambsch24 and can be easily implemented using the coxph function in the survival R package with multiple start and stop intervals for each multiply interval censored subject. We will illustrate this approach using the data of Figure 1, which is represented in conventional form in Table A1.

Let START, STOP, IE, and Z denote the start and stop times of each ordering score interval, the indicator of an event, and the treatment group indicator, respectively. The data of Table A1 become

ID START STOP IE Z W
1 0.0 0.5 0 1 61
1 1.0 1.5 0 1 61
1 2.0 2.5 0 1 61
2 0.0 0.5 0 0 46
2 1.0 1.4 1 0 46
3 0.0 0.7 1 1 73
4 0.0 1.0 0 0 29
4 1.0 2.0 0 0 29
4 2.0 2.3 1 0 29
5 0.0 1.0 0 1 63
5 1.0 1.8 1 1 63

For such data, the NPMLE of the ordering score distribution (ignoring treatment) is given by

survfit(Surv(START,STOP,IE)~1).

The PH estimator obtains from

coxph(Surv(START,STOP,IE)~Z).

The PH model with separate β parameters for each ordering score interval can be fit by creating a new dataset where IZ1 is Z times an indicator for follow-up during the period (0, τ), IZ2 is Z times an indicator for follow-up during the period (τ, 2τ), and so on. For the data of Table A1, we obtain and we can fit the model

ID START STOP IE Z W IZ1 IZ2 IZ3
1 0.0 0.5 0 1 61 1 0 0
1 1.0 1.5 0 1 61 0 1 0
1 2.0 2.5 0 1 61 0 0 1
2 0.0 0.5 0 0 46 0 0 0
2 1.0 1.4 1 0 46 0 0 0
3 0.0 0.7 1 1 73 1 0 0
4 0.0 1.0 0 0 29 0 0 0
4 1.0 2.0 0 0 29 0 0 0
4 2.0 2.3 1 0 29 0 0 0
5 0.0 1.0 0 1 63 1 0 0
5 1.0 1.8 1 1 63 0 1 0
coxph(Surv(START,STOP,IE)~IZ1+IZ2+IZ3).

If other software besides R is preferred, that software can be used to estimate the survival function of O and/or to estimate the semiparametric model. The basic idea is to create a large dataset by “stacking” successive datasets comprised of subjects who avoided earlier events. Thus, we first use all subjects for the top level of the hierarchy (death) with time to death the ordering score. We then append the subjects who avoided death using time to stroke plus τ as the ordering score, and so on. We call this a “stratified and segregated” data set and Table A2 is an example using the data of Table A1. In Table A2, note that the three strata are completely segregated by O. For example, all Os in stratum 2 are between [1,2). With stratified Cox regression, person 2 who remains event free separately contributes to the partial likelihoods over the intervals [0,1) and [1,2) and [2,3). Without stratification, person 2 would be counted in the risk set three times for all O. A Cox model can be fit using standard Cox regression software with stratification. The corresponding R code for Table A2 is

coxph(Surv(O,IE)~Z+W+strata(S)).

In R, these two approaches to estimation, using data with discontinuous risk intervals with

coxph(Surv(START,STOP,IE)~Z)

or stratified and segregated data with

coxph(Surv(O,IE)~Z+W+strata(S)),

result in the same estimates and standard errors.

TABLE A2.

Construction of the segregated stratified data set for ordering score regression. Each patient has a time to death, stroke, or bleed or is censored over the time interval [0,1). This experience is mapped onto an ordering score scale O. The ordering score intervals [0, 1), [1, 2), [3, 3), and [3, 3] correspond to the ordering death ≤ stroke ≤ bleed ≤ nothing and three datasets or strata are created. Each successive dataset is comprised of “survivors” of the previous dataset. Stratified Cox regression is applied to the “survival” data O, IE with covariates Z, W

ID O IE Z W Stratum
1 0.5 0 1 61 1
2 0.5 0 0 46 1
3 0.7 1 1 73 1
4 1.0 0 0 29 1
5 1.0 0 1 63 1
1 1.5 0 1 61 2
2 1.4 1 0 46 2
4 2.0 0 0 29 2
5 1.8 1 1 63 2
1 2.5 0 1 61 3
4 2.3 1 0 29 3

Footnotes

SUPPORTING INFORMATION

Additional supporting information may be found online in the Supporting Information section at the end of the article.

DATA AVAILABILITY STATEMENT

The NIPPON trial, analyzed in this paper, was not undertaken with the expectation of being publicly available and patients were not consented for this use. A single simulated dataset, from scenario 5 of the simulation section, is available in supplementary materials along with the computer code used to create Figure 2 and Table 3.

REFERENCES

  • 1.Follmann D, Wittes J, Cutler JA. The use of subjective rankings in clinical trials with an application to cardiovascular disease. Statist Med. 1992;11(4):427–437. [DOI] [PubMed] [Google Scholar]
  • 2.Bjorling LE, Hodges JS. Rule-based ranking schemes for antiretroviral trials. Statist Med. 1997;16(10):1175–1191. [DOI] [PubMed] [Google Scholar]
  • 3.Brittain E, Palensky J, Blood J, Wittes J. Blinded subjective rankings as a method of assessing treatment effect: a large sample example from the systolic hypertension in the elderly program (SHEP). Statist Med. 1997;16(6):681–693. [DOI] [PubMed] [Google Scholar]
  • 4.Finkelstein DM, Schoenfeld DA. Combining mortality and longitudinal measures in clinical trials. Statist Med. 1999;18(11):1341–1354. [DOI] [PubMed] [Google Scholar]
  • 5.Follmann DA. Regression analysis based on pairwise ordering of patients’ clinical histories. Statist Med. 2002;21(22):3353–3367. [DOI] [PubMed] [Google Scholar]
  • 6.Buyse M Generalized pairwise comparisons of prioritized outcomes in the two-sample problem. Statist Med. 2010;29(30):3245–3257. [DOI] [PubMed] [Google Scholar]
  • 7.Pocock SJ, Ariti CA, Collier TJ, Wang D. The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. Eur Heart J. 2012;33(2):176–182. [DOI] [PubMed] [Google Scholar]
  • 8.Thas O, Neve JD, Clement L, Ottoy J-P. Probabilistic index models. J R Stat Soc B Stat Methodol. 2012;74(4):623–671. [Google Scholar]
  • 9.Evans SR, Rubin D, Follmann D, et al. Desirability of outcome ranking (DOOR) and response adjusted for duration of antibiotic risk (RADAR). Clin Infect Dis. 2015;61(5):800–806. civ495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Evans SR, Follmann D. Using outcomes to analyze patients rather than patients to analyze outcomes: a step toward pragmatism in benefit: risk evaluation. Stat Biopharm Res. 2016;8(4):386–393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Oakes D On the win-ratio statistic in clinical trials with multiple types of event. Biometrika. 2016;103(3):742–745. [Google Scholar]
  • 12.Bebu I, Lachin JM. Large sample inference for a win ratio analysis of a composite outcome based on prioritized components. Biostatistics. 2015;17(1):178–187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18(1):50–60. [Google Scholar]
  • 14.Gehan EA. A generalized Wilcoxon test for comparing arbitrarily singly-censored samples. Biometrika. 1965;52(1–2):203–224. [PubMed] [Google Scholar]
  • 15.Turnbull BW. The empirical distribution function with arbitrarily grouped, censored and truncated data. J R Stat Soc B Methodol. 1976;38(3):290–295. [Google Scholar]
  • 16.Lee M, Lee J, Carroll MW, et al. Linezolid for treatment of chronic extensively drug-resistant tuberculosis. N Engl J Med. 2012;367(16):1508–1518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Shaw PA, Fay MP. A rank test for bivariate time-to-event outcomes when one event is a surrogate. Statist Med. 2016;35(19):3413–3423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Hougaard P A class of multivanate failure time distributions. Biometrika. 1986;73(3):671–678. [Google Scholar]
  • 19.Miller RG Jr. Survival Analysis. Vol. 66. Hoboken, NJ: John Wiley & Sons; 2011. [Google Scholar]
  • 20.Nakamura M, Iijima R, Ako J, et al. Dual antiplatelet therapy for 6 versus 18 months after biodegradable polymer drug-eluting stent implantation. JACC Cardiovasc Interv. 2017;10(12):1189–1198. [DOI] [PubMed] [Google Scholar]
  • 21.Efron B The two sample problem with censored data. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Vol. 4. Berkeley, CA: University of California, Berkeley; 1967:831–853. [Google Scholar]
  • 22.Peto R, Peto J. Asymptotically efficient rank invariant test procedures. J R Stat Soc A Gen. 1972;135(2):185–198. [Google Scholar]
  • 23.Prentice RL. Linear rank tests with right censored data. Biometrika. 1978;65(1):167–179. [Google Scholar]
  • 24.Therneau TM, Grambsch PM. Modeling Survival Data: Extending the Cox Model. Berlin, Germany: Springer Science & Business Media; 2013. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

R code (text file)
CSV File, simulated data

Data Availability Statement

The NIPPON trial, analyzed in this paper, was not undertaken with the expectation of being publicly available and patients were not consented for this use. A single simulated dataset, from scenario 5 of the simulation section, is available in supplementary materials along with the computer code used to create Figure 2 and Table 3.

RESOURCES