Skip to main content
Clinical Infectious Diseases: An Official Publication of the Infectious Diseases Society of America logoLink to Clinical Infectious Diseases: An Official Publication of the Infectious Diseases Society of America
. 2025 Jun 13;81(5):e319–e329. doi: 10.1093/cid/ciaf314

Making Sense of Hierarchical Composite End Points in Randomized Clinical Trials—A Primer for Infectious Diseases Clinicians and Researchers

Sean W X Ong 1,2,3,4,✉,3, Robert K Mahar 5,6,7, Chris J Selman 8,9, Ruxandra Pinto 10,11, Joshua S Davis 12,13,14, Robert A Fowler 15,16, Steven Y C Tong 17,18,#, Nick Daneman 19,20,#
PMCID: PMC12728281  PMID: 40512968

Abstract

Hierarchical composite end points (HCEs), combining features of simple composite end points and conventional ordinal end points, are increasingly being used in infectious diseases (ID) research. However, many clinicians may be unfamiliar with these novel end points, including the variety of different target parameters that may be of interest and the methods that can be used to estimate them. In this review, we provide a conceptual overview of HCEs by defining them and providing examples from the ID literature. We explain different methods for analyzing HCEs, including (1) the Wilcoxon rank sum approach (often used in studies with a desirability of outcome ranking [DOOR] end point), (2) generalized pairwise comparisons (used to estimate a win ratio or win odds), (3) proportional odds model (and the relevance of the proportional odds assumption), and, (4) the probabilistic index model. This review will help ID clinicians and healthcare providers interpret current and future research using such end points.

Keywords: Hierarchical composite end points, trial design, outcomes, clinical trial, ordinal outcomes


Hierarchical composite end points (HCEs) are increasingly being used in infectious diseases research. In this review aimed at educating the clinical reader, we provide a conceptual overview of HCEs and explain the different methods for analyzing them.


Hierarchical composite end points (HCEs) are increasingly being used in clinical trials [1, 2]. Recent examples from the infectious diseases (ID) literature include desirability of outcome ranking (DOOR) and post hoc analyses estimating a win ratio [2–4]. However, clinicians may be unfamiliar with these novel end points, which are analyzed differently from more familiar binary or continuous end points. In addition, HCE nomenclature is often inconsistent, and related methodologic concepts are named differently depending on the field that uses them. As these end points gain popularity in clinical research, a clear framework of how to understand and interpret their analyses is important for ID clinicians.

In this review, we aim to (1) give a conceptual overview of HCEs, (2) explain different methods for analyzing HCEs, and (3) clarify the nomenclature used in the literature and propose a consistent terminology for common use. To aid the reader, we have provided a glossary in Table 1 that provides definitions of terms or concepts. We illustrate these concepts in a paired article, describing how we applied different methods of analyzing HCEs in post hoc analyses of 2 clinical trials, BALANCE and CAMERA2 [5, 6].

Table 1.

Glossary of Terminology

Term Explanation
End point/outcome types
 Binary end point/outcome An end point/outcome whereby patients can only have 1 of 2 potential states: either they have or do not have the outcome. The most common example is mortality—patients are either alive or dead at the end of the follow-up period.
 Composite end point/outcome An end point combining multiple different events of interest, whereby a patient experiencing any of the component events is considered as having the composite end point of interest.
 Continuous end point/outcome An end point for which the outcome is a continuous variable that can take any value within a specific range—for example, HIV viral load, or LDL cholesterol.
 DOOR A special case of an ordinal end point, whereby multiple separate clinical events are combined into a single outcome and ranked in order from best to worst outcome. Patients can experience >1 component outcome but are assigned the “worst” ranking based on the prespecified ranking order. Often used in infectious diseases clinical trials. Multiple consensus DOORs have been developed for various infectious syndromes.
 HCE An end point type combining features of both ordinal and composite end points; they are composite, as they combine multiple clinical events, but they are also ordinal, as these different events or component outcomes are ranked in order of importance (eg, death ranked above disease complication or treatment-related adverse events); synonyms in the literature include prioritized outcomes, ordinal composite outcomes, and ordered composite end points.
 Ordinal end point/outcome An end point/outcome for which a clinical event of interest is categorized into multiple categories, with a clear ranked order of the direction of superiority; in a classic ordinal outcome, this is typically in a single clinical domain (eg, pain, neurologic outcome, respiratory status), and categories are mutually exclusive.
 Time-to-event end point/outcome An end point/outcome that accounts for not just the occurrence of an event of interest but also the time to its first occurrence; the event can be binary (eg, death), composite (among several component outcomes, eg, major adverse cardiovascular events), or recurrent (eg, hospitalization or emergency department attendance).
Analytic methods or features
 Generalized pairwise comparisons An analytic method whereby all possible patient pair combinations are made between 2 treatment groups and each patient pair is evaluated to determine which patient has a superior outcome; this can be used for any type of outcome, including HCEs, where each pair is evaluated based on the prespecified hierarchy of outcomes.
 Mann–Whitney U test A nonparametric statistical test, often used to compare the distributions of 2 nonnormally distributed variables; statistically, it tests the null hypothesis that, for randomly selected values x and y from 2 treatment groups, the probability that x > y is equal to the probability that y > x (ie, that the distributions are identical), and it is the same as the Wilcoxon rank sum test.
 PI model A flexible semiparametric model that uses the generalized pairwise comparisons framework and can incorporate multiple different outcome types to estimate a covariate-adjusted PI.
 PO assumption The assumption in a PO model that the treatment exerts a consistent effect across all levels of the ordinal scale; there is debate in the statistical community as to how to test for this assumption (where statistical tests tend to have less power), as well as the degree to which it needs to hold.
 PO model A parametric model used to analyze ordinal outcomes, in which the distribution of the ordinal outcome is used to estimate an odds ratio representing the odds of going up ≥1 step in the ordinal scale; also known as the cumulative logistic model assuming PO or ordered/ordinal logistic regression.
 Proportional win fractions model A flexible semiparametric model used to analyze HCEs, which can include multiple covariates to estimate an adjusted win ratio.
 Wilcoxon rank sum test See Mann–Whitney U test.
Target parameters
 Better DOOR probability The commonly used target parameter of a DOOR analysis, ie, the probability that a randomly selected patient in the treatment group has a DOOR outcome superior to that of a randomly selected patient in the control (reference or comparator) group (see Concordance probability and PI).
 Concordance probability Often represented by the abbreviation c; the probability that a randomly chosen observation from the treatment group is greater than one from the control group (see Better DOOR probability and PI).
 Net treatment benefit A target parameter used in generalized pairwise comparisons analysis, which represents the difference between wins and losses for the experimental intervention; it is a positive or negative proportion (on a scale from −1.0 to 1.0), for which a net treatment benefit >0 is considered superior, and it can be interpreted as the percentage of occasions in which the treatment is superior to the control.
 Number needed to treat The number of patients needed to be treated with the experimental intervention of interest for 1 patient to receive the benefit in terms of the outcome of interest; using a more familiar binary outcome of mortality it is the number needed to be treated to prevent 1 death, and using an HCS it is the number needed to be treated to achieve a superior global outcome in 1 patient, considering all component clinical events assessed in the hierarchy.
 PI The probability that any randomly selected patient in 1 of 2 groups has a superior outcome to a randomly selected patient in the other group (mathematically written as P(A > B), where P represents probability; A, a patient in group A; and B, a patient in group B (synonyms are better DOOR probability and concordance probability).
 Win odds A target parameter that can be used to summarize analysis of an HCE, which represents the ratio of the number of wins (and half the number of ties) vs the number of losses (and half the number of ties). This differs from the win ratio as it splits the ties (ie, patients with the same outcomes across all components of the hierarchy) across both groups, and it may be preferred in trials with a large proportion of tied outcomes. A win odds takes values between 0 and infinity, and a value >1.0 represents superiority of the experimental treatment group.
 Win ratio A target parameter that can be used to summarize analysis of an HCE, which represents the ratio of the number of wins to the number of losses for the experimental treatment group; a win ratio ranges from 0 to infinity, with values >1.0 representing superiority of the experimental treatment group.
Others
 RADAR A “tiebreaker” used in some DOOR analyses, whereby antibiotic duration is used to further distinguish and rank patients within the same level on the DOOR scale (patients receiving a shorter antibiotic duration are ranked as superior to those receiving a longer duration); this can also be seen as an HCE in which antibiotic duration is the second step in the hierarchy after the initial DOOR.

Abbreviations: DOOR, desirability of outcome ranking; HCE, hierarchical composite end point; HIV, human immunodeficiency; LDL, low-density lipoprotein; PI, probabilistic index; PO, proportional odds; RADAR, response adjusted for duration of antibiotic risk.

REVIEW

Conventional Composite and Ordinal Outcomes

HCEs combine features of composite outcomes and conventional ordinal outcomes (Figure 1). Composite outcomes combine multiple binary end points into a single outcome—eg, the POET trial combined death, unplanned cardiac surgery, embolic events, or bacteremia relapse as a composite outcome, such that a patient who had any of the component outcomes was considered as having an event [7]. This increases frequency of the outcome, increasing statistical power and reducing required sample size [8]. This may be especially useful in conditions where death is infrequent or a truncating event (competing risk) that prevents observation of other end points [9]. However, one problem with conventional composite outcomes is that component events are treated equally even if they have varying clinical importance [10]. Larger changes in less important components may thus obscure smaller changes in more important outcomes.

Figure 1.

Figure 1.

Examples of ordinal end points, composite end points, and hierarchical composite end points. Italicized acronyms indicate example trials that use the stated example end point. Abbreviations: COVID-19, coronavirus disease 2019; WHO, World Health Organization.

Ordinal outcomes are a special case of categorical outcomes, where mutually exclusive categories are rank ordered in a clinically meaningful manner, though the difference between contiguous categories may not necessarily be equidistant [11]. A common example of an ordinal scale is the modified Rankin scale for poststroke functional recovery [12]. The World Health Organization has developed a core ordinal outcome for coronavirus disease 2019 (COVID-19) randomized clinical trials, the World Health Organization clinical progression scale, reflecting a range of disease states in COVID-19 from asymptomatic (best) to death (worst) [13]. Ordinal outcomes can also provide greater statistical efficiency, by permitting a spread of event outcomes, capturing more information than a binary outcome [14, 15]. From a clinician's perspective, ordinal outcomes are appealing as they reflect different gradations of severity encountered in clinical practice. However, they may be challenging to analyze and interpret [1].

HCEs Defined

HCEs combine the 2 concepts of composite and ordinal outcomes: they are composite as they combine multiple events, and they are also hierarchical as they establish a rank order or hierarchy of clinical importance across the different component events. While a conventional ordinal outcome reflects gradations of severity of the same clinical attribute (eg, pain, function, or respiratory status), an HCE may combine multiple different clinical attributes and, at the same time, establish a rank order among them. They can thus combine multiple different health states related to efficacy and/or safety into a single end point, which may be more reflective of clinical decision making. Different component ranks of a conventional ordinal outcome are mutually exclusive (as one cannot have 2 different levels of the same clinical attribute), whereas different components of an HCE may overlap—but the “worst” component is counted, based on the stated rank order. The rank order prioritizing different outcome components addresses the limitation of conventional composite end points where components are weighted equally. This is especially beneficial when the different components are of varying severity or frequency. Table 2 summarizes key differences between HCEs and traditional composite or ordinal end points.

Table 2.

Comparison Between Composite, Ordinal, and Hierarchical Composite End Points

Feature Composite End Point Ordinal End Point Hierarchical Composite End Point
Development and selection of end point Some degree of subjectivity in terms of component selection; few widely validated composite end points in infectious diseases Less subjectivity in terms of end point construction, unless there is variation within how different levels in the ordinal scale are defined; few widely validated ordinal end points in infectious diseases Greatest degree of subjectivity present, since choices have to be made both in terms of component selection as well as order of prioritization; several consensus DOOR outcomes developed for different infectious syndromes; more flexible due to ability to combine multiple end point types (eg, binary, time-to-event, and continuous end points)
Clinical relevance May be less clinically meaningful, especially if outcome components are not of equal clinical importance; larger changes in less clinically important components may obscure smaller changes in more clinically important components More clinically meaningful as different levels may reflect different gradations of severity encountered in clinical practice Reflects real-world clinical decision making by combining competing risks and trade-offs into a single end point, permitting a more comprehensive risk-benefit assessment and providing a “global” assessment of the patient's condition
Statistical efficiency (power) More efficient than conventional binary end points as the event rate is increased but less efficient than ordinal end points More efficient than composite end points by capturing a spread of event outcomes, thus capturing more information May be more or less efficient than composite or ordinal outcomes, depending on the selection of components, relative ranking, and whether treatments exert effects in different directions among different components
Sample size determination Straightforward, easily conducted using simple online calculators with minimal data inputs (expected event rate, hypothesized effect size, α level, and power) More complicated than for composite end points but still relatively straightforward with available calculators and minimal data inputs (expected distribution across different levels, hypothesized effect size, α level, and power) Dependent on the method of analysis and target parameter; most complicated, especially when considering mix of different outcome types; may require extensive simulations
Commonly used analytic methods Calculation of risk difference, risk ratio, or OR (for binary end point); logistic regression model to estimate OR (for binary end point); log-rank test with Kaplan-Meier curve (for time-to-event end point); Cox proportional hazards model to estimate hazard ratio (for time-to-event end point) Dichotomization at cutoff point and analysis as binary end point (considered inappropriate and statistically inefficient); Mann–Whitney U tests to compare distributions; PO model to estimate common OR Mann–Whitney U tests to compare distributions and calculate PI; generalized pairwise comparisons to estimate win statistics; PO model to estimate common OR; PI model to estimate PI
Interpretability of target parameters Target parameters are commonly used and familiar to clinicians: risk differences, risk ratios, or ORs (for binary composite end points) or hazard ratios (for time-to-event composite end points) Target parameter is often the OR from a PO model, which may be less interpretable to clinicians who are less familiar with this than with more common target parameters Target parameters (PI, win ratio, win odds) are less familiar to clinical audiences

Abbreviations: DOOR, desirability of outcome ranking; OR, odds ratio; PI, probabilistic index; PO, proportional odds.

An example of an HCE is the DOOR outcome, which is increasingly used in ID clinical research [2, 3]. Multiple consensus DOOR outcomes have been developed for a variety of infectious disease syndromes [16–19]. For example, a consensus DOOR outcome for Staphylococcus aureus bacteremia combines death, treatment failure, infectious complications, and drug adverse events, and it has been used in the DOTS trial evaluating dalbavancin [20, 21].

DOOR is an example of an HCE in which, as in conventional composite outcomes, each component is binary (ie, yes/no for each constituent component). However, HCEs can also have greater flexibility by combining different types of component outcomes (eg, time to event, continuous, binary, or ordinal). Finkelstein and Schoenfeld [22] demonstrated this in post hoc analyses of human immunodeficiency virus clinical trials where they combined 3 component outcomes in an HCE: (1) death (time-to-event outcome), (2) opportunistic infection (time-to-event outcome), and (3) quality of life (ordinal outcome), ranked in order from most to least important.

An extension of DOOR incorporating antibiotic duration, DOOR/RADAR (response adjusted for duration of antibiotic risk), where receipt of shorter antibiotic duration (a continuous outcome) is considered superior to longer duration [3 , 23], can also be conceptualized as belonging to this HCE framework—patients are first evaluated by the DOOR ordinal scale, followed by antibiotic duration as the second rung in the HCE. DOOR/RADAR has been proposed as an alternative to the more traditional noninferiority design to study antibiotic durations, sidestepping some of the pitfalls of noninferiority trials, such as bias in the setting of treatment nonadherence or difficulties with establishing a noninferiority margin [3]. The use of other “tiebreakers” together with DOOR, such as quality of life or functional outcome, are other examples of HCEs [21, 24, 25].

Analysis of HCEs

Statistical analytic methods can be categorized into parametric or nonparametric approaches. Nonparametric approaches are based only on counting processes of the raw data, and a statistical model (with model parameters, hence the term parametric) is not created to summarize the data and estimate coefficients that correspond to the treatment effect. With nonparametric approaches, covariate adjustment is limited, so if one desires an adjusted estimate, a parametric model is usually required. For those new to these concepts, an analogous comparison is between nonparametric versus parametric approaches for a binary outcome (eg, death)—the nonparametric approach is a χ2 test to test the association between 2 variables, while the parametric approach is a logistic regression model where one can adjust for other variables (eg, age or sex) to derive an adjusted odds ratio (OR). In clinical trials, covariate adjustment is recommended when a variable is used to stratify randomization; adjusting for covariates that are predictive of the outcome can also improve power by providing more precise estimates (with narrower confidence intervals [CIs]) [26].

Nonparametric Approaches

Studies that use DOORs can be analyzed using a nonparametric approach, the Mann–Whitney U test (also known as the Wilcoxon rank sum test). The Mann–Whitney U test is commonly used in biomedical research to compare medians of nonnormally distributed continuous outcomes, by comparing the relative distributions of the range of values in 2 groups. The way this is used to analyze DOORs is similar: the distributions of ranks are compared across 2 groups, with the null hypothesis being that the distributions of both groups are the same. The primary effect estimated in this application is the probability that any randomly selected patient in one group has a superior outcome to a randomly selected patient in the other group—mathematically written as P(A>B), where P represents probability; A, the outcome of a patient in group A; and B, the outcome of a patient in group B.

This effect estimate has been variably termed the better DOOR probability, probabilistic index, or concordance probability c [2 , 27, 28]. A probability of 50% represents no difference between groups, and a result where the entire CI is >50% represents superiority. For example, the SCOUT-CAP trial, which compared 5 versus 10 days of antibiotics for community-acquired pneumonia in young children, used an HCE that ranked the following in order of priority: clinical response, resolution of symptoms, antibiotic-associated adverse effects, and duration of antibiotics received. The 5-day group had a 69% probability (95% CI: 63%–75%) of a more desirable outcome (DOOR-RADAR) compared with the 10-day group, indicating superiority of the short-course treatment (since the lower bound of the CI exceeded 50%) [23]. The first 3 outcome components were similar between treatment groups, but inclusion of the antibiotic treatment duration as an additional tiebreaker resulted in the shorter treatment group being superior. This assumes that shorter durations are preferable when other clinical outcomes are similar.

More generally, HCEs can be analyzed using the framework of generalized pairwise comparisons [29]. In this method, all possible patient pair combinations are made between the 2 treatment groups, and each pair is evaluated based on the hierarchy of outcomes to determine which treatment group is favored within each pair (Figure 2). The hierarchical ordering is such that the most important outcome is considered first, and if that is tied (eg, both individuals in the pair did not have the outcome), then the second most important outcome is considered, and so forth. This results in 3 summary numbers: total number of wins, losses, and ties. Different treatment effect estimates (win ratio, win odds, net treatment benefit, number needed to treat, and probabilistic index) can be obtained from these 3 summary numbers using different formulas (Supplementary Appendix) [30–32].

Figure 2.

Figure 2.

Example of a generalized pairwise comparisons analysis in a hypothetical clinical trial. In this hypothetical example clinical trial, 20 patients are randomized to 2 treatment groups—groups A and B. The simplified hierarchical composite end point depicted here consists of 2 outcomes in order of priority: (1) death and (2) complication. Each patient in one group is compared with each patient in the other (total 10 × 10 = 100 comparisons) and determined to be a win, loss, or tie depending on the outcomes. The total number of wins, losses, and ties are added up, which permits calculation of the win ratio, win odds, and net treatment benefit. Abbreviation: SOC, standard of care.

Using a simple binary outcome, the win ratio is the simple inverse of an OR and can be interpreted in a similar way (Supplementary Figure 1). With the inclusion of multiple outcomes in an HCE, the interpretation extends to being the ratio of having an overall superior outcome (a “win”) in one group versus the other group. The MERINO trial compared piperacillin-tazobactam versus meropenem for patients with ceftriaxone-resistant Escherichia coli or Klebsiella pneumoniae bloodstream infection, and it demonstrated a higher 30-day mortality rate in the piperacillin-tazobactam arm, failing to demonstrate noninferiority of piperacillin-tazobactam [33]. However, there were concerns that much of this mortality rate was due to underlying comorbid conditions rather than being infection related.

To address this limitation, Hardy and colleagues [4] constructed a post hoc HCE also including microbiologic relapse and secondary infection and then applied the generalized pairwise comparisons method to estimate the win ratio and win odds of piperacillin-tazobactam versus meropenem, thus considering clinically important outcomes that may be more directly relevant to the underlying infection The win ratio was 0.40 (95% CI: .22–.71) and win odds was 0.79 (.68–.92); that is, there were fewer wins in the piperacillin-tazobactam group, whether or not ties were considered. The upper bound of the CIs of both these estimates were <1.0, indicating the inferiority of piperacillin-tazobactam compared with meropenem, aligning with the conclusion of the primary analysis. The win ratio here can also be interpreted as excluding patients with equivalent outcomes, patients in the piperacillin-tazobactam group having a 60% lower chance of a more favorable outcome compared with those in the meropenem group. The win odds is interpreted as the odds of a patient in the piperacillin-tazobactam group having a more favorable outcome being 21% lower than for a patient in the meropenem group.

Parametric Approaches

Analogous parametric analytic approaches are also available for HCEs. Ordinal outcomes are commonly analyzed using a proportional odds (PO) model (also known as the cumulative logistic model assuming PO) [1]. Similar to how the logistic regression model can be conceptualized as a multivariable or model-based extension of the χ2 test, the PO model can be conceptualized as a multivariable extension of the Mann–Whitney U test [11, 28, 34]. The target parameter estimated using a PO model is a common (or proportional) OR. This common OR can be interpreted as the odds a patient in the treatment group has of going up ≥1 step on the ordinal scale (ie, having a more favorable outcome) compared with a patient in the control group.

As an example, the REMAP-CAP platform trial evaluated different treatment strategies for COVID-19 pneumonia and used an ordinal scale combining survival and organ support–free days, analyzed using a PO model. In the domain evaluating tocilizumab, the common OR for a favorable outcome was 1.64 (95% CI, 1.25–2.14), indicating the superiority of tocilizumab versus control, since the entirety of the CI was >1.0 [35]. This can be interpreted as tocilizumab-treated patients having higher odds of improved survival and/or more organ support–free days.

The PO model relies on the PO assumption that the treatment exerts the same effect across the distribution of the ordinal scale. This implies that the ORs for each binary split (ie, above and below each level) of the ordinal scale are the same, which may not be a realistic assumption in practice. There is debate as to the importance of this assumption and optimal solutions when it is violated [36–38], which is beyond the scope of this review. As long as the direction of effect is consistent across all levels of the ordinal scale (ie, there is a positive effect throughout the scale, even if the actual value of the OR at each binary split is different), the model is still interpretable—in this case, the common OR would be interpreted as a summary OR representing the odds of having a superior outcome in one group compared with the other (Figure 3A). In contrast, if there is a discordant effect at different ends of the scale (eg, the treatment exerts a beneficial effect on one end but a detrimental effect on the other end of the scale), the common OR is no longer clinically interpretable because it may contradict the treatment effect at different levels (Figure 3B).

Figure 3.

Figure 3.

Illustration of the proportional odds model (PO), in settings with a consistent effect across the ordinal scale (A) and a discordant effect at different ends of the ordinal scale (B). Abbreviation: OR, odds ratio.

The PO model can also be used for HCEs—with a hierarchy of different outcome components, the number of “steps” on the ordinal scale increases: for example, if we had a 2-step hierarchy where the first step is a 4-point ordinal scale, and the second step is antibiotic duration (ranging from 1 to 7 days), there would be a total of 4 × 7 = 28 possible ranks or steps on the hierarchical composite scale. It is less likely for the OR to be the same at each of these steps, but as long as the effect is consistent in direction across the entire combined scale, the common OR is still a clinically valid interpretation.

Alternative modeling approaches have also been proposed for HCEs. Mao and Wang [39] developed a semiparametric model, the proportional win fractions model, that permits multivariate adjustment of generalized pairwise comparison analyses, providing an adjusted win ratio. Thas and colleagues [40] developed the probabilistic index model, which allows modeling of a variety of combination of outcomes (continuous, ordinal, binary, and time-to-event outcomes), providing a summary adjusted probabilistic index [41]. This has also been applied to calculate adjusted win odds for HCEs using the generalized pairwise comparisons framework [42].

These alternative modeling approaches have not been commonly used in the ID literature. A previous review identified 12 RCTs using DOOR analyses as primary, secondary, or post hoc analyses, but all analyzed only DOOR outcomes using the nonparametric Mann–Whitney U approach [2]. This may be in part due to the confusing nomenclature for these concepts, with different terms being used despite their being related.

Clarification of Terminology and the Relationship Between Different Methods

One reason behind this confusion is the lack of clarity behind how to distinguish outcomes, target parameters, and analytic methods. An outcome is the metric used to measure the effect of treatment on the patient (at the individual level), the target parameter is the summary measure that quantifies the treatment effect of clinical interest, and the analytic method is the statistical method used to estimate the target parameter by comparing outcomes in one group versus those in the other group. Returning to an example of a simple binary outcome, the outcome is the 90-day mortality status, the target parameter could be the OR, and the analytic method could be a logistic regression model.

A DOOR is strictly an outcome, not an analytic method or target parameter. In contrast, a win ratio is a target parameter and not an outcome or analytic method. Thus, “DOOR analysis” or “win ratio analysis” are technically inaccurate and misleading terms. Confusingly, the relationships among outcome, target parameter, and analytic method are not one-to-one relationships; most outcomes can be summarized using multiple target parameters, each of which can analyzed in different ways (nonparametric and parametric) (Figure 4).

Figure 4.

Figure 4.

Relationships among choice of end points, target parameters, and analytic methods. Arrows between end points and target parameters represent the possible target parameters that can be calculated or estimated from each end point, and arrows between analytic methods and target parameters represent the output target parameters from each analytic method. Arrows between target parameters show that the different parameters can be estimated or approximated from each other.

The different target parameters are related and can be calculated (or estimated) from each other (see the Supplementary Appendix). The different analytic methods described in this review will often give similar results if the constituent components of the HCE are the same. Ultimately, the target parameter of interest should be discussed and chosen in the planning stage of a trial—what makes sense for the given context, how it is going to be interpreted to represent better or worse outcomes for patients, and how it is going to be used to aid decision making in patient care. We encourage investigators to use clear and consistent terminology and follow research best practices, including prespecifying primary and secondary outcomes along with a statistical analysis plan that outlines the estimand and target parameter of interest [43]. This could also include details of sensitivity analyses or a “backup” plan when there are significant departures from key model assumptions (eg, PO).

DISCUSSION

In this review, we have explained what an HCE is, outlined common methods used to analyze HCEs, and clarified use of terminology present in the literature. The use of these end points is gaining popularity, and we hope that this review can help clinical readers become familiar with how to appraise research studies using these end points.

From the researcher's perspective, when to use these end points is beyond the scope of this review and is a question that requires further study. One clinical rationale supporting the use of HCEs is that they can capture a range of clinical outcomes that may be important to patients [20, 44, 45]. This can allow a more nuanced assessment of benefits versus harms of any intervention, as well as competing risks and trade-offs, as opposed to making a conclusion based on a singular clinical component (eg, mortality). This may be especially relevant in ID trials, particularly those evaluating antimicrobial use, where there are inherent competing tensions: as clinicians we want to treat patients to optimize their outcomes, but as antimicrobial stewards we also want to minimize collateral harms or the development of antimicrobial resistance.

We have also not covered how to construct these end points. Consensus DOORs have been developed for a few infectious syndromes, but for most clinical questions new HCEs will have to be constructed. This process has a degree of subjectivity, and results may be sensitive to the choice of components and how the hierarchy is defined. Ideally, development of end points should involve a consultative process with multiple stakeholders, a consensus-generating mechanism, and validation across different settings [2, 45, 46].

From a statistical perspective, a stated advantage of HCEs is that they may be more statistically efficient than simple binary outcomes (ie, may have more power and hence require a smaller sample) [45]. However, this is not necessarily true, especially if treatments exert effects in different directions among different components of the hierarchy (with a resultant net effect near zero). More methodologic work is required to compare HCEs with more conventional end points, to identify the optimal clinical questions and trial settings where they should be used. Methods for determining sample size when the target parameter is a common OR are well described [47, 48], but determining sample sizes with other target parameters or analytic methods can be more complex, often requiring extensive simulation work [49, 50].

Further research should be conducted to extend the applications of the different analytic methods described above to trial designs beyond simple 2-arm parallel group trials, such as cluster-randomized trials, multiarm trials, or adaptive platform trials using a bayesian analytic framework. Beyond this methodologic work, research should also be conducted as to the acceptability of these outcomes and the interpretability of their target parameters, which may be less familiar than conventional effect measures such as ORs or risk ratios. This research should involve all stakeholders in the clinical research ecosystem, including patients, clinicians, guideline writers, and regulators, to ensure adequate knowledge translation of results arising from trials using these outcomes.

Supplementary Material

ciaf314_Supplementary_Data

Contributor Information

Sean W X Ong, Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada; Department of Infectious Diseases, University of Melbourne, at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia; Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada; Victorian Infectious Diseases Service, Royal Melbourne Hospital, Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia.

Robert K Mahar, Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, University of Melbourne, Parkville, Victoria, Australia; Clinical Epidemiology and Biostatistics Unit, Murdoch Children's Research Institute, Parkville, Victoria, Australia; Methods and Implementation Support for Clinical and Health Research Hub, University of Melbourne, Parkville, Victoria, Australia.

Chris J Selman, Clinical Epidemiology and Biostatistics Unit, Murdoch Children's Research Institute, Parkville, Victoria, Australia; Department of Paediatrics, University of Melbourne, Parkville, Victoria, Australia.

Ruxandra Pinto, Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada; Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada.

Joshua S Davis, School of Medicine and Public Health, University of Newcastle, Newcastle, New South Wales, Australia; Department of Immunology and Infectious Diseases, John Hunter Hospital, Newcastle, New South Wales, Australia; Global and Tropical Health Division, Menzies School of Health and Research, Darwin, Northern Territory, Australia.

Robert A Fowler, Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada; Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada.

Steven Y C Tong, Department of Infectious Diseases, University of Melbourne, at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia; Victorian Infectious Diseases Service, Royal Melbourne Hospital, Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia.

Nick Daneman, Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada; Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada.

Supplementary Data

Supplementary materials are available at Clinical Infectious Diseases online. Consisting of data provided by the authors to benefit the reader, the posted materials are not copyedited and are the sole responsibility of the authors, so questions or comments should be addressed to the corresponding author.

Notes

Acknowledgments. We thank Stephanie Sutjipto, MBBS and Pei Hua Lee, MBBS, who assisted with vetting the manuscript.

Financial support. This work was supported by funding to S. W. X. O., who conducted this research as part of his PhD studies, from the Melbourne Research Scholarship (University of Melbourne); the Emerging & Pandemic Infections Consortium and Connaught International Scholarship (University of Toronto); and the Queen Elizabeth II Graduate Scholarship in Science and Technology (Government of Ontario, Canada).

References

  • 1. Selman  CJ, Lee  KJ, Ferguson  KN, Whitehead  CL, Manley  BJ, Mahar  RK. Statistical analyses of ordinal outcomes in randomised controlled trials: a scoping review. Trials  2024; 25:241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Ong  SWX, Petersiel  N, Loewenthal  MR, Daneman  N, Tong  SYC, Davis  JS. Unlocking the DOOR-how to design, apply, analyse, and interpret desirability of outcome ranking endpoints in infectious diseases clinical trials. Clin Microbiol Infect  2023; 29:1024–30. [DOI] [PubMed] [Google Scholar]
  • 3. Evans  SR, Rubin  D, Follmann  D, et al.  Desirability of outcome ranking (DOOR) and response adjusted for duration of antibiotic risk (RADAR). Clin Infect Dis  2015; 61:800–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Hardy  M, Harris  PNA, Paterson  DL, Chatfield  MD, Mo  Y; MERINO Trial Investigators . Win ratio analyses of piperacillin-tazobactam versus meropenem for ceftriaxone-nonsusceptible Escherichia coli or Klebsiella pneumoniae bloodstream infections: post hoc insights from the MERINO trial. Clin Infect Dis  2024; 78:1482–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Daneman  N, Rishu  A, Pinto  R, et al. ; BALANCE Investigators, for the Canadian Critical Care Trials Group, the Association of Medical Microbiology and Infectious Disease Canada Clinical Research Network, the Australian and New Zealand Intensive Care Society Clinical Trials Group, and the Australasian Society for Infectious Diseases Clinical Research Network . Antibiotic treatment for 7 versus 14 days in patients with bloodstream infections. N Engl J Med  2024; 392:1482–9. [Google Scholar]
  • 6. Tong  SYC, Lye  DC, Yahav  D, et al.  Effect of vancomycin or daptomycin with vs without an antistaphylococcal β-lactam on mortality, bacteremia, relapse, or treatment failure in patients with MRSA bacteremia: a randomized clinical trial. JAMA  2020; 323:527–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Iversen  K, Ihlemann  N, Gill  SU, et al.  Partial oral versus intravenous antibiotic treatment of endocarditis. N Engl J Med  2019; 380:415–24. [DOI] [PubMed] [Google Scholar]
  • 8. Ferreira-Gonzalez  I, Permanyer-Miralda  G, Busse  JW, et al.  Methodologic discussions for using and interpreting composite endpoints are limited, but still identify major concerns. J Clin Epidemiol  2007; 60:651–7. [DOI] [PubMed] [Google Scholar]
  • 9. Kahan  BC, Morris  TP, White  IR, et al.  Treatment estimands in clinical trials of patients hospitalised for COVID-19: ensuring trials ask the right questions. BMC Med  2020; 18:286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Palileo-Villanueva  LM, Dans  AL. Composite endpoints. J Clin Epidemiol  2020; 128:157–8. [DOI] [PubMed] [Google Scholar]
  • 11. French  B, Shotwell  MS. Regression models for ordinal outcomes. JAMA  2022; 328:772–3. [DOI] [PubMed] [Google Scholar]
  • 12. Banks  JL, Marotta  CA. Outcomes validity and reliability of the modified Rankin scale: implications for stroke clinical trials: a literature review and synthesis. Stroke  2007; 38:1091–6. [DOI] [PubMed] [Google Scholar]
  • 13. WHO Working Group on the Clinical Characterisation and Management of COVID-19 infection . A minimal common outcome measure set for COVID-19 clinical research. Lancet Infect Dis  2020; 20:e192–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Roozenbeek  B, Lingsma  HF, Perel  P, et al.  The added value of ordinal analysis in clinical trials: an example in traumatic brain injury. Crit Care  2011; 15:R127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Ceyisakar  IE, van Leeuwen  N, Dippel  DWJ, Steyerberg  EW, Lingsma  HF. Ordinal outcome analysis improves the detection of between-hospital differences in outcome. BMC Med Res Methodol  2021; 21:4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Howard-Anderson  J, Hamasaki  T, Dai  W, et al.  Improving traditional registrational trial end points: development and application of a desirability of outcome ranking end point for complicated urinary tract infection clinical trials. Clin Infect Dis  2023; 76:e1157–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Kinamon  T, Gopinath  R, Waack  U, et al.  Exploration of a potential desirability of outcome ranking endpoint for complicated intra-abdominal infections using nine registrational trials for antibacterial drugs. Clin Infect Dis  2023; 77:649–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Howard-Anderson  J, Hamasaki  T, Dai  W, et al.  Moving beyond mortality: development and application of a desirability of outcome ranking (DOOR) endpoint for hospital-acquired bacterial pneumonia and ventilator-associated bacterial pneumonia. Clin Infect Dis  2024; 78:259–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Johns  BP, Dewar  DC, Loewenthal  MR, et al.  A desirability of outcome ranking (DOOR) for periprosthetic joint infection—a Delphi analysis. J Bone Jt Infect  2022; 7:221–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Doernberg  SB, Tran  TTT, Tong  SYC, et al.  Good studies evaluate the disease while great studies evaluate the patient: development and application of a desirability of outcome ranking endpoint for Staphylococcus aureus bloodstream infection. Clin Infect Dis  2019; 68:1691–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Turner  NA, Zaharoff  S, King  H, et al.  Dalbavancin as an Option for Treatment of S. aureus bacteremia (DOTS): study protocol for a phase 2b, multicenter, randomized, open-label clinical trial. Trials  2022; 23:407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Finkelstein  DM, Schoenfeld  DA. Combining mortality and longitudinal measures in clinical trials. Stat Med  1999; 18:1341–54. [DOI] [PubMed] [Google Scholar]
  • 23. Williams  DJ, Creech  CB, Walter  EB, et al.  Short- vs standard-course outpatient antibiotic therapy for community-acquired pneumonia in children: the SCOUT-CAP randomized clinical trial. JAMA Pediatr  2022; 176:253–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Tong  SYC, Mora  J, Bowen  AC, et al.  The Staphylococcus aureus network adaptive platform trial protocol: new tools for an old foe. Clin Infect Dis  2022; 75:2027–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Howard-Anderson  J, Dai  W, Yahav  D, et al.  A desirability of outcome ranking analysis of a randomized clinical trial comparing seven versus fourteen days of antibiotics for uncomplicated gram-negative bloodstream infection. Open Forum Infect Dis  2022; 9:ofac140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Adjusting for covariates in randomized clinical trials for drugs and biological products. Available at: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/adjusting-covariates-randomized-clinical-trials-drugs-and-biological-products. Accessed 13 March 2025.
  • 27. Acion  L, Peterson  JJ, Temple  S, Arndt  S. Probabilistic index: an intuitive non-parametric approach to measuring the size of treatment effects. Stat Med  2006; 25:591–602. [DOI] [PubMed] [Google Scholar]
  • 28. Harrell  F. If you like the Wilcoxon test you must like the proportional odds model. Statistical Thinking 2021. Available at: https://www.fharrell.com/post/wpo/. Accessed 13 March 2025.
  • 29. Buyse  M. Generalized pairwise comparisons of prioritized outcomes in the two-sample problem. Stat Med  2010; 29:3245–57. [DOI] [PubMed] [Google Scholar]
  • 30. Pocock  SJ, Ariti  CA, Collier  TJ, Wang  D. The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. Eur Heart J  2012; 33:176–82. [DOI] [PubMed] [Google Scholar]
  • 31. Dong  G, Hoaglin  DC, Qiu  J, et al.  The win ratio: on interpretation and handling of ties. Stat Biopharm Res  2020; 12:99–106. [Google Scholar]
  • 32. Dong  G, Huang  B, Verbeeck  J, et al.  Win statistics (win ratio, win odds, and net benefit) can complement one another to show the strength of the treatment effect on time-to-event outcomes. Pharm Stat  2023; 22:20–33. [DOI] [PubMed] [Google Scholar]
  • 33. Harris  PNA, Tambyah  PA, Lye  DC, et al.  Effect of piperacillin-tazobactam vs meropenem on 30-day mortality for patients with E coli or Klebsiella pneumoniae bloodstream infection and ceftriaxone resistance: a randomized clinical trial. JAMA  2018; 320:984–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. McCullagh  P. Regression models for ordinal data. J R Stat Soc Series B Stat Methodol  1980; 42:109–27. [Google Scholar]
  • 35. Gordon  AC, Mouncey  PR, Al-Beidh  F, et al. ; REMAP-CAP Investigators . Interleukin-6 receptor antagonists in critically ill patients with COVID-19. N Engl J Med  2021; 384:1491–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Harrell  F. Violation of proportional odds is not fatal. Statistical Thinking 2020. Available at: https://www.fharrell.com/post/po/. Accessed 13 March 2025.
  • 37. Liu  A, He  H, Tu  XM, Tang  W. On testing proportional odds assumptions for proportional odds models. Gen Psychiatr  2023; 36:e101048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Fullerton  AS, Xu  J. The proportional odds with partial proportionality constraints model for ordinal response variables. Soc Sci Res  2012; 41:182–98. [DOI] [PubMed] [Google Scholar]
  • 39. Mao  L, Wang  T. A class of proportional win-fractions regression models for composite outcomes. Biometrics  2021; 77:1265–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Thas  O, Neve  JD, Clement  L, Ottoy  JP. Probabilistic index models. J R Stat Soc Series B Stat Methodol  2012; 74:623–71. [Google Scholar]
  • 41. De Schryver  M, De Neve  J. A tutorial on probabilistic index models: regression models for the effect size P(Y1 < Y2). Psychol Methods  2019; 24:403–18. [DOI] [PubMed] [Google Scholar]
  • 42. Song  J, Verbeeck  J, Huang  B, et al.  The win odds: statistical inference and regression. J Biopharm Stat  2023; 33:140–50. [DOI] [PubMed] [Google Scholar]
  • 43. Clark  TP, Kahan  BC, Phillips  A, White  I, Carpenter  JR. Estimands: bringing clarity and focus to research questions in clinical trials. BMJ Open  2022; 12:e052953. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Evans  SR, Follmann  D. Using outcomes to analyze patients rather than patients to analyze outcomes: a step toward pragmatism in benefit:risk evaluation. Stat Biopharm Res  2016; 8:386–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Gasparyan  SB, Buenconsejo  J, Kowalewski  EK, et al.  Design and analysis of studies based on hierarchical composite endpoints: insights from the DARE-19 trial. Ther Innov Regul Sci  2022; 56:785–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Walker  H, McLeman  L, Meyran  D, et al.  Co-designing a novel ordinal endpoint for an adaptive platform trial, BANDICOOT, in pediatric hematopoietic stem cell transplant. Transplant Cell Ther  2025; 31:321.e1–e12. [DOI] [PubMed] [Google Scholar]
  • 47. Whitehead  J. Sample size calculations for ordered categorical data. Stat Med  1993; 12:2257–71. [DOI] [PubMed] [Google Scholar]
  • 48. White  IR, Marley-Zagar  E, Morris  TP, Parmar  MKB, Royston  P, Babiker  AG. artcat: Sample-size calculation for an ordered categorical outcome. Stata J  2023; 23:3–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Backer  M, Sengar  M, Mathews  V, et al.  Design of a clinical trial using generalized pairwise comparisons to test a less intensive treatment regimen. Clin Trials  2024; 21:180–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Barnhart  H, Lokhnygina  Y, Matsouaka  R, et al.  Sample size and power calculations with win measures based on hierarchical endpoints. Stat Med  2025; 44:e70096. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ciaf314_Supplementary_Data

Articles from Clinical Infectious Diseases: An Official Publication of the Infectious Diseases Society of America are provided here courtesy of Oxford University Press

RESOURCES