Skip to main content
BMC Medical Research Methodology logoLink to BMC Medical Research Methodology
. 2026 Feb 21;26:78. doi: 10.1186/s12874-026-02793-5

Statistical analysis of Likert-based ordinal scales: a guide for clinical trialists

Ahmed A Al-Jaishi 1,2,3,, Meaghan S Cuerden 2, Bin Luo 1,2,3, Pavel S Roshanov 1,2, Amit X Garg 1,2,3
PMCID: PMC13063477  PMID: 41723363

Abstract

Background

Likert-based scales are a popular tool in clinical trials for assessing patient-reported outcomes. A key analytical decision involves whether to treat these data as binary, continuous, or ordinal. Each approach has implications for statistical power, bias, and interpretation of the results. In this report, we examine methods for evaluating Likert scales, with a particular focus on ordinal approaches, including win probability methods and proportional odds models.

Methods

We examined the use of proportional odds logistic regression, win probability estimation, dichotomisation with binary logistic regression, and linear regression for analysis of Likert scale-based outcomes. We applied these analytical approaches to patient-reported discomfort data from MyTEMP, a randomised trial comparing personalised cooler dialysate to standard-temperature dialysate in patients undergoing hemodialysis. We also conducted a simulation study to evaluate bias, coverage, and statistical power for each method under proportional and non-proportional odds scenarios across varying sample sizes and outcome distributions.

Results

In the MyTEMP trial, ordinal analyses showed patients receiving personalised cooler dialysate reported greater discomfort related to feeling cold than those receiving standard dialysate (win probability 64%, win difference 28%, win ratio 1.70; all p ≤ 0.001). The proportional odds model suggested an average twofold increase in the odds of greater discomfort for the intervention (odds ratio 2.25), though the model assumption was violated. A partial proportional odds model revealed stronger intervention effects at higher discomfort thresholds (e.g., nearly sixfold odds at the highest discomfort scores). Simulations demonstrated that ordinal methods (win probability and proportional odds models) generally had higher statistical power and lower bias than methods involving dichotomisation or treating ordinal data as continuous, particularly in the presence of skewed outcome distributions.

Conclusion

Our work demonstrated that analysing Likert-based outcomes using ordinal methods yields greater statistical power and more nuanced interpretations than dichotomisation or treating data as continuous.

Keywords: Win probability, Proportional odds model, Ordinal data, Likert scales

Introduction

Patient-reported outcomes are commonly used in clinical trials to provide insights into patients’ symptoms, functional status, and quality of life [14]. Our group and others are advancing several large, pragmatic, randomised controlled trials that collect patient-reported outcomes using Likert scale-based instruments [57]. These ordinal data have a specific order or rank between categories, but the intervals between categories are not necessarily equal (e.g., 1 = strongly disagree, 2 = disagree, 3 = neutral, 4 = agree, and 5 = strongly agree) [8].

Choosing how to analyse such outcomes may not be straightforward. A key decision involves selecting a target estimand that aligns with the clinical question of interest. For example, investigators may be interested in detecting a shift in the outcome distribution, estimating the odds of being in a higher category, or identifying the proportion of patients who achieve a clinically meaningful threshold. Each estimand corresponds to a different analytical approach, which may treat the ordinal outcome as binary, continuous, or ordinal. These choices, in turn, affect statistical power, bias, interpretability, and the model’s robustness to assumptions [913]. Table 1 provides a non-comprehensive list and summary of techniques used to analyse Likert scale-based outcomes for two groups.

Table 1.

Summary of techniques for analysing Likert scale-based outcomes

Method Does it allow for the estimation of intervention effects? Does it allow for confounding adjustment? Ability to adjust for correlated outcomes? Description, target estimand, and considerations
Data exploration and visualisation No N/A N/A

- Outcome treated as ordinal or continuous.

- Estimand: not applicable.

- Summarises data (median, mode, interquartile range, frequency counts, percentages) and visualises data to identify basic patterns (e.g., histograms) and distribution.

- It is useful as a preliminary step and presents information in an easily understood manner, but it is not suitable for formal inference.

Non-parametric rank test (Wilcoxon/Mann–Whitney U) No (test only) No No

- Outcome treated as ordinal (or continuous for rank tests).

- Tests the null P(X > Y) + 0.5 × P(X = Y) = 0.5 (i.e., no stochastic dominance).

- Companion test for the probabilistic index (see next row).

- Appropriate when data do not meet the assumptions of parametric tests, but less powerful than parametric methods when those assumptions hold.

Probabilistic index / Win probability (effect size) Yes Yes Yes

- Outcome treated as ordinal.

- The target estimand is the win probability: P(X > Y) + 0.5 × P(X = Y)Inline graphici.e., the probability that a randomly selected participant in one group has a better outcome than a randomly selected participant in the comparator group.

- Effect-size companion to Wilcoxon/Mann–Whitney U. Report alongside the Wilcoxon p-value.

- Reliance on distributional assumptions for estimating variance may vary depending on the statistical models or methods used.18,19

- It requires careful interpretation, especially when tied outcomes occur frequently.

Ordinal logistic regression / cumulative logit model Yes Yes Yes

- Outcome treated as ordinal.

- The target estimand is the common odds ratio of being at or above any given outcome category (except the lowest level) in one group versus another.

- Assumes that the odds of being at or above a given category (except the lowest level) in one group versus another are proportional across all thresholds of the ordinal outcome (i.e., the proportional odds [PO] assumption).

- Violating the PO assumption may require alternative models (e.g., partial PO-logistic regression) depending on the direction and strength of the PO violation. If the treatment effect varies in direction across cut-points, alternative target estimands need to be considered.

Linear models Yes Yes Yes

- Outcome treated as continuous.

- The target estimand is the difference in means between groups, though this may not always be appropriate for ordinal data.

- It is powerful for complex, multi-level designs but requires careful consideration of equidistant intervals, data structure, and model assumptions.

Binary outcome models Yes Yes Yes

- Outcome treated as binary.

- The target estimand is the odds ratio or relative risk for occurrence of a binary outcome derived from dichotomising the ordinal data, which compares the odds or probability of being above or below a threshold.

- Dichotomising an ordinal outcome leads to a loss of information, can produce different interpretations depending on the cut-point, and may reduce statistical power.

The rank test and probabilistic index are presented separately to distinguish hypothesis testing (p-value) from estimation (effect size). They refer to the same underlying framework

In this manuscript, we provide a structured guide for analysing Likert scale-based outcomes in clinical trials. We begin by reviewing candidate estimands for a superiority framework and examining how these estimands influence the choice of statistical method. We then describe common analytical approaches such as dichotomisation, linear modelling, win probability methods, and ordinal regression. Three ordinal methods are applied to data from a large pragmatic trial to demonstrate how inferential conclusions can differ depending on the chosen analysis. Finally, we present the results of a simulation study that systematically evaluates the operating characteristics of each method under varying assumptions about outcome distribution and treatment effect structure.

Target estimands for Likert-based outcomes

An estimand, as outlined in the ICH E9 (R1) addendum [14] is a precise description of the treatment effect a clinical trial aims to estimate, defined by the target population, treatment conditions, outcome variable, summary measure, and the strategy for handling intercurrent events, ensuring alignment between the trial’s objective, design, and analysis. Defining the target estimand is a key step in selecting the statistical approach [15]. For ordinal (Likert) outcomes, commonly considered target estimands include: (i) a proportional odds ratio capturing a shift in the outcome distribution; (ii) win-based quantities such as win probability, win odds, or win ratio; and (iii) threshold-based estimands (e.g., the risk difference for achieving a ≥ k-point improvement). These estimands can be used to quantify overall benefit or harm and may be expressed using patient-centered metrics such as win difference or number needed to treat (NNT).

The choice of estimand should precede and guide method selection. Estimands framed around mean or threshold differences may justify linear or binary models (see Treating ordinal data as continuous). By contrast, estimands concerned with ordinal shifts or pairwise comparisons necessitate (partial-) proportional odds (PO) models or rank-based methods (e.g., win probability). Defining the estimand upfront ensures that the analysis answers the correct clinical question, supports transparent interpretation, and aligns with regulatory and reporting standards [16, 17].

Clinical relevance of win ratio/odds for Likert-scale ordinal outcomes

For Likert-based outcomes, win ratio/odds can be clinically meaningful when “wins” reflect a prespecified, clinically important change and ties are handled transparently. Because Likert scales often yield many ties, dropping or under-weighting ties can exaggerate efficacy and produce statistically positive yet clinically trivial findings [18]. Pre-specifying the estimand and setting consistent assessment timepoint(s) also prevents differences in follow-up patterns from distorting effect measures [19].

Importantly, win-based procedures do not target a unique estimand by default: different pairing/aggregation choices (e.g., complete vs. stratified) target different causal quantities and, in heterogeneous populations, can reverse treatment recommendations unless the estimand and pairing strategy are prespecified [20]. For Likert outcomes, this implies: (i) defining the estimand in clinical terms (population, treatment conditions, outcome variable and assessment timepoint, summary measure, intercurrent-event strategy); (ii) aligning the pairing rule with the question (e.g., complete for population-level contrasts; stratified by baseline severity/site for “like-with-like”); and (iii) prespecifying the win rule (e.g., ≥ 2-point improvement) and tie rule (e.g., ties count ½). A sensitivity panel should probe robustness to pairing (complete vs. stratified vs. matched), win thresholds, and tie handling. Stable conclusions from sensitivity analyses would support clinical interpretability, whereas divergence should be reported and interpreted in light of which estimand best maps to patient-relevant benefit.

Analytical methods for Likert-based outcomes

In the sub-sections below, we discuss analytical methods for Likert scale-based outcomes and how the choice depends on the clinical question defined by the target estimand, data structure, and trial design. We focus on methods that allow for prognostic covariate adjustment, which can increase statistical power and precision of treatment effect estimates [2124]. Prognostic covariate adjustment becomes especially important in trials that use restricted randomisation (e.g., covariate-constrained randomisation, minimisation) to achieve baseline balance on key covariates [2124].

Dichotomising ordinal scales

A common practice is to dichotomise ordinal responses by selecting a cutoff point, a choice that could be arbitrary or clinically anchored (e.g., modified Rankin Scale 0–2 vs. 3–6 for post-stroke functional independence) [2527]. While dichotomisation at a specific cut-point preserves the distinction between groups above and below that threshold, it loses the full ordinal information of the original scale. As a result, it can obscure nuanced differences between categories and may lead to a loss of information [2832]. Additionally, the choice of cut-point may lack clinical justification, and sometimes, different cut-points lead to different conclusions [33]. Any decision to dichotomise an ordinal outcome should be prespecified in the protocol and justified by a clinically relevant threshold that addresses the research question of interest; post-hoc selection of cut-points after reviewing the outcome distribution should be avoided or labelled as exploratory [16]. Finally, assessing different “optimal” cut-points may increase the risk of a type I error (false positive result) if multiple tests are not taken into account [33, 34].

Despite these drawbacks, dichotomisation may be appropriate in some scenarios. For example, when response distributions are heavily skewed (e.g., responses cluster at the lowest or highest categories), collapsing the scale into a binary outcome (e.g., zero vs. above zero) may simplify interpretation and analysis without a significant loss in statistical power [35, 36].

Treating ordinal data as continuous

The decision around whether to analyse ordinal outcomes using linear models requires consideration of the meaning, spacing of the outcome categories, and number of categories [10, 28, 29, 3739]. If the labels between categories represent equal, well-defined intervals (e.g., an 11-point 0–10 pain numerical rating scale, where each increment denotes “a unit increase on the scale”), then the mean and mean difference may retain a clear interpretation. In contrast, for typical 4‑ or 5‑point Likert items (‘strongly disagree’ → ‘strongly agree’), the distances between categories are subjective and potentially uneven; analysing such data as continuous can produce estimates without an obvious clinical meaning.

Rank-based and non-parametric methods

The Mann-Whitney U test, Wilcoxon rank-sum test, and Kruskal-Wallis H test are commonly used non-parametric methods for comparing ordinal outcomes across groups [40, 41]. These tests are robust to outliers and require no assumptions about the distribution of the outcome variable [10, 4044]. However, they typically have lower statistical power than parametric methods when the data meet parametric assumptions. Traditional non-parametric tests also do not easily accommodate covariate adjustment or hierarchical data structures, limiting their utility in more complex analyses [45].

A more flexible rank-based method is the win probability, which estimates the probability that a randomly selected individual from the intervention group will have a better outcome than a randomly selected individual from the control group (Eq. 1). The win probability is closely related to the Mann-Whitney U statistic and can be estimated by comparing all possible pairs between groups. Each comparison is scored as a win, loss, or tie (counted as a half‑win and half‑loss). The method can be extended to accommodate covariate adjustment and correlation structures, using regression-based approaches such as those proposed by Zou and colleagues [21, 4547].

Let X and Y denote the outcome for a randomly selected individual from the intervention and control group, respectively. Then,

graphic file with name d33e565.gif 1

where P(X > Y) is the probability that a randomly selected individual from the intervention group ‘X’ has a better (or more favorable) outcome than a randomly selected individual from the control group ‘Y’; and P(X = Y) is the probability that two individuals have the same outcome. A win probability of 1 (or 100%) means the intervention wins every comparison; a value of 0 (or 0%) means it loses every comparison; and a value of 0.5 (or 50%) indicates the probability of wins is equal between groups.

Zou et al. [45] demonstrated that the win probability can be estimated using linear regression models with or without a logistic transformation, allowing for covariate adjustments. While the win probability offers a valuable measure of the treatment effect, it serves as a more implicit between-group comparison because the probability is embedded within the definition of a “win,” and it summarises the comparison between two groups. Win probability can be transformed into a win difference or win ratio to provide a more explicit measure of effect size. For example, the win difference (calculated as Inline graphic) shifts the scale from [0, 1] to [-1, 1], centring the effect size around zero [48, 49]. This transformation allows for a direct interpretation of the magnitude and direction of the treatment effect, analogous to how mean differences are interpreted in continuous data analysis. A win difference of zero indicates treatment has no effect, positive values favour the treatment, and negative values favour the control. Alternatively, the win ratio, calculated as win probability/(1-win probability), provides another explicit measure by comparing the number of wins between groups [45, 4951]. A win ratio of one indicates no treatment effect, values greater than one favour the treatment, and values less than one favour the control. This measure is analogous to relative risks, which are familiar in clinical research.

PO-logistic and partial PO-logistic

The PO-logistic model is a type of ordinal logistic regression used to estimate the odds of participants in the intervention group having an outcome at or above a particular category compared to participants in the control group [11, 13, 39, 52]. A key assumption of the PO-logistic model is that the relationship between predictors and the odds of achieving higher outcome categories (versus lower categories) is consistent across all thresholds of the ordinal outcome. This assumption is known as the PO assumption. When the PO assumption holds, the odds ratio from this model directly relates to non-parametric measures such as the Mann-Whitney U statistic and win probability [44, 45, 47].

The approximation for the win probability in terms of the odds ratio is shown in Eq. 2 [53].

graphic file with name d33e630.gif 2

Formal global tests for assessing the PO assumption, such as the Brant or score test, can lack statistical power in small samples and flag trivial deviations in large samples. Consequently, they should be treated as preliminary screens and supplemented with visual diagnostics (e.g., plots of category-specific log odds, partial residual plots) and, when indicated, alternative models such as partial PO-logistic models [5458]. These tools clarify whether the treatment effect (and any baseline prognostic covariates) remains constant across different thresholds of the ordinal outcome. When the PO assumption is violated, partial PO-logistic models relax the constraint by allowing selected coefficients to vary across outcome categories [39, 5861].

The number needed to treat

Effect size measures such as the Number Needed to Treat (NNT) are valuable in clinical decision-making because they provide an interpretable estimate of an intervention’s impact on patient outcomes. One approach uses win difference, which can then be used to calculate the average number of patients who need to receive the treatment to observe one additional favourable outcome along the ordinal scale, who would otherwise not have had the outcome if they received the control (Eq. 3) [40, 62].

graphic file with name d33e664.gif 3

When calculating the NNT using the generalised win difference derived from the win probability, it’s important to consider the design and directionality of the ordinal scale. Depending on whether higher or lower scores indicate better outcomes, positive or negative values can reflect either the NNT or the Number Needed to Harm (NNH).

Case study: MyTEMP PRO trial

We applied the statistical methods above to data from a single Likert-based question in the MyTEMP PRO (patient-reported outcomes), a sub-study within the MyTEMP trial. MyTEMP was a pragmatic, two-arm, parallel-group, registry-based, open-label, cluster-randomised, superiority trial conducted in 84 hemodialysis centres in Ontario, Canada [5, 63, 64]. The trial assessed whether a personalised cooler dialysate, implemented as a centre-wide policy, reduced the risk of cardiovascular-related death or hospital admission compared with standard temperature dialysate. In a PRO substudy, patients were surveyed in 10 of the 84 dialysis centres about their discomfort level related to feeling cold during hemodialysis treatment, using an 11-point numerical rating scale anchored at 0 = ‘no discomfort’ and 10 = ‘worst possible discomfort’; integers 1–9 denote progressively greater discomfort.⁵ For illustrative purposes, we will ignore the effect of correlated outcome data within centres (i.e., clustering) below; however, similar considerations apply to correlated data. All analyses were conducted using R version 4.4.2 (R Core Team, 2024) [65].

Results from MyTEMP PRO

In the PRO substudy, 328 of 345 patients responded to the survey question of interest. A threshold of ≥ 7 on the Likert scale was considered ‘severe’ to align with previous validation work [5, 66]. In the intervention group (personalised cooler dialysate), 47% of patients reported being substantially bothered by feeling cold (score ≥ 7), compared to 29% in the control group (standard temperature dialysate). Fig. 1 illustrates the frequency and cumulative percentage of patients reporting discomfort in each Likert category for both groups. In the original MyTEMP PRO trial analysis (accounting for clustering), a GEE model with a log link estimated a relative risk of 1.63 (99% CI 1.06–2.52) for being ‘very bothered’ (score ≥ 7) by cold symptoms, while a cumulative logit GEE gave an odds ratio of 2.22 (99% CI 1.15–4.35) across the full 0–10 scale, both indicating greater discomfort with cooler dialysate [5].

Fig. 1.

Fig. 1

A Frequency distribution and (B) Cumulative percentages for reporting of cold symptom severity. Patients in the cooler dialysate (intervention) consistently reported higher levels of discomfort than patients in the standard-temperature dialysate group (control)

The Mann-Whitney U test showed a significant difference in the outcome distribution between groups (p < 0.001). We converted the U statistic into a win probability measure, estimating a 63% (95% CI: 57%–70%, p < 0.001) probability that a randomly selected patient from the intervention group would report higher discomfort than a randomly selected patient from the control group. We also used the methodology from Zou et al.18 to estimate a win probability of 64% (95% CI: 57%–70%, p < 0.001), corresponding to a win difference of 28% (95% CI: 14% – 40%, p = 0.001) and a win ratio of 1.70 (95% CI: 1.33–2.33, p = 0.001), with confidence intervals and p-values derived using a logistic transformation (Table 2). Since the intervention resulted in worse outcomes compared to the control, we calculated a NNH of 3.57 (95% CI: 2.50–7.14, p = 0.001). This means that approximately four patients would need to receive the intervention for one additional patient to experience a worse symptom score who would otherwise not have been negatively affected.

Table 2.

Interpretation of results from the MyTEMP PRO study

Estimand Effect size Interpretation
Win probability estimated using Mann-Whitney U Test (Appendix 1)

U Statistic: 14,834

SEU = 801

n1 = 223

n2 = 105

The Mann-Whitney U statistic can be derived from the sum of ranks for each group, and it is related to the probability that a randomly selected individual from one group has a higher win probability than a randomly selected individual from the other group. Specifically:

Inline graphic  

Inline graphic  

Confidence Interval: win probability ± 𝑍 × SE win probability

0.633 ± 1.96 × 0.0342

(0.57–0.70)

There is a 63% (95% CI: 57%–70%) probability that a randomly selected individual from the intervention arm reports a higher discomfort level than a randomly selected individual from the control arm.
Win Probability estimated by linear regression of win fractions (Appendix 1) 64% (95% CI: 57%–70%) There is approximately a 64% (95% CI: 57%–70%) probability that a randomly selected individual from the intervention arm reports a higher discomfort level than a randomly selected individual from the control arm.
Win difference estimated by linear regression of win fractions (Appendix 1) 28% (95% CI: 14% – 40%) The intervention arm is 28% points (95% CI: 14%–40%) more likely to report higher discomfort levels than the control arm.
Win ratio estimated by linear regression of win fractions (Appendix 1) 1.70 (95% CI: 1.33–2.33) A randomly selected individual in the intervention arm is 1.70 (95% CI: 1.38–1.93) times more likely to report a higher discomfort level than a randomly selected individual in the control arm.
Odds ratio estimated using PO-logistic model (Appendix 1) 2.25 (95% CI: 1.49–3.39) Patients in the intervention arm have 2.25 higher odds (95% CI: 1.29–3.57) of reporting higher discomfort levels than those in the control group.
Odds ratio estimated using Partial PO-logistic model (Appendix 1) Cutoff value ≥ threshold: Odds ratio (95% CI) Odds ratios with 95% confidence intervals showing the odds of being at or above various discomfort thresholds between the intervention and control groups. The intervention group consistently had higher odds of greater discomfort across nearly all thresholds. For instance, at the threshold distinguishing ratings 2 + vs. below 2, the odds ratio was 2.24 (95% CI: 1.35–3.73), indicating more than twice the odds of higher discomfort in the intervention group compared to the control group.
0 1.85 (1.00–3.41)
1 1.63 (0.94–2.85)
2 2.24 (1.35–3.73)
3 2.06 (1.27–3.34)
4 2.45 (1.52–3.94)
5 2.10 (1.30–3.40)
6 2.19 (1.33–3.60)
7 2.00 (1.17–3.41)
8 5.80 (2.41–13.94)
9 4.64 (1.78–12.11)

Using the PO-logistic model, patients in the intervention group had 2.25 (95% CI: 1.49–3.39, p < 0.001) higher odds of reporting a higher discomfort level than those in the control group. Visual diagnostics (Fig. 2) revealed non-parallel cumulative log odds [67]. As a result, we also fitted a partial PO-logistic model to compare discomfort levels .

Fig. 2.

Fig. 2

The probability-probability (P-P) plot displays the cumulative probabilities of ordinal scores (0–10) for the control (x-axis) and intervention (y-axis) groups. All data points lie below the reference line (y = x), indicating that the control group consistently has higher cumulative probabilities across most scores. The varying vertical deviations (0.08 to 0.21) demonstrate inconsistent differences between groups, suggesting that the PO assumption is likely violated

Table 2 presents the ORs and corresponding 95% CIs for being at or above each discomfort threshold when comparing the intervention to the control, that is, when the PO assumption is violated. Patients in the intervention group consistently reported greater discomfort than the control group at nearly every threshold. For example, at the threshold between reporting a discomfort rating of greater than 2 versus 2 or less, the OR was 2.24 (95% CI: 1.35–3.73), indicating that the intervention group had more than double the odds of being in a higher discomfort category than their control counterparts. In contrast, at higher thresholds (e.g., > 8 vs. ≤8), the intervention group had nearly six times the odds of reporting greater discomfort (OR = 5.8, 95% CI: 2.41–13.94), underscoring a pronounced divergence at the higher end of the discomfort spectrum.

Putting the MyTEMP PRO results into perspective

Our analyses using a Mann-Whitney U test and the corresponding win probability provided an overall measure of the difference in the distribution of discomfort scores between the intervention and control groups, suggesting that individuals receiving personalised cooler dialysate reported more discomfort than those receiving standard temperature dialysate. The PO-logistic model provided a single summary odds ratio, which supported the conclusion that the intervention group, on average, had higher odds of experiencing greater discomfort. However, the violation of the PO assumption indicated that the relationship between group assignment and discomfort level was not consistent across all thresholds of the Likert scale. By using a partial PO-logistic model, we obtained a more nuanced understanding of these differences rather than relying on a single averaged effect. The partial PO-logistic model indicated that at lower thresholds, the difference between the intervention and control groups was moderate. In contrast, the odds of the intervention group reporting extreme discomfort were higher at higher thresholds of the ordinal outcome. That is, although the odds ratio hovered around twofold in the mid-range of the scale (e.g., at the 2 + thresholds), it increased dramatically, by about sixfold, at the higher thresholds (> 8).

Simulation study: aims, estimands and design

We conducted a simulation study to evaluate the performance of various non-parametric and parametric methods for analysing ordinal outcomes in an individual-level randomised controlled trial. Specifically, we assessed the methods under scenarios where the PO assumption does or does not hold. Our simulation study follows the guidelines for planning and reporting simulations as proposed by Morris et al. (2019), which involves defining aims, data-generating mechanisms, estimands, methods, and performance measures (the ‘ADEMP’ framework) [68].

Under the PO assumption

Individuals were assigned to the treatment or control group with equal probability (1:1 ratio) using simple randomisation. Ordinal outcome data were generated according to the following PO logistic regression model:

graphic file with name d33e954.gif

where Yi is the ordinal outcome for individual i, Xi is the treatment indicator (1 for treatment, 0 for control), αk are the fixed cut-point-specific intercepts for category k, and β1 represents the treatment effect. The probabilities for the ordinal categories in the control group were specified to match two outcome distributions often seen in practice: (1) symmetrical outcome distribution of (0.20, 0.10, 0.05, 0.05, 0.05, 0.10, 0.05, 0.05, 0.05, 0.10, 0.20) and (2) skewed with heavy 0 outcome distribution (0.6, 0.10, 0.04, 0.04, 0.04, 0.04, 0.02, 0.02, 0.02, 0.02, 0.06). The number of ordinal categories, denoted k, was fixed at 11 to mirror the 0–10 numerical rating format of the Edmonton Symptom Assessment System–Revised (ESAS-r renal), the scale used in the MyTEMP trial [5, 66].

Under the violation of the PO assumption

For scenarios in which the PO assumption did not hold, outcomes were generated from a cumulative-logit model that permits threshold-specific treatment effects:

graphic file with name d33e1000.gif

where Inline graphic denotes the treatment effect for category 0 of the ordinal outcome, Inline graphic when k = 1,…,9, and Inline graphic induces departures from proportionality at each cumulative cut-point. We explored two magnitudes of departure:

Departure type Inline graphic specification (logit scale)  
Weak PO violation Inline graphic  
Strong PO violation Inline graphic  

We varied Inline graphic between − 0.693 and 0 (i.e., an odds ratio ranging from 0.50 to 1.00) while applying each pattern of Inline graphic . Results for the strong violation are presented in the main text because they represent a “worst-case” scenario for model robustness; the corresponding findings for the weak PO violation are summarised in Appendix 2

Simulation settings and data generation process

We conducted 1,000 simulations per scenario and assessed the performance of the statistical methods by varying the total sample size (N total = 300, 600, 1200, 2400), the odds ratio for the treatment effect (0.5 to 1.0, in increments of 0.10), and the distribution of outcome scores as described above.

Statistical methods evaluated

The following statistical methods were evaluated.

  1. PO-logistic regression.

  2. Zou et al.’s. [45] method for calculating the win probability and variance estimates, with confidence intervals derived using a logistic (logit) transformation.

  3. Logistic regression with dichotomised outcome, using ≥ 7 as the threshold. This threshold was chosen because ESAS‑r (renal) validation work classifies scores 7–10 as ‘severe’, and the MyTEMP PRO analysis used the same cut‑point to denote clinically important discomfort [66].

  4. Linear regression treating ordinal outcomes as continuous.

Performance metrics

We evaluated the methods using absolute bias along with empirical coverage of 95% confidence intervals (CIs), and statistical power for detecting a treatment effect. We used the odds ratio to approximate the true win probability as:

graphic file with name d33e1107.gif

where Inline graphic is the data-generating log-odds ratio in the proportional-odds scenario. When treating the ordinal outcome as continuous, the treatment effect was computed as follows:

graphic file with name d33e1116.gif

where the first term calculates the expected value (mean) of Y in the intervention group, and the second term calculates the expected value of Y in the control group. The full R code used to generate the data and assess performance metrics can be found on our GitHub page (Link).

Simulation results

Symmetrical outcome distribution under a satisfied proportional-odds assumption

Empirical performance was consistent across the full range of sample sizes (N = 300 to 2,400) and treatment odds ratios (0.50 to 1.00). The PO-logistic, win probability, binary-logistic, and linear regression analyses maintained good coverage within ± 2% points of the nominal 95% target (94% to 97%). Absolute bias was generally trivial (often within 5% of the true estimate) for all models (Appendix 3). Even with some instances of inflated bias (e.g., greater than 5% absolute relative bias), nominal coverage was preserved, indicating no loss of inferential validity in practice. Power estimates from the PO-logistic, win probability, and linear regression models closely tracked the closed-form values obtained using Whitehead’s (1993) PO formula (data not shown). However, the binary logistic regression model often yielded results with 10 to 15% points for lower statistical power compared to PO-logistic regression (Table 3) [69].

Table 3.

Estimated statistical power comparing four analytic Methods as a function of the total sample size, true treatment effect (odds ratio of β1), and whether the proportional odds (PO) assumption was Met

Total sample size (N) Odds ratio, exp(βk=0,...9) Symmetrical distribution Heavy-zero distribution
PO Logistic Win probability Binary logistic Linear Partial-PO Logistic** PO Logistic Win probability Binary logistic Linear Partial-PO Logistic**
PO Assumption was met
300 0.50 94% 94% 81% 94% 64% 83% 83% 39% 71% 47%
0.60 71% 71% 54% 71% 34% 57% 57% 25% 47% 26%
0.70 40% 40% 29% 40% 19% 31% 32% 15% 27% 14%
0.80 18% 19% 14% 19% 8% 15% 16% 10% 14% 8%
0.90 8% 8% 7% 8% 6% 7% 7% 6% 7% 5%
1.00 4% 4% 4% 5% 5% 4% 4% 5% 5% 7%
600 0.50 100% 100% 98% 100% 94% 98% 98% 66% 94% 82%
0.60 94% 94% 86% 94% 68% 87% 87% 45% 78% 53%
0.70 72% 72% 59% 73% 35% 60% 60% 26% 49% 28%
0.80 37% 37% 27% 38% 14% 29% 30% 14% 23% 12%
0.90 13% 13% 8% 12% 6% 9% 9% 7% 10% 6%
1.00 5% 5% 5% 5% 5% 5% 5% 5% 5% 5%
1200 0.50 100% 100% 100% 100% 100% 100% 100% 94% 100% 99%
0.60 100% 100% 99% 100% 96% 99% 99% 75% 97% 87%
0.70 94% 94% 84% 94% 67% 86% 86% 47% 78% 54%
0.80 62% 62% 48% 61% 26% 52% 52% 24% 41% 20%
0.90 20% 20% 14% 19% 7% 14% 14% 8% 14% 7%
1.00 6% 6% 6% 5% 4% 6% 6% 5% 5% 6%
2400 0.50 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%
0.60 100% 100% 100% 100% 100% 100% 100% 96% 100% 100%
0.70 100% 100% 99% 100% 96% 99% 99% 74% 97% 87%
0.80 89% 89% 73% 89% 51% 78% 78% 42% 68% 40%
0.90 30% 30% 24% 30% 11% 25% 25% 12% 23% 10%
1.00 3% 3% 4% 3% 3% 4% 4% 6% 4% 4%
PO Assumption was violated
Total sample size (N) Odds ratio, exp(βk=0) PO Logistic Win probability Binary logistic Linear Partial-PO Logistic** PO Logistic Win probability Binary logistic Linear Partial-PO Logistic**
300 0.50 99% 99% 98% 99% 88% 87% 87% 61% 85% 58%
0.60 92% 92% 91% 94% 70% 64% 64% 49% 68% 38%
0.70 73% 73% 77% 76% 49% 40% 40% 36% 47% 24%
0.80 48% 48% 56% 52% 32% 20% 21% 26% 30% 17%
0.90 27% 28% 37% 30% 21% 11% 11% 18% 17% 13%
1.00 15% 15% 25% 15% 24% € 5% 5% 13% 9% 15% €
600 0.50 100% 100% 100% 100% 100% 99% 99% 89% 99% 92%
0.60 100% 100% 100% 100% 97% 91% 91% 78% 92% 74%
0.70 95% 95% 96% 96% 85% 72% 72% 63% 78% 51%
0.80 80% 80% 87% 84% 66% 40% 40% 46% 53% 33%
0.90 52% 52% 68% 56% 47% 17% 17% 31% 30% 22%
1.00 25% 25% 41% 26% 59% 7% 7% 24% 14% 31% €
1200 0.50 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%
0.60 100% 100% 100% 100% 100% 100% 100% 98% 100% 98%
0.70 100% 100% 100% 100% 100% 93% 93% 92% 97% 86%
0.80 97% 97% 99% 98% 94% 67% 67% 77% 83% 66%
0.90 81% 81% 92% 84% 83% 29% 29% 55% 52% 46%
1.00 47% 47% 72% 48% 91% 9% 9% 42% 25% 73%
2400 0.50 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%
0.60 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%
0.70 100% 100% 100% 100% 100% 100% 100% 99% 100% 99%
0.80 100% 100% 100% 100% 100% 93% 93% 96% 98% 93%
0.90 98% 98% 100% 99% 99% 51% 51% 83% 80% 79%
1.00 79% 79% 96% 81% 100% 12% 12% 70% 45% 98%

PO = Proportional odds; Bk=0 treatment effect for the first set of categories on the ordinal scale

More than 10% of the PO-logistic regression models failed to converge

**We reported the power based on the likelihood ratio test. The null hypothesis being tested was that all category-specific treatment effects are simultaneously equal to zero. (βk=1 = βk=2 = ... = βk=9 = 0) versus the alternative hypothesis that at least one of the category-specific treatment effects (βk) is not zero

Heavy zero outcome distribution under a satisfied proportional-odds assumption

Empirical performance for the various methods was more variable across the range of simulation scenarios when the outcome distribution was highly skewed. The PO-logistic, binary logistic, and linear regression models had a 95% CI coverage between 94% and 96% across all scenarios. However, win probability coverage became increasingly anti-conservative as the sample size grew, especially at the strongest treatment effects (e.g., OR ≤ 0.70), whereas coverage remained close to 95% for moderate or weak effects (Appendix 3). The absolute bias for PO-logistic, win-probability, and linear-regression models remained acceptably low across nearly all scenarios. Binary logistic-regression estimates, however, showed appreciable bias (> 10% of the true value) when N = 300 and the odds ratio was 0.80 or higher (Appendix 3).

The PO-logistic and win-probability models were consistently the most statistically powerful (Table 3), while linear regression and the binary-logistic test trailed behind at modest sample sizes (n = 300 to 600). For example, for n = 300 and OR = 0.60, the power was 57% for the ordinal methods compared to 47% for linear regression and 25% for binary logistic regression. As the sample size increased (n = 1200–2400), linear regression nearly caught up to ordinal methods for strong effects, while binary logistic regression achieved comparable power only at odds ratios of 0.50 to 0.60. For moderate effects (OR ≈ 0.70–0.80), binary logistic still lagged the ordinal approaches by roughly 25 to 35%, whereas linear regression trailed by no more than 10%. All four methods maintained a type-I error rate near the nominal 5% when the true odds ratio was 1.

Symmetrical outcome distribution under a violated proportional-odds assumption

We omitted coverage and bias estimates for models that impose a common odds ratio because the true treatment effect varies across cumulative cut-points. Instead, we report statistical power, contrasting four analyses that assume proportional odds (PO-logistic, win-probability, binary logistic with a ≥ 7 vs. 0–6 split, and linear regression) with the correctly specified partial-PO logistic model evaluated using its omnibus LRT. The text below presents the scenario in which PO violations strengthen the average effect (δ₁–₆ = 0.15; δ₇–₉ = 0.30); results for weaker departures (δ₁–₆ = 0.05; δ₇–₉ = 0.10) and for violations that attenuate the effect are summarised in Appendices 2 & 3.

For strong treatment effects (OR ≤ 0.60), statistical power exceeded 80% for nearly all methods; however, the omnibus partial-PO LRT trailed the misspecified ordinal tests by 12 to 22% points. For moderate treatment effects (OR = 0.70 to 0.80), the partial-PO model consistently had lower power to detect a treatment signal compared to methods assuming proportional odds; however, this disparity diminished as the sample size increased. For weak treatment effects, the partial-PO model became more sensitive than the PO methods at larger sample sizes, although it continued to underperform relative to the binary logistic model. Finally, when the average odds ratio was 1, rejection rates were inflated for every PO method, reflecting heterogeneity in threshold-specific effects (Fig. 3).

Fig. 3.

Fig. 3

Statistical power under a symmetric outcome distribution with a violated proportional-odds assumption. Each panel depicts power curves (y-axis) against the true odds ratio for Inline graphic for each sample size: n = 300, 600, 1200, and 2400. Curves correspond to four analytic approaches: PO-logistic (solid black), binary logistic (dashed dark grey), linear regression (dotted mid grey), and the omnibus partial PO likelihood ratio test (dash-dot, thick black). Note: The curve for win probability was excluded because it was identical to the curve for PO logistic

Heavy zero outcome distribution under a violated proportional-odds assumption

Like the preceding scenario, this analysis assesses statistical power by comparing models that incorrectly assume proportional odds (PO-logistic, win probability, linear regression, and binary logistic regression) with the partial PO-logistic model (Table 3). For small samples, methods that assumed proportional odds retained a clear advantage; the partial-PO likelihood-ratio test showed noticeably lower power unless the effect was very strong (i.e., OR ≤ 0.60). With larger samples, this deficit shrank: with moderate samples the two approaches performed similarly for stronger and moderately strong effects, and with large samples they were virtually indistinguishable for all but the weakest treatment effects (OR > 0.90). In fact, when the true effect of Inline graphic approached the null value, the partial-PO model eventually surpassed the PO methods in power, reflecting its greater sensitivity to threshold-specific departures once sufficient information was available (Fig. 4).

Fig. 4.

Fig. 4

Statistical power under a heavy-zero outcome distribution with a violated proportional-odds assumption. Each panel depicts power curves (y-axis) against the true odds ratio for Inline graphic  for each sample size: n = 300, 600, 1200, and 2400. Curves correspond to four analytic approaches: PO-logistic (solid black), binary logistic (dashed dark grey), linear regression (dotted mid grey), and the omnibus partial PO likelihood ratio test (dash-dot, thick black). Note: The curve for win probability was excluded because it was identical to the curve for PO logistic

Conclusion

Choosing the optimal method for analysing ordinal Likert outcomes hinges on the estimand, the outcome distribution and whether the proportional-odds (PO) assumption is tenable. Rank-based modelling approaches (e.g., win probability) allow for covariate adjustment while retaining robustness to distributional quirks. PO-logistic and, where needed, partial PO-logistic models report odds ratios familiar to clinicians and can offer high power when their assumptions hold. Our simulations underscore a classic lesson: forcing a common slope in the presence of non-parallel effects can erode power, particularly for weaker treatment effects and heavily skewed outcomes. Pre-specifying diagnostic checks for the PO assumption and adopting more flexible models when violations are plausible will safeguard both power and interpretability.

Acknowledgements

AA was supported by Western University.

Abbreviations

CI

Confidence Interval

NNH

Number Needed to Harm

NNT

Number Needed to Treat

LRT

Likelihood Ratio Test

OR

Odds Ratio

PO

Proportional Odds

PRO

Patient–Reported Outcome

P-P plot

Probability-Probability Plot

RCT

Randomised Controlled Trial

Appendix 1

The R code used for the MyTEMP Analysis is available on our GitHub page (Link).

Appendix 2

The full results are available in a CSV file on our GitHub page (Link) for the simulation under the setting with weak PO violation (Inline graphic). The column description for the file is included below:

  • N: The total sample size per arm (i.e. 2 × n per simulated trial).

  • sym_tag: “Symmetric” vs. “Heavy0” label for the marginal outcome distribution (controlled by sym_flag).

  • po_tag: “POMet” when the proportional-odds assumption holds, “POViolated” when it’s violated.

  • eff_tag: “attenuatedeffects” vs. “strengthenedeffects”, indicating whether the non-PO deltas were attenuated or strengthened.

  • beta1_OR: True odds ratio used in the PO model, i.e. exp(Inline graphic).

  • poregbias: Bias of the PO model’s log-OR estimate: Inline graphic.

  • poregcov: Empirical coverage probability of the 95% CI for the PO log-OR (proportion of sims whose CI contained the true estimate).

  • Poregp: Empirical power (or type-I error) of the PO Wald test: proportion of sims with p < 0.05 on for the treatment effect in the PO model.

  • Wpbias: Bias of the win probability estimate: Inline graphic (where Inline graphic is Harrell’s approximation).

  • Wpcov: Coverage probability of the 95% CI for the win probability.

  • Wpp: Empirical power of the win probability (p < 0.05).

  • bin7bias: Bias of the binary-logistic regression coefficient at threshold Inline graphic.

  • bin7cov: Coverage probability of the 95% CI for that binary-logistic coefficient.

  • bin7p: Empirical power of the binary logistic regression.

  • Linbias: Bias of the linear regression estimate: Inline graphic, where Inline graphic is the true mean difference.

  • Lincov: Coverage probability of the 95% CI for the linear-regression slope.

  • Linp: Empirical power of the linear regression test.

  • Pobiasz: Relative bias (%) for the PO estimate: Inline graphic.

  • Wpbiasz: Relative bias (%) for the win-probability estimate (analogous to above).

  • bin7biasz: Relative bias (%) for the binary logistic estimate at Inline graphic.

  • Linbiasz: Relative bias (%) for the linear-regression estimate Inline graphic.

  • sumnonpoErrFlag: Total count of simulation replicates where the VGAM non-PO fit flagged an error (i.e., model did not coverage).

  • nonpoLRTpower: Among converged non-PO fits, the empirical power of the likelihood-ratio test comparing non-PO vs. parallel PO (proportion with LRT p < 0.05).

  • nonpoConvProp: Proportion of replicates where the VGAM non-PO model converged successfully.

Appendix 3

The full results are available in a CSV file on our GitHub page (Link) for the simulation under the setting with strong PO violation (Inline graphic). The column descriptions are described above in Appendix 2.

Authors’ contributions

- AAA: Writing (original draft).- AAA, MC, and AXG: Conceptualisation, project administration, study supervision,- AAA, MC, and BL: Methodology and formal analysis.- AAA, MC, BL, PR, and AXG: Interpretation of the results and writing (review and editing).

Funding

This project was funded by Western University and a grant from CIHR.

Data availability

The code used to generate and analyse the simulated data is available on our GitHub page ( [https://github.com/PragmaticTrialServices/Statistical-Analysis-of-Likert-based-Ordinal-Scales](https://github.com/PragmaticTrialServices/Statistical-Analysis-of-Likert-based-Ordinal-Scales) ).

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

All authors have approved the final version of the manuscript and provided their consent for publication.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Masterson Creber R, Spadaccio C, Dimagli A, Myers A, Taylor B, Fremes S. Patient-Reported outcomes in cardiovascular trials. Can J Cardiol. 2021;37(9):1340–52. 10.1016/j.cjca.2021.04.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Calvert M, Kyte D, Mercieca-Bebber R, et al. Guidelines for inclusion of Patient-Reported outcomes in clinical trial protocols: the SPIRIT-PRO extension. JAMA. 2018;319(5):483–94. 10.1001/jama.2017.21903. [DOI] [PubMed] [Google Scholar]
  • 3.Johnston BC, Patrick DL, Devji T et al. Chapter 18: Patient-reported outcomes. In: Cochrane Handbook for Systematic Reviews of Interventions. 6.5. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA, editors. Cochrane; 2024. https://training.cochrane.org/handbook/current/chapter-18. Accessed September 25, 2024.
  • 4.Tong A, Oberbauer R, Bellini MI, et al. Patient-Reported outcomes as endpoints in clinical trials of kidney transplantation interventions. Transpl Int. 2022;35:10134. 10.3389/ti.2022.10134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Garg AX, Al-Jaishi AA, Dixon SN, et al. Personalised cooler dialysate for patients receiving maintenance haemodialysis (MyTEMP): a pragmatic, cluster-randomised trial. Lancet Lond Engl. 2022;400(10364):1693–703. 10.1016/S0140-6736(22)01805-0. [DOI] [PubMed] [Google Scholar]
  • 6.NIH Collaboratory. NIH Pragmatic Trials Collaboratory. Rethinking Clinical Trials. 2024. https://rethinkingclinicaltrials.org/about-nih-collaboratory/. Accessed September 27, 2024.
  • 7.Western University. Pragmatic Trials Training Program. https://www.schulich.uwo.ca/pragmatictrialstraining/index.html. Accessed September 27. 2024.
  • 8.de Winter JCF. Five-Point likert items: t test versus Mann-Whitney-Wilcoxon (Addendum added October 2012). Pract Assess Res Eval. 2010;15(1). 10.7275/bj1p-ts64.
  • 9.Greenhalgh T. How to read a paper: Statistics for the non-statistician. I: Different types of data need different statistical tests | The BMJ. https://www.bmj.com/content/315/7104/364.long. Accessed September 30, 2024. [DOI] [PMC free article] [PubMed]
  • 10.Agresti A. Ordinal Probabilities, Scores, and odds ratios. Analysis of ordinal categorical data. John Wiley & Sons, Ltd; 2010. pp. 9–43. 10.1002/9780470594001.ch2.
  • 11.Agresti A. Logistic regression models using cumulative logits. Analysis of ordinal categorical data. John Wiley & Sons, Ltd; 2010. pp. 44–87. 10.1002/9780470594001.ch3.
  • 12.Agresti A. Clustered ordinal responses: marginal models. Analysis of ordinal categorical data. John Wiley & Sons, Ltd; 2010. pp. 262–80. 10.1002/9780470594001.ch9.
  • 13.Agresti A. Clustered ordinal responses: random effects models. Analysis of ordinal categorical data. John Wiley & Sons, Ltd; 2010. pp. 281–314. 10.1002/9780470594001.ch10.
  • 14.International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH). Addendum on Estimands and Sensitivity Analysis in Clinical Trials to the Guideline on Statistical Principles for Clinical Trials E9 (R1). ICH. 2019. https://www.ich.org/page/efficacy-guidelines.
  • 15.Kahan BC, Hindley J, Edwards M, Cro S, Morris TP. The estimands framework: a primer on the ICH E9(R1) addendum. BMJ. 2024;384:e076316. 10.1136/bmj-2023-076316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Food US, Administration D. Guidance for industry: E9 statistical principles for clinical trials. FDA; 1998. https://www.fda.gov/media/71336/download.
  • 17.Lynggaard H, Bell J, Lösch C, et al. Principles and recommendations for incorporating estimands into clinical study protocol templates. Trials. 2022;23:685. 10.1186/s13063-022-06515-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Butler J, Stockbridge N, Packer M. Win ratio: A seductive but potentially misleading method for evaluating evidence from clinical trials. Circulation. 2024;149(20):1546–8. 10.1161/CIRCULATIONAHA.123.067786. [DOI] [PubMed] [Google Scholar]
  • 19.Mao L. Defining estimand for the win ratio: separate the true effect from censoring. Clin Trials Lond Engl. 2024;21(5):584–94. 10.1177/17407745241259356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Even M, Josse J. Rethinking the Win Ratio: A Causal Framework for Hierarchical Outcome Analysis. arXiv. Preprint posted online January 28, 2025. 10.48550/arXiv.2501.16933.
  • 21.Kahan BC, Jairath V, Doré CJ, Morris TP. The risks and rewards of covariate adjustment in randomized trials: an assessment of 12 outcomes from 8 studies. Trials. 2014;15(1):139. 10.1186/1745-6215-15-139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Gao Y, Liu Y, Matsouaka R. When does adjusting covariate under randomization help? A comparative study on current practices. BMC Med Res Methodol. 2024;24(1):250. 10.1186/s12874-024-02375-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Li F, Lokhnygina Y, Murray DM, Heagerty PJ, DeLong ER. An evaluation of constrained randomization for the design and analysis of group-randomized trials. Stat Med. 2016;35(10):1565–79. 10.1002/sim.6813. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Li F, Turner EL, Heagerty PJ, Murray DM, Vollmer WM, DeLong ER. An evaluation of constrained randomization for the design and analysis of group-randomized trials with binary outcomes. Stat Med. 2017;36(24):3791–806. 10.1002/sim.7410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Selman CJ, Lee KJ, Ferguson KN, Whitehead CL, Manley BJ, Mahar RK. Statistical analyses of ordinal outcomes in randomised controlled trials: a scoping review. Trials. 2024;25(1):241. 10.1186/s13063-024-08072-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Maguire J, Attia J. Which version of the modified Rankin scale should we use for stroke trials? Neurology. 2018;91(21):947–8. 10.1212/WNL.0000000000006533. [DOI] [PubMed] [Google Scholar]
  • 27.Ganesh A, Luengo-Fernandez R, Wharton RM, Rothwell PM, on behalf of the Oxford Vascular Study. Ordinal vs dichotomous analyses of modified Rankin Scale, 5-year outcome, and cost of stroke. Neurology. 2018;91(21):e1951–60. 10.1212/WNL.0000000000006554. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.MacCallum RC, Zhang S, Preacher KJ, Rucker DD. On the practice of dichotomization of quantitative variables. Psychol Methods. 2002;7(1):19–40. 10.1037/1082-989x.7.1.19. [DOI] [PubMed] [Google Scholar]
  • 29.Harrell FE. Ordinal Logistic Regression. In: Harrell FE, ed. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer International Publishing; 2015:311–325. 10.1007/978-3-319-19425-7_13.
  • 30.Harrell FE. Case study in ordinal Regression, data Reduction, and penalization. In: Harrell E Jr, editor. Regression modeling strategies: with applications to linear Models, logistic and ordinal Regression, and survival analysis. Springer International Publishing; 2015. pp. 327–58. 10.1007/978-3-319-19425-7_14.
  • 31.Roozenbeek B, Lingsma HF, Perel P, et al. The added value of ordinal analysis in clinical trials: an example in traumatic brain injury. Crit Care. 2011;15(3):1–7. 10.1186/cc10240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Cohen J. The cost of dichotomization. Appl Psychol Meas. 1983;7(3):249–53. 10.1177/014662168300700301. [Google Scholar]
  • 33.Altman DG, Royston P. The cost of dichotomising continuous variables. BMJ. 2006;332(7549):1080. 10.1136/bmj.332.7549.1080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Bennette C, Vickers A. Against quantiles: categorization of continuous variables in epidemiologic research, and its discontents. BMC Med Res Methodol. 2012;12:21. 10.1186/1471-2288-12-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Strömberg U. Collapsing ordered outcome categories: A note of concern. Am J Epidemiol. 1996;144(4):421–4. 10.1093/oxfordjournals.aje.a008944. [DOI] [PubMed] [Google Scholar]
  • 36.Sauzet O, Ofuya M, Peacock JL. Dichotomisation using a distributional approach when the outcome is skewed. BMC Med Res Methodol. 2015;15(1):40. 10.1186/s12874-015-0028-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Carifio J, Perla R. Resolving the 50-year debate around using and misusing likert scales. Med Educ. 2008;42(12):1150–2. 10.1111/j.1365-2923.2008.03172.x. [DOI] [PubMed] [Google Scholar]
  • 38.Norman G. Likert scales, levels of measurement and the laws of statistics. Adv Health Sci Educ Theory Pract. 2010;15(5):625–32. 10.1007/s10459-010-9222-y. [DOI] [PubMed] [Google Scholar]
  • 39.Harrell FE. Regression modeling strategies: with applications to linear Models, logistic and ordinal Regression, and survival analysis. Springer International Publishing; 2015. 10.1007/978-3-319-19425-7.
  • 40.Wilcoxon F. Individual comparisons by ranking methods. Biom Bull. 1945;1(6):80–3. 10.2307/3001968. [Google Scholar]
  • 41.Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18(1):50–60. [Google Scholar]
  • 42.Other Ordinal Logistic Regression Models. In. Analysis of ordinal categorical data. John Wiley & Sons, Ltd; 2010. pp. 88–117. 10.1002/9780470594001.ch4.
  • 43.Modeling Ordinal Association Structure. In. Analysis of ordinal categorical data. John Wiley & Sons, Ltd; 2010. pp. 145–83. 10.1002/9780470594001.ch6.
  • 44.Rahlfs VW, Zimmermann H, Lees KR. Effect size measures and their relationships in stroke studies. Stroke. 2014;45(2):627–33. 10.1161/STROKEAHA.113.003151. [DOI] [PubMed] [Google Scholar]
  • 45.Zou G, Zou L, Qiu SF. Parametric and nonparametric methods for confidence intervals and sample size planning for win probability in parallel-group randomized trials with likert item and likert scale data. Pharm Stat. 2023;22(3):418–39. 10.1002/pst.2280. [DOI] [PubMed] [Google Scholar]
  • 46.Zou G. Confidence interval Estimation for treatment effects in cluster randomization trials based on ranks. Stat Med. 2021;40(14):3227–50. 10.1002/sim.8918. [DOI] [PubMed] [Google Scholar]
  • 47.Yu C. Nonparametric Methods for Analysis and Sizing of Cluster Randomization Trials with Baseline Measurements. Western University; 2023. https://ir.lib.uwo.ca/etd/9697. Accessed September 30, 2024.
  • 48.Buyse M. Generalized pairwise comparisons of prioritized outcomes in the two-sample problem. Stat Med. 2010;29(30):3245–57. 10.1002/sim.3923. [DOI] [PubMed] [Google Scholar]
  • 49.Song J, Verbeeck J, Huang B, et al. The win odds: statistical inference and regression. J Biopharm Stat. 2023;33(2):140–50. 10.1080/10543406.2022.2089156. [DOI] [PubMed] [Google Scholar]
  • 50.Wang D, Pocock S. A win ratio approach to comparing continuous non-normal outcomes in clinical trials. Pharm Stat. 2016;15(3):238–45. 10.1002/pst.1743. [DOI] [PubMed] [Google Scholar]
  • 51.Pocock SJ, Ariti CA, Collier TJ, Wang D. The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. Eur Heart J. 2012;33(2):176–82. 10.1093/eurheartj/ehr352. [DOI] [PubMed] [Google Scholar]
  • 52.Bob Derr. Ordinal Response Modeling with the LOGISTIC Procedure, SAS Institute Inc. 2013. http://support.sas.com/resources/papers/proceedings13/446-2013.pdf. Accessed September 30, 2024.
  • 53.Harrell F. Equivalence of Wilcoxon Statistic and Proportional Odds Model. Statistical Thinking. April 6, 2022. https://www.fharrell.com/post/powilcoxon/. Accessed December 12, 2024.
  • 54.Holmgren EB. The P-P plot as a method for comparing treatment effects. J Am Stat Assoc. 1995;90(429):360–5. 10.2307/2291161. [Google Scholar]
  • 55.Wilk MB, Gnanadesikan R. Probability plotting methods for the analysis of data. Biometrika. 1968;55(1):1–17. 10.2307/2334448. [PubMed] [Google Scholar]
  • 56.SAS Institute Inc. Plotting empirical (observed) logits for binary and ordinal response data. https://support.sas.com/kb/37/944.html. Accessed September 30. 2024.
  • 57.Brant R. Assessing proportionality in the proportional odds model for ordinal logistic regression. Biometrics. 1990;46(4):1171–8. 10.2307/2532457. [PubMed] [Google Scholar]
  • 58.Harrell F. Assessing the Proportional Odds Assumption and Its Impact. Statistical Thinking. March 9, 2022. https://www.fharrell.com/post/impactpo/#examining-the-po-assumption. Accessed October 29, 2024.
  • 59.22954 - The PROC LOGISTIC proportional odds test and fitting a partial proportional odds model. https://support.sas.com/kb/22/954.html. Accessed October 24. 2024.
  • 60.Peterson B, Harrell FE Jr. Partial proportional odds models for ordinal response variables. J R Stat Soc Ser C Appl Stat. 1990;39(2):205–17. 10.2307/2347760. [Google Scholar]
  • 61.Harrell F. Borrowing Information Across Outcomes. Statistical Thinking. April 30, 2024. https://www.fharrell.com/post/yborrow/?utm_source=chatgpt.com. Accessed June 2, 2025.
  • 62.Optimising the Analysis of Stroke Trials Collaboration with the Writing Committee, Bath P, Hogg C, Tracy M, Pocock S. Calculation of numbers-needed-to-treat in parallel group trials assessing ordinal outcomes: case examples from acute stroke and stroke prevention. Int J Stroke Off J Int Stroke Soc. 2011;6(6):472–9. 10.1111/j.1747-4949.2011.00614.x. [DOI] [PubMed] [Google Scholar]
  • 63.Al-Jaishi AA, McIntyre CW, Sontrop JM, et al. Major outcomes with personalized dialysate temperature (MyTEMP): rationale and design of a pragmatic, registry-based, cluster randomized controlled trial. Can J Kidney Health Dis. 2020;7:1–18. 10.1177/2054358119887988. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Dixon S, Sontrop J, Al-Jaishi A, et al. MyTEMP: statistical analysis plan of a Registry-Based, Cluster-Randomized clinical trial. Can J Kidney Health Dis. 2021;8:205435812110411. 10.1177/20543581211041182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.R Core Team. R: A Language and Environment for Statistical Computing. Published online 2024. https://www.R-project.org/.
  • 66.Davison SN, Jhangri GS, Johnson JA. Longitudinal validation of a modified Edmonton symptom assessment system (ESAS) in haemodialysis patients. Nephrol Dial Transpl. 2006;21(11):3189–95. 10.1093/ndt/gfl380. [DOI] [PubMed] [Google Scholar]
  • 67.Harrell F. Statistical Thinking - Violation of Proportional Odds is Not Fatal. September 20, 2020. https://www.fharrell.com/post/po/. Accessed September 30, 2024.
  • 68.Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38(11):2074–102. 10.1002/sim.8086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Whitehead J. Sample size calculations for ordered categorical data. Stat Med. 1993;12(24):2257–71. 10.1002/sim.4780122404. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The code used to generate and analyse the simulated data is available on our GitHub page ( [https://github.com/PragmaticTrialServices/Statistical-Analysis-of-Likert-based-Ordinal-Scales](https://github.com/PragmaticTrialServices/Statistical-Analysis-of-Likert-based-Ordinal-Scales) ).


Articles from BMC Medical Research Methodology are provided here courtesy of BMC

RESOURCES