Abstract
Cluster randomized trials commonly employ multiple endpoints. When a single summary of treatment effects across endpoints is of primary interest, global methods represent a common analysis strategy. However, specification of the required joint distribution is non-trivial, particularly when the endpoints have different scales. We develop rank-based interval estimators for a global treatment effect referred to here as the “global win probability,” or the mean of multiple Wilcoxon Mann-Whitney probabilities, and interpreted as the probability that a treatment individual responds better than a control individual on average. Using endpoint-specific ranks among the combined sample and within each arm, each individual-level observation is converted to a “win fraction” which quantifies the proportion of wins experienced over every observation in the comparison arm. An individual’s multiple observations are then replaced with a single “global win fraction” by averaging win fractions across endpoints. A linear mixed model is applied directly to the global win fractions to obtain point, variance, and interval estimates adjusted for clustering. Simulation demonstrates our approach performs well concerning confidence interval coverage and type I error, and methods are easily implemented using standard software. A case study using public data is provided with corresponding R and SAS code.
Keywords: Cluster randomized trials, Global treatment effects, Nonparametric rank-sum test, Win ratio, Wilcoxon Mann-Whitney statistic, Linear mixed models
1. Introduction
Complex diseases with complex interventions demand complex trials. Cluster randomized trials, or cluster trials, can simplify the delivery of complex interventions and mitigate the risk of contamination by randomly allocating groups of individuals, or “clusters,” rather than individuals to intervention arms. However, their design and analysis is complicated by correlation among individuals within the same cluster, or “intracluster correlation.”1 Correlation structures are complicated even further when multiple endpoints are of primary interest. In addition to multiple sources of intracluster correlation which may differ by endpoint, endpoint correlations within-subject must also be captured. Parametric methods exist to model multiple endpoints, but are limited to the same scale, e.g., all binary2 or all continuous with normality assumed3.
In this paper, we focus on the scenario where a single summary of the treatment effect across multiple endpoints of different scales, or a “global treatment effect,” is of primary interest. This scenario may arise when there is a lack of consensus on which single best endpoint to use, or when multiple endpoints are necessary to characterize disease burden at a single time point, risk-benefit trade-off, or even longitudinal disease course.4,5 Global tests represent a common analytic approach in this scenario, and assess overarching hypotheses about all endpoints simultaneously using a single test statistic constructed from their joint distribution. By testing endpoints jointly rather than separately, multiplicity is not a concern, and power is often greater than the multiple corresponding univariate tests.6
Motivated by a need to obtain a single probability statement regarding multiple disparate endpoints among few individually randomized subjects, O’Brien7 developed three 1-df global tests of a directed alternative: the ordinary least squares (OLS) test, the generalized least squares (GLS) test, and the nonparametric rank-sum test. All three tests construct the global treatment effect as the sum (or mean) of the endpoint-specific effects, but differ in their standardization and weighting schemes. Each of the tests is actually equivalent to: (1) a univariate test of the global treatment effect; (2) a combination of the univariate, endpoint-specific test statistics;8 and notably, (3) a two-sample t-test of a composite endpoint constructed as the within-subject mean of standardized responses.9 Thus, if an appropriate standardization can be identified and the corresponding composite is relevant, univariate tests may be applied directly, avoiding explicit specification of complex correlation structures.
The “parametric” OLS and GLS tests apply Z-standardization to each endpoint under the assumption that they are multivariate normal but this assumption is often untenable, particularly as multiple endpoints are commonly ordinal or differ in scale.10 O’Brien’s “nonparametric” rank-sum test instead replaces responses with their rank in the pooled sample.7 Ranks are then summed across endpoints within-subject, yielding composite “rank-sums.” For large sample sizes, correlation between rank-sums is sufficiently weak to permit application of the Central Limit Theorem. Thus, the nonparametric rank-sum test is constructed as a t-test for the mean difference in rank-sums, here the global treatment effect. Of the three tests, O’Brien promoted the rank-sum test for general use, as it suffers from little loss in efficiency when parametric assumptions are met and provides gains in power otherwise.7,11 While developed with one-sided alternatives in mind, the nonparametric rank-sum test is also applicable to two-sided alternatives.12,13
The nonparametric rank-sum test suffers from three main drawbacks. First, hypothesis tests and p-values, while informative in their own right, are no longer sufficient. Trials are encouraged to report point estimates and their uncertainty, preferably in the form of confidence intervals.14 Second, the mean difference in rank-sums is not an interpretable measure of group differences. As we demonstrate in what follows, it may be re-expressed in terms of “the win probability.” The win probability is a nonparametric treatment effect defined as the probability that a randomly selected treatment individual responds better than, or “wins over,” a randomly selected control individual. Some readers may recognize this quantity as the Wilcoxon Mann-Whitney statistic. We use this terminology to align with related “win statistics” currently gaining traction among trialists, though we do not adopt “prioritized evaluation.”4,15,16 Third, there are currently no feasible extensions to the cluster randomized setting. Zou & Zou demonstrated that the family of O’Brien-Wei-Lachin methods could be constructed using win fraction methods in individually randomized trials.17
Zou18 developed rank-based confidence interval estimators for a single win probability within cluster randomized trials by transforming individual-level responses into “win fractions.” Win fractions are proportions that summarize the wins experienced by an individual when compared to all others within the comparator arm. For example, a win fraction of 0.75 suggests that a (treatment) individual responded better than, or “won over,” 75% of those in the comparator (control) arm. Zou18 demonstrated that unbiased and consistent estimators of the win probability could be recovered from a linear mixed model for the win fractions with a random cluster intercept. Simulation studies demonstrated that resulting confidence intervals perform well, maintaining nominal coverage probabilities and type I error rates. Perhaps most importantly, these methods can be easily implemented using standard software and place win-based treatment effects within a regression framework, permitting easy adjustment for baseline responses for example.19
This paper presents an easily implemented, widely applicable, and interpretable solution to the otherwise complex analysis of cluster randomized trials with multiple endpoints of ordinal or different scales by building upon the ideas of O’Brien’s nonparametric rank-sum test7, win fraction methods for individually randomized trials,17,19–21 and the mixed model estimators of Zou18. Using ranks, a “global win fraction” is first constructed for each individual within the cluster trial as the within-subject mean of their endpoint-specific win fractions. A univariate linear mixed model is then applied directly to the global win fractions to obtain point and variance estimates for the global win probability adjusted for clustering effects. While hypothesis testing methods are provided, emphasis is placed on confidence interval estimation. The developed methods are simple yet flexible enough to handle multiple endpoints with differing properties such as type, scale, or priority, avoid explicit specification of complex correlation structures, can be implemented using existing software, yield meaningful treatment effects, and have the potential to increase power – a particular concern of cluster trials.
The rest of the paper is organized as follows. Section 2 defines notation, introduces the global win probability for multiple endpoints, and demonstrates translation to alternative win measures and the mean difference in rank-sums. In Section 3, simulation studies demonstrate the ability of the developed methods to maintain confidence interval coverage and the type I error rate while providing high power. In Section 4, a case study using publicly available data from the SHARE22 cluster trial exemplifies the application of the global win probability to two endpoints with different types and priority. Corresponding SAS and R code is provided in the Appendix to assist with implementation. The paper closes with a discussion, including a sketch of how sample size may be calculated.
2. Methods
2.1. Notation, context, and assumptions
Consider a parallel two-arm cluster randomized trial allocating and clusters to the control () and treatment () arms, respectively. Let index the clusters randomized to the intervention arm and index the individuals within the cluster in the arm. All individuals receive the intervention allocated to their cluster. Then, is the total number of clusters randomized by the trial, is the total number of individuals within the intervention arm, and is the total number of individuals within the trial. Due to randomization, clusters are independent within and between arms.
Suppose endpoints are recorded for each individual. Let denote the response of the individual within the cluster in the arm. Assume all individuals are completely observed so that () total responses are recorded, and that cluster sizes are non-informative, so that responses within a cluster are independent of its size. The are not independent within the cluster in the arm, but are assumed to be identically distributed according to non-degenerate distribution function . To accommodate discrete endpoints, is defined as the “normalized distribution function,” where and are the left- and right-continuous distribution functions, respectively. Finally, assume endpoints are at least ordinal in nature so any two responses can be ordered, and without loss of generalizability, greater responses correspond to better health.
2.2. Global win probability, and related measures
Following Zou,18 we consider the individual-level win probability for a single endpoint, hereon referred to simply as the win probability, which compares individual-level responses between arms rather than cluster-level summaries. For the endpoint, the win probability takes the form,
| (1) |
The win probability is rooted in the Wilcoxon rank-sum or Mann-Whitney U test statistics and has had over a dozen unique names including the area under the receiver operating characteristic curve, concordance or the c-index, the probabilistic index, and the common language effect size.20 It also shares relationships with commonly encountered effect measures. Of particular note, when the endpoint is normally distributed, is a one-to-one function of the standardized mean difference or Cohen’s effect size , with where is the standard normal distribution function. Cohen’s qualitative benchmarks are then easily transferable to the win probability, with corresponding to null, small, moderate, and large effects, respectively.23
When endpoints are of joint interest, we propose that the average of the corresponding win probabilities serves as the global treatment effect. Formally, the “global win probability” is defined as,
| (2) |
where represents the optional, fixed contribution weight of the endpoint. Weights need not be specified to use our method, and in what follows, we focus primarily on equal contribution weights, for all , so . The global win probability may be formally interpreted as the probability that a randomly selected individual from a treatment cluster responds no worse than a randomly selected individual from a control cluster with respect to the endpoints, on average.
2.2.1. Global win difference
Rather than the global win probability, one may consider the global win difference. For a single endpoint, the win difference has also been referred to as Somer’s , the Mann-Whitney difference,24 and the proportion in favor of treatment.25 The win difference is commonly used for methodology development as it avoids explicit consideration of tie probabilities, . For example, the global win difference was employed by Huang et al when developing improvements of the nonparametric rank-sum test,12,13 and by Lachin when developing multivariate distribution-free hypothesis testing methods.24
First, note that the win probability presented within Equation (1) is defined with respect to a treatment win, . The probabilities of a treatment win, loss, and tie sum to unity,
| (3) |
Since the win probability distributes ties equally between arms and a loss for treatment is a win for control, , it follows from Equation (3) that the win probability defined with respect to a control win, or the “control win probability,” is the complement of the treatment win probability, .
The win difference is defined as the difference between the treatment and control win probabilities,
| (4) |
or equivalently, . represents a generalization of the risk difference for binary endpoints to all endpoint types, and ranges between −1 and +1 with suggesting no treatment effect and treatment benefit relative to the control. With endpoints, the “global win difference” may be obtained in a similar fashion to Equation (2) as . Thus, estimators provided in Sections 2.4 and 2.5 may be translated to the global win difference by applying the transformations and .
2.2.2. Global win odds
Agresti proposed the generalized odds ratio, commonly referred to as “Agresti’s ,” as a treatment effect for ordinal endpoints.26 For the endpoint, is equal to the ratio of the probability of a treatment win to a control win,
and is equivalent to the odds ratio when the endpoint is binary. Pocock et al popularized the use of Agresti’s as a measure of effect size for prioritized time-to-event composites within cardiovascular trials,27 referring to it as the “win ratio.” For a single continuous survival endpoint, is equivalent to the inverse of the hazard ratio when the proportional hazards assumption holds. However, excludes the probability of ties which can be both informative and substantial for discrete endpoints.
Dong et al28 and Brunner et al29 instead advocate for the use of the “win odds” which incorporates ties and is equal to the ratio of the treatment and control win probabilities. For the endpoint,
or equivalently, . When the endpoint is continuous, so that . With endpoints, we define the “global win odds” as where is defined per (2). ranges between 0 and with indicating no treatment effect and treatment benefit relative to the control. Estimators provided in Sections 2.4 and 2.5 may be translated to the global win odds through application of the -method, with and .
2.3. Global win fractions
Point and variance estimation for associated measures such as the Wilcoxon Mann-Whitney U-test statistic or win difference have traditionally relied on the construction and comparison of all pairs consisting of one treatment and one control response.25,30 However, the construction of all “pairwise comparisons” is computationally expensive, even for moderate sample sizes.31 We focus instead on novel estimators which transform each individual’s observed response into a proportion summarizing their wins and ties experienced, referred to as a “win fraction.”17–21 In Appendix A, we show that win fractions are asymptotically uncorrelated between arm and cluster, permitting the application of conventional methods, though not within cluster, requiring adjustment for intracluster correlation.
Formally, the win fraction for the individual in the treatment cluster is provided by,
| (5) |
where is the Heaviside function taking the value +1 when or a “win” occurs, +0.5 when or a “tie” occurs, and +0 when or a “loss” occurs. In other words, win fractions are equal to the within-subject mean of the () pairwise comparisons involving . Of particular note, may also be expressed as , where is the empirical cumulative distribution function (ECDF) of the endpoint in the arm. That is, is the percentile that treatment observation occupies among all control observations.18 Similarly, for the individual in the control cluster,
| (6) |
or , the percentile that occupies among all treatment observations.
As a result of this relationship with the ECDF, win fractions may be conveniently expressed in terms of ranks. With ties, “ranks” refers to “midranks” or the average of the tied positions. Following Hoeffding,32 define two ranks for individual-level treatment response : (1) the “overall rank” among all responses in the trial,
or equivalently, , and (2) the “group-specific rank” among the responses in the treatment arm,
or equivalently, . These ranks can be defined analogously for control observations. It then follows from Equations (5) and (6) that the win fractions may be expressed generally as,
| (7) |
The rank-based form of the win fractions simplifies and speeds up calculation significantly as all pairwise comparisons do not need to be constructed, rather only three sets of ranks.31
To estimate the global win probability, a single “global win fraction” is constructed for each individual as the (weighted) mean of their endpoint-specific win fractions,
| (8) |
With equal contribution weights, Equation (8) reduces to , or the simple within-subject mean of the win fractions.
2.4. Mixed model estimators
Zou18 demonstrated that an unbiased estimator of a single win probability can be obtained as the mean treatment win fraction, , or since a win for control is a loss for treatment, one minus the mean control win fraction, . It was also proved that asymptotically, where . Two estimators of the and a total of three corresponding variance estimators, , were investigated.
Here, we construct a single transformed response for each individual within the cluster randomized trial, i.e., a global win fraction. In what follows, we obtain point and variance estimators of the global win probability by applying these univariate methods, the mixed model estimators specifically, directly to the global win fractions.
2.4.1. Application to global win fractions
An unbiased estimator of the global win probability and a consistent estimator of its variance can be obtained by applying the following linear mixed model to the global win fractions,
| (9) |
where is an indicator equal to 1 if individual belongs to a treatment cluster and 0 if control, represents the random intercept of the cluster, represents the residual of the win fraction, and and are assumed to be independent.
From these definitions it follows that and an estimator of the global win probability can be recovered from the fitted model as . Since the two intervention arms are independent, and thus . An estimate of the intracluster correlation of the global win fractions is also be obtained as . Coefficient estimators take the form of feasible generalized least squares (FGLS) estimators, which are well-known to be unbiased and asymptotically normal.33 Regardless of alignment between (9) and the underlying data generation process, is always consistent for the mean difference or average treatment effect, and under equal allocation, the model-based variance is always asymptotically correct.34
2.4.2. Relationship with two-sample -statistics
When all clusters feature the same number of individuals so that for all and , and endpoints are equally weighted so , the mixed model estimator reduces to the simple mean of the global win fractions in the arm,
| (10) |
Expansion of Equation (10) according to the global win fraction definitions within Section 2.3 provides,
| (11) |
suggesting that is equivalent to the simple mean of clustered, two-sample -statistics when individuals are equally weighted. Comparing the form of the treatment win fraction in Equation (5) to Equation (11), it becomes clear that each treatment win fraction can be conceptualized as a two-sample U-statistic where . The same is true for control win fractions with .
Obuchowski35 developed the same estimator (11) for a single endpoint () by extending DeLong et al’s AUC estimators,36 or equivalently Sen’s structural component estimators,37 to the clustered setting. Brunner et al38 established the unbiasedness, consistency, and multivariate normality of multiple win probabilities with independent observations, and Rubarth et al39 established a similar result for clustered factorial designs. From these results, in addition to those of Zou,18 it follows for cluster randomized trials that the global win probability, equal to the mean of the multiple win probabilities, is asymptotically .
2.4.3. Equivalence to mean difference in rank-sums
Consider the simplified scenario from the previous subsection. From the rank-based form of the win fractions presented previously in Equation (7) and the fact that the sum of the group-specific ranks , with some algebra it follows that,
where is the rank-sum for the treatment individual and is the mean treatment rank-sum. Similarly for the control arm,
Thus, the mean difference in rank-sums is a linear transformation of the global win probability estimator with
and .
2.5. Interval estimators and hypothesis tests
A large-sample confidence interval for the global win probability is provided by,
where is the upper quantile of . For smaller samples, may be substituted with the corresponding Student’s critical value with df degrees of freedom, . Commonly, where is the total number of clusters, but there is no unique way of specifying the degrees of freedom of a mixed model.1 The corresponding test of the null hypothesis of no global treatment effect, , can be assessed using the test statistic,
which is distributed according to for large samples or approximately for small samples. Using results from Section 2.4.3, the nonparametric rank-sum test is analogous to a test of the global win probability with as,
A logit transformation may also be applied to improve behaviour for a small number of clusters or extreme values of . The lower and upper bounds of the large-sample logit interval, , are obtained respectively as,
where
The null hypothesis of no treatment effect, , can also be assessed using the test statistic,
which is distributed according to for large samples.
3. Simulation studies
3.1. Objectives and evaluation criteria
Our simulation study assesses the performance of the proposed interval estimators for the global win probability and their corresponding hypothesis tests of across a range of cluster trial designs. Performance metrics of interest are the coverage probability, balance of left and right tail error rates, type I error rates, and power. Emphasis is placed on the evaluation of interval estimators as their performance provides a combined summary of the quality of the proposed point and variance estimators.
Empirical coverage probability (ECP) is estimated as the proportion of confidence intervals containing the true global win probability , while left (right) tail error rates are estimated as the proportion of lower (upper) bounds greater than (less than) . The tail error ratio (TER), defined as the ratio of the left tail error rate to the right, is reported with a TER of 1 suggesting balance. Empirical type I error rates and power are estimated as the proportion of confidence intervals excluding the null value of 0.5, or the empirical rejection rate (ERR), when is null and non-null, respectively.
Reported ECPs are considered acceptable if they fall within approximately two standard errors of the specified nominal rate. Specifically, 95% confidence intervals are desired, and each scenario is replicated times so that the acceptable range of the empirical coverage probability is or 94.4% to 95.6%. The corresponding acceptable range for the empirical type I error rate is 0.044 to 0.056.
3.2. Scenarios and data generation
Our simulation study employed a factorial design, evaluating 1,536 scenarios in total. R code for this simulation study, and results for all scenarios, can be publicly accessed at https://github.com/statisticelle/sim-gwinp.
We consider both balanced and unbalanced allocation, so that the proportion of clusters allocated to treatment (versus control) is either or , and total clusters.
Sample size parameters were selected to correspond to the number of clusters randomized and number of individuals per cluster reported by several reviews of cluster trials. Among cluster trials in primary care, Eldridge et al reported a median of 34 clusters randomized with a median cluster size of 32 individuals (IQR = 9 to 82).40 Kahan et al reported a median of 25 clusters (IQR = 15 to 44) with a median cluster size of 34 individuals (IQR = 14 to 94).41 Ivers et al reported a median of 21 clusters (IQR = 12 to 52) with a median cluster size of 34 individuals (IQR = 13 to 89).42 Since analytic results in Appendix A suggest win fractions are uncorrelated between cluster and arm for large , we consider .
As discussed in Section 2.4.2, the mixed model point estimator is equivalent to the unweighted mean of the global win fractions when cluster sizes are equal or fixed. Thus, two scenarios are considered: (i) a reference case with fixed cluster sizes so that or 30 for all and ; and (ii) a realistic case with random cluster sizes, with an average of approximately or 30 individuals per cluster. Random cluster sizes were generated according to a truncated log-normal distribution, with a lower bound of 3 and an upper bound of . The mean of this distribution was selected to approximate , and variance chosen to approximate a coefficient of variation equal to 0.65, based on cluster trials of UK general practices.43
Our simulation study considers only endpoints. Since our methods are particularly advantageous for ordinal endpoints lacking meaningful units or endpoints with different scales,21 we let and to reflect commonly encountered 5- and 7-point Likert scales. , to reflect symmetric distributions, or 0.3 to reflect skewed distributions. Parameters within the treatment arm were then chosen so that , or to yield null, small, medium, and large effect sizes (based on relationships with Cohen’s effect size when endpoints are normal).23 Endpoint effects could be homogeneous such that or heterogeneous such that . Heterogeneous effects were set to and so that a “large” difference existed between the two effects. The true global win probability is .
Let represent the within-subject correlation matrix,
with off-diagonal entries equal to the pairwise correlation of the two endpoints on their original scale, i.e., . The pairwise correlation was varied from weak to strong, with .
Let represent the matrix of intracluster correlations,
with diagonal entries equal to the ICC of the endpoint, i.e., where , and off-diagonal entries equal to the cross-subject cross-endpoint within-cluster correlation, i.e., where and . Values of the intracluster correlation are generally small. A review of estimated intracluster correlations within primary care trials reported that 99% of ICCs were less than 0.10.44 Focusing on heterogeneous endpoint ICCs, the ICCs were set to values of , and .
The within-cluster correlation matrix is then provided by where is the identity matrix, is the matrix of ones, and denotes the Kronecker product. In other words, is a block diagonal matrix with repeated on the diagonal times and each off-diagonal entry equal to . For a cluster with three individuals, i.e., ,
The components of were assumed to be the same for both arms, and clusters are independent within- and between-arms. This correlation structure is commonly referred to as the “block exchangeable correlation model,” and has also been studied within the context of cluster trials with multiple continuous co-primary endpoints,45 longitudinal and cross-over designs,46 and stepped wedge designs.47
Individual-level bivariate ordinal responses were generated independently by arm and cluster using the “mean mapping algorithm”.48 In essence, realizations from a bivariate normal distribution with mean zero and intermediate correlation matrix are generated. Quantiles of the univariate standard normal distribution are then used to convert the normal variates into ordinal variates with the desired marginal distributions and correlation matrix . Root-finding algorithms are required to identify the intermediate correlation or entries of which yield the desired correlation upon discretization.
3.3. Results
Due to large number of scenarios examined, numerical summaries are provided here for only the most extreme scenario. For identity and logit back-transformed interval estimators, respectively, Tables 1 and 2 summarize performance with unequal allocation of clusters, skewed endpoint distributions, and large effect heterogeneity. Power was slightly greater for the least extreme scenario of equal allocation, symmetric endpoints, and no heterogeneity; otherwise performance was similar to that of Table 1 and 2. Empirical coverage probability (ECP) is also graphically summarized across all scenarios for the logit back-transformed interval estimator in Figure 1, suggesting coverage is similar regardless of allocation ratio, effect heterogeneity, and endpoint skew. Similar trends were observed for the identity back-transformed interval estimator.
Table 1.
Identity interval estimator: Simulation results for increasing numbers of clusters , endpoint correlation , and global win probabilities under unequal allocation of clusters (), skewed endpoint distributions (), and large heterogeneity in endpoint effects ().
|
|
Fixed cluster sizes |
Random cluster sizes |
||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Parameters | ||||||||||||||
| ECP | TER | ERR | ECP | TER | ERR | ECP | TER | ERR | ECP | TER | ERR | |||
|
|
|
|
|
|
||||||||||
| 10 | 0.2 | 0.50 | 95.8 | 1.00 | 4.2 | 94.2 | 0.87 | 5.8 | 95.8 | 1.00 | 4.2 | 94.3 | 0.90 | 5.7 |
| 0.56 | 95.8 | 1.10 | 12.8 | 94.7 | 1.12 | 19.6 | 95.3 | 1.35 | 14.6 | 94.4 | 0.87 | 18.2 | ||
| 0.64 | 95.5 | 1.37 | 55.6 | 95.1 | 1.45 | 75.7 | 95.6 | 1.44 | 54.8 | 94.8 | 1.26 | 72.3 | ||
| 0.71 | 95.9 | 3.10 | 91.8 | 95.0 | 1.50 | 98.5 | 94.7 | 1.52 | 88.8 | 94.5 | 1.89 | 97.1 | ||
| 0.5 | 0.50 | 96.6 | 0.89 | 3.4 | 95.2 | 0.75 | 4.8 | 95.6 | 0.91 | 4.4 | 94.3 | 0.87 | 5.7 | |
| 0.56 | 96.1 | 1.29 | 11.3 | 95.3 | 1.24 | 18.5 | 95.6 | 1.10 | 12.8 | 94.7 | 1.12 | 18.5 | ||
| 0.64 | 96.1 | 1.71 | 50.5 | 94.8 | 1.36 | 71.8 | 95.6 | 2.00 | 50.0 | 94.9 | 1.43 | 69.9 | ||
| 0.71 | 96.3 | 2.80 | 88.0 | 95.2 | 1.82 | 97.6 | 95.0 | 2.27 | 86.2 | 94.5 | 1.75 | 96.4 | ||
| 0.8 | 0.50 | 96.6 | 0.84 | 3.4 | 95.4 | 0.84 | 4.6 | 96.0 | 1.11 | 4.0 | 94.7 | 0.83 | 5.3 | |
| 0.56 | 96.3 | 0.95 | 10.6 | 95.2 | 1.47 | 18.5 | 95.6 | 1.25 | 12.1 | 94.8 | 1.36 | 17.7 | ||
| 0.64 | 96.0 | 1.67 | 45.7 | 95.3 | 1.47 | 69.0 | 96.1 | 1.79 | 44.8 | 94.3 | 1.24 | 66.0 | ||
| 0.71 | 96.5 | 2.18 | 82.8 | 96.0 | 1.56 | 97.1 | 95.8 | 2.42 | 82.1 | 94.4 | 1.89 | 95.0 | ||
|
|
|
|
|
|
||||||||||
| 20 | 0.2 | 0.50 | 94.8 | 0.86 | 5.2 | 95.3 | 0.92 | 4.7 | 94.8 | 0.96 | 5.2 | 94.5 | 0.87 | 5.5 |
| 0.56 | 95.3 | 1.19 | 27.3 | 94.8 | 0.89 | 39.7 | 94.6 | 1.00 | 26.9 | 94.5 | 1.29 | 39.1 | ||
| 0.64 | 95.3 | 1.24 | 89.7 | 95.0 | 1.33 | 98.1 | 94.4 | 1.20 | 88.4 | 94.6 | 1.45 | 97.3 | ||
| 0.71 | 94.4 | 1.95 | 99.9 | 94.8 | 1.60 | 100 | 94.9 | 1.55 | 99.8 | 94.5 | 1.62 | 100 | ||
| 0.5 | 0.50 | 94.7 | 0.89 | 5.3 | 95.0 | 0.92 | 5.0 | 94.0 | 1.00 | 6.0 | 94.5 | 0.77 | 5.5 | |
| 0.56 | 94.9 | 1.13 | 24.9 | 94.9 | 1.43 | 38.4 | 94.7 | 0.86 | 23.8 | 93.3 | 1.16 | 35.9 | ||
| 0.64 | 95.4 | 1.47 | 85.6 | 94.8 | 1.32 | 97.2 | 95.3 | 2.00 | 85.0 | 94.3 | 1.38 | 96.0 | ||
| 0.71 | 95.7 | 1.87 | 99.7 | 94.3 | 1.85 | 100 | 94.8 | 1.89 | 99.5 | 94.7 | 1.65 | 100 | ||
| 0.8 | 0.50 | 95.6 | 1.20 | 4.4 | 94.7 | 0.96 | 5.3 | 94.8 | 0.79 | 5.2 | 94.0 | 0.90 | 6.0 | |
| 0.56 | 94.9 | 1.04 | 22.0 | 94.9 | 1.04 | 34.8 | 94.3 | 1.19 | 22.1 | 94.8 | 0.79 | 33.3 | ||
| 0.64 | 95.5 | 1.37 | 81.5 | 94.8 | 1.36 | 96.1 | 95.3 | 1.76 | 80.9 | 94.3 | 1.11 | 95.0 | ||
| 0.71 | 95.6 | 1.59 | 99.3 | 94.6 | 1.50 | 100 | 94.9 | 1.68 | 99.0 | 94.0 | 1.61 | 100 | ||
|
|
|
|
|
|
||||||||||
| 30 | 0.2 | 0.50 | 95.0 | 1.00 | 5.0 | 94.7 | 0.96 | 5.3 | 94.8 | 0.79 | 5.2 | 94.2 | 1.07 | 5.8 |
| 0.56 | 95.0 | 1.08 | 39.6 | 95.2 | 1.09 | 56.1 | 94.5 | 0.90 | 38.1 | 94.4 | 1.24 | 54.7 | ||
| 0.64 | 94.7 | 1.52 | 98.0 | 94.8 | 1.17 | 99.9 | 94.2 | 1.28 | 97.2 | 94.6 | 1.00 | 99.8 | ||
| 0.71 | 95.3 | 1.24 | 100 | 95.7 | 1.53 | 100 | 94.5 | 1.80 | 100 | 94.5 | 1.43 | 100 | ||
| 0.5 | 0.50 | 95.3 | 1.00 | 4.7 | 95.1 | 0.75 | 4.9 | 94.7 | 1.17 | 5.3 | 94.7 | 0.86 | 5.3 | |
| 0.56 | 95.2 | 1.14 | 35.4 | 94.9 | 1.04 | 52.8 | 94.8 | 0.93 | 34.8 | 94.3 | 1.00 | 49.5 | ||
| 0.64 | 94.9 | 1.43 | 96.7 | 95.3 | 1.24 | 99.7 | 94.3 | 1.48 | 95.9 | 94.5 | 1.33 | 99.7 | ||
| 0.71 | 95.7 | 1.87 | 100 | 95.6 | 1.44 | 100 | 94.7 | 1.52 | 100 | 94.4 | 1.55 | 100 | ||
| 0.8 | 0.50 | 95.1 | 0.96 | 4.9 | 95.4 | 0.92 | 4.6 | 94.8 | 1.00 | 5.2 | 94.6 | 0.93 | 5.4 | |
| 0.56 | 95.3 | 0.96 | 30.9 | 94.8 | 1.08 | 51.5 | 94.8 | 1.08 | 31.5 | 94.2 | 1.23 | 48.7 | ||
| 0.64 | 94.9 | 1.27 | 94.7 | 95.2 | 1.14 | 99.6 | 95.0 | 1.17 | 94.0 | 93.9 | 1.10 | 99.3 | ||
| 0.71 | 94.7 | 1.94 | 99.9 | 95.7 | 1.69 | 100 | 94.5 | 1.62 | 100 | 94.3 | 1.71 | 100 | ||
|
|
|
|
|
|
||||||||||
| 50 | 0.2 | 0.50 | 95.1 | 1.13 | 4.9 | 94.7 | 0.93 | 5.3 | 94.3 | 1.11 | 5.7 | 94.0 | 1.00 | 6.0 |
| 0.56 | 94.8 | 1.08 | 60.3 | 94.7 | 0.93 | 77.7 | 94.4 | 1.07 | 59.0 | 94.5 | 0.83 | 75.5 | ||
| 0.64 | 94.5 | 1.12 | 100 | 94.8 | 1.00 | 100 | 94.5 | 1.00 | 99.9 | 94.6 | 1.08 | 100 | ||
| 0.71 | 95.3 | 1.82 | 100 | 94.8 | 1.48 | 100 | 94.8 | 1.26 | 100 | 95.2 | 1.58 | 100 | ||
| 0.5 | 0.50 | 95.2 | 0.96 | 4.8 | 95.5 | 1.25 | 4.5 | 94.7 | 0.71 | 5.3 | 94.6 | 1.04 | 5.4 | |
| 0.56 | 95.7 | 1.10 | 54.9 | 95.1 | 1.18 | 76.6 | 94.5 | 1.35 | 54.3 | 94.2 | 0.87 | 71.8 | ||
| 0.64 | 95.0 | 1.27 | 99.9 | 95.2 | 1.40 | 100 | 94.7 | 1.30 | 99.9 | 94.7 | 1.21 | 100 | ||
| 0.71 | 94.3 | 1.67 | 100 | 94.8 | 1.26 | 100 | 95.0 | 1.38 | 100 | 94.9 | 1.13 | 100 | ||
| 0.8 | 0.50 | 95.3 | 0.96 | 4.7 | 94.6 | 0.80 | 5.4 | 94.9 | 1.13 | 5.1 | 93.6 | 0.85 | 6.4 | |
| 0.56 | 94.9 | 1.32 | 50.7 | 94.9 | 0.96 | 73.4 | 94.8 | 0.96 | 50.3 | 94.3 | 0.90 | 68.3 | ||
| 0.64 | 95.3 | 1.47 | 99.6 | 95.1 | 1.27 | 100 | 94.8 | 1.26 | 99.6 | 94.3 | 1.48 | 100 | ||
| 0.71 | 95.4 | 1.88 | 100 | 95.1 | 1.72 | 100 | 95.0 | 1.78 | 100 | 94.8 | 1.65 | 100 | ||
Table 2.
Logit back-transformed interval estimator; Simulation results for increasing numbers of clusters , endpoint correlation , and global win probabilities under unequal allocation of clusters (), skewed endpoint distributions (), and large heterogeneity in endpoint effects ().
|
|
Fixed cluster sizes |
Random cluster sizes |
||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Parameters | ||||||||||||||
| ECP | TER | ERR | ECP | TER | ERR | ECP | TER | ERR | ECP | TER | ERR | |||
|
|
|
|
|
|
||||||||||
| 10 | 0.2 | 0.50 | 96.4 | 1.12 | 3.6 | 94.6 | 0.80 | 5.4 | 96.4 | 1.00 | 3.6 | 94.8 | 0.86 | 5.2 |
| 0.56 | 96.6 | 0.84 | 11.2 | 95.2 | 0.92 | 18.2 | 96.1 | 1.05 | 12.8 | 94.9 | 0.70 | 16.5 | ||
| 0.64 | 96.2 | 0.65 | 51.5 | 95.6 | 0.91 | 72.9 | 96.4 | 0.80 | 50.7 | 95.3 | 0.88 | 68.9 | ||
| 0.71 | 96.6 | 1.12 | 88.7 | 95.6 | 0.83 | 97.8 | 95.4 | 0.70 | 85.1 | 95.2 | 1.09 | 96.2 | ||
| 0.5 | 0.50 | 97.0 | 1.00 | 3.0 | 95.5 | 0.73 | 4.5 | 96.4 | 0.80 | 3.6 | 94.8 | 0.86 | 5.2 | |
| 0.56 | 96.7 | 1.06 | 9.9 | 95.6 | 1.10 | 17.1 | 96.3 | 0.90 | 10.7 | 95.1 | 0.92 | 17.0 | ||
| 0.64 | 96.8 | 0.88 | 45.6 | 95.5 | 0.88 | 68.2 | 96.6 | 1.12 | 44.5 | 95.4 | 0.92 | 66.8 | ||
| 0.71 | 97.3 | 1.08 | 84.4 | 95.7 | 1.05 | 96.9 | 96.6 | 0.89 | 81.9 | 95.1 | 0.88 | 95.4 | ||
| 0.8 | 0.50 | 97.3 | 0.80 | 2.7 | 95.8 | 0.75 | 4.2 | 97.0 | 1.00 | 3.0 | 95.3 | 0.81 | 4.7 | |
| 0.56 | 96.9 | 0.68 | 8.4 | 95.7 | 1.26 | 16.9 | 96.4 | 0.89 | 9.9 | 95.4 | 1.09 | 16.1 | ||
| 0.64 | 97.0 | 0.71 | 40.5 | 96.0 | 0.90 | 65.4 | 97.1 | 0.75 | 39.7 | 94.9 | 0.89 | 62.1 | ||
| 0.71 | 97.3 | 0.59 | 78.1 | 96.6 | 0.70 | 95.8 | 96.8 | 0.60 | 76.7 | 95.1 | 0.78 | 93.6 | ||
|
|
|
|
|
|
||||||||||
| 20 | 0.2 | 0.50 | 95.4 | 0.84 | 4.6 | 95.6 | 0.91 | 4.4 | 95.1 | 1.00 | 4.9 | 94.7 | 0.83 | 5.3 |
| 0.56 | 95.7 | 0.95 | 25.9 | 94.9 | 0.79 | 38.7 | 95.0 | 0.82 | 25.7 | 94.8 | 1.13 | 37.6 | ||
| 0.64 | 95.8 | 0.68 | 88.7 | 95.5 | 0.96 | 97.9 | 94.7 | 0.86 | 87.4 | 94.8 | 1.08 | 96.9 | ||
| 0.71 | 95.1 | 1.00 | 99.9 | 95.2 | 0.92 | 100 | 95.5 | 0.84 | 99.7 | 94.9 | 1.13 | 100 | ||
| 0.5 | 0.50 | 95.2 | 0.92 | 4.8 | 95.2 | 0.88 | 4.8 | 94.6 | 1.00 | 5.4 | 94.7 | 0.77 | 5.3 | |
| 0.56 | 95.3 | 0.96 | 23.2 | 95.3 | 1.24 | 36.9 | 95.1 | 0.75 | 22.6 | 93.6 | 1.03 | 34.9 | ||
| 0.64 | 96.0 | 1.00 | 84.1 | 94.9 | 0.96 | 97.0 | 95.8 | 1.21 | 83.1 | 94.7 | 1.04 | 95.5 | ||
| 0.71 | 96.3 | 0.95 | 99.5 | 94.7 | 1.12 | 100 | 95.4 | 1.00 | 99.4 | 95.1 | 0.96 | 100 | ||
| 0.8 | 0.50 | 96.0 | 1.22 | 4.0 | 95.0 | 1.00 | 5.0 | 95.2 | 0.78 | 4.8 | 94.3 | 0.97 | 5.7 | |
| 0.56 | 95.5 | 0.80 | 20.6 | 95.1 | 0.96 | 33.8 | 94.9 | 0.96 | 20.7 | 95.0 | 0.72 | 32.1 | ||
| 0.64 | 96.2 | 0.76 | 79.9 | 95.2 | 0.96 | 95.6 | 96.1 | 1.05 | 79.2 | 94.7 | 0.83 | 94.4 | ||
| 0.71 | 95.9 | 0.71 | 99.2 | 95.0 | 0.88 | 100 | 95.6 | 0.83 | 98.8 | 94.4 | 0.87 | 100 | ||
|
|
|
|
|
|
||||||||||
| 30 | 0.2 | 0.50 | 95.2 | 1.00 | 4.8 | 94.9 | 0.93 | 5.1 | 95.0 | 0.79 | 5.0 | 94.3 | 1.07 | 5.7 |
| 0.56 | 95.3 | 0.88 | 38.7 | 95.4 | 1.00 | 55.4 | 94.8 | 0.79 | 37.0 | 94.7 | 1.08 | 54.0 | ||
| 0.64 | 95.1 | 1.13 | 97.8 | 95.0 | 0.88 | 99.8 | 94.5 | 0.90 | 97.1 | 94.7 | 0.77 | 99.8 | ||
| 0.71 | 95.7 | 0.72 | 100 | 95.9 | 1.11 | 100 | 95.0 | 1.08 | 100 | 94.9 | 0.89 | 100 | ||
| 0.5 | 0.50 | 95.5 | 1.05 | 4.5 | 95.3 | 0.78 | 4.7 | 95.1 | 1.13 | 4.9 | 94.8 | 0.86 | 5.2 | |
| 0.56 | 95.5 | 1.00 | 34.2 | 95.1 | 0.96 | 52.1 | 95.0 | 0.79 | 33.4 | 94.4 | 0.90 | 48.8 | ||
| 0.64 | 95.1 | 0.96 | 96.4 | 95.6 | 0.91 | 99.7 | 94.7 | 1.04 | 95.5 | 94.8 | 1.00 | 99.6 | ||
| 0.71 | 96.1 | 0.95 | 100 | 95.6 | 1.00 | 100 | 95.2 | 0.78 | 100 | 94.8 | 1.08 | 100 | ||
| 0.8 | 0.50 | 95.3 | 0.96 | 4.7 | 95.6 | 0.96 | 4.4 | 95.2 | 0.92 | 4.8 | 94.7 | 0.96 | 5.3 | |
| 0.56 | 95.6 | 0.83 | 30.0 | 95.0 | 1.00 | 50.7 | 95.2 | 0.92 | 30.5 | 94.3 | 1.07 | 47.5 | ||
| 0.64 | 95.3 | 0.96 | 93.9 | 95.5 | 0.88 | 99.6 | 95.0 | 0.85 | 93.6 | 94.1 | 0.90 | 99.3 | ||
| 0.71 | 95.3 | 1.04 | 99.9 | 96.1 | 1.05 | 100 | 95.1 | 0.96 | 100 | 94.7 | 1.12 | 100 | ||
|
|
|
|
|
|
||||||||||
| 50 | 0.2 | 0.50 | 95.3 | 1.14 | 4.7 | 94.9 | 0.96 | 5.1 | 94.4 | 1.15 | 5.6 | 94.1 | 0.97 | 5.9 |
| 0.56 | 95.0 | 0.96 | 59.9 | 94.7 | 0.89 | 77.4 | 94.6 | 1.00 | 58.3 | 94.7 | 0.71 | 75.2 | ||
| 0.64 | 94.7 | 0.96 | 100 | 94.9 | 0.79 | 100 | 94.7 | 0.77 | 99.9 | 94.6 | 0.93 | 100 | ||
| 0.71 | 95.6 | 1.25 | 100 | 95.2 | 1.09 | 100 | 94.9 | 0.76 | 100 | 95.4 | 1.24 | 100 | ||
| 0.5 | 0.50 | 95.4 | 1.00 | 4.6 | 95.5 | 1.20 | 4.5 | 94.9 | 0.70 | 5.1 | 94.7 | 1.00 | 5.3 | |
| 0.56 | 95.8 | 0.95 | 54.2 | 95.2 | 1.09 | 76.4 | 94.7 | 1.21 | 53.5 | 94.3 | 0.81 | 71.5 | ||
| 0.64 | 95.1 | 0.88 | 99.8 | 95.2 | 1.18 | 100 | 95.0 | 0.92 | 99.9 | 94.9 | 0.96 | 100 | ||
| 0.71 | 95.0 | 1.00 | 100 | 95.1 | 0.88 | 100 | 95.2 | 0.85 | 100 | 94.8 | 0.79 | 100 | ||
| 0.8 | 0.50 | 95.6 | 0.91 | 4.4 | 94.7 | 0.77 | 5.3 | 95.2 | 1.09 | 4.8 | 93.8 | 0.85 | 6.2 | |
| 0.56 | 95.3 | 1.09 | 49.7 | 95.0 | 0.92 | 73.0 | 95.1 | 0.88 | 49.6 | 94.3 | 0.84 | 67.9 | ||
| 0.64 | 95.5 | 1.00 | 99.6 | 95.3 | 0.96 | 100 | 95.1 | 1.04 | 99.5 | 94.8 | 1.17 | 100 | ||
| 0.71 | 95.7 | 1.05 | 100 | 95.3 | 1.24 | 100 | 95.5 | 1.00 | 100 | 94.9 | 1.13 | 100 | ||
ECP: Empirical Coverage Probability, TER: Tail Error Ratio, ERR: Empirical Rejection Rate.
Figure 1.

Visual summary of empirical coverage probability (ECP) of logit back-transformed interval estimator across all 1,536 simulation scenarios and values of global win probability . Grey band represents the acceptable range of ECP from 94.4% to 95.6%. The black, smoothed curves were fit via GAM. Jitter applied on x-axis for presentation.
Empirical coverage probability (ECP) generally fell within the acceptable range of 94.4% to 95.6% and the type I error rate within 4.4% to 5.6%. The only exceptions were scenarios featuring the fewest number of clusters with the smallest cluster size (), which resulted in conservative coverage (> 95%) and type I error rates (< 5%). Desired coverage and type I error rates were consistently obtained for increasing numbers of clusters and larger cluster sizes . This is likely a result of the large-sample nature of our developed methodology (see Appendix A), and note the closely related DeLong variance estimator36 is known to be conservative for small sample sizes.49 Performance was also negatively impacted by random cluster sizes, as should be expected33,43.
The tail error ratio (TER) was almost always greater than 1 for the identity interval estimator, suggesting that the left tail error rate has a tendency to exceed the right tail error rate. Discrepancies between tail errors appeared to increase as the global win probability increased or the number of clusters or cluster size decreased. When , use of the logit back-transformed interval estimator improved balance of the tail errors considerably while producing desirable ECPs and ERRs. When , logit back-transformation also improved TERs, but appeared to exacerbate the conservative nature of our method when , improving as increased.
Increasing the pairwise correlation of the endpoints appeared to decrease the power to detect a global treatment effect, as should be expected. However, our method appeared generally powerful. When , an empirical rejection rate (power) of approximately 70% to detect a “moderate” global effect was achieved with as few as total clusters, while 80% power to detect a “small” global effect () required approximately clusters.
4. Case study: SHARE
4.1. Motivation
The Sexual Health and Relationships: Safe, Happy and Responsible (SHARE) trial aimed to determine whether an experimental sexual education curriculum reduced unsafe sex among students when compared to the existing curriculum.22 Schools in Scotland were randomized as clusters to be trained according to the experimental SHARE curriculum or not. This case study uses a subset of the trial data originally accessed from the Harvard Dataverse, with a link provided in the Appendix alongside reproducible R and SAS code.
Two endpoints are considered here: (1) knowledge, an ordinal endpoint ranging from −8 to 8, or least sexual health knowledge to most; and (2) activity, a binary endpoint taking the value 1 if the student was sexually active during follow-up and 0 otherwise. Suppose investigators wish to estimate the global treatment effect, assigning weights to knowledge and to activity, respectively. Traditional composite approaches may dichotomize knowledge to combine it with activity, potentially resulting in ceiling or floor effects, or ignore differences in priority.
4.2. Descriptives by endpoint
We consider and clusters assigned to conventional and experimental curriculum, respectively. total students () students with a completely observed bivariate response were retained for analysis, resulting in an average of 193 students per cluster (). Table 3 provides a summary of the observed endpoints, their corresponding win fractions, and correlation. The win fraction ICCs of knowledge and activity were both 0.028, and the corresponding observed ICCs were 0.031 and 0.028.
Table 3.
Summary of observed knowledge and activity scores and win fractions for N0 = 2,474 and N1 = 2,349 individuals receiving conventional and experimental curriculum within SHARE.
| Endpoint | Mean (SD) |
Correlation |
|||
|---|---|---|---|---|---|
| Conventional | Experimental | Knowledge | Activity | ||
|
| |||||
| Obs. scores | Knowledge | 4.10 (2.36) | 4.75 (2.28) | 1.00 | 0.08 |
| Activity | 0.27 (0.44) | 0.27 (0.44) | 0.08 | 1.00 | |
|
| |||||
| Win fractions | Knowledge | 0.42 (0.28) | 0.58 (0.28) | 1.00 | −0.08 |
| Activity | 0.50 (0.22) | 0.50 (0.22) | −0.08 | 1.00 | |
4.3. Estimates and interpretation
The use of SAS PROC MIXED to estimate the two-level mixed model presented in Equation (9) provides with on degrees of freedom, and . The corresponding weighted global win probability estimate is with and a logit back-transformed 95% confidence interval of 0.517 to 0.587. The estimated ICC of the global win fractions was 0.037.
Formally, the estimated probability that a student receiving the experimental curriculum responds better than a student receiving the existing curriculum is 55.2% (95% CI: 0.517 to 0.587), on average, with respect to the two weighted endpoints of sexual knowledge and activity. Since the null value is excluded from the 95% confidence interval, there is statistically significant evidence at the level to reject the null hypothesis of no weighted global treatment effect. For reference, if equal weights are assigned to each endpoint, the estimated global win probability is 0.537 (95% CI: 0.506 to 0.568).
Alternatively, the estimated global win difference is to 0.171), suggesting that the probability of winning on treatment is approximately 10.4 basis points greater than that on control. The estimated win odds are ; 95%CI: 1.076 to 1.410), suggesting that the treatment win probability is approximately 23% greater than the control win probability.
5. Discussion
In this paper, we address the challenges of analyzing cluster randomized trials with multiple endpoints different scales by developing interval estimation and hypothesis testing methods for a nonparametric global treatment effect, referred to here as the global win probability. The global win probability directly quantifies the objective of most trials which is to determine if patients receiving treatment have better overall health outcomes than those on control or the current standard of care. The global win probability only compares the ordering of responses, rather than their size or difference, and is applicable to any set of endpoints that can be ranked including binary, ordinal, count, and continuous endpoints. Estimation is also robust to monotonic transformation, unlike mean-based methods which may yield differing or even conflicting results pre- and post-transformation.
The developed methods are simple as they bypass the need to consider complex correlation structures between-endpoint and within-cluster, and accessible as they may be implemented using standard statistical software. To estimate the global win probability, a single rank-based global win fraction is constructed for each individual within the cluster trial as the within-subject mean of their endpoint-specific win fractions. Unlike alternative composite measures like the rank-sum, global win fractions are interpretable at the individual-level as the average proportion of responses within the comparator arm exceeded by a given individual. Weights may also be incorporated within construction of the global win fractions to reflect differences in endpoint utility or priority, though are not required. The mixed model estimation framework previously introduced by Zou18 for a single win probability is then easily applied to the univariate global win fractions to obtain point, variance, and interval estimators of the global win probability adjusted for intracluster correlation. Simulation suggests interval estimators respect nominal coverage and type I error rates across a range of cluster randomized designs.
The flexible mixed model framework employed also permits the consideration of more complex models. For example, Zou et al recently detailed how win fraction regression methods could be used to analyze individually randomized pre-post designs in a fashion analogous to ANCOVA.21 A similar approach could be taken here by, e.g., regressing global win fractions at follow-up on global win fractions at baseline, potentially leading to additional boosts in power.50 Investigation into additional baseline adjustment strategies, e.g., for precision or stratification, represents an area of future work. Multivariate linear mixed models for the win fractions would also permit the construction of a -df test, use of more complex covariance structures, or assessment of global treatment effects over time.
Another advantage of using the mixed model framework is that the form of sample size estimators is relatively simple. The sample size for a corresponding individually randomized trial can be scaled by a design effect, i.e., . Sample size formulas for a single win probability within individually randomized trials were recently provided by Zou et al.19 Relationships between endpoint distributions or alternative measures and the win probabilitiy are well-known, with a nice summary provided by Rahlfs and Zimmerman23, for example. Thus, parametric treatment effect estimates reported in the literature may be transformed into endpoint-specific win probabilities and averaged to obtain an approximate global win probability for sample size estimation. Standard errors may also be obtained, see Shu & Zou for example.51 However, further investigation into the true relationship between the observed and win fraction ICCs is needed. Zou et al21 suggested use of the observed pre-post correlation for sample size. Using the largest, conservative, endpoint ICC estimate may be a reasonable strategy, for now.
If desired, the global win probability can also be directly transformed into other popular, alternative effect measures encountered in medicine. This includes related win measures such as the win difference or win odds,25,29 or more traditional effects such as the risk difference or standardized mean difference.23 The global win probability is inspired by the nonparametric rank-sum test introduced by O’Brien7, and as demonstrated here, the mean difference in rank-sums is a linear transformation of global win probability estimators. Zou & Zou17 proposed methods for estimating the global win probability in individually randomized trials with missing data. Expansion of their methods to the current context represents another area of future work.
The inability of some composites to reflect differences in endpoint priority is a commonly cited concern.52 Endpoint priority may differ when, e.g., endpoint utility, importance, or severity differs. Hence, we have expanded our presentation of the global win probability to permit contribution weights for each endpoint. However, these weights are not necessary to apply our methods. Furthermore, while relationships with other win-based effects are discussed for a single endpoint, their conceptual papers25,27 evaluate multiple prioritized endpoints in a fundamentally different way using generalized pairwise comparisons (GPC). Explicit priority ordering of the endpoints must be specified to use GPC approaches, unlike our method. As pointed out by Rauch et al,52 estimates based on GPC may be challenging to interpret.
Acknowledgements
This work was partially supported by the National Institute of Allergy and Infectious Diseases (NIAID) at the National Institutes of Health (NIH) [T32 AI007358]. Most of this work was completed while Emma Davies Smith was affiliated with Western University. Thank you to Dr. Yun-Hee Choi of Western University for her helpful comments.
A. Asymptotic correlation of clustered win fractions
Perhaps surprisingly, win fractions are asymptotically uncorrelated between cluster and arm, but not within cluster due to intracluster correlation. Here, we provide a sketch proof examining the asymptotic covariance, and hence correlation, between win fractions in these three situations: within cluster, within arm between cluster, and between arm. In the following sketch it is important to remember that individuals within the same cluster are identically distributed and share an exchangeable correlation structure, and individuals in different clusters are independent as a result of cluster randomization.
For simplicity, we consider only a single endpoint so , but extension to the global win fraction is straightforward. Let represent the win fraction of the individual in the treatment cluster (). Modifying the notation of the control responses and win fractions, let represent the win fraction of the individual in the control cluster and the control response (). Finally, let represent the Heaviside scoring function.
Recall there are responses within the treatment cluster, and responses within the control cluster. There are a total of treatment responses and control responses. Each treatment response is compared to all control responses when constructing a win fraction. Similarly, each control response is compared to all treatment responses.
A. 1. Within cluster
The covariance of two win fractions within the same cluster is defined as,
The first arguments of the and will be correlated for all terms due to intracluster correlation. The second arguments will be correlated only when they belong to the same control cluster . See the top panel of Figure 2 for a visualization of the corresponding covariance structure. Counting like terms, we have:
Same cluster, same response: will be correlated with respect to and perfectly correlated with respect to for terms of the form .
Same cluster, same cluster: will be correlated with respect to both and for terms of the form .
Same cluster, different cluster: will be correlated with respect to only for terms of the form .
Figure 2.

From top to bottom: covariance pattern of kernel terms (a) within the same arm and cluster; (b) within the same arm but different clusters; and (c) in different arms and clusters. Darkest shading: covariation in and terms; lighter: covariation in only; lightest: covariation in only; none: no covariation. Striping indicates perfect covariation in and/or .
Putting this together,
| (12) |
and because we have:
Because , we have . Thus, as the number of control individuals increases towards infinity (),
so the covariance of win fractions within the same arm is dependent only on their intracluster correlation for large samples.
A.2. Within arm, between cluster
The covariance of two win fractions within the same arm but different clusters is defined as,
The first arguments of the and will be uncorrelated for all terms. The second arguments will be correlated only when they belong to the same control cluster . See the middle panel of Figure 2 for a visualization of the corresponding covariance structure. Counting like terms, we have:
Different cluster, same response: will be perfectly correlated with respect to only for terms of the form .
Different cluster, same cluster: will be correlated with respect to only for terms of the form .
Different cluster, different cluster: will be uncorrelated for terms of the form .
Putting this together,
so as , and win fractions within the same arm but different clusters are asymptotically uncorrelated.
A.3. Between arm
The covariance of two win fractions within different arms is defined as,
See the bottom panel of Figure 2 for a visualization of the corresponding covariance structure.
Counting like terms, we have:
and once again, we have that as .
A.4. Variance of a treatment win fraction
For two win fractions within different clusters in the same arm, for example, recall the relationship:
In the previous sections, we have shown as . If is asymptotically finite and non-zero, it thus follows that as . Here, we explore the asymptotic nature of .
The variance of a single win fraction is defined as,
The first arguments of the will be perfectly correlated for all terms. The second arguments will be correlated only when they belong to the same control cluster . Counting like terms:
Same response, same response: will be perfectly correlated with respect to and for terms of the form .
Same response, same cluster: will be perfectly correlated in and correlated with for terms of the form .
Same response, different cluster: will be perfectly correlated in and uncorrelated in for () terms of the form .
Putting this together, we have:
As ,
so the variance of a treatment win fraction approaches a finite, non-zero function of the variance of the treatment response only. It follows that as . Similar arguments can be used to show that as and for two win fractions within the same cluster,
B. SHARE
Data from the SHARE trial can be downloaded from the Harvard Dataverse in tab-delimited format (share.tab) at https://dataverse.harvard.edu/dataverse/crt.
B.1. R code
library(dplyr) # Data manipulation library(nlme) # Mixed models # IMPORT: SHARE data share_raw <- read.table(‘share.tab’, sep=‘\t’, header=T) # EXTRACT: identifiers and outcomes (Note: school=cluster) share <- share_raw %>% select(arm, school, idno, ‘knowledge’=kscore, ‘active’=debut) %>% na.omit() # EXTRACT: Number of students in each arm N0 <- nrow(share[share$arm == 0,]) # Conventional N1 <- nrow(share[share$arm == 1,]) # Experimental # CONSTRUCT: overall and group ranks for each endpoint # Note: increased knowledge = good, increased active = bad ranks <- share %>% mutate(R1 = rank(knowledge), R2 = rank(-active)) %>% group_by(arm) %>% mutate(G1 = rank(knowledge), G2 = rank(-active)) # Endpoint weights w1 = 0.7; w2 = 0.3 # CONSTRUCT: endpoint and global win fractions winf <- ranks %>% mutate(Y1 = ifelse(arm == 0, (R1-G1)/N1, (R1-G1)/N0), Y2 = ifelse(arm == 0, (R2-G2)/N1, (R2-G2)/N0), YG = w1 * Y1+ w2* Y2) # FIT: linear mixed model for global win fractions modG <- lme(YG ˜ arm, random = ˜1 | school, data=winf) # EXTRACT: Fixed effect estimates, their variance, and df modG_fest <- fixef(modG) modG_fvar <- vcov(modG) modG_fdf <- modG$fixDF$X # Note: Arm df = C-2 by default # EXTRACT: Random effect variance components modG_rvar <- matrix(as.numeric(VarCorr(modG)), ncol=2) # CONSTRUCT: Global win probability point, variance, and ICC est estG <- (modG_fest[2] + 1)/2 seG <- sqrt(modG_fvar[2,2]) iccG <- modG_rvar[1,1] / (modG_rvar[1,1] + modG_rvar[2,1]) # CONSTRUCT: Global win probability interval estimates alpha <− 0.05; t <- qt(1-alpha/2, modG_fdf[2]) untransform_ci <- estG + c(−1, 1) * t * seG logit_lu <- log(estG/(1-estG)) + c(−1, 1) * t * seG/(estG*(1-estG)) logit_ci <- exp(logit_lu) / (1 + exp(logit_lu)) # REPORT: Estimates report_point <- paste0(‘Est. global win probability = ‘, estG, ‘ (Est. SE = ‘, seG, ‘, df = ‘, modG_fdf[2], ‘)’) report_icc <- paste0(‘Est. global ICC = ‘, iccG) report_uci <- paste0(‘95% untransformed confidence interval = (‘, untransform_ci[1], ‘, ‘, untransform_ci[2], ‘)’) report_lci <- paste0(‘95% logit confidence interval = (‘, logit_ci[1], ‘, ‘, logit_ci[2], ‘)’) cat(paste(report_point, report_uci, report_lci, report_icc, sep=‘\n’))
B.2. SAS code
PROC IMPORT DATAFILE=“share.tab” OUT=share DBMS=DLM REPLACE; DELIMITER=‘09’x; RUN; DATA share; SET share; knowledge = kscore; * Reverse code active since higher => worse; active = −1 * debut; * Exclude individuals with incomplete responses; IF (knowledge ne .) and (active ne .) THEN OUTPUT; KEEP arm school idno knowledge active; RUN; * Sort data by Arm (0 = Conventional, 1 = Experimental); PROC SORT DATA=share; BY arm; RUN; * Obtain number of individuals in each arm; PROC FREQ DATA=share NOPRINT; TABLE arm / OUT=SampleSize(DROP=percent); RUN; * Reverse Arm labels for win fraction denominator; DATA Denominator; SET SampleSize; arm = 1 - arm; RUN; * Sort for merge; PROC SORT DATA=Denominator; BY arm; RUN; * Overall (mid)ranks for each endpoint; PROC RANK DATA=share OUT=O_ranks TIES=mean; VAR knowledge active; RANKS O1 O2; RUN; * Group (mid)ranks for each endpoint; PROC RANK DATA=share OUT=G_ranks TIES=mean; BY arm; VAR knowledge active; RANKS G1 G2; RUN; * Calculate endpoint and global win fractions; DATA WinF; MERGE O_ranks G_ranks Denominator; BY arm; Y1 = (O1-G1)/COUNT; Y2 = (O2-G2)/COUNT; YG = 0.7*Y1 + 0.3*Y2; RUN; * Fit linear mixed model for global win fractions; PROC MIXED DATA=WinF NOITPRINT NOCLPRINT; CLASS arm school idno / REF=first; MODEL YG = arm / solution; RANDOM intercept / SUBJECT=school(arm); ODS OUTPUT SolutionF = FixEff(KEEP=Arm Estimate StdErr DF); ODS OUTPUT CovParms = CovParms; RUN; * Extract estimates; DATA VarInt VarRes; SET CovParms; IF CovParm=‘Intercept’ THEN OUTPUT VarInt; IF CovParm=‘Residual’ THEN OUTPUT VarRes; DROP CovParm Subject; RUN; DATA FixEff; SET FixEff; Beta1 = Estimate; IF Arm = 1 THEN OUTPUT; RUN; DATA GlobalEstimates; MERGE FixEff VarInt(RENAME=(Estimate=VarAlpha)) VarRes(RENAME=(Estimate=VarEps)); EstG = (Beta1 + 1)/2; SeG = StdErr; DfG = DF; t = tinv(1–0.05/2, DF); * Untransformed 95% confidence interval; L1 = EstG - t * SeG; U1 = EstG + t * SeG; * Logit 95% confidence interval; L2 = log(EstG/(1-EstG)) - t * SeG / (EstG * (1-EstG)); U2 = log(EstG/(1-EstG)) + t * SeG / (EstG * (1-EstG)); L2 = exp(L2) / (1 + exp(L2)); U2 = exp(U2) / (1 + exp(U2)); * Intracluster correlation of global win fractions; IccG = VarAlpha / (VarAlpha + VarEps); KEEP EstG SeG DfG L1 U1 L2 U2 IccG; RUN; PROC PRINT DATA=GlobalEstimates; Run;
References
- 1.Donner A and Klar N. Design and Analysis of Cluster Randomization Trials in Health Research. London, England: Arnold, 2000. [Google Scholar]
- 2.Li D, Cao J and Zhang S. Power analysis for cluster randomized trials with multiple binary co-primary endpoints. Biometrics 2020; 76(4): 1064–1074. Doi: 10.1111/biom.13212. [DOI] [PubMed] [Google Scholar]
- 3.Turner R, Omar R and Thompson S. Modelling multivariate outcomes in hierarchical data, with application to cluster randomised trials. Biometrical Journal 2006; 48(3): 333–345. Doi: 10.1002/bimj.200310147. [DOI] [PubMed] [Google Scholar]
- 4.US Food and Drug Administration. Multiple Endpoints in Clinical Trials. https://www.fda.gov/media/162416/download, 2022.
- 5.O’Brien P and Geller N. Interpreting tests for efficacy in clinical trials with multiple endpoints. Controlled Clinical Trials 1997; 18(3): 222–227. Doi: 10.1016/S0197-2456(97)00049-4. [DOI] [PubMed] [Google Scholar]
- 6.Ristl R, Urach S, Rosenkranz G et al. Methods for the analysis of multiple endpoints in small populations: A review. Journal of Biopharmaceutical Statistics 2019; 29(1): 1–29. Doi: 10.1080/10543406.2018.1489402. [DOI] [PubMed] [Google Scholar]
- 7.O’Brien P Procedures for comparing samples with multiple endpoints. Biometrics 1984; 40(4): 1079–1087. Doi: 10.2307/2531158. [DOI] [PubMed] [Google Scholar]
- 8.Lachin JM. Applications of the Wei-Lachin multivariate one-sided test for multiple outcomes on possibly different scales. PloS One 2014; 9(10): e108784. Doi: 10.1371/journal.pone.0108784. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Logan BR and Tamhane AC. On O’Brien’s OLS and GLS tests for multiple endpoints. IMS Lecture Notes Monograph Series 2004; 47: 76–88. Doi: 10.1214/lnms/1196285627. [DOI] [Google Scholar]
- 10.Offen W, Chuang-Stein C, Dmitrienko A et al. Multiple co-primary endpoints: medical and statistical solutions: a report from the multiple endpoints expert team of the Pharmaceutical Research and Manufacturers of America. Drug Information Journal 2007; 41(1): 31–46. Doi: 10.1177/009286150704100105. [DOI] [Google Scholar]
- 11.Sankoh A, Huque M, Russell H et al. Global two-group multiple endpoint adjustment methods applied to clinical trials. Drug Information Journal 1999; 33: 119–140. Doi: 10.1177/009286159903300115. [DOI] [Google Scholar]
- 12.Huang P, Tilley B, Woolson R et al. Adjusting O’Brien’s test to control type I error for the generalized nonparametric Behrens–Fisher problem. Biometrics 2005; 61(2): 531–539. Doi: 10.1111/j.1541-0420.2005.00322.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Huang P, Woolson R and O’Brien P. A rank-based sample size method for multiple outcomes in clinical trials. Statistics in Medicine 2008; 27(16): 3084–3104. Doi: 10.1002/sim.3182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Schulz KF, Altman DG, et al. CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. Journal of Pharmacology and Pharmacotherapeutics 2010; 1(2): 100–107. Doi: 10.1371/journal.pmed.1000251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Acion L, Peterson JJ, Temple S et al. Probabilistic index: an intuitive non-parametric approach to measuring the size of treatment effects. Statistics in Medicine 2006; 25(4): 591–602. Doi: 10.1002/sim.2256. [DOI] [PubMed] [Google Scholar]
- 16.Dong G, Huang B, Verbeeck J et al. Win statistics (win ratio, win odds, and net benefit) can complement one another to show the strength of the treatment effect on time-to-event outcomes. Pharmaceutical Statistics 2023; 22(1): 20–33. Doi: 10.1002/pst.2251. [DOI] [PubMed] [Google Scholar]
- 17.Zou G and Zou L. A nonparametric global win probability approach to the analysis and sizing of randomized controlled trials with multiple endpoints of different scales and missing data: Beyond o’brien–wei–lachin. Statistics in Medicine 2024; 43: 5366–5379. Doi: 10.1002/sim.10247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Zou G Confidence interval estimation for treatment effects in cluster randomization trials based on ranks. Statistics in Medicine 2021; 40(14): 3227–3250. Doi: 10.1002/sim.8918. [DOI] [PubMed] [Google Scholar]
- 19.Zou G, Smith EJ, Zou L et al. A rank-based approach to design and analysis of pretest-posttest randomized trials, with application to COVID-19 ordinal scale data. Contemporary Clinical Trials 2023; 126: 107085. Doi: 10.1016/j.cct.2023.107085. [DOI] [PubMed] [Google Scholar]
- 20.Zou G, Zou L and Choi Y. Distribution-free approach to the design and analysis of randomized stroke trials with the modified Rankin scale. Stroke 2022; 53(10): 3025–3031. Doi: 10.1161/STROKEAHA.121.037744. [DOI] [PubMed] [Google Scholar]
- 21.Zou G, Zou L and Qiu SF. Parametric and nonparametric methods for confidence intervals and sample size planning for win probability in parallel-group randomized trials with Likert item and Likert scale data. Pharmaceutical Statistics 2023; 22(3): 418–439. Doi: 10.1002/pst.2280. [DOI] [PubMed] [Google Scholar]
- 22.Wight D, Raab G, Henderson M et al. Limits of teacher delivered sex education: interim behavioural outcomes from randomised trial. BMJ 2002; 324: 1430–1435. Doi: 10.1136/bmj.324.7351.1430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Rahlfs V and Zimmermann H. Effect size measures and their benchmark values for quantifying benefit or risk of medicinal products. Biometrical Journal 2019; 61(4): 973–982. Doi: 10.1002/bimj.201800107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Lachin JM. Some large-sample distribution-free estimators and tests for multivariate partially incomplete data from two populations. Statistics in Medicine 1992; 11(9): 1151–1170. Doi: 10.1002/sim.4780110903. [DOI] [PubMed] [Google Scholar]
- 25.Buyse M Generalized pairwise comparisons of prioritized outcomes in the two-sample problem. Statistics in Medicine 2010; 29(30): 3245–3257. Doi: 10.1002/sim.3923. [DOI] [PubMed] [Google Scholar]
- 26.Agresti A Generalized odds ratios for ordinal data. Biometrics 1980; 36(1): 59–67. Doi: jstor.org/stable/2530495. [Google Scholar]
- 27.Pocock S, Ariti C, Collier T et al. The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. European Heart Journal 2012; 33(2): 176–182. Doi: 10.1093/eurheartj/ehr352. [DOI] [PubMed] [Google Scholar]
- 28.Dong G, Hoaglin D, Qiu J et al. The win ratio: On interpretation and handling of ties. Statistics in Biopharmaceutical Research 2020; 12(1): 99–106. Doi: 10.1080/19466315.2019.1575279. [DOI] [Google Scholar]
- 29.Brunner E, Vandemeulebroecke M and Mütze T. Win odds: An adaptatation of the win ratio to include ties. Statistics in Medicine 2021; 40(14): 3367–3384. Doi: 10.1002/sim.8967. [DOI] [PubMed] [Google Scholar]
- 30.Mann H and Whitney D. On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics 1947; 18(1): 50–60. Doi: 10.1214/aoms/1177730491. [DOI] [Google Scholar]
- 31.Sun X and Xu W. Fast implementation of DeLong’s algorithm for comparing the areas under correlated receiver operating characteristic curves. IEEE Signal Processing Letters 2014; 21(11): 1389–1393. Doi: 10.1109/LSP.2014.2337313. [DOI] [Google Scholar]
- 32.Hoeffding W A class of statistics with asymptotically normal distribution. Annals of Mathematical Statistics 1948; 19(3): 293–325. Doi: 10.1214/aoms/1177730196. [DOI] [Google Scholar]
- 33.van Breukelen GJ, Candel MJ and Berger MP. Relative efficiency of unequal versus equal cluster sizes in cluster randomized and multicentre trials. Statistics in Medicine 2007; 26(13): 2589–2603. Doi: 10.1002/sim.2740. [DOI] [PubMed] [Google Scholar]
- 34.Wang B, Harhay M, Small D et al. On the mixed-model analysis of covariance in cluster-randomized trials. arXiv preprint 2021; ArXiv:2112.00832. Forthcoming in Statistical Science. [Google Scholar]
- 35.Obuchowski NA. Nonparametric analysis of clustered ROC curve data. Biometrics 1997; 53(2): 567–578. Doi: 10.2307/2533958. [DOI] [PubMed] [Google Scholar]
- 36.DeLong E, DeLong D and Clarke-Pearson D. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 1988; 44(3): 837–845. Doi: 10.2307/2531595. [DOI] [PubMed] [Google Scholar]
- 37.Sen P On some convergence properties of U-statistics. Calcutta Statistical Association Bulletin 1960; 10(37–38): 1–18. Doi: 10.1177/0008068319600101. [DOI] [Google Scholar]
- 38.Brunner E, Munzel U and Puri M. The multivariate nonparametric Behrens–Fisher problem. Journal of Statistical Planning and Inference 2002; 108(1–2): 37–53. Doi: 10.1016/S0378-3758(02)00269-0. [DOI] [Google Scholar]
- 39.Rubarth K, Sattler P, Zimmermann H et al. Estimation and testing of wilcoxon–mann–whitney effects in factorial clustered data designs. Symmetry 2021; 14(2): 1–34. Doi: 10.3390/sym14020244. [DOI] [Google Scholar]
- 40.Eldridge S, Ashby D, Feder G et al. Lessons for cluster randomized trials in the twenty-first century: a systematic review of trials in primary care. Clinical Trials 2004; 1: 80–90. Doi: 10.1191/1740774504cn006rr. [DOI] [PubMed] [Google Scholar]
- 41.Kahan B, Forbes G, Ali Y et al. Increased risk of type I errors in cluster randomised trials with small or medium numbers of clusters: a review, reanalysis, and simulation study. Trials 2016; 17: 1–8. Doi: 10.1186/s13063-016-1571-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Ivers N, Taljaard M, Dixon S et al. Impact of CONSORT extension for cluster randomised trials on quality of reporting and study methodology: review of random sample of 300 trials, 2000–8. BMJ 2011; 343: d5886. Doi: 10.1136/bmj.d5886. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Eldridge S, Ashby D and Kerry S. Sample size for cluster randomized trials: effect of coefficient of variation of cluster size and analysis method. International Journal of Epidemiology 2006; 35(5): 1292–1300. Doi: 10.1093/ije/dyl129. [DOI] [PubMed] [Google Scholar]
- 44.Adams G, Gulliford M, Ukoumunne O et al. Patterns of intra-cluster correlation from primary care research to inform study design and analysis. Journal of Clinical Epidemiology 2004; 57: 785–794. Doi: 10.1016/j.jclinepi.2003.12.013. [DOI] [PubMed] [Google Scholar]
- 45.Yang S, Moerbeek M, Taljaard M et al. Power analysis for cluster randomized trials with continuous coprimary endpoints. Biometrics 2023; 79(2): 1293–1305. Doi: 10.1111/biom.13692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Wang J, Cao J, Zhang S et al. A flexible sample size solution for longitudinal and crossover cluster randomized trials with continuous outcomes. Contemporary Clinical Trials 2021; 109: 106543. Doi: 10.1016/j.cct.2021.106543. [DOI] [PubMed] [Google Scholar]
- 47.Li F, Turner E and Preisser J. Sample size determination for gee analyses of stepped wedge cluster randomized trials. Biometrics 2018; 74(4): 1450–1458. Doi: 10.1111/biom.12918s. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Wicklin R Simulating Data with SAS. Cary, North Carolina: SAS Institute, 2013. [Google Scholar]
- 49.Xu J On the bias in the auc variance estimate. Pattern Recognition Letters 2024; 178: 62–68. Doi: 10.1016/j.patrec.2023.12.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Yu C Nonparametric Methods for Analysis and Sizing of Cluster Randomization Trials with Baseline Measurements. PhD Thesis, Western University, London, ON, CA, 2023. Electronic Thesis and Dissertation Repository. 9697. https://ir.lib.uwo.ca/etd/9697. [Google Scholar]
- 51.Shu D and Zou G. Revisiting sample size planning for receiver operating characteristic studies: A confidence interval approach with precision and assurance. Statistical Methods in Medical Research 2023; 32(4): 748–759. Doi: 10.1177/09622802231151210. [DOI] [PubMed] [Google Scholar]
- 52.Rauch G, Jahn-Eimermacher A, Brannath W et al. Opportunities and challenges of combined effect measures based on prioritized outcomes. Statistics in Medicine 2014; 33(7): 1104–1120. Doi: 10.1002/sim.6010. [DOI] [PubMed] [Google Scholar]
