Abstract
Background/Aims:
Multi-arm, multi-stage trials frequently include a standard care to which all interventions are compared. This may increase costs and hinders comparisons among the experimental arms. Furthermore, the standard care may not be evident, particularly when there is a large variation in standard practice. Thus, we aimed to develop an adaptive clinical trial that drops ineffective interventions following an interim analysis before selecting the best intervention at the final stage without requiring a standard care.
Methods:
We used Bayesian methods to develop a multi-arm, two-stage adaptive trial and evaluated two different methods for ranking interventions, the probability that each intervention was optimal (P best ) and using the surface under the cumulative ranking curve (SUCRA), at both the interim and final analysis. The proposed trial design determines the maximum sample size for each intervention using the Average Length Criteria. The interim analysis takes place at approximately half the pre-specified maximum sample size and aims to drop interventions for futility if either P best or the SUCRA is below a pre-specified threshold. The final analysis compares all remaining interventions at the maximum sample size to conclude superiority based on either P best or the SUCRA. The two ranking methods were compared across 12 scenarios that vary the number of interventions and the assumed differences between the interventions. The thresholds for futility and superiority were chosen to control type 1 error, and then the predictive power and expected sample size were evaluated across scenarios. A trial comparing three interventions that aim to reduce anxiety for children undergoing a laceration repair in the emergency department was then designed, known as the Anxiolysis for Laceration Repair in Children Trial (ALICE) trial.
Results:
As the number of interventions increases, the SUCRA results in a higher predictive power compared with P best . Using P best results in a lower expected sample size when there is an effective intervention. Using the Average Length Criterion, the ALICE trial has a maximum sample size for each arm of 100 patients. This sample size results in a 86% and 85% predictive power using P best and the SUCRA, respectively. Thus, we chose P best as the ranking method for the ALICE trial.
Conclusion:
Bayesian ranking methods can be used in multi-arm, multi-stage trials with no clear control intervention. When more interventions are included, the SUCRA results in a higher power than P best . Future work should consider whether other ranking methods may also be relevant for clinical trial design.
Keywords: Bayesian adaptive trial, multi-arm multi-stage trial, Surface Under the Cumulative Ranking curve, clinical trial design, paediatric emergency department
Introduction
Novel interventions are frequently evaluated in two-arm clinical trials, which compare the intervention against either a placebo or standard of care control. 1 However, head-to-head comparisons of different effective interventions are less frequent, particularly when multiple interventions are developed concurrently. 2 Multi-arm studies, which compare a relatively large number of interventions can be a more efficient method for performing these comparisons, 3 particularly if they stop recruitment to less effective interventions early. 4 This is a type of adaptive trial, where the number of interventions enrolling is based on study data.5,6
Typically, multi-arm adaptive trials perform all analyses in a pairwise fashion against a common control7,8 to determine whether each of the interventions is superior to the control. However, this framework of pairwise comparisons is neither economical nor ethical9–12 and restricts comparisons among experimental interventions, which could be the target of the proposed trial. 13 This is especially crucial when there is no obvious consensus for standard of care intervention, for example, if there is a large variation in practice across institutions or novel interventions have been developed concurrently. 14 In two-arm trials, the designation of a ‘control’ arm between two active comparators is not crucial, but in multi-arm trials, it is not obvious how the multiple arms should be compared. 10
Thus, we aimed to design a multi-arm trial to identify the optimal intervention from a set of active comparators, equivalent to a phase III study. This trial is most relevant to settings where a range of interventions are used, with different interventions favoured by different sites, and there is limited evidence on which intervention offers superior performance. In this setting, we can assume that the investigators are comparing the efficacy of interventions for which safety is well-established and hope to identify less effective interventions early. To achieve this, our proposed trial uses a two-stage design where low-ranked interventions are dropped at the first stage, and the remaining interventions are assessed to determine which, if any, is optimal. We implemented this design in a Bayesian framework and made decisions using the rank of each intervention. 15 The Bayesian framework is well-suited to this design as posterior ranks can be easily computed and it naturally fits within an adaptive design framework.16,17
This study compared two methods to rank the interventions. The first method calculates the probability that an intervention is better than all other interventions , that is, the probability that the intervention is associated with the most desirable mean value. 18 However, is sensitive to the uncertainty in the estimates as, if all the interventions have the same point estimate, it associates the highest rank with the intervention with the most uncertainty. 18 Thus, we also considered ranking the interventions using the Surface Under the Cumulative Ranking curve (SUCRA),15,19,20 which can mitigate the drawbacks of . 18 The SUCRA accounts for the complete ranking distribution of an intervention by averaging the cumulative rank probabilities and indicates the fraction of interventions that are less effective than that intervention. 18 In settings where all interventions have previously demonstrated efficacy, ranking methods alone could be used to determine the most effective intervention. 18 If novel interventions were included, it would be important to also include an assessment of safety and rankings may not be suitable if interventions differ substantially in other characteristics, for example, costs and ease of accessibility.
To our knowledge, the proposed ranking methods have only been compared for network-meta-analysis. 18 Thus, our study is the first to compare different ranking methods for decision-making in multi-arm trial designs. We compared the ability of the SUCRA and to identify the optimal intervention based on the power and expected sample size, that is, the average number of patients randomized to each intervention, across a range of numbers of trial arms and minimally important clinical differences. To ensure a fair comparison of the two methods, we imposed the same frequentist type 1 error across the two ranking methods. We also implemented our novel design for a trial comparing three anxiolytic agents used in the paediatric emergency department to reduce distress in children undergoing a laceration repair. The design for this trial used prior distributions extracted from the literature21–23 and previous studies from our study team. 24 We identified the sample size for this trial using Bayesian methods25,26 and determined the study power and expected sample size.
Methods
Multi-arm multi-stage trials
Multi-arm multi-stage (MAMS) trials evaluate multiple interventions and use interim analyses to determine whether trial arms should be dropped or continue to the next stage.4,27 Frequentist MAMS have been well-defined and use repeated statistical tests to determine whether interventions should be dropped.4,28,29 In contrast, Bayesian MAMS determine whether interventions should be discontinued or declared superior using predefined decision rules based on estimates from the posterior distribution of the parameters of interest. 4 Some MAMS trials will continue until a conclusion has been reached, 30 while others pre-specify a maximum sample size and number of stages. 4
In our design, we considered two stages and applied different decision rules for the interim and final analyses. During the interim analysis, we decided which interventions were promising enough to proceed to the final analysis stage by determining whether their efficacy exceeded a given stopping boundary. 31 The final analysis aimed to identify the optimal intervention among those that had not been stopped and took place when all participants had been recruited to those interventions. Thus, our proposed trial requires a pre-specified maximum recruitment level for each intervention and the boundaries to (1) stop an intervention at an interim analysis and (2) declare superiority at the final analysis. These key trial design components can then be chosen to minimize the sample size while ensuring that the trial meets its predefined target(s). 31
Our trial is a two-stage -arm trial to compare interventions with a maximum sample size of patients per arm and one interim analysis after approximately half the maximum sample size had been assigned to each arm . The decisions at the interim and final analyses used ‘ranking’ the interventions conditional on the posterior distribution of the mean effectiveness. We denoted the quantity computed to determine futility and superiority as , where is the intervention, is the ranking method (either , or the SUCRA, ) and is the analysis stage (either interim or final ).
At the interim analysis, was compared to a ranking method-specific threshold with the intervention is declared futile if . If all but one of the interventions meets this threshold, then the remaining intervention was declared superior at the interim analysis and the trial was terminated. Otherwise, we proceeded to the final stage for all the non-futile interventions, which may be all of them. At the final stage, we calculated and compared it to a ranking method-specific threshold . Superiority was declared if . Note that, should be sufficiently high so that it is not mathematically possible for more than one intervention to satisfy this superiority condition. Figure 1 represents this design pictorially when .
Figure 1.
A pictorial representation of the decision-making in the proposed two-stage trial design. The first and second scenarios proceed to the final analysis stage, where at least two interventions are evaluated at the maximum sample size. Superiority is declared at the interim analysis stage in the final scenario, so the trial is terminated early and no interventions continue to the maximum sample size.
Implementing our proposed design
Before presenting the methods to select the maximum sample size, and , we introduce the model used for the data. Our example trial had a continuous outcome, and to allow for efficient computation, we modelled this outcome using a normal distribution with a normal-gamma conjugate prior for the mean and precision. Let denote the intervention the patient received and be the outcome for patient , where is the number of patients recruited to intervention . The outcome is modelled
| (1) |
where is the effectiveness of intervention , and is the first-order precision. The priors for and are defined as follows
| (2) |
where is the prior mean for , and is the prior effective sample size. 32 In an applied trial, , , and should be defined using data from previous studies. The closed form definitions for , , and can support the use of available data to define the prior distributions. Specifically, can be set equal to the mean of the outcome from the previous data, , and can be set to the number of patients used to estimate the mean. For the precision, can be set to , where is the number of patients used to estimate the precision, and can be set to , where is the sample variance from a previous study. The use of conjugate priors also facilitates the simulation study as the posterior distribution for and , conditional on , can be determined analytically. Note, however, that our design can be extended to alternative likelihoods using conjugate or non-conjugate distributions.
Determining the maximum sample size
The proposed trial design required specifying the maximum sample size for each trial arm. We proposed that the maximum sample size is derived using the Average Length Criterion (ALC), a Bayesian method for sample size determination. 25 The ALC controls the average length of the posterior credible interval for parameters of interest, typically the treatment effect but in this study.
To adapt the ALC to a multi-arm study, we computed the length of the longest posterior credible interval for across the interventions, denoted by . The maximum sample size was then chosen to limit , calculated by simulation. Using the design prior for and , we generated data from the prior-predictive distribution of across a range of sample sizes. 25 The posterior distributions for conditional on these data were found by combining the data with an analysis prior. The analysis prior is not necessarily the same as the design prior, particularly in trials as the analysis prior is often non-informative. 25 We then calculated for a range of sample sizes, and we computed the average maximum length across all simulated data sets, separately for each sample size.
In this study, we selected the maximum sample size as the smallest sample size such that the longest prior credible interval for was reduced tenfold in the posterior. The length of credible intervals specified using the prior distribution reflects the strength of evidence of the outcomes before we observe the data. Thus, the maximum sample size ensured the estimate of the outcomes after observing the data would be 10-fold more informative than before the data were observed. Once was selected, we had to choose the thresholds to control which interventions would be declared futile and superior.
Determining the futility and superiority thresholds
We selected the futility and superiority thresholds, and , respectively, to maintain a frequentist type 1 error of 0.025. In MAMS trials, the commonly used measures of type 1 error are pairwise and familywise type 1 error rates. Pairwise error is the probability that the null hypothesis is incorrectly rejected for a specific intervention at the end of the study, while familywise error is the probability that the null hypothesis of at least one intervention is incorrectly rejected in a multi-arm study. 33 As our study does not focus on rejecting a null hypothesis but rather declaring an intervention superior from a set of multiple interventions, we defined type 1 error as the proportion of trials that declares any of the interventions superior when no differences between the interventions exist.
To evaluate type 1 error, we accounted for the prior uncertainty in the parameters and for all . Thus, the type 1 error was computed by first simulating the value for and from a shared design prior. The data were then simulated conditional on these values for and . The hyperparameters of the posterior distributions were calculated, and the ranking of each intervention was calculated using simulations from each posterior distribution .
Bayesian predictive power
Once and were specified, the trial design was evaluated using Bayesian predictive power. In the frequentist framework, power is the probability that an intervention will be declared superior if it is superior. 34 The power calculation for our design accounted for uncertainty in and for all . Thus, the values for and were simulated from their prior distributions. This prior simulation means that even when the prior mean for a given intervention was superior, the intervention may not be superior in a given simulation. Thus, we computed power by determining the simulation specific superior intervention based on which of the simulated values for was the smallest. If the simulated trial identified that intervention as superior, then it was a successful trial. To compute the power, we have two possible methods for declaring an intervention is superior:
At the interim analysis, all interventions except one meet the futility criterion; , or;
At the final analysis, a single intervention meets the superiority criterion; .
Predictive power can also be computed by evaluating the probability of detecting superiority for a specific intervention, but this power is restricted by the prior probability that the intervention is optimal, whereas our proposed definition can reach 1.
Evaluating the ranking methods
We performed a simulation study to compare two different ranking methods for decision-making in our trial, reported using the aims, data-generating mechanisms, estimands, methods, and performance measures (ADEMP) framework. 35
Aims
We aimed to compare the use of or the SUCRA to make adaptive design decisions in our trial based on predictive power and expected sample size.
Data-generating process
The data were simulated from the prior-predictive distribution of the normal-gamma conjugate model (see equations (1) and (2)). The parameters for the normal-gamma conjugate model were chosen to mimic the trial design for Anxiolysis for Laceration Repair in Children Trial (ALICE) (described below) and are set to , , and for all . We considered four different scenarios for the prior mean, generated by setting and for and and four different values for , the number of interventions in the trial, . We used a maximum sample size for each intervention of 100, with an interim analysis at .
Estimands
Depending on the underlying assumptions for the prior mean of the interventions, the key estimand is either the type 1 error, defined as the probability that a trial declares superiority of any interventions when , or the predictive power, defined as the probability that the intervention with the simulation-specific lowest mean outcome is declared superior. We also evaluate the expected sample size, defined as the expected number of individuals recruited to each intervention, which varies due to dropping interventions at the interim analysis. Finally, we evaluated the proportion of trials that conclude superiority for a non-superior intervention.
Methods
We compared two ranking methods for drawing conclusions of superiority and futility during the trial. The rankings are computed using the posterior mean for each intervention; for . The first method was , the probability that the intervention ranks first
| (2) |
estimated through simulation from the posterior distributions with 10,000 simulations. The second method was the SUCRA, which numerically summarizes of the entire ranking distribution. To define SUCRA, let be the probability that the posterior mean of intervention is smaller than posterior means among the alternatives, that is, is the probability that the posterior mean is the smallest and the intervention ranks first. Next, define the cumulative probability for and . Finally, the SUCRA of intervention is then computed as its average cumulative ranking
| (2) |
To ensure comparability between the two ranking methods, the thresholds and were selected such that the type 1 error was fixed at 2.5%. We considered three potential thresholds for , 0.95, 0.975, 0.99. These were chosen to ensure that only one intervention would be ranked superior. The value of was then calibrated, separately for each superiority threshold to maintain a type 1 error. The final thresholds were chosen to result in the maximal power conditional on maintaining type 1 error control.
Performance measures
We used 10,000 simulated trials in each scenario, which guarantees that the estimated quantities have a 95% probability of being with 0.002 of the reported value. We estimate the type 1 error and predictive power as the proportion of simulations in which the relevant criteria are met. We report the chosen thresholds and for each scenario, the type 1 error or predictive power and the Monte Carlo standard error for the predictive power. Finally, the expected sample size is calculated as the average total sample size for each simulated trial, divided by , the number of trial interventions.
Designing the ALICE trial
The ALICE trial is a phase III, multi-centre, single-blinded, randomized, three-arm, adaptive trial that aims to compare three anxiolytic agents, intranasal midazolam (INM) , inhaled nitrous oxide (N2O) and intranasal dexmedetomidine (IND) , used for laceration repairs in the paediatric emergency department. Children, particularly young children, are often distressed when undergoing laceration repair,36–38 which often requires physical restraint. 37 Anxiolytic agents can reduce distress during the procedure, making it less technically challenging for the proceduralist and reducing negative experiences for children and caregivers. INM and N2O are widely used in the paediatric emergency department, but there is little consensus on which agent is most effective. As both agents have limitations,21–23 IND has been suggested as an effective alternative, but a head-to-head comparison of these three anxiolytics has not been performed. Crucially, due to variation in clinical practice, a standard of care cannot be selected.
The ALICE trial will enrol children between 1 and 13 years who present to the emergency department with a single laceration requiring simple interrupted sutures alone. The child or caregiver must also desire anxiolysis for the repair. The primary outcome is a weighted mean anxiolysis score, measured using the Observational Scale of Behavioral Distress – Revised (OSBD-R), 39 which ranges from 0 (no distress) to 23.5 (maximal distress).
Determining the priors
The prior information for the ALICE trial was extracted from the published literature and from study undertaken by our team. 24 The prior for the mean of the OSBD-R for INM and for N2O is extracted from a trial randomizing 51 patients to each intervention. 40 For IND, a dose-funding study enrolling 21 patients indicated a mean OSBD-R score of . A pooled standard deviation of for the OSBD-R score was estimated from 204 patients. 40 We assumed that the individual-level precision was the same across all three interventions, that is, and are the same for all , and therefore, this study represents the best possible evidence for the precision. The values for and are computed using the sample size and pooled standard deviation. To reduce the impact of the prior on the results, we down-weighted all the prior sample sizes by a factor of 20, indicating that the information from 20 patients in the previous trial is equivalent to one patient in the ALICE trial. The final prior distributions for the ALICE trial are in Table 1.
Table 1.
The prior parameters used for the ALICE trial.
| Intervention | ||||
|---|---|---|---|---|
| Inhaled Nitrous Oxide (N2O) | 0.4 | 2.55 | 5.1 | 12.73 |
| Intranasal Midazolam (INM) | 1.9 | 2.55 | 5.1 | 12.73 |
| Intranasal Dexmedetomidine (IND) | 3.96 | 1.05 | 5.1 | 12.73 |
N2O: nitrous oxide; INM: intranasal midazolam; IND: intranasal dexmedetomidine.
Sample size determination
The sample size for the ALICE trial was selected using the ALC, conditional on the priors in Table 1. We simulated 10,000 data sets, conditional on the design prior, for eight different sample sizes from 70 to 140, in increments of 10. For each data set, we obtained the high-density posterior credible intervals, estimated by 10,000 simulations from the analytic posterior distribution. The longest length was estimated for each simulation, and the average maximum length was identified for each simulated data set. The longest length of 95% high-density prior credible interval was also computed, denoted by . We then selected the smallest sample size such that .
Designing the ALICE trial
We evaluated both ranking methods to determine the optimal design for the ALICE trial. The optimal thresholds and from the simulation study were used to evaluate the predictive power and expected sample size. The predictive power was evaluated based on 10,000 simulated studies.
Results
Evaluating the ranking methods
Tables 2 and 3 display the results of our simulation study of the two ranking methods. We report the results for the optimal threshold , , with results for the other thresholds reported in the Supplemental Material. results in a lower predictive power, and it was not possible to control type 1 error below 2.5% for and . The futility threshold for was lower than for the SUCRA as were strictly larger than . The thresholds also increased as the number of interventions in the trial increased. Thus, the thresholds for dropping interventions became more lenient, and more interventions were dropped at the interim analysis as increased. This can also be seen in the expected sample size, which decreased as increased.
Table 2.
The type 1 error and expected sample size (ESS) under the null hypothesis, obtained for four different scenarios that vary the number of interventions in the adaptive trial.
| Number of intervention | Ranking approach | Type 1 error | ESS | ||
|---|---|---|---|---|---|
| 3 | SUCRA | 0.274 | 0.975 | 0.025 | 89 |
| 0.026 | 0.975 | 0.025 | 94 | ||
| 5 | SUCRA | 0.477 | 0.975 | 0.025 | 77 |
| 0.057 | 0.975 | 0.025 | 79 | ||
| 8 | SUCRA | 0.436 | 0.975 | 0.025 | 80 |
| 0.073 | 0.975 | 0.025 | 73 | ||
| 12 | SUCRA | 0.5 | 0.975 | 0.025 | 75 |
| 0.079 | 0.975 | 0.025 | 66 |
ESS: expected sample size; SUCRA: surface under the cumulative ranking curve.
Table 3.
The predictive power, expected sample size (ESS), and probability of incorrectly identifying a superior treatment under the alternative hypothesis, obtained from 12 simulated cases varying the number of interventions and the incremental difference between the prior means .
| Number of intervention | The increment between interventions, | Ranking approach | Power | ESS | Probability of incorrect superiority |
|---|---|---|---|---|---|
| 3 | 0.5 | SUCRA | 0.78 | 83 | 0.002 |
| 0.79 | 60 | 0.002 | |||
| 1.0 | SUCRA | 0.81 | 83 | 0.001 | |
| 0.82 | 59 | 0.001 | |||
| 1.5 | SUCRA | 0.84 | 83 | 0.001 | |
| 0.85 | 57 | 0.002 | |||
| 5 | 0.5 | SUCRA | 0.75 | 77 | 0.002 |
| 0.76 | 56 | 0.005 | |||
| 1.0 | SUCRA | 0.80 | 78 | 0.002 | |
| 0.81 | 55 | 0.003 | |||
| 1.5 | SUCRA | 0.85 | 78 | 0.001 | |
| 0.85 | 54 | 0.004 | |||
| 8 | 0.5 | SUCRA | 0.78 | 77 | 0.006 |
| 0.75 | 54 | 0.007 | |||
| 1.0 | SUCRA | 0.84 | 77 | 0.003 | |
| 0.81 | 53 | 0.005 | |||
| 1.5 | SUCRA | 0.88 | 77 | 0.003 | |
| 0.86 | 52 | 0.004 | |||
| 12 | 0.5 | SUCRA | 0.80 | 75 | 0.009 |
| 0.75 | 53 | 0.009 | |||
| 1.0 | SUCRA | 0.86 | 75 | 0.007 | |
| 0.82 | 52 | 0.005 | |||
| 1.5 | SUCRA | 0.89 | 75 | 0.004 | |
| 0.86 | 51 | 0.004 |
ESS: expected sample size; SUCRA: surface under the cumulative ranking curve.
Ranking interventions using resulted a smaller expected sample size when true differences exist and a larger expected sample size when not differences exist. For our optimal design, provides the highest power and a lower expected sample size for three and five interventions but the SUCRA outperform in terms of predictive power as the number of interventions increases, with a 5% increase for and . For , outperforms the SUCRA in terms of predictive power until , albeit at a lower level, while for , the SUCRA clearly outperforms . The proportion of trials that conclude superiority for a non-superior intervention is below 1% for all studies. Thus, this design is unlikely to produce incorrect superiority conclusions when a superior intervention is available.
The expected sample size does not change substantially as the difference between the outcomes increases. We believe this is because the precision of the estimates is the same across the different scenarios and the relatively small sample sizes ensure that interventions are retained in the trial.
The ALICE trial
Table 4 displays the average longest 95% high-density posterior credible interval length for the mean of the OSBD-R score. The maximum sample size of 100 is chosen as the maximum sample size for the ALICE trial as the average longest length of 0.66 close to . The interim analysis is after 50 patients have been enrolled for each intervention.
Table 4.
The range of sample size and corresponding high-density posterior credible interval length for the mean of the OSBD-R score.
| n | n | ||
|---|---|---|---|
| 70 | 0.79 | 110 | 0.63 |
| 80 | 0.74 | 120 | 0.60 |
| 90 | 0.70 | 130 | 0.59 |
| 100 | 0.66 | 140 | 0.56 |
Based on and and , the predictive power for is and the SUCRA is . The expected sample size with no difference in the mean of the interventions was and for and the SUCRA, respectively. Similarly, and for and the SUCRA, respectively. Based on these analyses, we chose as the ranking method in the ALICE trial.
Discussion
This study evaluated two methods for ranking interventions to make decisions in a Bayesian, multi-arm, two-stage, adaptive trial across 12 scenarios. Broadly, this study showed that is more likely than the SUCRA to drop futile interventions at the interim analysis, resulting in a smaller expected sample size and a lower power for trials with a larger number of interventions. To our knowledge, this is the first evaluation of using the SUCRA to make decisions in an adaptive trial, as it has primarily been used in network meta-analyses20,41 and provides evidence that a further exploration of these methods and other relevant ranking methods could be useful.
We then used our novel design for the ALICE trial, a randomized trial to determine the optimal anxiolytic agent among three interventions to reduce distress for children undergoing laceration repair. Due to substantial practice variation, it was not obvious which intervention should be considered as the common comparator, necessitating the use of treatment rankings to make trial conclusions. This design is useful for trials where a placebo or clear standard of care is not available and the interventions have previously been shown to be effective. For example, variation in clinical practice where head-to-head trials have not been done, novel interventions developed at the same time by different teams/companies, common off-label use of drugs, for example, in paediatrics where trials are lacking, and the comparison of non-drug related interventions such as different implementation methods. A key challenge of the proposed trial design is choosing the design and analysis priors. We extracted these from previous literature but, in some examples, absolute outcome values may not be available, for example, if only relative treatment effects are reported, which would make this method challenging to implement. Another limitation of the proposed method was the use of conjugate distributions, which limited the models we could consider. For example, we could have considered a pooled precision across the different interventions, but this would have created an infeasible computational burden for our simulation study.
Furthermore, the simulation study could have been expanded to consider additional ranking methods, which is an important avenue for future research. In particular, using only ranking metrics, rather than the absolute effect of interventions, may violate consistency as different metrics may provide different treatment hierarchies. 18 Moreover, the efficacy that we evaluate using the ranking metric may not be clinically significant. A future extension of this design could consider the effect size and a minimum clinically important value in the superiority criteria. This has been suggested in the network meta-analysis 19 and could be extended to our trial design.
Conclusion
In multi-arm clinical trials with no obvious control, Bayesian methods for ranking interventions can be determine the optimal outcome from a set of effective alternatives. In trials with a small number of interventions, the probability that the treatment is superior provides high predictive power, while for larger numbers of interventions, the SUCRA offers increased predictive power. Our results showed both ranking metrics could provide valid, powerful trials with different operating characteristics. Thus, we suggest that investigators carefully consider their trial design and appropriate ranking method before the trial.
Supplemental Material
Supplemental material, sj-pdf-1-ctj-10.1177_17407745241251812 for A comparison of alternative ranking methods in two-stage clinical trials with multiple interventions by Nam-Anh Tran, Abigail McGrory, Naveen Poonai and Anna Heath in Clinical Trials
Supplemental material, sj-pdf-2-ctj-10.1177_17407745241251812 for A comparison of alternative ranking methods in two-stage clinical trials with multiple interventions by Nam-Anh Tran, Abigail McGrory, Naveen Poonai and Anna Heath in Clinical Trials
Footnotes
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship and/or publication of this article.
ORCID iD: Nam-Anh Tran
https://orcid.org/0000-0001-9368-2597
Anna Heath
https://orcid.org/0000-0002-7263-4251
Supplemental material: Supplemental material for this article is available online.
References
- 1. Bratton DJ, Phillips PPJ, Parmar MKB. A multi-arm multi-stage clinical trial design for binary outcomes with application to tuberculosis. BMC Med Res Methodol 2013; 13(1): 139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Schöttker B, Lühmann D, Boulkhemair D, et al. Indirect comparisons of therapeutic interventions. GMS Health Technol Assess 2009; 5: Doc09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Parmar MKB, Carpenter J, Sydes MR. More multiarm randomised trials of superiority are needed. Lancet 2014; 384(9940): 283–284. [DOI] [PubMed] [Google Scholar]
- 4. Bassi A, Berkhof J, De Jong D, et al. Bayesian adaptive decision-theoretic designs for multi-arm multi-stage clinical trials. Stat Methods Med Res 2021; 30(3): 717–730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Chow S-C, Chang M, Pong A. Statistical consideration of adaptive methods in clinical development. J Biopharm Stat 2005; 15(4): 575–591. [DOI] [PubMed] [Google Scholar]
- 6. Phillips PPJ, Gillespie SH, Boeree M, et al. Innovative trial designs are practical solutions for improving the treatment of tuberculosis. J Infect Dis 2012; 205(suppl. 2): S250–S257. [DOI] [PubMed] [Google Scholar]
- 7. Lin J, Bunn V. Comparison of multi-arm multi-stage design and adaptive randomization in platform clinical trials. Contemp Clin Trials 2017; 54: 48–59. [DOI] [PubMed] [Google Scholar]
- 8. Ghosh P, Liu L, Mehta C. Adaptive multiarm multistage clinical trials. Stat Med 2020; 39(8): 1084–1102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Chang M. Introductory adaptive trial designs: a practical guide with R, vol. 75. Boca Raton, FL: CRC Press, 2015. [Google Scholar]
- 10. Streiner DL. Alternatives to placebo-controlled trials. Can J Neurol Sci 2007; 34(S1): S37–S41. [DOI] [PubMed] [Google Scholar]
- 11. Cheah PY, Steinkamp N, Von Seidlein L, et al. The ethics of using placebo in randomised controlled trials: a case study of a plasmodium vivax antirelapse trial. BMC Med Ethics 2018; 19(1): 19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Stang A, Hense H-W, Jöckel K-H, et al. Is it always unethical to use a placebo in a clinical trial? PLoS Med 2005; 2(3): e72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Magaret A, Angus DC, Adhikari NKJ, et al. Design of a multi-arm randomized clinical trial with no control arm. Contemp Clin Trials 2016; 46: 12–17. [DOI] [PubMed] [Google Scholar]
- 14. Evans SR. Clinical trial structures. J Exp Stroke Transl Med 2010; 3(1): 8–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Rücker G, Schwarzer G. Ranking treatments in frequentist network meta-analysis works without resampling methods. BMC Med Res Methodol 2015; 15(1): 58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Sheiner LB. Learning versus confirming in clinical drug development. Clin Pharmacol Ther 1997; 61(3): 275–291. [DOI] [PubMed] [Google Scholar]
- 17. Berry SM, Carlin BP, Lee JJ, et al. Bayesian adaptive methods for clinical trials. Boca Raton, FL: CRC Press, 2010. [Google Scholar]
- 18. Salanti G, Nikolakopoulou A, Efthimiou O, et al. Introducing the treatment hierarchy question in network meta-analysis. Am J Epidemiol 2022; 191(5): 930–938. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Mavridis D, Porcher R, Nikolakopoulou A, et al. Extensions of the probabilistic ranking metrics of competing treatments in network meta-analysis to reflect clinically important relative differences on many outcomes. Biom J 2020; 62(2): 375–385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Salanti G, Ades AE, Ioannidis JPA. Graphical methods and numerical summaries for presenting results from multiple-treatment meta-analysis: an overview and tutorial. J Clin Epidemiol 2011; 64(2): 163–171. [DOI] [PubMed] [Google Scholar]
- 21. Miller JL, Capino AC, Thomas A, et al. Sedation and analgesia using medications delivered via the extravascular route in children undergoing laceration repair. J Pediatr Pharmacol Ther 2018; 23(2): 72–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Conway A, Rolley J, Sutherland JR. Midazolam for sedation before procedures. Cochrane Database Syst Rev 2016; 2016(5): CD009491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. National Clinical Guideline Centre. Sedation in children and young people: sedation for diagnostic and therapeutic procedures in children and young people. London: Royal College of Physicians, 2010. [PubMed] [Google Scholar]
- 24. Poonai N, Sabhaney V, Ali S, et al. Optimal dose of intranasal dexmedetomidine for laceration repair in children: a phase II dose-ranging study. Ann Emerg Med, 2023. 82(2): 179–190. [DOI] [PubMed] [Google Scholar]
- 25. Joseph L, Belisle P. Bayesian sample size determination for normal means and differences between normal means. J Roy Stat Soc D: Sta 1997; 46(2): 209–226. [Google Scholar]
- 26. Joseph L, Bélisle P. Bayesian consensus-based sample size criteria for binomial proportions. Stat Med 2019; 38(23): 4566–4573. [DOI] [PubMed] [Google Scholar]
- 27. Jaki T, Wason JMS. Multi-arm multi-stage trials can improve the efficiency of finding effective treatments for stroke: a case study. BMC Cardiovasc Disord 2018; 18(1): 215. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Magirr D, Jaki T, Whitehead J. A generalized Dunnett test for multi-arm multi-stage clinical studies with treatment selection. Biometrika 2012; 99(2): 494–501. [Google Scholar]
- 29. White IR, Choodari-Oskooei B, Sydes MR, et al. Combining factorial and multi-arm multi-stage platform designs to evaluate multiple interventions efficiently. Clin Trials 2022; 19(4): 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Cheng Y, Shen Y. Bayesian adaptive designs for clinical trials. Biometrika 2005; 92(3): 633–646. [Google Scholar]
- 31. Wason J, Stallard N, Bowden J, et al. A multi-stage drop-the-losers design for multi-arm clinical trials. Stat Methods Med Res 2017; 26(1): 508–524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Morita S, Thall PF, Müller P. Determining the effective sample size of a parametric prior. Biometrics 2008; 64(2): 595–602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Bratton DJ, Parmar MKB, Phillips PPJ, et al. Type I error rates of multi-arm multi-stage clinical trials: strong control and impact of intermediate outcomes. Trials 2016; 17(1): 309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Jones SR, Carley S, Harrison M. An introduction to power and sample size estimation. Emerg Med J 2003; 20(5): 453–458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38(11): 2074–2102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Hall JE, Patel DP, Thomas JW, et al. Certified child life specialists lessen emotional distress of children undergoing laceration repair in the emergency department. Pediatr Emerg Care 2018; 34(9): 603–606. [DOI] [PubMed] [Google Scholar]
- 37. Kumar K, Ali S, Sabhaney V, et al. Anxiolysis for laceration repair in children: a survey of pediatric emergency providers in Canada. Can J Emerg Med 2022; 24(1): 75–83. [DOI] [PubMed] [Google Scholar]
- 38. Gursky B, Kestler LP, Lewis M. Psychosocial intervention on procedure-related distress in children being treated for laceration repair. J Dev Behav Pediatr 2010; 31(3): 217–222. [DOI] [PubMed] [Google Scholar]
- 39. Elliott CH, Jay SM, Woody P. An observation scale for measuring children’s distress during medical procedures. In: Roberts MC, Koocher GP, Routh DK, et al. (eds) Readings in pediatric psychology. Boston, MA: Springer, 1993, pp. 259–267. [DOI] [PubMed] [Google Scholar]
- 40. Luhmann JD, Kennedy RM, Porter FL, et al. A randomized clinical trial of continuous-flow nitrous oxide and midazolam for sedation of young children during laceration repair. Ann Emerg Med 2001; 37(1): 20–27. [DOI] [PubMed] [Google Scholar]
- 41. Daly CH, Neupane B, Beyene J, et al. Empirical evaluation of SUCRA-based treatment ranks in network meta-analysis: quantifying robustness using Cohen’s kappa. BMJ Open 2019; 9(9): e024625. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-pdf-1-ctj-10.1177_17407745241251812 for A comparison of alternative ranking methods in two-stage clinical trials with multiple interventions by Nam-Anh Tran, Abigail McGrory, Naveen Poonai and Anna Heath in Clinical Trials
Supplemental material, sj-pdf-2-ctj-10.1177_17407745241251812 for A comparison of alternative ranking methods in two-stage clinical trials with multiple interventions by Nam-Anh Tran, Abigail McGrory, Naveen Poonai and Anna Heath in Clinical Trials

