Abstract
Different combined outcome-data lags (follow-up durations plus data-collection lags) may affect the performance of adaptive clinical trial designs. We assessed the influence of different outcome-data lags (0–105 days) on the performance of various multi-stage, adaptive trial designs (2/4 arms, with/without a common control, fixed/response-adaptive randomisation) with undesirable binary outcomes according to different inclusion rates (3.33/6.67/10 patients/day) under scenarios with no, small, and large differences. Simulations were conducted under a Bayesian framework, with constant stopping thresholds for superiority/inferiority calibrated to keep type-1 error rates at approximately 5%. We assessed multiple performance metrics, including mean sample sizes, event counts/probabilities, probabilities of conclusiveness, root mean squared errors (RMSEs) of the estimated effect in the selected arms, and RMSEs between the analyses at the time of stopping and the final analyses including data from all randomised patients. Performance metrics generally deteriorated when the proportions of randomised patients with available data were smaller due to longer outcome-data lags or faster inclusion, i.e., mean sample sizes, event counts/probabilities, and RMSEs were larger, while the probabilities of conclusiveness were lower. Performance metric impairments with outcome-data lags ≤45 days were relatively smaller compared to those occurring with ≥60 days of lag. For most metrics, the effects of different outcome-data lags and lower proportions of randomised patients with available data were larger than those of different design choices, e.g., the use of fixed versus response-adaptive randomisation. Increased outcome-data lag substantially affected the performance of adaptive trial designs. Trialists should consider the effects of outcome-data lags when planning adaptive trials.
Keywords: Adaptive trials, follow-up durations, data collection lag, trial design, simulation
1 |. Introduction
Most randomised clinical trials are conducted with no or few interim analyses that usually employ very strict criteria for stopping early,1 e.g., O’Brien-Fleming monitoring boundaries that preserve most of the alpha for the final analysis (the final analysis employs more lenient criteria for declaring an intervention effective than earlier analyses).2 This approach is challenged by sample size calculations based on assumptions that often turn out to be optimistic or incorrect,3–6 which may lead trials to ultimately be unable to firmly confirm or reject clinically important effects and to running longer than required.1,4,7,8 Thus, adaptive trials with a higher number and frequency of interim analyses have received increased attention.1,9,10 Such designs may employ both adaptive dropping of inferior arms, adaptive stopping of the full trial, and response-adaptive randomisation according to pre-specified adaptation rules.10,11
Regardless of design, whenever interim or adaptive analyses are conducted before inclusion, follow-up, and data collection for all patients has concluded, the proportion of randomised patients with available data at the time of analysis will be below 100%, except when overrunning is not allowed (i.e., when inclusion is paused during data-collection lag periods and while awaiting analysis results, which is often infeasible in practice, especially when the number of possible analyses is large). Consequently, effect estimates used for adaptations may later change in magnitude or even direction once data from all randomised patients are analysed.12 Higher inclusion rates, longer follow-up durations, and longer data-collection lag will aggravate this risk,13 especially with a relatively higher number of adaptive analyses. Similarly, the influence of longer outcome-data lags depends on the follow-up duration to inclusion period ratio,13 with a threshold of <0.25 previously suggested as a rule of thumb for when adaptive trial designs are useful.14 While these factors should be considered when selecting follow-up durations and planning data collection and verification procedures in adaptive trials, studies of the influence of different outcome-data lags (durations of follow-up plus data-collection lag) in complex adaptive trials also considering different inclusion rates that could inform these decisions are lacking, and previous methodological studies have been criticized for ignoring these factors and assuming that outcome data are always instantaneously available following inclusion.13
Consequently, we undertook a simulation study assessing the influence of several different outcome-data lags on the performance of several adaptive trial designs under different inclusion rates. We hypothesised that longer outcome-data lags would affect the performance characteristics of adaptive trials to an extent where it could sway trialists to prefer relatively shorter follow-up durations or less adaptation for such trial designs, where possible.15
2 |. Methods
We conducted this simulation study according to a publicly pre-registered protocol and statistical analysis plan15 adhering to recommendations for statistical simulation studies16 and considering previous studies assessing the performance of different adaptive trial designs.17–19
2.1 |. Statistical methods and simulation
We conducted all simulations using R v4.2.2 using our adaptr20 package (inceptdk.github.io/adaptr; see Appendix C). The adaptr package simulates adaptive trials with adaptive arm dropping, stopping, and/or response-adaptive randomisation using Bayesian statistical methods. The complete analysis code is included in Appendix C.
Using the adaptr package we simulated studies where patients were randomly allocated to the active treatment arms (described below) according to the current allocation ratios, immediately followed by simulation of random, undesirable binary outcomes (e.g., mortality) from Bernoulli distributions. Subsequently, probabilities of experiencing the outcome in each trial arm were estimated using beta-binomial conjugate prior models20–22 with flat beta(α = 1, β = 1) priors, which corresponds to all event probabilities being equally probable a priori and information-wise as two additional patients in each arm (one with the outcome, one without).21 Importantly, while outcomes were randomly generated immediately following randomisation, outcome data were only used for patients who had completed the relevant combined follow-up duration/data collection lag period at the time of each analysis.During each analysis, comparisons used 5,000 independent posterior draws from each trial arm. We simulated 100,000 trials for each combination of the parameters described above; this number ensures sufficient precision for all performance metrics and corresponds to the United States Food and Drug Administration’s recommendations for the assessment of final, ‘real’ complex adaptive trials,23 and is likely larger than what was necessary due to the number and spacing of combined outcome-data lags considered.10,23
2.2 |. Adaptive trial designs assessed
We considered six different adaptive trial designs:
Two arms using fixed, equal randomisation (50%:50%)
Two arms using response-adaptive randomisation
Four arms all compared against each other (i.e., no common control arm) using fixed, equal randomisation (25%:25%:25%:25%)
Four arms compared against each other using response-adaptive randomisation
Four arms with three interventional arms compared pairwise against a common control arm using fixed, square root ratio-based randomisation10,24 (square root of the number of non-control arms to 1-ratio for the allocation probabilities in the common control arm and all interventional arms; i.e., 36.6% control arm allocation and 21.1% allocation to each interventional arm, with allocation probabilities re-calculated after arm dropping using the same approach, e.g., 41.4% to the control arm and 29.3% to each interventional arm after dropping of 1 interventional arm)
Four arms with three interventional arms and a common control group using square root ratio-based initial randomisation,10,24 with subsequent fixed control arm allocation (to 36.6% initially, with the control arm allocation probability re-calculated after arm dropping as described above) and response-adaptive randomisation in the interventional arms
We considered three different, constant inclusion (recruitment) rates of 3.33, 6.67, and 10.0 patients per day; the assumed constant inclusion rates are justified by inclusion patterns in previous trials by our group.25–29 In these, inclusion rates have generally been reasonably constant after an initial period of site initiations, and the first adaptive analyses in this simulation study will occur after a burn-in period (described below) long enough to ignore the initial non-constant inclusion rate.
We considered eight different outcome-data lags, defined as the follow-up durations plus data-collection lags: 0 (outcome data immediately available), 15, 30, 45, 60, 75, 90, and 105 days, which covers the range of periods (including data-collection lags) mostly used in clinical trials conducted in our setting (critical care).30 The proportions of randomised patients with available outcome data at the time of each possible adaptive analysis according to the inclusion rate and outcome-data lag combinations are presented in Figure 1. Of note, while the same outcome-data lag periods can consist of different follow-up durations and data-collection lag combinations (e.g., an outcome-data lag of 30 days could equally consist of 15 days of follow-up plus 15 days of data-collection lag or 20 days of follow-up plus 10 days of data collection lag), any influence on adaptive trial performance is due to the length of the combined period.
Figure 1.
Proportion of randomised patients with data available at each adaptive analysis
Proportion of randomised patients with data available at the time of each possible adaptive analysis (marked with dotted lines) according to inclusion rates and outcome-data lag (follow-up durations plus data-collection lag). Horizontal axis: % of patients included out of the maximum possible 10,000; vertical axis: proportion (%) of randomised patients with outcome data available at the time of each analysis.
Finally, we considered three different clinical scenarios, i.e., three different sets of simulated event probabilities (including both beneficial and harmful effects compared to the control/standard-of-care arm), with the two-arm trials only using the first two event probabilities in each set:
No differences: (25% event rate in all arms)
Small differences: arm A 25%; arm B 22.5%, arm C 27.5%, arm D 26.5%
Large differences: arm A 25%; arm B 20%, arm C 30%, arm D 28%.
Across all scenarios, arm A was considered to represent the standard-of-care and was used as the common control in designs employing a common control arm. In total, this yielded 432 configurations based on unique combination 6 of trial designs, 3 inclusion rates, 8 combined follow-up durations and data collection periods, and 3 scenarios.
2.3 |. Adaptations
In all simulations, the first adaptive analysis was conducted after randomisation of 400 patients, followed by analyses after every 300 additional randomised patients up to a maximum of 10,000 randomised patients. Adaptive analyses were thus conducted according to the number of patients randomised (and not the number of patients with outcome data available or according to fixed time-points, as may also be done), and early planned adaptive analyses were skipped in the cases where no patients had outcome data available. Overrunning was allowed, i.e., inclusion was not paused while awaiting data collection and adaptive analyses to complete. In each simulation, a final analysis including outcome data from all randomised patients was conducted after stopping and used to calculate a specific performance metric (described below), but superiority could not be declared at this analysis.
2.3.1 |. Stopping rules
Based on comparisons of the posterior distributions, two-arm trials were stopped for superiority if the probability of one arm being the best exceeded a certain threshold. In the four-arm trials with all arms compared against each other, arms were dropped for inferiority if their probability of being the overall best were below a certain threshold, and trials were stopped for superiority when the probability of one arm being the overall best exceeded a certain threshold. In these cases, arm dropping was immediately followed by a new analysis; as the probabilities of each arm being the best must sum to 100%, dropping an arm and updating these probabilities may lead to another arm crossing the superiority threshold. In the four-arm trials with pairwise comparisons against a common control group, interventional arms were dropped for inferiority if their probability of being better than the control was below a certain threshold. If the probability of an intervention being better than the control group exceeded a certain threshold, that interventional arm was promoted to be the new control arm with all subsequent pairwise comparisons of interventional arms against the new control arm only, the previous control arm was dropped for inferiority, and all remaining interventional arms were immediately compared to the new control arm, as previously described.10,20 In these designs, trials where stopped for superiority whenever only one arm remained after dropping all other arms for inferiority, regardless of whether the remaining arm was the control or an interventional arm (i.e., the overall final trial status was not considered to be stopped for superiority if the original common control was dropped due to superiority of an interventional arm if other interventional arms remained for comparisons with the new control arm).
All stopping thresholds were constant during all analyses in the same set of simulations and symmetric, i.e., the probability thresholds for inferiority were defined as 100% minus the probability thresholds for superiority. To ensure fair comparisons between designs, we calibrated stopping rules to keep the overall Bayesian type-1 error rates at approximately 5% (as is usually required for ‘real’ clinical trials),23 defined as the probability of conclusiveness in the scenario with no differences between trial arms,10 corresponding to two-sided family-wise type-1 error rates for the outcome across all arms being compared. We used a sequential model-based Bayesian optimisation algorithm31 using Gaussian process interpolation,32 initially based on fewer simulations than the final 100,000. If the probability of conclusiveness was ≤4% or ≥6% after initial calibration, we planned to repeat calibration with 100,000 simulations in each step until these criteria were reached; this, however, proved unnecessary.
2.3.2 |. Response-adaptive randomisation
Response-adaptive randomisation was based on the probabilities of each arm being the overall best regardless of the design (reflecting the aim of ultimately finding a single best arm),18 ‘softened’ by raising the raw probabilities to the power 0.7 and normalising, with minimum allocation ratios of 30% (two-arm trials) and 15% (four-arm trials) for all active arms.10
2.4 |. Performance metrics
For each configuration, the following performance metrics were calculated:10,17–20
Mean (expected) total (i.e., including all arms) sample sizes, event probabilities, and event counts across all simulated trials (calculated for all randomised patients, i.e., after concluding follow-up and data collection for all randomised patients); these were supplemented with standard deviations, medians, and interquartile ranges.
Probability of conclusiveness: the percentage of simulated trials stopped for superiority at an adaptive analysis (i.e., the last time-point where superiority could be declared was at the adaptive analysis conducted after randomisation of 10,000 patients, which only included those with available outcome data at that time), which corresponds to the type-1 error rate in the scenario with no differences and as the power (100% - the type-2 error rate) in the scenarios with differences.10
Probability of selecting the best arm (i.e., stopping with a superiority decision for the best arm).
Root mean squared errors (RMSEs) of 1) the estimated event probabilities (median values from the posterior distributions from the last adaptive analysis, i.e., when trials were stopped) in the selected arm compared to the true event probabilities and 2) of the differences between estimated event probabilities in the selected arms at the last adaptive analysis (i.e., after randomisation of maximum 10,000 patients and including only those with available outcome data at that time) compared to those from a final analysis including outcome data for all randomised patients.
Ideal design percentages (IDPs): a measure combining arm selection probabilities, power, and the consequences of selecting inferior arms (i.e., selecting an arm that is slightly inferior to the best will affect the IDP less than selecting a substantially inferior arm), defined and calculated as previously described (with n denoting the number of arms in the trial):10,17–20
Probabilities of selecting the best arm and IDPs were only calculated in scenarios with differences present. RMSEs of the estimated event probabilities in the selected arms and IDPs were calculated twice:10,20 first, for trials ending in superiority only with the superior arm selected and, second, assuming that the control or standard-of-care arm was selected in inconclusive trials (except four-arm trial simulations where this arm was dropped at an earlier analysis), as this corresponds to what may occur in clinical practice where the standard-of-care arm is most commonly used after an inconclusive trial.10
3 |. Results
Calibrated superiority stopping thresholds ranged from 98.09% to 99.75% and were largely similar according to outcome-data lags (Table S1, Appendix B). Raw numerical results for all performance metrics are presented in Tables S2–S11 (Appendix B).
3.1 |. Mean total sample sizes
Longer outcome-data lags substantially affected sample sizes in the scenarios with small or large differences (Figure 2), with differences largest in the simulations with fastest inclusion rates and in the four-arm trials without a common control group, for which differences up to approximately 1150 (small-differences-scenario) and 1440 (large-differences-scenario) were observed. Mean total sample sizes were less than the maximum 10,000 patients in all simulation configurations regardless of between-arm differences and outcome-data lags.
Figure 2.
Mean sample sizes
Mean sample sizes across all simulations for each design/inclusion pattern/scenario/outcome-data lag combination, with specific simulation settings outlined in the legend below the plot.
3.2 |. Mean total event counts and probabilities
Longer outcome-data lags substantially affected total event counts in the small-/large-difference scenarios (Figure 3), especially when combined with higher inclusion rates, with differences up to approximately 350 additional events. Mean total event probabilities were similarly affected, although differences were relatively small in the two-arm designs (approximately 0.2% in the scenario with large differences, substantially less in the other scenarios) and smaller in the four-arm trials with a common control than in the four-arm trials without (approximately 0.2%-points versus 0.6%-points in the scenarios with large differences, respectively; Figure S1, Appendix A). Designs using adaptive randomisation in general led to lower mean total event probabilities, especially with shorter outcome-data lags.
Figure 3.
Mean total event counts
Mean total event counts across all simulations for each design/inclusion pattern/scenario/outcome-data lag combination, with specific simulation settings outlined in the legend below the plot.
3.4 |. Probability of conclusiveness and selecting the best arm
The probabilities of conclusiveness (Figure 4) in the scenarios without differences (corresponding to the type-1 error rates) ranged from 4.52% to 5.55% after calibration. In the scenarios with differences, the probabilities of conclusiveness (corresponding to power) were ≥99.3% in all scenarios with large differences. In the scenarios with small differences, we observed differences up to 9.1%-points according to outcome-data lag, largest in the four-arm designs with fast inclusion rates. The probabilities of selecting the best arm showed a similar pattern as the probabilities of conclusiveness (Figure S2, Appendix A).
Figure 4.
Probability of conclusiveness
Probability of conclusiveness for each design/inclusion pattern/scenario/outcome-data lag combination, with specific simulation settings outlined in the legend below the plot.
The probability of conclusiveness may be interpreted as the type-1 error rate in scenarios without differences and as the power (100% – the type-2 error rate) in scenarios with differences.
3.5 |. Root mean squared errors
RMSEs for the estimated compared to the true event probabilities in the selected arm across simulations ending in superiority only (Figure S3, Appendix A) ranged from approximately 1.4%-points to 2.8%-points in the large- and small-differences-scenarios, and from approximately 3.2%-points to 7.5%-points in the no-differences scenarios, with relatively large increases with longer outcome-data lags and largest differences present in the four-arm trials. A similar pattern emerged when selecting the standard-of-care/control arm in inconclusive trials, although RMSEs, especially in the no-differences scenarios, were smaller in magnitude and RMSEs in the no-difference scenarios were smaller than in the scenarios with differences (Figure S4, Appendix A). RMSEs for the estimated event probabilities in the last adaptive analysis compared to the final analysis including all patients in trials stopped for superiority substantially increased with longer outcome-data lags, from 0%-points to 5.8%-points (no-differences-scenarios), to 2.0%-points (small-differences-scenarios), and to 2.2%-points (large-differences scenarios), with the largest increases seen with faster inclusion (Figure 5).
Figure 5.
Root mean squared errors between last adaptive analysis and final analysis (superiority only)
Root mean squared errors (RMSEs) of the estimated event probabilities in the selected arm for trials ending in superiority only at the last adaptive analysis (where the trial was stopped for superiority) compared with the final analysis (including outcome data for all randomised patients) across all simulations for each design/inclusion pattern/scenario/follow-up and data collection lag duration combination, with specific simulation settings outlined in the legend below the plot.
3.6 |. Ideal design percentages
IDPs were negatively correlated with longer outcome-data lags, both when calculated across trials ending in superiority only and when selecting the standard-of-care/control arm in inconclusive trials (Figures S5–S6, Appendix A). IDPs were >98.4% in all trials in the large-differences scenario regardless of calculation method; in the small-differences scenarios, all IDPs were >97.9% when restricted to trials ending in superiority. When selecting the standard-of-care/control arm in inconclusive trials, IDPs in the small-differences scenario substantially decreased with longer outcome-data lags, with values ranging from 59.5% to 69.7% in the two-arm trials, 85.5% to 88.9% in the four-arm trials without a common control group, and 82.8% to 89.2% in the four-arm trials with a common control group, respectively.
4 |. Discussion
We assessed the influence of different outcome-data lags, i.e., combined follow-up durations and data-collection lags, on the performance of several adaptive trial designs using statistical simulation under three different underlying inclusion rates and scenarios with no, small, or large differences present. As expected, we found that performance metrics deteriorated with diminishing proportions of randomised patients with outcome data available at time of analysis due to longer outcome-data lag or faster inclusion. Deteriorations in performance metrics for outcome-data lags up to 45 days were generally relatively smaller than deteriorations occurring with lags of 60 days or more, with deteriorations for 60 or more days of lag being substantial. This pattern was similar for different inclusion rates, but more pronounced with faster inclusion rates. Notably, important deteriorations with longer outcome-data lags occurred despite all outcome-data lag/maximum inclusion period ratios in our study being substantially below the previously suggested threshold of 0.25 for when adaptive trials become useful (all ≤0.105 in our study).14 The effects of different outcome-data lags and lower proportions of randomised patients with data available on most metrics were generally larger than the effects of using fixed or response-adaptive randomisation.
Longer outcome-data lags affected both performance metrics that may be prioritised for logistical reasons (expected sample sizes), to benefit patients external to the trial (probability of conclusiveness/selecting the best arm, IDPs), to benefit patients internal to the trial (total event counts/probabilities), and to yield accurate estimates (RMSEs).10
As longer outcome-data lags mean that less patients are included in each analysis, the total sample sizes increase, which in turn increases the total event counts and leads to deteriorations in the other performance metrics due to adaptations being made based on less available data. Our results add to the existing literature assessing errors in treatment estimates in more conventional adaptive trials using group sequential designs,11 using simulations33,34 and analytically,35 by illustrating the influence of incomplete information on accuracy of treatment effect estimates. While appropriate stopping rules control the overall type-1 error, treatment effects will generally be overestimated in trials stopped before reaching the maximum sample size, although the extent of this seems to vary from slight and typically unimportant,33,35 to potentially important.34,35 Importantly, overestimation is typically smaller in large trials (comparable to those assessed in our simulations)34 and for trials stopped early according to constant stopping rules (i.e., Pocock monitoring boundaries in conventional group sequential designs akin to our constant stopping rules) versus those using stopping rules more conservative in earlier analyses (i.e., O’Brien-Fleming or Haybittle-Peto monitoring boundaries).35 As expected, RMSEs calculated for simulations stopping for superiority were highest in the scenarios without differences present, as all superiority decisions were, by definition, type-1 errors in these simulations.
On a more general note, comparing designs using fixed randomisation to the corresponding designs using response-adaptive randomisation showed that 1) mean total event counts and probabilities were lower with response-adaptive randomisation regardless of the number of arms or use of a common control, which is explained by allocation of more participants to better arms (i.e., arms with lower event probabilities), 2) expected sample sizes were only slightly larger with response-adaptive randomisation in some of the simulations for two-arm trials, and 3) response-adaptive randomisation led to slightly lower probabilities of conclusiveness in the two-arm trials, similar probabilities in the four-arm trials without a common control, and slightly higher probabilities in the four-arm trials with a common control arm. This suggests that response-adaptive randomisation, at least when restricted, is not detrimental to two-arm trials and may even be preferable for patients internal to the trials due to a potentially lower risk of undesirable outcomes. As such, previous discussions about whether response-adaptive randomisation is preferable or suboptimal from an ethical point of view in two-arm trials36–40 should not cause trialists to abandon the use of response-adaptive randomisation in such trials before submitting such designs to proper evaluation.
4.1 |. Strengths
This simulation study has multiple strengths. First, we conducted it following a protocol and statistical analysis plan, registered and made publicly available prior to conducting the analyses,15 and include all analysis code in Appendix C for full transparency and reproducibility. Second, we ran a large number of simulations for each configuration, ensuring adequate precision of the presented estimates. Third, the relatively broad range of realistic outcome-data lags in combination with different inclusion rates yielded a relatively fine grid of different proportions of randomised patients with outcome data available at each analysis. Further, influence of outcome-data lags was assessed in multiple disparate trial designs, with the designs using response-adaptive randomisation using sensible restrictions, and under different assumed clinical scenarios. While the results are likely not generalisable to all settings, the combinations of multiple combined outcome-data lags, trial designs, and clinical scenarios means that these results should have reasonable external validity in at least relatively similar scenarios. Finally, the calibration of all trial designs to ensure type-1 errors of approximately 5% resembles actual clinical trial design and ensures comparability of results across simulation configurations.
4.2 |. Limitations
This study also has limitations. First, while a relatively broad range of outcome-data lags were assessed using various relevant trial designs, clinical scenarios, and inclusion rates, the results are not generalisable to all other settings and preferences for specific designs and outcome-data lags may depend on other factors and practical/logistical considerations external to the trial design per se and, thus, not assessed here. We recommend that trialists, who consider different follow-up durations, assess the influence of these in a formalised manner using simulations such as those in this study, to guide their choices in actual trials. Second, the clinical scenarios and maximum allowed sample sizes (of 10,000 patients) were somewhat arbitrary but chosen to resemble what may be realistically chosen in large, pragmatic phase-3 or −4 trials conducted in the clinical setting of the clinical co-authors, i.e., critical care. Of note, the choice of similar maximum sample sizes in both two- and four-arm trials could be challenged, and our maximum sample sizes were substantially larger than in some other simulation studies of adaptive trial performance.17,18 Finally, while conducted using a Bayesian framework, we used minimally informative priors in all analyses. Although beyond the scope of this study, the Bayesian framework enables the use of more informative priors conveying scepticism (to protect against early adaptations to chance through regularisation19) or incorporating either previous results or actual a priori beliefs, and sceptical or informative priors may be considered in actual trial planning.
4.3 |. Conclusions
We found that performance metrics in adaptive trials deteriorated with diminishing proportions of randomised patients with data available at the time of analysis, due to longer outcome-data lags, especially when the fraction further decreased due to faster inclusion rates. The effects on performance metrics of longer outcome-data lags were generally larger than those of designs characteristics such as fixed or response-adaptive randomisation. The relative impairment of performance metrics substantially increased with lags of 60 days or more; consequently, trialists should consider the effects of outcome-data lags when planning adaptive trials.
Supplementary Material
Acknowledgements
The work was partially supported by the Danish e-infrastructure Cooperation (DeiC) National HPC (DeiC-KU-S1-000114), with simulations performed on the UCloud (docs.cloud.sdu.dk) interactive high-performance computing system managed by the eScience Center at the University of Southern Denmark.
The Gaussian process model used for calibrating stopping thresholds is partially based on code from ‘Surrogates’29 by Robert B. Gramacy (with permission).
Funding information
This study was conducted as part of the Intensive Care Platform Trial (INCEPT, www.incept.dk) programme funded by a grant from Sygeforsikringen “danmark” (20200320) and supported by Grosserer Jakob Ehrenreich og Hustru Grete Ehrenreichs Fond and Dagmar Marshalls Fond. MOH is supported by grant numbers R00HL141678 and R01HL168202 from the United States National Institutes of Health.
Footnotes
Conflicts of interest statement
The authors have no conflicts of interest to declare.
Data availability statement
The complete R code (including random seeds) used to run all simulations and prepare result tables and figures is included in Appendix C. As all simulated data were generated using the enclosed code, data sharing is not relevant.
References
- 1.Granholm A, Alhazzani W, Derde LPG, et al. Randomised clinical trials in critical care: past, present and future. Intensive Care Med 2022; 48(2): 164–178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wassmer G and Brannath W. Group Sequential and Confirmatory Adaptive Designs in Clinical Trials. 1st ed. Switzerland: Springer International Publishing, 2016. [Google Scholar]
- 3.Ridgeon EE, Bellomo R, Aberegg SK, et al. Effect sizes in ongoing randomized controlled critical care trials. Crit Care 2017; 21: 132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Harhay MO, Wagner J, Ratcliffe SJ, et al. Outcomes and statistical power in adult critical care randomized trials. Am J Respir Crit Care Med 2014; 189(12): 1469–1478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Cuthbertson BH and Scales DC. “Paying the Piper”: The Downstream Implications of Manipulating Sample Size Assumptions for Critical Care Randomized Control Trials. Crit Care Med 2020; 48(12): 1885–1886. [DOI] [PubMed] [Google Scholar]
- 6.Abrams D, Montesi SB, Moore SKL, et al. Powering Bias and Clinically Important Treatment Effects in Randomized Trials of Critical Illness. Crit Care Med 2020; 48(12): 1710–1719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Altman DG and Bland JM. Absence of evidence is not evidence of absence. BMJ 1995; 311(7003): 485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Altman DG. Statistics and Ethics in Medical Research III: How large a sample? Br Med J 1980; 281(6251): 1336–1338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Harhay MO, Casey JD, Clement M, et al. Contemporary strategies to improve clinical trial design for critical care research: insights from the First Critical Care Clinical Trialists Workshop. Intensive Care Med 2020; 46(5): 930–942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Granholm A, Kaas-Hansen BS, Lange T, et al. An overview of methodological considerations regarding adaptive stopping, arm dropping and randomisation in clinical trials. J Clin Epidemiol 2023; 153: 45–54. [DOI] [PubMed] [Google Scholar]
- 11.Pallmann P, Bedding AW, Choodari-Oskooei B, et al. Adaptive designs in clinical trials: why use them, and how to run and report them. BMC Med 2018; 16(1): 29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Baldi I, Azzolina D, Soriani N, et al. Overrunning in clinical trials: some thoughts from a methodological review. Trials 2020; 21: 668. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wason JMS, Brocklehurst P and Yap C. When to keep it simple - adaptive designs are not always useful. BMC Med 2019; 17(1): 152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Mukherjee A, Grayling MJ and Wason JMS. Adaptive Designs: Benefits and Cautions for Neurosurgery Trials. World Neurosurg 2022; 161: 316–322. [DOI] [PubMed] [Google Scholar]
- 15.Granholm A, Lange T, Harhay MO, Perner A, Møller MH and Kaas‐Hansen BS. Effects of duration of follow-up and lag in data collection on the performance of adaptive clinical trials – Protocol and statistical analysis plan for a simulation study. OSF Registries 2023. DOI: 10.17605/OSF.IO/4WMC9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Morris TP, White IR and Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38(11): 2074–2102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Viele K, Broglio K, McGlothlin A and Saville BR. Comparison of methods for control allocation in multiple arm studies using response adaptive randomization. Clin Trials 2020; 17(1): 52–60. [DOI] [PubMed] [Google Scholar]
- 18.Viele K, Saville BR, McGlothlin A and Broglio K. Comparison of response adaptive randomization features in multiarm clinical trials with control. Pharm Stat. 2020; 19(5): 602–612. [DOI] [PubMed] [Google Scholar]
- 19.Granholm A, Lange T, Harhay MO, Perner A, Møller MH and Kaas‐Hansen BS. Effects of regularising, sceptical priors on the performance of adaptive clinical trials - Protocol and statistical analysis plan for a simulation study. OSF Registries 2022. DOI: 10.17605/OSF.IO/U56XM. [DOI] [Google Scholar]
- 20.Granholm A, Jensen AKG, Lange T and Kaas-Hansen BS. adaptr: an R package for simulating and comparing adaptive clinical trials. J Open Source Softw 2022; 7(72): 4284. [Google Scholar]
- 21.Ryan EG, Harrison EM, Pearse RM and Gates S. Perioperative haemodynamic therapy for major gastrointestinal surgery: the effect of a Bayesian approach to interpreting the findings of a randomised controlled trial. BMJ Open 2019; 9(3): e024256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Chapter Lambert B. 9: Conjugate priors. In: Lambert B. A Student’s Guide to Bayesian Statistics. 1st ed. London: SAGE Publications Ltd., 2018: 237–257. [Google Scholar]
- 23.FDA. Adaptive Designs for Clinical Trials of Drugs and Biologics - Guidance for Industry. 2019. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/adaptive-design-clinical-trials-drugs-and-biologics-guidance-industry (2019, accessed 1 May 2023) [Google Scholar]
- 24.Park JJH, Harari O, Dron L, Lester RT, Thorlund K and Mills EJ. An overview of platform trials with a checklist for clinical readers. J Clin Epidemiol 2020; 125: 1–8. [DOI] [PubMed] [Google Scholar]
- 25.Krag M, Marker S, Perner A, et al. Pantoprazole in Patients at Risk for Gastrointestinal Bleeding in the ICU. N Engl J Med 2018; 379(23): 2199–2208. [DOI] [PubMed] [Google Scholar]
- 26.Schjørring OL, Klitgaard TL, Perner A, et al. Lower or Higher Oxygenation Targets for Acute Hypoxemic Respiratory Failure. N Engl J Med 2021; 384(14): 1301–1311. [DOI] [PubMed] [Google Scholar]
- 27.The COVID STEROID 2 Trial Group. Effect of 12 mg vs 6 mg of Dexamethasone on the Number of Days Alive Without Life Support in Adults With COVID-19 and Severe Hypoxemia. JAMA 2021; 326(18): 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Meyhoff TS, Hjortrup PB, Wetterslev J, et al. Restriction of Intravenous Fluid in ICU Patients with Septic Shock. N Engl J Med 2022; 386(26): 2459–2470. [DOI] [PubMed] [Google Scholar]
- 29.Andersen-Ranberg NC, Poulsen LM, Perner A, et al. Haloperidol for the Treatment of Delirium in ICU Patients. N Engl J Med 2022; 387(26): 2425–2435. [DOI] [PubMed] [Google Scholar]
- 30.Granholm A, Anthon CT, Kjær MN, et al. Patient-Important Outcomes Other Than Mortality in Contemporary ICU Trials: A Scoping Review. Crit Care Med 2022; 50(10): e759–e771. [DOI] [PubMed] [Google Scholar]
- 31.Hutter F, Hoos HH, and Leyton-Brown K. Sequential Model-Based Optimization for General Algorithm Configuration. In: Coello Coello CA (ed) Learning and Intelligent Optimization (5th International Conference, LION 5 Rome, Italy, January 17–21, 2011 Selected Papers). Berlin: Springer, 2011: 507–523. [Google Scholar]
- 32.Gramacy RB. Chapter 5: Gaussian Process Regression. In: Gramacy B. Surrogates - Gaussian process modeling, design and optimization for the applied sciences. Boca Raton, Florida: Chapman Hall/CRC; 2020. [Google Scholar]
- 33.Wang H, Rosner GL and Goodman SN. Quantifying over-estimation in early stopped clinical trials and the “freezing effect” on subsequent research. Clin Trials. 2016; 13(6): 621–631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Liu S and Garrison SR. Overestimation of benefit when clinical trials stop early: a simulation study. Trials 2022; 23(1): 747. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Walter SD, Guyatt GH, Bassler D, Briel M, Ramsay T and Han HD. Randomised trials with provision for early stopping for benefit (or harm): The impact on the estimated treatment effect. Stat Med 2019; 38(14): 2524–2543. [DOI] [PubMed] [Google Scholar]
- 36.Wathen JK and Thall PF. A Simulation Study of Outcome Adaptive Randomization in Multi-arm Clinical Trials. Clin Trials 2017; 14(5): 432–440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Korn EL and Freidlin B. Outcome-adaptive randomization: Is it useful? J Clin Oncol 2011; 29(6): 771–776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Thall PF, Fox P and Wathen J. Statistical controversies in clinical research: scientific and ethical problems with adaptive randomization in comparative clinical trials. Ann Oncol 2015; 26(8): 1621–1628. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Hey SP and Kimmelman J. Are outcome-adaptive allocation trials ethical? Clin Trials 2015; 12(2): 102–106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Legocki LJ, Meurer WJ, Frederiksen S, et al. Clinical trialist perspectives on the ethics of adaptive clinical trials: a mixed-methods analysis. BMC Med Ethics 2015; 16: 27. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The complete R code (including random seeds) used to run all simulations and prepare result tables and figures is included in Appendix C. As all simulated data were generated using the enclosed code, data sharing is not relevant.