Skip to main content
Neuro-Oncology logoLink to Neuro-Oncology
. 2019 Jun 1;21(10):1239–1249. doi: 10.1093/neuonc/noz097

To randomize, or not to randomize, that is the question: using data from prior clinical trials to guide future designs

Alyssa M Vanderbeek 1,2,2, Steffen Ventz 1,2,2, Rifaquat Rahman 3,4,5, Geoffrey Fell 1,2, Timothy F Cloughesy 6, Patrick Y Wen 4, Lorenzo Trippa 1,2,3, Brian M Alexander 1,3,4,3,
PMCID: PMC6784282  PMID: 31155679

Abstract

Background

Understanding the value of randomization is critical in designing clinical trials. Here, we introduce a simple and interpretable quantitative method to compare randomized designs versus single-arm designs using indication-specific parameters derived from the literature. We demonstrate the approach through application to phase II trials in newly diagnosed glioblastoma (ndGBM).

Methods

We abstracted data from prior ndGBM trials and derived relevant parameters to compare phase II randomized controlled trials (RCTs) and single-arm designs within a quantitative framework. Parameters included in our model were (i) the variability of the primary endpoint distributions across studies, (ii) potential for incorrectly specifying the single-arm trial’s benchmark, and (iii) the hypothesized effect size. Strengths and weaknesses of RCT and single-arm designs were quantified by various metrics, including power and false positive error rates.

Results

We applied our method to show that RCTs should be preferred to single-arm trials for evaluating overall survival in ndGBM patients based on parameters estimated from prior trials. More generally, for a given effect size, the utility of randomization compared with single-arm designs is highly dependent on (i) interstudy variability of the outcome distributions and (ii) potential errors in selecting standard of care efficacy estimates for single-arm studies.

Conclusions

A quantitative framework using historical data is useful in understanding the utility of randomization in designing prospective trials. For typical phase II ndGBM trials using overall survival as the primary endpoint, randomization should be preferred over single-arm designs.

Keywords: clinical trial design, glioblastoma, randomization


Key Points.

  1. We generated a quantitative framework using prior trial data to understand the value of randomization for a given indication.

  2. This framework shows the value of randomization in phase II GBM trials.

  3. Single arms are possible for large effects or low endpoint variation.

Importance of the Study.

In this study, we show the importance of randomization in ndGBM clinical trials using a quantitative framework and data from prior clinical trials. Due to significant intertrial variability of the outcome distributions and the propensity to misestimate benchmarks, single-arm designs are prone to false positive or negative results. The use of single-arm designs compared with historical benchmarks in phase II may be a key reason that the GBM clinical trials landscape is characterized by frequent phase III trial failures. Previously, the discussion/debate regarding the value of randomization in phase II trials in GBM has been based on qualitative arguments. Our quantitative framework provides a mechanism to include relevant data to guide the design of future trials. Furthermore, the approach that we describe can be used for clinical trial design in other disease indications.

The choice of endpoint and whether to randomize are critical components of clinical trial design. Historically in oncology trials, single-arm designs have been considered appropriate when using endpoints such as response rate, particularly for monotherapies.1,2 For endpoints with more potential for confounding and selection bias, randomization is often recommended and has been shown to increase the ability of phase II results to accurately predict phase III success.2,3 Still, the use of single-arm designs is prevalent and lack of randomization in the phase II setting has been suggested as a major driver of poor go/no-go decision making resulting in phase III failures.3–6

Additionally, the paradigm for therapeutic development is evolving. Traditional lines dividing clinical research and clinical practice have become blurred.7 The idea of using real-world data to replace or complement data from clinical trials has gained traction,8 and advances in trial design have led to the development of basket and platform trials.9–11 Either explicitly or implicitly, much of the discussion/debate regarding these innovations includes an assessment of the value of randomization. This assessment is often limited to general claims and characterizations, without an overall framework to judge the value of randomization in a disease- or indication-specific context.

Here we propose a simple and interpretable method, using historical data from prior clinical trials to estimate key indication-specific parameters, to quantify the value of randomization in phase II studies using a few key metrics. These include potential bias of treatment effect estimates, the risk of false positives, and misleading power summaries that overestimate the ability of non-randomized studies to detect treatment effects. Given our prior work showing a preponderance of non-randomized phase II trials and frequently negative phase III trials for glioblastoma (GBM),4 we use newly diagnosed GBM (ndGBM) as an example to demonstrate the utility of the methodology. Finally, we provide software (R code) to reproduce the proposed approach and results in ndGBM, and to enable investigators to apply the methodology in other disease indications.

Methods

We compared randomized and non-randomized phase II designs while holding sample size constant to avoid the argument that randomized trials are necessarily larger. The non-randomized single-arm trial allocates all patients to a single experimental arm, while the randomized design randomly divides the same sized overall patient sample into equal proportions to an experimental and a control arm. We conducted the comparison based on both a binary endpoint (eg, response rate, overall survival [OS] at 12 mo) and survival outcomes.

Three Important Parameters in the Choice of a Randomized or Non-Randomized Design

We considered 3 key parameters that we thought would materially impact the relative operating characteristics of both designs in question (see also Table 1): (i) the variability of the endpoint distributions under the control treatment across studies, (ii) the estimation error for the control treatment in single-arm studies, and (iii) the treatment effect of the experimental therapy. In single-arm trials, investigators compare the outcomes of the experimental treatment to a pre-specified threshold. The estimation error (ii) is the difference between this pre-specified threshold in the single-arm study and the average response rate for the control treatment across trials, whereas the treatment effect (iii) is the increment in average response of the experimental treatment compared with the control.

Table 1.

Summary of statistical concepts used throughout this paper

Statistical Concepts Description and Relevance
Variability across studies of the primary endpoint distributions under the standard of care An index (for example standard derivation) that quantifies the extent to which the primary outcome distributions of the control arm varies across studies. In our study primary outcome distributions coincide either with rates or survival functions.
Zero variability indicates that the primary outcome distributions for the control treatment are identical across studies. This facilitates the specification of a historical benchmark in single-arm studies.
Large variability indicates that the response rates of the control treatment vary significantly across trials. This makes selection of a benchmark in single-arm studies challenging.
Estimation error of the control treatment efficacy The difference between the benchmark selected by a single-arm study and actual average result rate of the control treatment across trials.
Zero estimation error indicates that the investigators correctly selected the average outcome of the control arm across trials as benchmark for the single-arm study.
Negative (positive) estimation error indicates that the investigators underestimate (overestimate) the average outcome of the control.
Type I error rate and power The probabilities of a false positive result, when the experimental treatment has no positive effects compared with the control, and the probability of a true positive result when the experimental treatment has positive effects compared with the control.
Receiver operating characteristic (ROC) curve The ROC curve is created by plotting the false positive (type I error) rate against power by varying a pivotal threshold for the evidence on positive effects when classifying an experimental treatment as effective or not
An ROC curve close to 1 (zero) across all type I error rate values indicates universal high (low) probabilities of detecting effective treatments.
AUC The area under the ROC curve is a single index that summarizes the ROC curve.
An AUC value of 1 indicates perfect classification of experimental treatment into ineffective and effective therapies.
An AUC value of 0.5 indicates ability no better than chance at distinguishing effectiveness.
MSE of the treatment effect estimate The average squared difference between the estimate treatment effect and the true treatment effect.
A single index that summarizes how well the design recovers on average the unknown true treatment benefit of an experimental therapy.
Zero MSE values indicate that the treatment effect is estimated perfectly.
Large MSE indicates large uncertainty about the effectiveness of the experimental therapy.

Learning from Previous Trials in ndGBM

We then estimated the values of these parameters using information from previous trials. Based on a systematic literature review4 and a PubMed query (see Supplementary Figure 1), we identified 7 phase II randomized controlled trials (RCTs) and 5 single-arm trials in ndGBM with identical eligibility criteria (Table 2). All single-arm trials identified for ndGBM used the European Organisation for Research and Treatment of Cancer (EORTC) 22981/2698113 trial as historical benchmark.12 Using the reported results from these trials we first estimated the endpoint variability across trials and then estimated the error in selecting benchmarks using the reported study designs of each single-arm trial (see “Statistical Details”). Once the parameters were estimated and an illustrative effect size was selected, randomized and single-arm designs were compared. Since the estimates are based on limited data, we conducted sensitivity analyses on the estimated parameters.

Table 2.

Summary of designs and results of the RCTs and single-arm trials in ndGBM*

PubMed ID NCT ID Final Enrollment Primary Endpoint Sample Size Control Treatment Type I Error Expected Effect Size Statistical Test OS-12 Rate in Control Arm Median OS,mo, in Control Arm
RCTs 25910950 NCT00441142 Jun 2011 OS 106 TMZ+RT 0.1 0.15 increase in OS-15 Test of medians 0.56 (0.40–0.80) 15.9 (11.0–22.5)
26481741 Jun 2010 OS 99 TMZ+RT 0.05 0.1 decrease in 0S-24 Log-rank test 0.60 (0.41–0.75) 13.2 (11.1–18.8)
22120301 2012 OS 34 TMZ+RT 0.05 Wilcoxon rank- sum 0.75 (#) 15.0 (#)
26843484† NCT00589875 2010 OS 182 TMZ+RT Log-rank test 0.67 (#) 13.5 (#)
29126203 NCT01062399 Sep 2013 PFS 171 TMZ+RT 0.15 HR 0.7 Log-rank test 0.78 (**,#) 16.5 (12.5–18.7)
28142059 NCT00190424 Oct 2008 OS-24 81 TMZ+RT 0.05 0.08 increase in OS-24 Log-rank test 0.81 (**,#) 18.0 (#)
21135282 NCT01013285 Nov 2008 OS 180 TMZ+RT 0.05 0.15 increase in OS-18 Weighted log- rank test 0.75 (**,#) 21.1 (18.9–25.2)
Single-arm trials 20564147 NCT00544817 July 2008 PFS 54 TMZ+RT Historical (15758009) HR 0.75
20615924 NCT00262730 Jan 2007 OS 97 TMZ+RT Historical (15758009) 0.1 HR 0.75 Test of medians
21531816 NCT00597402 Sep 2008 OS-16 75 TMZ+RT Historical (15758009) Cox Model
22706484 NCT00805961 Oct 2009 PFS 68 TMZ+RT Historical (15758009) 0.05 HR 0.7
25586468 NCT00458601 Nov 2009 PFS-5.5 65 TMZ+RT Historical (15758009) 0.05 0.2 increase in PFS-5.5 Exact binomial test
Historical Control Trial EORTC 22981/26981 15758009 NCT00006353 March 2002 OS 573 RT 0.05 HR 0.75 Log-rank test 0.61
(0.54–0.68)
14.6
(13.2–16.8)

*Obtained by our systematic literature review (see Supplementary Figure 1 for details). All study populations consisted of ndGBM patients, and enrolled patients with both methylated and unmethylated status of O6-methylguanine-DNA methyltransferase. The 5 single-arm trials used the results reported in the phase III trial EORTC 22981/26981 to specify the standard of care historical response rate. The outcomes (last 2 columns) of the control arm (SOC = temozolomide and radiation) in the RCTs are used in the random effects model described in the text to estimate the SOC’s average response rate and variability across trials. Random effect meta-analysis utilizes studies that completed accrual in after January 1, 2005, and EORTC 22981/26981 is not included.

**Estimated from digitized OS curve; #confidence interval not provided in publication; † trial with concurrent matched control arm.

Metrics of Comparison

We used the following metrics to compare single-arm and RCT designs (Table I):

  • (a) the area under the curve (AUC) of the receiver operating characteristic (ROC) curve13;

  • (b) deviations of the type I error rate and power from targeted values; and

  • (c) the mean squared error (MSE) of the treatment effect estimate.

The AUC is an index between zero and 1 that summarizes a design capability to distinguish between experimental therapies with and without positive effects compared with the control. An AUC of 1 indicates perfect classification of therapies into superior and ineffective therapies. Treatment effect estimates from phase II trials are often determinant in the design of successive confirmatory phase III studies; for instance, they are used to select appropriate phase III study sample sizes. The MSE of the treatment effect estimate is an index that summarizes the design’s ability to generate an accurate estimate of the experimental treatment effect. Large MSEs indicate large variability or bias of treatment effect estimates. The Supplementary Material provides details on how we computed these metrics.

Candidate Designs

We considered 3 candidate designs for a phase II ndGBM trial: (i) a single-arm trial with a binary endpoint (OS-12), (ii) a randomized trial with a binary endpoint (OS-12), and (iii) a randomized trial using a time-to-event endpoint, ie, OS. All designs used a target type I error rate of 0.1 and had a total sample size of 60 patients to reflect the average sample size of non-randomized studies in GBM.4 Designs with OS used a one-sided z-test for proportions (log-rank test14). The single-arm design using OS-12 was compared with both the randomized design using OS-12 (Comparison I in Figures 1 and 3) and with the randomized design using OS (Comparison II in Figures 1 and 3).

Fig. 1.

Fig. 1

Operating characteristics of single-arm trial and randomized (RCT) designs with total sample size of 60 patients. The figure illustrates the ROC curves, the corresponding AUC, and types I/II error rates (panels A and B), and the MSE of the treatment effect estimates (panels C and D). The first column (panels A and C) compares single-arm trial and RCT designs with binary OS-12 outcomes (Comparison I). The second column (panels B and D) compares the single-arm trial with OS-12 primary outcome to an RCT design with an OS endpoint. We used an average OS-12 rate of 0.7 and variability of the OS-12 rate across trials (standard deviation 0.075) for SOC, and used the estimation error of the control treatment efficacy (Table 1) of −0.09 in the single-arm trial (Table 2). The vertical and horizontal lines in panels (A) and (B) indicate type I error rates and power of the RCT and single-arm trial designs. In both cases the nominal type I error level is 10%. Panels (C) and (D) illustrate MSEs of the treatment effect estimates for a range of values of SOC variability in the outcome distribution (OS-12 or OS) across trials. We refer the reader to the Supplementary Material for details on the procedure to estimate HR from a single-arm trial.

Fig. 3.

Fig. 3

Difference in AUC between the RCT and single-arm trial over a range of values of the standard of care OS-12 rate variability across trials (x-axis) and treatment effect (y-axis). Blue (gray) shaded areas represent values of variability and effect size at which the RCT design has a higher (lower) AUC than the single-arm trial. Dotted horizontal and vertical lines indicate the estimated variability of the standard of care OS-12 rates across past trials (0.075) and an effect size of 10% improvement in OS-12 (panel A) and hazard ratio of 0.63 (panel B). Green arrows on the y-axis in panel B represent effect sizes used in the design of the RCTs in Table 1. The error in the estimation of the control treatment efficacy does not affect AUC of the single-arm trial (see Supplementary Material for statistical details). Here, we assumed that in single-arm trials the estimation error of the average SOC efficacy was zero.

Statistical Details

Studies with binary endpoints

p0,i is the control response rate in trial i, while p0  is the average rate across trials, and σSOC0 is the standard deviation of the rates p0,i. The response rate of the experimental arm in RCT study i is denoted by pE,i=p0,i+Δ, where Δ is the treatment effect. The single-arm trial attempts to compare pE,i to a pre-specified value p0,SAT,  and β=p0,SAT p0 is an estimation error.

Parameter estimation

The parameters p0 and σSOC are estimated via a random-effect model15 from previous trials. For each trial, we extracted the reported endpoint estimate p0,i and the standard error SE(p^0,i) from publications or by digitizing survival curves.16 We then estimated σSOC and p0 using the DerSimonian–Laird estimator.15 We compared the estimate p0 from the random effect model with the thresholds used in single-arm studies to distinguish presence or absence of improved outcomes.

Results

Parameter Estimates

Table 3 summarizes the parameter estimates obtained through analysis of prior GBM trials that we used to compare phase II randomized and single-arms designs. The estimated average OS-12 rate of the standard of care (SOC) and variability across trials were 0.70 (90% CI: 0.63 to 0.76) and 0.075. The reported OS-12 rate in EORTC 22981/2698112 (the reference study for all single-arm trials in ndGBM that we considered) was 0.61, which indicates potential underestimation of the SOC’s efficacy across the single-arm trials in Table 2 by approximately 9%. Table 2 shows similar tendency toward underestimation of the SOC’s efficacy in single-arm trials for other endpoints (OS and progression-free survival [PFS]).

Table 3.

Parameter estimates derived from prior clinical trials in ndGBM

Average Response to Standard of Care Across Trials (90% CI) Standard Deviation of the Standard of Care Response Rate (or median) Across Trials Average Efficacy Threshold in Single-Arm Trials Estimation Error Treatment Effect
Estimation Random effect model Random effect model Reported standard of care outcomes in EORTC 22981/26981 Column (3)–column (1) Reported treatment effect in EORTC 22981/26981
OS-12 rate 0.70 (0.63, 0.76) 0.075 0.61 −0.09 0.10*
PFS-6 rate 0.64 (0.58, 0.68) 0.043 0.54 −0.10 0.18*
Median OS (mo) 16.7 (14.4, 19.1) 3.48 14.6 −2.18 0.63**
Median PFS (mo) 8.3 (7.1, 9.5) 1.36 6.9 −1.42 0.54**

Note. The estimated mean response rate for the standard of care (column 1) and the variability of the standard of care response rates across trials (column 2) are estimated using random effect meta-analysis. All single-arm trials use EORTC/NCIC CE.3 as historical control. For binary endpoints, we consider the average probability of OS at 12 months of treatment (OS-12) and average PFS at 6 months of treatment (PFS-6). For time-to-event endpoints, we consider the average SOC median (OS or PFS) across trials. The treatment effects that we use in our comparisons of single-arm and RCT designs match the treatment effect estimates from the EORTC/NCIC CE.3 study (column 5).

*Difference between (OS-12 and PFS-6) response rate of the experimental and control arm.

**Hazard ratio between the experimental and control arm.

Comparing Single-Arm and RCT Designs

We compare single-arm and randomized designs based on metrics (a)–(c) using the estimated parameters presented in Table 3 and overall sample size of 60 patients. For the comparison based on OS-12, the treatment effect in our scenario matches a 10% improvement of OS after 12 months from enrollment (the reported treatment effect in EORTC 22981/2698112). This difference is obtained with an OS hazard ratio (HR) equal to 0.63.12

Panels (A) and (B) of Figure 1 illustrate ROC curves and corresponding AUC for each design. The single-arm trial (black curve) performs slightly better on the AUC metric than the randomized design (red curve) when OS-12 is used in both designs, (AUCs of 0.787 and 0.749, respectively). But the randomized design using OS (blue curve) performs significantly better than both OS-12 designs (AUC of 0.897).

The vertical and horizontal dotted lines in panels (A) and (B) of Figure 1 indicate the type I error rate (vertical line) and power (horizontal lines) of the single-arm trial. If the OS-12 rate of the SOC did not vary across trials and was correctly specified in the single-arm trial, then the single-arm trial would have 10% type I error and 66% power. Instead, the single-arm trial inflated the targeted type I error to 55% (panel A). Notably, if the average survival outcome of the SOC across trials was correctly specified, the variability of the SOC survival outcomes across ndGBM trials (0.075) alone inflates the type I error to 21% (Figure 2). The randomized design using OS-12 controls the type I error rate at 0.1 with low power of 34% (Figure 1A), and the RCT testing OS controls the type I error rate at 0.1 with 70% power (Figure 1B).

Fig. 2.

Fig. 2

Types I and II error rates of the single-arm trial with overall sample size of 60 patients across a range of values for the estimation error of the control treatment efficacy (panel A) and variability in the standard of care OS-12 response rates across trials (panel B). The targeted type I error rate for the single-arm trial design is 0.1. The variability of the OS-12 response rates across trials is fixed at 0.075 in panel A. Solid (dotted) lines in panel B correspond to an OS-12 estimation error of the control treatment efficacy of −0.09 (no estimation error). The types I and II error rate at the estimated level of OS-12 variability (0.075) with estimation error of the control treatment efficacy (−0.09) and an improvement on OS-12 of 10% is equal to 0.56 (black point) and <0.1 (pink point).

Panels (C) and (D) of Figure 1 show the relationship between the variability of the SOC outcome across trials and the precision of the estimated treatment effect. In both comparisons (OS-12 and OS), the randomized design provides more accurate treatment effect estimates than the single-arm trial. In particular, with the estimated level of variability (0.075) for the binary OS-12 endpoint (panel C) the single-arm trial has higher MSE (0.016 vs 0.012). Similarly, using OS as primary endpoint (panel D), the single-arm design has higher MSE than the randomized design (0.086 vs 0.032). Additionally, for OS-12 endpoints, the randomized design’s MSE decreases to zero when we increase the sample size. Alternatively, even with a large sample size the single-arm trial would not accurately estimate the treatment effect. For example, Supplementary Figure 3 illustrates accuracy of single-arm trials and randomized designs with 150 patients (the average sample size of all RCTs in Table 1).

Figure 3 illustrates variations of the AUC over a range of values for treatment effect and variability of the SOC outcome distribution across trials. The blue shaded area in Figure 3 favors the randomized design, whereas the gray shade favors the single-arm design. For example, in the first comparison with OS-12 endpoints, the single-arm trial is superior to the randomized design for combinations of standard deviations of SOC efficacy and treatment effects below 0.09 and 0.4 (panel A).

We can summarize the results of our analyses as follows:

  • (a) Based on the estimated parameters of Table 3, we hypothesize a tendency to underestimate the SOC efficacy in the ndGBM single-arm trials that we considered.

  • (b) Single-arm studies in ndGBM in our analyses are associated with inflated type I error rates and biased treatment effects estimates.

  • (c) Phase II randomized designs in ndGBM with a time-to-event endpoint appropriately control the type I error rate, and discriminate better between experimental therapies with and without positive effects compared with randomized designs with binary OS-12 endpoints or single-arm studies.

  • (d) Randomized designs in phase II ndGBM would tend to provide more accurate treatment effect estimates than single-arm trials.

Discussion

Randomization has been used for decades in clinical trials to limit selection bias and confounding,17 and more generally for rigorous testing.18 The utility of randomization as applied in phase II clinical trials has been debated extensively,1,3,19–22 and this discussion has been reinvigorated by the increasing attention on the use of “external” or “synthetic” control arms for clinical development. Recommendations for randomization have been proposed qualitatively based on endpoints and experimental regimens2,20 or based on evidence of prior decision-making failures for a given indication.3,4,20,22 The results of qualitative analyses are often summarized as a set of generic guidelines to aid in clinical trial design choices.2 But better choices may be made by a more quantitative and analytic approach, based on previous clinical trials, which translates into informed decisions of biostatistician and clinicians in the design of future studies. Furthermore, indication-specific debates using data from past RCTs and single-arm studies might offer different scenarios and support opposite assertions,20,23 all of which may be logically consistent with a common overall framework. Identifying that general analytical framework could potentially inform or reduce the landscape of debate.

The use of randomization is indeed debated in neuro-oncology23 where the clinical trials landscape is characterized by many single-arm trials using PFS or OS as endpoint.4 “Promising” results from such trials24–26 often do not translate into positive phase III trials, however, as seen in recent cases with cilengitide27 and rindopepimut.28 Using data from prior clinical trials for patients with ndGBM, we showed that randomization combined with time to event endpoints has clear benefit at a sample size commensurate with typical phase II studies. While single-arm trials performed slightly better than RCTs in terms of AUC when a binary endpoint was used, randomized designs were clearly superior than single-arm trials when OS was used to estimate the treatment effect. This is not a surprising result—binomial proportion has been shown to be inefficient compared with utilizing the entire survival distribution.19 Randomized designs are superior to single-arm trials that compare to historical control threshold values, both because of the variability of the outcome distribution under SOC and because of more powerful analyses that use time to event data. It is important to note that our analyses kept the overall sample size constant when comparing designs in order to allay concerns that randomization necessarily means larger sample sizes. There is, however, still the issue that patients may be less willing to enroll in trials with a control arm. Master protocols with a common control9,29 and possibly the incorporation of external control arms may help address this issue.

While our study compared randomized designs and the single-arm design for phase II studies, we did not compare randomized designs to single-arm trials with external controls, using patient-level data from prior clinical trials or real-world dataset. Such novel single-arm designs may substantially improve operating characteristics. If major confounding factors, due to variations of patient populations across trials, can be appropriately considered in rigorous statistical analyses, externally controlled single-arm studies may potentially provide robust inference on treatment effects as randomized designs and would directly address the limitations in single-arm designs highlighted here. Their potential advantages depend on the availability of external patient-level data, the possibility of accurate adjustments that account for confounders, and the harmonization of measurements on patient characteristics and outcomes across trials. Similarly, we did not consider Bayesian randomized designs. While Bayesian randomized designs have been shown to improve efficiency over conventional RCTs in the multi-arm setting,9,30,31 efficiency gains in the two-arm setting have been shown to be modest.

Errors in estimating the benchmark value for the control treatment can significantly impact the performance of single-arm trial, while randomized designs are robust to such error. In a systematic review of phase II studies in oncology, 70 of 134 studies required historical data to specify a null hypothesis value but only 38 cited specific historical data sources and only 9 explicitly stated the value.32 Trials that failed to cite were 49% more likely to declare a positive result.32 Incorrectly specifying the null hypothesis can also reduce the possibility of finding a treatment effect when one exists.22 Even small amounts of bias in estimating the historical control rate for binary endpoints can significantly inflate false positive rates,6,20,22,33,34 an effect that is more pronounced when sample sizes are increased.6,20 In essence, larger sample sizes in single-arm studies provide more definitive results from an inaccurate comparison, an error that is not made in randomized designs where the control arm is included. Though the effects of this estimation error do not impact either design’s AUC metric, we see the impact in the MSE and false positive rates. Like the previously referenced studies, we found that misspecification of the historical control significantly increases the error rates in therapeutic development decision making. For endpoints where the potential for biased estimates are high, single-arm trials using historical benchmarks should be replaced with randomized designs or more rigorous external controls.

Randomization intuitively has less value when the outcome under the SOC is easy to predict and the hypothesized effect size is large relative to that uncertainty. The paucity of randomized trials for parachute use35 has been cited as an extreme example. In such an example, the intervention (parachutes) has a marked “therapeutic benefit,” without which the outcome under the alternative (no parachute) is highly predictable. Both the large signal and low uncertainty in the comparison arm favor a non-randomized study. Conversely, investigations examining interventions with expected effect of smaller size and where the outcome under the control is less predictable might favor randomization, as we showed in this study (Figure 4). Another notable example may be high-dose chemotherapy followed by autologous hematopoietic stem cell for breast cancer, where initial enthusiasm for the procedure based on non-randomized data was extinguished upon the publication of negative findings from RCTs.36 Our study builds upon prior work33,34,37,38 to codify this intuition in a quantitative framework that can be applied to other contexts to use available data to make more rational choices.

Fig. 4.

Fig. 4

Conceptual framework for randomization. Whether to randomize or not is dependent on uncertainty in the standard of care outcome distribution and the expected effect size. For large effect sizes and little outcome uncertainty under control conditions, randomization may not be preferred (parachute example). However, for more modest expected effect sizes and higher uncertainty levels, randomization may be preferred (ndGBM use case from this paper).

Some of the elements of our framework have been discussed by others for binary endpoints. Pond et al evaluated effects of phase II designs for binary endpoints on subsequent phase III trials through simulations and found randomization valuable when there is high interstudy variability or a tendency to underestimate the control response.33 This study also showed that the proportion of true active agents in phase II had a relevant impact on phase III results,33 a factor pointed out by others as well.1 Sambucini used a Bayesian framework to show that RCTs performed better than standard single-arm trials for binary variables when there was considerable uncertainty in the control rate,37 and Thall and Simon showed that variability in historical control estimates for binary endpoints could lead to erroneous conclusions at the end of the trial.34 Taylor et al simulated designs using a binary endpoint and found that the RCT tended to perform better with substantial variations in historical controls and for larger sample sizes.38 Liu et al39 and Redman and Crowley40 argued against phase II RCTs, since these studies often tend to be underpowered. Other discussions on the value of randomization in phase II studies include work by Rubenstein et al,41 Wieand,42 and Booth et al.43

It is frequently stated that randomized trials require larger sample sizes than single-arm trials.20,22 Comparisons of sample sizes for randomized versus single-arm trials frequently ignore the value of the information that results,20 however. In the present study, we held overall sample size fixed to address this argument and to reflect the typical decision problem of medical investigators: “Based on feasibility constraints, with enrollment of up to N patients, what are major pros and cons if the study has a single-arm or a randomized design?” Even if resources are constrained, our framework may still suggest that dividing the available population into a control and an experimental arm through randomization has value based on the factors we described. In such cases, enrolling all patients on a single-arm trial with the intent of “signal finding” just makes that signal harder to find or predisposes the trial to false positive conclusions. Although randomized designs with small sample sizes may be prone to high false negative rates or estimates of treatment effects with large confidence intervals, they may still be favorable to single-arm trials in their ability to prevent false positives.

Alternative approaches to randomized designs that address the effects of intertrials variability on bias have also been proposed. Thall and Simon proposed a method that incorporates some randomization elements that would decrease as the amount of historical data increased or intertrial variability decreased.34 Historical estimation error due to either change in care or differences in known prognostic factors across studies could be controlled for using contemporary data and multivariate models for statistical adjustments. Korn et al used prior data from phase II cooperative group trials to develop more robust historical comparisons for stage IV melanoma by controlling for known prognostic factors.44 A similar approach to the one we developed here could be applied to understand the relative value of such innovations.

Conclusions

Understanding the possible context-specific benefit from randomization is important both for traditional therapeutic development and future innovations in clinical evidence generation. Often, clinical-trial design choices are debated using qualitative arguments, and one particular solution may not apply to another context. Here, we showed that by using a quantitative framework with parameters estimated from prior clinical trials, a more rational comparison could be made. We applied this framework to phase II ndGBM trials to show that employing more randomization could lead to better designs. Our framework can be applied to other diseases to aid in clinical trial design choices including future innovations.

Funding

This work was supported by a Burroughs Wellcome Innovations in Regulatory Science Award.

Conflict of interest statement.

BMA reports employment with Foundation Medicine, Inc, personal fees from BMS, Precision Health Economics, Schlesinger Associates, and Abbvie and research grants from Puma, Eli Lilly, and Celgene. PYW reports grants, personal fees, and non-financial support from Agios and Novartis, non-financial support from Angiochem, GlaxoSmithKline, ImmunoCellular Therapeutics, VBI Vaccines, and Karyopharm, and personal fees and non-financial support from AstraZeneca, Genentech/Roche, and Vascular Biogenics, grants and non-financial support from Merck, and non-financial support from Oncoceutics, Sanofi Aventis, personal fees from Cavion, INSYS Therapeutics, Monteris, from Novogen, Regeneron Pharmaceuticals, and Tocagen. TFC reports consulting fees from: Roche/Genentech, VBL, Merck, BMS, Pfizer, Agios, Novogen, Boston Biomedical, MedQIA, Tocagen, Cortice Biosciences, Novocure, NewGen, Oxigene, Wellcome Trust, Sunovion Pharmaceuticals, Abbvie, Celgene, Lilly, and reports equity in Notable Labs.

Authorship statement.

AMV, SV, RR, LT, and BMA made substantial contributions to the conception and design of the work. All authors contributed to the acquisition, analysis, and interpretation of data for the work, drafting the work, and revising it critically for important intellectual content and final approval of the submitted version, and are in agreement to be accountable.

Supplementary Material

noz097_suppl_Supplementary_Data
noz097_suppl_Supplementary_Material

References

  • 1. Rubinstein L, Leblanc M, Smith MA. More randomization in phase II trials: necessary but not sufficient. J Natl Cancer Inst. 2011;103(14):1075–1077. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Seymour L, Ivy SP, Sargent D, et al. The design of phase II clinical trials testing cancer therapeutics: consensus recommendations from the clinical trial design task force of the national cancer institute investigational drug steering committee. Clin Cancer Res. 2010;16(6):1764–1769. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Sharma MR, Stadler WM, Ratain MJ. Randomized phase II trials: a long-term investment with promising returns. J Natl Cancer Inst. 2011;103(14):1093–1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Vanderbeek AM, Rahman R, Fell G, et al. The clinical trials landscape for glioblastoma: is it adequate to develop new treatments? Neuro Oncol. 2018;20(8):1034–1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Sharma MR, Karrison TG, Jin Y, et al. Resampling phase III data to assess phase II trial designs and endpoints. Clin Cancer Res. 2012;18(8):2309–2315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Tang H, Foster NR, Grothey A, Ansell SM, Goldberg RM, Sargent DJ. Comparison of error rates in single-arm versus randomized phase II cancer clinical trials. J Clin Oncol. 2010;28(11):1936–1941. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Berry DA. The brave new world of clinical cancer research: adaptive biomarker-driven trials integrating clinical practice with clinical research. Mol Oncol. 2015;9(5):951–959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Sherman RE, Anderson SA, Dal Pan GJ, et al. Real-world evidence—what is it and what can it tell us? N Engl J Med. 2016;375(23):2293–2297. [DOI] [PubMed] [Google Scholar]
  • 9. Alexander BM, Trippa L, Gaffey S, et al. Individualized screening trial of innovative glioblastoma therapy (INSIGhT): a Bayesian adaptive platform trial to develop precision medicines for patients with glioblastoma. JCO Precis Oncol. 2019;( 3):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Ventz S, Alexander BM, Parmigiani G, Gelber RD, Trippa L. Designing clinical trials that accept new arms: an example in metastatic breast cancer. J Clin Oncol. 2017;35(27):3160–3168. [DOI] [PubMed] [Google Scholar]
  • 11. Woodcock J, LaVange LM. Master protocols to study multiple therapies, multiple diseases, or both. N Engl J Med. 2017;377(1):62–70. [DOI] [PubMed] [Google Scholar]
  • 12. Stupp R, Mason WP, van den Bent MJ, et al. Radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma. N Engl J Med. 2005;352(10):987–996. [DOI] [PubMed] [Google Scholar]
  • 13. Akobeng AK. Understanding diagnostic tests 3: receiver operating characteristic curves. Acta Paediatr. 2007;96(5):644–647. [DOI] [PubMed] [Google Scholar]
  • 14. Bland JM, Altman DG. The logrank test. BMJ. 2004;328(7447):1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials. 1986;7(3):177–188. [DOI] [PubMed] [Google Scholar]
  • 16. Guyot P, Ades AE, Ouwens MJ, Welton NJ. Enhanced secondary analysis of survival data: reconstructing the data from published Kaplan-Meier survival curves. BMC Med Res Methodol. 2012;12:9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Medical Research Council. Streptomycin treatment of pulmonary tuberculosis. Br Med J. 1948;2(4582):769–782. [PMC free article] [PubMed] [Google Scholar]
  • 18. Armitage P. Fisher, Bradford Hill, and randomization. Int J Epidemiol. 2003;32(6):925–928. [DOI] [PubMed] [Google Scholar]
  • 19. Rubinstein L, Crowley J, Ivy P, Leblanc M, Sargent D. Randomized phase II designs. Clin Cancer Res. 2009;15(6):1883–1890. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Gan HK, Grothey A, Pond GR, Moore MJ, Siu LL, Sargent D. Randomized phase II trials: inevitable or inadvisable? J Clin Oncol. 2010;28(15):2641–2647. [DOI] [PubMed] [Google Scholar]
  • 21. Ratain MJ, Sargent DJ. Optimising the design of phase II oncology trials: the importance of randomisation. Eur J Cancer. 2009;45(2):275–280. [DOI] [PubMed] [Google Scholar]
  • 22. Hunsberger S, Zhao Y, Simon R. A comparison of phase II study strategies. Clin Cancer Res. 2009;15(19):5950–5955. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Grossman SA, Schreck KC, Ballman K, Alexander B. Point/counterpoint: randomized versus single-arm phase II clinical trials for patients with newly diagnosed glioblastoma. Neuro Oncol. 2017;19(4):469–474. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Nabors LB, Mikkelsen T, Hegi ME, et al. A safety run-in and randomized phase 2 study of cilengitide combined with chemoradiation for newly diagnosed glioblastoma (NABTT 0306). Cancer. 2012;118(22):5601–5607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Stupp R, Hegi ME, Neyns B, et al. Phase I/IIa study of cilengitide and temozolomide with concomitant radiotherapy followed by cilengitide and temozolomide maintenance therapy in patients with newly diagnosed glioblastoma. J Clin Oncol. 2010;28(16):2712–2718. [DOI] [PubMed] [Google Scholar]
  • 26. Schuster J, Lai RK, Recht LD, et al. A phase II, multicenter trial of rindopepimut (CDX-110) in newly diagnosed glioblastoma: the ACT III study. Neuro Oncol. 2015;17(6):854–861. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Stupp R, Hegi ME, Gorlia T, et al. Cilengitide combined with standard treatment for patients with newly diagnosed glioblastoma with methylated MGMT promoter (CENTRIC EORTC 26071-22072 study): a multicentre, randomised, open-label, phase 3 trial. Lancet Oncol. 2014;15(10):1100–1108. [DOI] [PubMed] [Google Scholar]
  • 28. Weller M, Butowski N, Tran DD, et al. Rindopepimut with temozolomide for patients with newly diagnosed, EGFRvIII-expressing glioblastoma (ACT IV): a randomised, double-blind, international phase 3 trial. Lancet Oncol. 2017;18(10):1373–1385. [DOI] [PubMed] [Google Scholar]
  • 29. Alexander BM, Ba S, Berger MS, et al. Adaptive global innovative learning environment for glioblastoma: GBM AGILE. Clin Cancer Res. 2017. doi:10.1158/1078-0432.CCR-17-0764. [DOI] [PubMed] [Google Scholar]
  • 30. Wason JM, Trippa L. A comparison of Bayesian adaptive randomization and multi-stage designs for multi-arm clinical trials. Stat Med. 2014;33(13):2206–2221. [DOI] [PubMed] [Google Scholar]
  • 31. Trippa L, Lee EQ, Wen PY, et al. Bayesian adaptive randomized trial design for patients with recurrent glioblastoma. J Clin Oncol. 2012;30(26):3258–3263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Vickers AJ, Ballen V, Scher HI. Setting the bar in phase II trials: the use of historical data for determining “go/no go” decision for definitive phase III testing. Clin Cancer Res. 2007;13(3):972–976. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Pond GR, Abbasi S. Quantitative evaluation of single-arm versus randomized phase II cancer clinical trials. Clin Trials. 2011;8(3):260–269. [DOI] [PubMed] [Google Scholar]
  • 34. Thall PF, Simon R. Incorporating historical control data in planning phase II clinical trials. Stat Med. 1990;9(3):215–228. [DOI] [PubMed] [Google Scholar]
  • 35. Smith GC, Pell JP. Parachute use to prevent death and major trauma related to gravitational challenge: systematic review of randomised controlled trials. BMJ. 2003;327(7429):1459–1461. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Howard DH, Kenline C, Lazarus HM, et al. Abandonment of high-dose chemotherapy/hematopoietic cell transplants for breast cancer following negative trial results. Health Serv Res. 2011;46(6pt1):1762–1777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Sambucini V. Comparison of single-arm vs. randomized phase II clinical trials: a Bayesian approach. J Biopharm Stat. 2015;25(3):474–489. [DOI] [PubMed] [Google Scholar]
  • 38. Taylor JM, Braun TM, Li Z. Comparing an experimental agent to a standard agent: relative merits of a one-arm or randomized two-arm phase II design. Clin Trials. 2006;3(4):335–348. [DOI] [PubMed] [Google Scholar]
  • 39. Liu PY, LeBlanc M, Desai M. False positive rates of randomized phase II designs. Control Clin Trials. 1999;20(4):343–352. [DOI] [PubMed] [Google Scholar]
  • 40. Redman M, Crowley J. Small randomized trials. J Thorac Oncol. 2007;2(1):1–2. [DOI] [PubMed] [Google Scholar]
  • 41. Rubinstein LV, Korn EL, Freidlin B, Hunsberger S, Ivy SP, Smith MA. Design issues of randomized phase II trials and a proposal for phase II screening trials. J Clin Oncol. 2005;23(28):7199–7206. [DOI] [PubMed] [Google Scholar]
  • 42. Wieand HS. Randomized phase II trials: what does randomization gain? J Clin Oncol. 2005;23(9):1794–1795. [DOI] [PubMed] [Google Scholar]
  • 43. Booth CM, Cescon DW, Wang L, Tannock IF, Krzyzanowska MK. Evolution of the randomized controlled trial in oncology over three decades. J Clin Oncol. 2008;26(33):5458–5464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Korn EL, Liu PY, Lee SJ, et al. Meta-analysis of phase II cooperative group trials in metastatic stage IV melanoma to determine progression-free and overall survival benchmarks for future phase II trials. J Clin Oncol. 2008;26(4):527–534. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

noz097_suppl_Supplementary_Data
noz097_suppl_Supplementary_Material

Articles from Neuro-Oncology are provided here courtesy of Society for Neuro-Oncology and Oxford University Press

RESOURCES