A Simulation Study of Outcome Adaptive Randomization in Multi-arm Clinical Trials

J Kyle Wathen; Peter F Thall

doi:10.1177/1740774517692302

. Author manuscript; available in PMC: 2018 Oct 1.

Published in final edited form as: Clin Trials. 2017 Feb 1;14(5):432–440. doi: 10.1177/1740774517692302

A Simulation Study of Outcome Adaptive Randomization in Multi-arm Clinical Trials

J Kyle Wathen ^1,^*, Peter F Thall ²

PMCID: PMC5634533 NIHMSID: NIHMS833409 PMID: 28982263

Abstract

Background

Randomizing patients among treatments with equal probabilities in clinical trials is the established method to obtain unbiased comparisons. In recent years, motivated by ethical considerations, many authors have proposed outcome adaptive randomization, wherein the randomization probabilities are unbalanced, based on interim data, to favor treatment arms having more favorable outcomes. While there has been substantial controversy regarding the merits and flaws of adaptive versus equal randomization, there has not yet been a systematic simulation study in the multi-arm setting.

Methods

A simulation study was conducted to evaluate four different Bayesian adaptive randomization methods and compare them to equal randomization in five-arm clinical trials. All adaptive randomization methods included an initial burn-in with equal randomization and some combination of other modifications to avoid extreme randomization probabilities. Trials either with or without a control arm were evaluated, using designs that may terminate arms early for futility and select one or more experimental treatments at the end. The designs were evaluated under a range of scenarios and sample sizes.

Results

For trials with a control arm and maximum same size 250 or 500, several commonly used adaptive randomization methods have very low probabilities of correctly selecting a truly superior treatment. Of those studied, the only adaptive randomization method with desirable properties has a burn-in with equal randomization and thereafter randomization probabilities restricted to the interval .10 to .90. Compared to equal randomization, this method has a favorable sample size imbalance but lower probability of correctly selecting a superior treatment.

Conclusions

In multi-arm trials, compared to equal randomization, several commonly used adaptive randomization methods give much lower probabilities of selecting superior treatments. Aside from randomization method, conducting a multi-arm trial without a control arm may lead to very low probabilities of selecting any superior treatments if differences between the treatment success probabilities are small.

Keywords: Adaptive randomization, Bayesian design, play the winner, screening trial, simulation

1 Introduction

Outcome adaptive randomization (AR) has been proposed by many authors as an alternative to equal randomization (ER), for comparing treatments A and B. AR uses the interim outcome data to unbalance randomization probabilities in favor of the treatment arm, or arms, having currently higher empirical success rates. Proponents of AR consider it more ethical than ER for the patients enrolled in the trial because AR leads to sample sizes, N_A and N_B, on average unbalanced in favor of the truly superior treatment. AR was proposed by Thompson [1] for binary outcomes. He suggested that, assuming success probabilities π_A and π_B following beta priors, the next patient should receive treatment A with probability r_A_,_n = Pr(π_B < π_A | data_n) and B with probability r_B_,_n = 1 − r_A_,_n. Adaptive statistical criteria used to define AR probabilities similar to r_A_,_n and r_B_,_n sometimes are called “randomized play-the-winner” rules [2][3]. Many different AR methods have been proposed ([4]–[7]), and clinical trials have been conducted using various AR methods ([8]–[10]).

Use of AR in clinical trials remains controversial. Critics argue that AR provides a small advantage in sample size imbalance in favor of the superior treatments, while introducing inferential problems that decrease benefit to future patients. Discussions of AR have been given by Chappell and Karrison [11], Korn and Friedlin [12], Yuan and Yin [13], Lee, Chen and Yin [14], Rosenberger, Sverdlov and Hu [15], Buyse [16], Lee [17], and Hey and Kimmelman [18]. Berry [19] has argued that the greatest advantages of AR over ER may be obtained in multi-arm trials. Thall, Fox, and Wathen [20] reported a simulation study, for two-arm trials, comparing several Bayesian AR methods to a group sequential design using ER. Their simulations showed that, compared to ER, AR methods often have a much lower probability of selecting a truly superior treatment arm, much larger estimation bias, produce distributions of N_A and N_B with much greater variability and skewness, and have a nontrivial probability of unbalancing N_A and N_B in favor of the inferior treatment. Thus, only reporting mean sample sizes from simulations may be very misleading. The particular way an AR method is defined, and other aspects of a trial design, can greatly affect overall design performance. Because there are numerous ways to design a randomized trial, and many different ways to define AR methods, statements about the comparative desirability of AR versus ER must be accompanied by detailed explanations of these design specifics.

In this paper, we report a simulation study examining four AR methods and ER in multi-arm clinical trials. A multi-arm trial design may or may not (1) include a control arm, (2) restrict the randomization to a control arm if it is included, (3) involve various rules for between-arm comparisons or stopping an arm early, (4) enrich the remaining arms with larger sample sizes when some arms are terminated early, (5) select one best or possibly several experimental treatments, and (6) include two or more than two stages, or monitor continuously. Thus, to obtain reasonable comparisons of randomization methods, the underlying designs must have qualitatively identical structures, decision rules, and maximum sample size. To obtain results that are useful to practitioners, we evaluate several relatively simple clinical trial designs and AR methods, for five-arm trials that either do or do not include a control arm. We consider Bayesian designs for trials with binary outcomes that use either ER or one of four specific AR methods.

The AR methods to be evaluated are defined in Section 2. The trial designs are given in Section 3, and the simulation study design is given in Section 4. Section 5 presents the simulation results, and we close with a discussion in Section 6.

2 Outcome Adaptive Randomization Methods

There are many ways to do AR ([2],[6], [7], [21], [22]). The Bayesian AR methods considered here are similar to those studied by Thompson [1], Thall and Wathen [23], and Thall, Fox, and Wathen [20] for two-arm trials, generalized to accommodate multi-arm trials. Index treatments by k = 1, ⋯, K, and intermediate sample sizes by n = 1, ⋯, N, for maximum overall sample size N. Denoting response probabilities of the K treatments by π₁, ⋯, π_K, the AR probabilities are defined in terms of the K posterior probabilities

r_{k, n} = P r (π_{k} = \max {π_{1}, \dots π_{K}} ∣ {data}_{n}), k = 1, \dots, K,

(1)

which sum to 1. Thus, r_1,_n, ⋯, r_K_,_n generalize the original definition [1] given for K = 2.

It is well known that using {r_1,_n, ⋯, r_K_,_n} as AR probabilities often leads to undesirable treatment assignments due to “stickiness,” wherein an outcome-adaptive treatment assignment rule assigns a suboptimal treatment to an undesirably large number of patients [24]. With the above AR probabilities, if a truly inferior treatment arm happens to have a higher early success rate, it is likely to receive a larger proportion of patients thereafter, and consequently the trial design is not likely to identify a truly superior treatment. Various modifications of r_k_,_n have been proposed to mitigate stickiness. We consider AR methods that use different combinations of three such modifications. The first is a “burn-in” wherein, initially, a fixed number of patients are randomized equally among the arms, with AR applied subsequently. The second replaces r_k_,_n with

r_{k, n}^{(c)} = \frac{{(r_{k, n})}^{c}}{\sum_{j = 1}^{K} {(r_{j, n})}^{c}}

(2)

for some c > 0, with c = .50 used very commonly. This shrinks r_k_,_n toward .50, so the AR method is more like ER, for which c = 0 and all $r_{k, n}^{(0)} \equiv 1 / K$ . The third modification restricts e ≤ r_k_,_n ≤ 1− e for small e > 0. If r_k_,_n < e then the AR probability for arm k is set equal to e, and if r_k_,_n > 1− e the AR probability is set equal to 1 − e, with the K resulting AR probabilities normalized so that they sum to 1. A method using $r_{k, n}^{(c)}$ restricted to [e, 1− e] will be denoted by AR(c, e).

All designs include a burn-in with the first 50 patients randomized equally among the arms, with exactly 10 patients assigned to each arm. We first consider AR(1, 0), which randomizes patients to arm k with probability r_k_,_n, a K-arm generalization of Thompson [1], but imposing a burn-in. The second method, AR(0.5, 0), randomizes patients to arm k with probability $r_{k, n}^{(0.5)}$ given by (2). AR(0.5, 0) minimizes the expected number of non-responders [11]. The third method, AR(n/2N, 0), generalizes Thall and Wathen’s [23] two-arm trial method by applying (2) using c = n/2N, for current sample size n = 1, ⋯, N. The fourth method, AR(1, 0.10), uses r_k_,_n with the restriction 0.10 ≤ r_k_,_n ≤ 0.90. We thus evaluate AR(1, 0), AR(0.5, 0), AR(n/2N, 0), AR(1, 0.10), and ER.

3 Trial Designs

Each simulation case is determined by whether a control arm is included, the maximum sample size N=250 or N=500, decision rules, and randomization method. All cases are five-arm trials. When a control arm, C, is included, we index it by k = C and the four experimental arms by k = 1, 2, 3, 4. When C is not included, we index the five experimental treatments by k = 1, 2, 3, 4, 5. For all designs, we assume the response probabilities, {π_k}, are independent with beta(0.20, 0.80.) priors. Each design requires one parameter, a_U, to define the treatment arm selection rule, determined via preliminary simulations under the null scenario where all fixed response probabilities equal 0.20.

When C is included, its response probability, π_C, is used as the comparator in the decision rules. These rules may stop randomization to an experimental arm E_k due to futility, or select an E_k as promising, based on the posterior of π_k − π_C. If no control arm is included, one possible approach is to use a fixed standard probability, p_C, for comparison. Unless p_C is completely arbitrary, this requires the assumption that there exists a standard treatment with response probability known to equal p_C, i.e. Pr(π_C = p_C) = 1. It also requires that the numerical value p_C, obtained in practice from previous trials or clinical experience, will remain a valid comparator during the trial. This implies there are no between-trial or trial-versus-historical effects. Because these are very unrealistic assumptions, we do not consider designs assuming a fixed standard. Thus, the designs without a control arm that we consider make decisions based on comparisons among the E_k’s.

3.1 Multi-Arm Trials With a Control Arm

For each experimental arm, E_k, k = 1, 2, 3, 4, after the initial burn-in, the following decision rules are applied continuously during the trial.

Futility

For each k = 1, 2, 3, 4, arm E_k is terminated early due to futility if

P r (π_{k} > π_{C} + 0.20 ∣ {data}_{n}) < 0.01.

If all four experimental arms are terminated, the trial is stopped.

Enrichment

If an E_k is terminated early for futility, the remaining patients, up to N, are randomized among the remaining open arms.

Selection

If E_k is not terminated early, then at the end of the trial E_k is selected if

P r (π_{k} > π_{C} + 0.20 ∣ {data}_{n}) > a_{U} .

(3)

The design thus allows more than one E_k to be selected. It is typical practice to require a new treatment to provide a minimal clinically significant improvement, here specified to be δ = 0.20. The futility rule decreases the number of patients randomized to an E_k that is very unlikely to achieve the targeted improvement over C, and thus enriches the sample sizes of arms having larger success probabilities. For each design, the numerical value of a_U is determined to ensure overall false positive probability 0.05 for the trial, with a false positive defined as selecting any E_k in the null case where all true p_k = .20. The numerical value of a_U depends on the randomization method, the value of N and the initial burn-in. Supplementary Table S1 gives the numerical value of the cut-off a_U used by each design’s selection rule in each case. An alternative to deriving a_U in this way is to set it equal to a fixed value, such as a_U = 0.95. We chose to determine a_U for each design to obtain the same overall false positive probability 0.05 in order ensure fair comparisons among the randomization methods in terms of per-arm selection probabilities, stopping probabilities, and sample size distributions.

3.2 Multi-Arm Trials Without a Control Arm

For trial without a control arm, the decision rules are as follows:

Futility

For each k = 1, 2, 3, 4, 5, accrual to E_k is terminated due to futility if

P r (π_{k} > \max {π_{r} : r \neq k} ∣ {data}_{n}) < 0.01.

Enrichment

If an E_k is closed early for futility, the remaining patients, up to maximum sample size N, are randomized among the remaining open arms.

Selection

If E_k is not terminated early, at the end of the trial E_k is selected if

P r (π_{k} > \max {π_{r} : r \neq k} ∣ {data}_{n}) > a_{U} .

(4)

At the end of the trial, the designs with a control arm may select more than one E_k, whereas the designs without a control arm may select at most one E_k. While one might question why at most one E_k may be selected in trials without a control arm, it is extremely unlikely that two different π_k’s both will satisfy the criterion (4) for any reasonably large a_U. Moreover, in the cases of no control arm there is no required improvement, such as the value δ = .20 that is used in the selection rule. If the selection criterion (4) were replaced by

P r (π_{k} > \max {π_{r} : r \neq k} + δ ∣ {data}_{n}) > a_{U},

for δ = .15 or .20, our simulations show that, for N = 250 or 500 in a five-arm trial, this design would be extremely unlikely to correctly select any E_k in many scenarios where there actually are substantive differences among the p_k’s.

4 Simulation Study Design

Under the Bayesian formulation, the probabilities, π_C, π₁, ⋯, π₄ in the case with a control arm, or π₁, ⋯, π₅ in the case without a control arm, are random. We distinguish between these random quantities and corresponding assumed fixed probabilities, denoted using p_k in place of π_k, that are used to define scenarios and simulate data. In all simulation scenarios, we assumed fixed null response rate 0.20. We consider three scenarios. In the null scenario, all p_k = 0.20. Given fixed targeted improvement δ = 0.20, the least favorable configuration (LFC) has one experimental p_k = 0.20 + δ and all other p_k = 0.20. Thus, p_C = p₁ = … = p₃ = .20 and p₄ = 0.20 + δ = 0.40 if there is a control arm, and p₁ = … = p₄ = .20 with p₅ = 0.20 + δ = 0.40 if there is no control arm. The LFC is determined, in the case with a control arm, by assuming that (i) no experimental p_k is between p_C and p_C + δ and (ii) at least one experimental arm has p_k ≥ p_C + δ. The LFC is the configuration of p₁, ⋯, p_K values that minimizes the probability, under (i) and (ii), that at least one E_k for which p_k ≥ p_C + δ is selected. The name “least favorable configuration” is somewhat misleading, since the requirements (i) and (ii) are quite strong, and they ensure that it is relatively easy to identify the one E_k providing a δ improvement over p_C. This motivates the third, more realistic “staircase” scenario, for which the p_k’s are 0.20, 0.25, 0.30, 0.35, 0.40.

5 Simulation Results for Trials with a Control Arm

In the tables, n̄(95% CI) denotes the mean and (2.5^th, 97.5^th) percentiles of each per-arm sample size distribution. Under the LFC, we denote the probability of correctly selecting the superior arm E₄ by PCS. In practice, an AR-based design with a large sample size imbalance favoring a superior arm is unlikely to be used if it has substantially lower PCS than ER. Table 2 shows that, under the LFC with p₄= 0.40 and N = 250, AR(1,0), $AR (\frac{1}{2}, 0)$ and $AR (\frac{n}{2 N}, 0)$ all have very low PCS, between 0.44 and 0.48, compared to AR(1, 0.1) and ER, which have PCS values 0.67 and 0.66. One reason for this large loss in PCS for AR(1,0), $AR (\frac{1}{2}, 0)$ and $AR (\frac{n}{2 N}, 0)$ is that each gets stuck randomizing patients to E₄ very early in the trial, resulting in a smaller n̄ for C. The AR methods have n̄ ranging from 23 to 35, with the widest 95% CI (11, 70) for $AR (\frac{1}{2}, 0)$ , compared to n̄ = 72 and 95% CI (37, 110) for ER. AR(1, 0.1) provides a favorable sample size imbalance, with n̄ = 127 for E₄ compared to n̄ = 70 with ER. To ensure false positive probability 0.05, the cut-off a_U in the selection rule (3) must be larger for AR(1,0), $AR (\frac{1}{2}, 0)$ and $AR (\frac{n}{2 N}, 0)$ , compared to AR(1, 0.1) or ER, resulting in much smaller PCS for the first three AR methods.

Table 2.

Simulation results for designs with a control arm in LFC scenario, p_C = p₁ = p₂ = p₃ = 0.20, p₄ = 0.40, for N=250. n̄ = mean per-arm sample size. Each η_m = Pr(N_C > N_k + m), the probability that the number of patients randomized to arm C is at least m larger than the number randomized to arm E_k. Values in the row E₁ – E₃ are per-arm.

Method

Arm

Pr(Select)

Pr(Stop)

n̄(95% CI)

η₁₀, η₂₀, η₃₀

AR(1, 0)

–

23 (10, 58)

–

E₁–E₃

0.02

0.40

23 (10, 69)

0.28, 0.15, 0.08

E₄

0.44

0.07

152 (11, 201)

0.04, 0.03,0.02

Total

244 (140, 250)

AR (\frac{1}{2}, 0)

–

34 (11, 70)

–

E₁–E₃

0.02

0.56

29 (10, 70)

0.42, 0.29, 0.18

E₄

0.46

0.07

123 (10, 177)

0.05, 0.04,0.02

Total

243 (130, 250)

AR (\frac{n}{2 N}, 0)

–

31 (12, 63)

–

E₁–E₃

0.02

0.52

27 (10, 68)

0.39, 0.25, 0.13

E₄

0.48

0.07

132 (11, 179)

0.04, 0.03,0.02

Total

243 (120, 250)

AR(1, 0.1)

–

35 (23, 60)

–

E₁–E₃

0.03

0.58

27 (10, 66)

0.44, 0.26, 0.11

E₄

0.67

0.07

127 (10, 177)

0.05, 0.04,0.02

Total

243 (130, 250)

–

72 (37, 110)

–

E₁–E₃

0.02

0.78

34 (10, 71)

0.73, 0.64, 0.56

E₄

0.66

0.08

70 (10, 109)

0.23, 0.07, 0.03

Total

243 (130, 250)

Open in a new tab

Thall and Wathen [23] and Thall, Fox and Wathen [20] showed that, in the two-arm case, there is a significant risk that AR(1,0) and $AR (\frac{1}{2}, 0)$ will get stuck randomizing more patients to the inferior treatment arm. To determine whether this holds in the multi-arm case, we calculate η_m = Pr(N_C > N_k + m) for m = 10, 20 or 30 for each method. When some E_k is superior, η_m is the probability that a method will randomize at least m more patients to the inferior control arm than to E_k. An AR procedure having η_m much larger than that obtained with ER is undesirable. Under the LFC with p₄ = 0.40, using AR(1, 0.1), on average, 127 patients are treated with E₄ compared to 70 using ER, so an additional 57 patients are treated with E₄ as a result of using AR(1, 0.1), which has η₁₀ = 0.05 compared to 0.23 for ER. The reason that ER has larger η₁₀ than AR(1, 0.1) is that, if E₄ is dropped and the trial continues, ER assigns more patients to C than AR(1, 0.1). Thus, results of the two arm case cannot be extended to the multi-arm setting. In the LFC, AR(1, 0.1) achieves a very favorable patient imbalance in favor of E₄ compared to ER while maintaining PCS and reducing the likelihood of randomizing patients to inferior treatments.

In the staircase scenario, it is much more difficult to discriminate among the E_k’s. Table 3 summarizes simulations in this case for trials including C with N = 250. Compared to ER, AR(1, 0.1) has sightly smaller probabilities of selecting E₃ or E₄, which have p₃ = .35 and p₄ = .40. This is due to the fact that E₁, E₂, and E₃ remain in the trial longer because these treatments provide some improvement over C, limiting the number of patients treated with E₄, and reducing the probability that any AR method will select E₄. Still, AR(1, 0.1) assigns more patients to the better treatment arms, on average. Additionally, η₁₀, η₂₀ and η₃₀ each are smaller for AR(1, 0.1) compared to ER. Compared to AR(1, 0.1) or ER, the probabilities of selecting the best arms E₄ or E₅ are much smaller for AR(1,0), $AR (\frac{1}{2}, 0)$ and $AR (\frac{n}{2 N}, 0)$ .

Table 3.

Simulation results for designs with a control arm in staircase scenario, (p_C, p₁, p₂, p₃, p₄) = (0.20,0.25,0.30,0.35,0.40) for N=250. n̄ = mean per-arm sample size. Each η_m = Pr(N_C > N_k + m), the probability that the number of patients randomized to arm C is at least m larger than the number randomized to arm E_k.

Method

Arm

Pr(Select)

Pr(Stop)

n̄(95% CI)

η₁₀, η₂₀, η₃₀

AR(1, 0)

–

20 (10, 51)

–

E₁

0.05

0.24

27 (10, 74)

0.17, 0.08, 0.04

E₂

0.11

0.16

40 (10, 113)

0.12, 0.06, 0.03

E₃

0.22

0.10

62 (10, 158)

0.08, 0.04, 0.02

E₄

0.40

0.06

101 (10, 185)

0.05, 0.03, 0.02

AR (\frac{1}{2}, 0)

–

28 (11, 60)

–

E₁

0.06

0.33

32 (10, 73)

0.25, 0.16, 0.08

E₂

0.14

0.22

44 (10, 93)

0.17, 0.11, 0.06

E₃

0.26

0.13

60 (10, 121)

0.10, 0.07, 0.04

E₄

0.45

0.06

84 (11, 149)

0.05, 0.04, 0.03

AR (\frac{n}{2 N}, 0)

–

26 (11, 55)

–

E₁

0.06

0.31

31 (10, 72)

0.24, 0.13, 0.06

E₂

0.13

0.21

42 (10, 97)

0.16, 0.09, 0.04

E₃

0.25

0.12

61 (10, 130)

0.09, 0.06, 0.03

E₄

0.45

0.07

90 (10, 156)

0.05, 0.04, 0.02

AR(1, 0.1)

–

33 (23, 54)

–

E₁

0.10

0.36

31 (10, 70)

0.29, 0.16, 0.05

E₂

0.21

0.22

41 (10, 99)

0.19, 0.10, 0.04

E₃

0.38

0.13

58 (10, 131)

0.11, 0.07, 0.03

E₄

0.61

0.06

86 (11, 151)

0.06, 0.04, 0.02

–

58 (40, 93)

–

E₁

0.08

0.49

39 (10, 65)

0.50, 0.38, 0.31

E₂

0.22

0.29

46 (10, 72)

0.36, 0.24, 0.19

E₃

0.44

0.15

51 (10, 79)

0.26, 0.14, 0.10

E₄

0.66

0.07

55 (11, 86)

0.21, 0.07, 0.05

Open in a new tab

Tables 2 and 3 show that, for designs with a control arm and N = 250 patients, in the LFC or staircase scenarios, the highest probabilities of selecting the best arm are 0.66 or 0.67, obtained by AR(1, 0.1) or ER. A trial probably would not be conducted if there were only a 66% chance of selecting an E_k achieving the targeted improvement. In practice, one would either increase N, increase the false-positive rate, or both. Supplementary Tables S1, S2, S3 summarize simulations in the three scenarios for N = 500 with a control arm. Table S3 shows that N = 500 gives much larger probabilities of selecting superior E_k’s in the staircase scenario, with E₄ selected with probabilities 0.84 by AR(1, 0.1) and 0.86 by ER, while the other three AR methods have substantially inferior performance. Tables S2 and S3 show that, under the LFC, for N = 500 the probability of stopping superior arm E₄ is .08 to .09 for AR(1,.01). If desired, these Pr(Stop) values may be made smaller by reducing the futility stopping rule cut-off to a value smaller than .01, such as .005, but the price would be smaller per-arm sample sizes for E₄ and consequently lower Pr(Select) values.

Table 4 compares PCS = Pr(Select E₄) for N=250 and N=500 under the LFC when p₄ = 0.40. When N=500, AR(1, 0.1) and ER have PCS values 0.87 and 0.85. Compared to ER, although AR(1, 0.1) has a much more disperse sub-sample size distribution for E₄, on average AR(1, 0.1) randomizes many more patients to E₄. The PCS values 0.77, 0.67 and 0.53 for $AR (\frac{n}{2 N}, 0), AR (\frac{1}{2}, 0)$ , and AR(1,0) are much smaller. $AR (\frac{1}{2}, 0)$ would require N=500 patients to obtain the same PCS as AR(1, 0.1) and ER with only N=250. A trial utilizing AR(1, 0) would require more than double the sample size to obtain the same PCS as AR(1, 0.1) or ER. A general conclusion is that AR(1, 0.1) provides more patients with superior treatment while maintaining acceptable PCS, for N = 500 in a five-arm trial with a control.

Table 4.

Simulation results for designs with a control arm comparing N=250 and N = 500 in LFC scenario, p₁ = p₂ = p₃ = p_C = 0.20, p₄ = 0.40. Values of n̄= mean per-arm sample size and 95% CI are for E₄.

N=250

N=500

AR(1,0)

Pr(Select E₄)

0.44

0.53

n̄(95% CI )

152(11, 201)

369 (11, 444)

AR (\frac{1}{2}, 0)

Pr(Select E₄)

0.46

0.67

n̄(95% CI )

123(11, 177)

319 (11, 413)

AR (\frac{n}{2 N}, 0)

Pr(Select E₄)

0.48

0.77

n̄(95% CI )

132(11, 179)

321 (11, 406)

AR(1, 0.1)

Pr(Select E₄)

0.67

0.87

n̄(95% CI )

127(11, 177)

313 (11, 403)

Pr(Select E₄)

0.66

0.85

n̄(95% CI )

70(10, 109)

175 (13, 238)

Open in a new tab

6 Simulation Results for Trials without a Control Arm

Each design without a control arm was calibrated to have a 1% chance of selecting each treatment in the null scenario (Table 5). In the LFC scenario with p₅ = 0.40 and N=250, Table 6 shows that all methods provide PCS for E₄ ranging from 0.75 to 0.82, and all of the η_m values are relatively small for E₄. If the only cases considered were the null and the LFC, then it might seem that running a multi-arm trial including a control arm is foolish. However, the opposite is true. Table 7 shows that, in the staircase scenario, for N = 250 the probabilities of selecting the best treatments are extremely low, ranging from 0.19 to 0.26, compared to approximately 0.65 when a control arm is included. The main reason for this large drop is that, without a control arm, comparison among the E_k’s is extremely difficult if the differences between the p_k’s are small. Supplementary Table S6 shows that, in the staircase scenario, even if the overall maximum sample size is increased to N = 500, the selection probabilities for E₅ range from 0.33 to 0.39 for any randomization method, with selection probabilities at most 0.04 for any of E₁, ⋯, E₄.

Table 5.

Simulation results for designs without a control arm in the null scenario p₁ = ⋯ = p₅ = 0.20, for N=250. Each η_m = Pr(N_E₁> N_{E_k} + m), the probability that the number of patients randomized to arm C is at least m larger than the number randomized to arm E_k. All values are per-arm.

Method

Pr(Select)

Pr(Stop)

n̄(95% CI)

η₁₀, η₂₀, η₃₀

AR(1, 0)

0.01

0.19

50 (10, 137)

0.42, 0.35, 0.28

Total

250 (250, 250)

AR (\frac{1}{2}, 0)

0.01

0.26

50 (10, 110)

0.41, 0.33, 0.26

Total

249 (250, 250)

AR (\frac{n}{2 N}, 0)

0.01

0.25

50 (10, 118)

0.41, 0.34, 0.27

Total

249 (250, 250)

AR(1, 0.1)

0.01

0.24

50 (10, 128)

0.41, 0.34, 0.26

Total

250 (250, 250)

0.01

0.32

50 (10, 97)

0.32, 0.23, 0.2

Total

248 (250, 250)

Open in a new tab

Table 6.

Simulation results for designs with no control arm in the LFC scenario p₁ = p₂ = p₃ = p₄ = 0.20 and p₅ = 0.40, for N=250. n̄ = mean per-arm sample size. Each η_m = Pr(N_E₁ > N_{E_k} + m), the probability that the number of patients randomized to arm E₁ is at least m larger than the number randomized to arm E_k. Values in the row E₁ – E₄ are per-arm.

Method

Arm

Pr(Select)

Pr(Stop)

n̄(95% CI)

η₁₀, η₂₀, η₃₀

AR(1, 0)

E₁–E₄

0.58

24 (10, 72)

0.27, 0.15, 0.08

E₅

0.78

0.02

141 (11, 199)

0.03, 0.02,0.02

Total

236 (90, 250)

AR (\frac{1}{2}, 0)

E₁–E₄

0.74

29 (10, 74)

0.32, 0.21, 0.13

E₅

0.81

0.02

102 (14, 164)

0.03, 0.02,0.02

Total

217 (80, 250)

AR (\frac{n}{2 N}, 0)

E₁–E₄

0.71

27 (10, 69)

0.31, 0.19, 0.1

E₅

0.82

0.02

107 (13, 169)

0.02, 0.02,0.01

Total

216 (70, 250)

AR(1, 0.1)

E₁–E₄

0.68

26 (10, 71)

0.30, 0.17, 0.08

E₅

0.80

0.02

123 (17, 184)

0.03, 0.02,0.02

Total

228 (80, 250)

E₁–E₄

0.79

36 (10, 89)

0.34, 0.26, 0.19

E₅

0.75

0.02

61 (13, 103)

0.07, 0.02,0.01

Total

205 (70, 250)

Open in a new tab

Table 7.

Simulation results for designs with no control arm in the staircase scenario, (p₁, p₂, p₃, p₄, p₅) = (0.20,0.25,0.30,0.35,0.40), for N=250. n̄ = mean per-arm sample size. η_m = Pr(N_E₁ > N_{E_j} + m ), the probability that the number of patients randomized to E₁ is at least m larger than the number randomized to E_j.

Method

Arm

Pr(Select)

Pr(Stop)

n̄(95% CI)

η₁₀, η₂₀, η₃₀

AR(1, 0)

E₁

0.61

19 (10, 52)

–

E₂

0.44

27 (10, 81)

0.17, 0.08,0.03

E₃

0.30

39 (10, 115)

0.12, 0.06,0.03

E₄

0.04

0.16

63 (10, 160)

0.08, 0.04,0.02

E₅

0.21

0.07

101 (10, 184)

0.04, 0.02,0.01

AR (\frac{1}{2}, 0)

E₁

0.79

22 (10, 58)

–

E₂

0.60

31 (10, 75)

0.2, 0.11,0.06

E₃

0.01

0.39

44 (10, 100)

0.13, 0.08,0.04

E₄

0.03

0.21

62 (10, 130)

0.08, 0.05,0.03

E₅

0.21

0.07

87 (10, 152)

0.04, 0.03,0.02

AR (\frac{n}{2 N}, 0)

E₁

0.76

21 (10, 54)

–

E₂

0.58

29 (10, 75)

0.19, 0.1,0.05

E₃

0.38

42 (10, 103)

0.13, 0.07,0.03

E₄

0.04

0.20

62 (10, 136)

0.07, 0.04,0.02

E₅

0.26

0.07

91 (10, 158)

0.04, 0.02,0.01

AR(1, 0.1)

E₁

0.75

21 (10, 52)

–

E₂

0.55

29 (10, 77)

0.19, 0.1,0.04

E₃

0.34

41 (10, 113)

0.12, 0.06,0.02

E₄

0.04

0.18

62 (10, 149)

0.08, 0.04,0.02

E₅

0.23

0.07

94 (10, 172)

0.04, 0.03,0.01

E₁

0.89

25 (10, 69)

–

E₂

0.71

36 (10, 82)

0.21, 0.14,0.09

E₃

0.45

50 (10, 101)

0.14, 0.08,0.06

E₄

0.03

0.22

62 (10, 108)

0.08, 0.05,0.03

E₅

0.19

0.08

68 (10, 109)

0.05, 0.03,0.02

Open in a new tab

7 Discussion

A general conclusion is that, for multi-arm trials, AR(1, 0), $AR (\frac{1}{2}, 0)$ , and $AR (\frac{n}{2 N}, 0)$ should not be used. If one wishes to use some AR method in a multi-arm trial, if an initial burn-in is imposed, the superior performance of AR(1, 0.1) indicates that it is important to restrict the domain of possible AR probabilities by bounding them away from 0 and 1. Given the apparent popularity of AR(1,0) and AR(.50, 0), this is a very important result. While we have not examined other hybrid methods, such as AR(.50, .10) or AR(n/2N, .10), the simulations suggest that these may perform well compared to AR(1, .10) or ER. The numerical limit e cannot be arbitrary, since, for example, AR(.50, .20) would be close to ER in a five-arm trial. ER does the best job of selecting treatments having p_k’s that are superior but close to each other.

In practice, it is not unlikely that two or more p_k’s may be close to each other, so the staircase scenario may be closer to reality than the LFC. When the p_k’s are close to each other, it is very difficult to select any E_k if no C is included as a comparator. The simulations in the staircase scenario indicate that conducting a multi-arm trial without a control arm may be a waste of resources, for any randomization method, and it is best to include a control arm in a multi-arm selection trial.

Many elaborations and alternative cases are possible, including time-to-event or multivariate outcomes, accounting for covariates, and evaluating AR methods for multi-arm trials in the presence of drift. This latter issue is closely related to so-called platform designs [25], which allow experimental arms to enter a trial after it has started. These are important areas for future simulation study.

Supplementary Material

NIHMS833409-supplement-supplement_1.pdf^{(107.3KB, pdf)}

Table 1.

Simulation results for designs with a control arm in the null scenario with all p_k = 0.20, for N=250. n̄ = mean per-arm sample size. Each η_m = Pr(N_C > N_k + m), the probability that the number of patients randomized to arm C is at least m larger than the number randomized to arm E_k. Values in the row E₁ – E₄ are per-arm.

Method

Arm

Pr(Select)

Pr(Stop)

n̄(95% CI)

η₁₀, η₂₀, η₃₀

AR(1,0)

–

33 (10, 63)

–

E₁–E₄

0.01

0.67

42 (10, 135)

0.4, 0.27, 0.15

Total

202 (70, 250)

AR (\frac{1}{2}, 0)

–

40 (13, 69)

–

E₁ – E₄

0.01

0.74

40 (10, 109)

0.44, 0.32, 0.19

Total

200 (70, 250)

AR (\frac{n}{2 N}, 0)

–

38 (13, 65)

–

E₁ – E₄

0.01

0.73

40 (10, 116)

0.43, 0.3, 0.18

Total

199 (70, 250)

AR(1, 0.1)

–

38 (20, 63)

–

E₁ – E₄

0.01

0.74

41 (10, 124)

0.44, 0.3, 0.16

Total

200 (70, 250)

–

58 (17, 98)

–

E₁ – E₄

0.01

0.81

36 (10, 82)

0.6, 0.45, 0.33

Total

200 (70, 250)

Open in a new tab

References

1.Thompson WR. On the likelihood that one unknown probability exceeds another in view of the evidence of the two samples. Biometrika. 1933;25:285–294. [Google Scholar]
2.Wei LJ, Durham S. The randomized play-the-winner rule in medical trials. Journal American Statistical Association. 1978;73:840–843. [Google Scholar]
3.Zelen M. A new design for randomized clinical trials. New England Journal Med. 1979;300:1242–1246. doi: 10.1056/NEJM197905313002203. [DOI] [PubMed] [Google Scholar]
4.Cheung YK, Inoue LYT, Wathen JK, Thall PF. Continuous Bayesian adaptive randomization based on event times with covariates. Statistics in Medicine. 2006;25:55–70. doi: 10.1002/sim.2247. [DOI] [PubMed] [Google Scholar]
5.Thall PF, Wathen JK. Covariate-adjusted adaptive randomization in a sarcoma trial with multi-stage treatments. Statistics in Medicine. 2005;24:1947–1964. doi: 10.1002/sim.2077. [DOI] [PubMed] [Google Scholar]
6.Hu F, Rosenberger WF. Wiley Series in Probability and Statistics. Hoboken: 2006. The Theory of Response-Adaptive Randomization in Clinical Trials. [Google Scholar]
7.Sverdlov O. Modern Adaptive Randomized Clinical Trials: Statistical and Practical Aspects. Boca Raton: CRC Press, Taylor & Francis; 2015. [Google Scholar]
8.Giles FJ, Kantarjian HM, Cortes JE, et al. Adaptive randomized study of idarubicin and cytarabine versus troxacitabine and cytarabine versus troxacitabine and idarubicin in untreated patients 50 years or older with adverse karyotype acute myeloid leukemia. Journal Clinical Oncology. 2003;21:1722–1727. doi: 10.1200/JCO.2003.11.016. [DOI] [PubMed] [Google Scholar]
9.Maki RG, Wathen JK, Hensley ML, et al. An adaptively randomized phase III study of gemcitabine and docetaxel versus gemcitabine alone in patients with metastatic soft tissue sarcomas. Journal Clinical Oncology. 2007;25:2755–1763. doi: 10.1200/JCO.2006.10.4117. [DOI] [PubMed] [Google Scholar]
10.Kim ES, Herbsy RS, Wistuba, et al. The battle trial: personalizing therapy for lung cancer. Cancer Discovery. 2011;1:44–53. doi: 10.1158/2159-8274.CD-10-0010. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Chappell R, Karrison T. Letter to the editor. Statistics in Medicine. 2007;26:3046–3056. [Google Scholar]
12.Korn EL, Freidlin B. Outcome-adaptive randomization: Is it useful? Journal Clinical Oncology. 2011;29:771–776. doi: 10.1200/JCO.2010.31.1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Yuan Y, Yin G. On the usefulness of outcome-adaptive randomization. Journal Clin Oncology. 2011;29:390-392-776. doi: 10.1200/JCO.2010.34.5330. [DOI] [PubMed] [Google Scholar]
14.Lee JJ, Chen N, Yin G. Worth adapting? Revisiting the usefulness of outcome-adaptive randomization. Clinical Cancer Research. 2012;18:4498–4507. doi: 10.1158/1078-0432.CCR-11-2555. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Rosenberger WF, Sverdlov O, Hu F. Adaptive randomization for clinical trials. Journal Biopharm Statistics. 2012;22(4):719–36. doi: 10.1080/10543406.2012.676535. [DOI] [PubMed] [Google Scholar]
16.Buyse M. Commentary on Hey and Kimmelman. Clinical Trials. 2015;12:119–121. doi: 10.1177/1740774515568916. [DOI] [PubMed] [Google Scholar]
17.Lee JJ. Commentary on Hey and Kimmelman. Clinical Trials. 2015;12:110–112. doi: 10.1177/1740774514568875. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Hey SP, Kimmellman J. Are outcome-adaptive allocation trials ethical? Clinical Trials. 2015;12:102–106. doi: 10.1177/1740774514563583. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Berry DA. Adaptive clinical trials; The promise and the caution. Journal Clinical Oncology. 2011;29:606–609. doi: 10.1200/JCO.2010.32.2685. [DOI] [PubMed] [Google Scholar]
20.Thall PF, Fox PS, Wathen JK. Statistical controversies in medical research: scientific and ethical problems with adaptive randomization in comparative clinical trials. Annals of Oncology. 2015;26:1621–1628. doi: 10.1093/annonc/mdv238. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Rosenberger WR, Lachin JM. The use of response-adaptive designs in clinical trials. Controlled Clinical Trials. 1993;14:471–84. doi: 10.1016/0197-2456(93)90028-c. [DOI] [PubMed] [Google Scholar]
22.Karrison TG, Huo D, Chappell R. A group sequential, response-adaptive design for randomized clinical trials. Controlled Clinical Trials. 2003;24:506–22. doi: 10.1016/s0197-2456(03)00092-8. [DOI] [PubMed] [Google Scholar]
23.Thall PF, Wathen JK. Practical Bayesian adaptive randomization in clinical trials. European Journal Cancer. 2007;43:860–867. doi: 10.1016/j.ejca.2007.01.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press; 1998. [Google Scholar]
25.Berry SM, Connor JT, Lewis RJ. The platform trial. An efficient strategy for evaluating multiple treatments. Journal American Medical Association. 2015;313:1619–1620. doi: 10.1001/jama.2015.2316. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS833409-supplement-supplement_1.pdf^{(107.3KB, pdf)}

[R1] 1.Thompson WR. On the likelihood that one unknown probability exceeds another in view of the evidence of the two samples. Biometrika. 1933;25:285–294. [Google Scholar]

[R2] 2.Wei LJ, Durham S. The randomized play-the-winner rule in medical trials. Journal American Statistical Association. 1978;73:840–843. [Google Scholar]

[R3] 3.Zelen M. A new design for randomized clinical trials. New England Journal Med. 1979;300:1242–1246. doi: 10.1056/NEJM197905313002203. [DOI] [PubMed] [Google Scholar]

[R4] 4.Cheung YK, Inoue LYT, Wathen JK, Thall PF. Continuous Bayesian adaptive randomization based on event times with covariates. Statistics in Medicine. 2006;25:55–70. doi: 10.1002/sim.2247. [DOI] [PubMed] [Google Scholar]

[R5] 5.Thall PF, Wathen JK. Covariate-adjusted adaptive randomization in a sarcoma trial with multi-stage treatments. Statistics in Medicine. 2005;24:1947–1964. doi: 10.1002/sim.2077. [DOI] [PubMed] [Google Scholar]

[R6] 6.Hu F, Rosenberger WF. Wiley Series in Probability and Statistics. Hoboken: 2006. The Theory of Response-Adaptive Randomization in Clinical Trials. [Google Scholar]

[R7] 7.Sverdlov O. Modern Adaptive Randomized Clinical Trials: Statistical and Practical Aspects. Boca Raton: CRC Press, Taylor & Francis; 2015. [Google Scholar]

[R8] 8.Giles FJ, Kantarjian HM, Cortes JE, et al. Adaptive randomized study of idarubicin and cytarabine versus troxacitabine and cytarabine versus troxacitabine and idarubicin in untreated patients 50 years or older with adverse karyotype acute myeloid leukemia. Journal Clinical Oncology. 2003;21:1722–1727. doi: 10.1200/JCO.2003.11.016. [DOI] [PubMed] [Google Scholar]

[R9] 9.Maki RG, Wathen JK, Hensley ML, et al. An adaptively randomized phase III study of gemcitabine and docetaxel versus gemcitabine alone in patients with metastatic soft tissue sarcomas. Journal Clinical Oncology. 2007;25:2755–1763. doi: 10.1200/JCO.2006.10.4117. [DOI] [PubMed] [Google Scholar]

[R10] 10.Kim ES, Herbsy RS, Wistuba, et al. The battle trial: personalizing therapy for lung cancer. Cancer Discovery. 2011;1:44–53. doi: 10.1158/2159-8274.CD-10-0010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Chappell R, Karrison T. Letter to the editor. Statistics in Medicine. 2007;26:3046–3056. [Google Scholar]

[R12] 12.Korn EL, Freidlin B. Outcome-adaptive randomization: Is it useful? Journal Clinical Oncology. 2011;29:771–776. doi: 10.1200/JCO.2010.31.1423. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Yuan Y, Yin G. On the usefulness of outcome-adaptive randomization. Journal Clin Oncology. 2011;29:390-392-776. doi: 10.1200/JCO.2010.34.5330. [DOI] [PubMed] [Google Scholar]

[R14] 14.Lee JJ, Chen N, Yin G. Worth adapting? Revisiting the usefulness of outcome-adaptive randomization. Clinical Cancer Research. 2012;18:4498–4507. doi: 10.1158/1078-0432.CCR-11-2555. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Rosenberger WF, Sverdlov O, Hu F. Adaptive randomization for clinical trials. Journal Biopharm Statistics. 2012;22(4):719–36. doi: 10.1080/10543406.2012.676535. [DOI] [PubMed] [Google Scholar]

[R16] 16.Buyse M. Commentary on Hey and Kimmelman. Clinical Trials. 2015;12:119–121. doi: 10.1177/1740774515568916. [DOI] [PubMed] [Google Scholar]

[R17] 17.Lee JJ. Commentary on Hey and Kimmelman. Clinical Trials. 2015;12:110–112. doi: 10.1177/1740774514568875. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Hey SP, Kimmellman J. Are outcome-adaptive allocation trials ethical? Clinical Trials. 2015;12:102–106. doi: 10.1177/1740774514563583. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Berry DA. Adaptive clinical trials; The promise and the caution. Journal Clinical Oncology. 2011;29:606–609. doi: 10.1200/JCO.2010.32.2685. [DOI] [PubMed] [Google Scholar]

[R20] 20.Thall PF, Fox PS, Wathen JK. Statistical controversies in medical research: scientific and ethical problems with adaptive randomization in comparative clinical trials. Annals of Oncology. 2015;26:1621–1628. doi: 10.1093/annonc/mdv238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Rosenberger WR, Lachin JM. The use of response-adaptive designs in clinical trials. Controlled Clinical Trials. 1993;14:471–84. doi: 10.1016/0197-2456(93)90028-c. [DOI] [PubMed] [Google Scholar]

[R22] 22.Karrison TG, Huo D, Chappell R. A group sequential, response-adaptive design for randomized clinical trials. Controlled Clinical Trials. 2003;24:506–22. doi: 10.1016/s0197-2456(03)00092-8. [DOI] [PubMed] [Google Scholar]

[R23] 23.Thall PF, Wathen JK. Practical Bayesian adaptive randomization in clinical trials. European Journal Cancer. 2007;43:860–867. doi: 10.1016/j.ejca.2007.01.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press; 1998. [Google Scholar]

[R25] 25.Berry SM, Connor JT, Lewis RJ. The platform trial. An efficient strategy for evaluating multiple treatments. Journal American Medical Association. 2015;313:1619–1620. doi: 10.1001/jama.2015.2316. [DOI] [PubMed] [Google Scholar]

PERMALINK

A Simulation Study of Outcome Adaptive Randomization in Multi-arm Clinical Trials

J Kyle Wathen

Peter F Thall

Abstract

Background

Methods

Results

Conclusions

1 Introduction

2 Outcome Adaptive Randomization Methods

3 Trial Designs

3.1 Multi-Arm Trials With a Control Arm

Futility

Enrichment

Selection

3.2 Multi-Arm Trials Without a Control Arm

Futility

Enrichment

Selection

4 Simulation Study Design

5 Simulation Results for Trials with a Control Arm

Table 2.

Table 3.

Table 4.

6 Simulation Results for Trials without a Control Arm

Table 5.

Table 6.

Table 7.

7 Discussion

Supplementary Material

Table 1.

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Simulation Study of Outcome Adaptive Randomization in Multi-arm Clinical Trials

J Kyle Wathen

Peter F Thall

Abstract

Background

Methods

Results

Conclusions

1 Introduction

2 Outcome Adaptive Randomization Methods

3 Trial Designs

3.1 Multi-Arm Trials With a Control Arm

Futility

Enrichment

Selection

3.2 Multi-Arm Trials Without a Control Arm

Futility

Enrichment

Selection

4 Simulation Study Design

5 Simulation Results for Trials with a Control Arm

Table 2.

Table 3.

Table 4.

6 Simulation Results for Trials without a Control Arm

Table 5.

Table 6.

Table 7.

7 Discussion

Supplementary Material

Table 1.

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases