Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Sep 3.
Published in final edited form as: J Biopharm Stat. 2023 Feb 3;33(5):575–585. doi: 10.1080/10543406.2023.2170399

Response adaptive randomization design for a two-stage study with binary response

Guogen Shan 1,*
PMCID: PMC10397367  NIHMSID: NIHMS1871988  PMID: 36735855

Abstract

Response adaptive randomization has the potential to treat more participants in better treatments in a trial to benefit participants. We propose optimal response adaptive randomization designs for a two-stage study with binary response, having the smallest expected sample size or the fewest expected number of failures. Equal randomization is used in the first stage, and data from the first stage is used to determine the adaptive sample size ratio in the second stage. In the proposed optimal designs, the type I error rate and the statistical power are calculated from the asymptotic normal distributions. The new designs that minimize the expected number of failures have the advantage over the existing optimal randomized designs to substantially reduce the number of failures.

Keywords: Binary response, Optimal two-stage design, Parallel design, Response adaptive randomization, Sample size

1. Introduction

For an early phase trial with binary outcome (e.g., response VS non-response), Simon’s optimal two-stage designs are frequently used among the single-arm designs to evaluate the activity of a new treatment as compared to the historical response rate [1, 2]. When a single-arm clinical trial is not appropriate due to the unreliable response rate from historical data, a randomized clinical trial is recommended for use to compare new treatments with the standard care. As compared to a single-arm study, a randomized clinical trial generally requires more participants to test the same effect size, but it is able to reduce potential biases from a single-arm study [3, 4, 5].

Thall et al. [6] were among the first to develop an optimal two-arm two-stage design for binary response having the smallest expected sample size. The Z test statistic with the pooled variance estimate was used to determine the threshold values for the first stage and the two stages combined.

A two-stage design provides the flexibility allowing early stop due to futility to protect participants being treated by an inferior treatment. A two-stage design is often allowed to be stopped in the first stage due to futility to protect participants being treated by inferior treatments. If a trial proceeds to the second stage with sufficient treatment activity from the first stage, a final decision will be made at the end of the second stage to move the trial to the next phase or not. Thall et al. [6] found that the maximum possible sample size of a two-stage design could be slightly more than that of the conventional randomized design. However, the excepted sample size savings are substantial.

To improve the effectiveness of clinical studies, response adaptive randomization (RAR) approach could be utilized to assign more participants to better treatment(s) in the second stage based on the observed results from the first stage [7]. In a two-stage design, the first stage data are going to be used for two purposes: (1) data to decide the futility stopping; and (2) a burn-in period to estimate the sample size ratio for the second stage. Because of the potential sample size ratio change in the second stage, the RAR designs are more likely to have unbalanced sample sizes across the treatment groups as compared to trials with equal randomization. For that reason, the statistical power of a RAR design may be lower than that of a study with equal randomization given the total sample size. However, a RAR design can reduce the number of failures in a trial to benefit participants. In addition, the use of RAR in trials has the potential to increase recruitment as participants are informed that they have a higher chance to be treated by one of the best treatments [8].

In this article, we propose to develop new optimal randomized two-stage RAR design by using the two optimal sample size allocation formula for binary response [7]. Rosenberger et al. [9] derived the optimal allocation formula for a study to minimize the total number of failures based on a theorem from Melfi et al. [10]. The second optimal allocation rule is from the play-the-winner design given the total number of participants [11, 12]. This optimal allocation is derived from the asymptotic theory of Markov chains. In addition to these two optimal allocation rules, the Neyman allocation [7] may be used to minimize the total sample size when the sample size variances can be considered as fixed. Under this allocation rule, more participants are going to be assigned to an inferior treatment group when the sum of response rates from all groups is more than 100% [9]. For that unethical reason, we are not going to include the Neyman allocation rule in this article for comparison.

The rest of the article is organized as follows. In Section 2, we introduce the optimal RAR designs for a two-stage study with binary response. We then compare the performance of the proposed RAR designs with the existing optimal two-stage design with regards to the expected sample size and the number of failure. We use a clinical trial that was designed by using the existing optimal two-stage design to illustrate the application of the proposed optimal RAR designs. Lastly, we provide some comments and suggestions in Section 4.

2. Methods

In a randomized parallel trial with binary response, suppose pE and pC are the response rates for the treatment group and the control group, respectively. The proportion difference between the two groups Δ = pEpC is the parameter of interest. We assume that a treatment with a high response rate is preferable. Then, the statistical hypothesis is presented as

H0:Δ0againstHa:Δ>0. (1)

In this article, we are going to develop optimal two-stage designs with the possibility of futility stopping after the first stage due to ineffectiveness observed from the new treatment. Suppose xEi and xCi are the observed responders from the treatment group and the control group in the i-th stage, i = 1, and 2. In the first stage, participants are equally randomized to one of the two treatments with the sample size of n1 in each group. Then, the total sample size in the first stage is N1 = 2n1. The first stage response rate is estimated as p^E1=xE1/n1 in the treatment group, and p^C1=xC1/n1 in the control group. Thus, the proportion difference in the first stage is Δ^1=p^E1p^C1=xE1/n1xC1/n1. Following Thall et al. [6], we use a two-proportion Z test with the pooled variance estimate to compare the response rate between the treatment group and the control group in the first stage:

Z1=Δ^12p^1(1p^1)/n1, (2)

where p^1=(xE1+xC1)/N1 is the overall response rate of the two groups from the first stage. If Z1 fells in the rejection region as compared to the pre-specified threshold value (r1), this study is stopped for futility in the first stage. Otherwise, this trial proceeds to the second stage with additional N2 participants. The sample size allocation in the second stage is determined by using the observed responses from the first stage.

In a two-stage design with equal randomization, the proportion of the sample size for the treatment group remains a constant across the two stages. In this article, we propose to develop optimal RAR designs that allow the sample size ratio in the second stage to be adjusted according to the outcome from the first stage. Suppose ρ is the probability of assigning participants to the treatment group in the second stage. The following two RAR procedures are considered:

RAR1:ρ=pE1pE1+pC1

and

RAR2:ρ=qC1qE1+qC1,

where qG1 = 1 – pG1 is the non-response rate in the first stage for the group G=E or C. The RAR1 procedure is the optimal allocation formula to minimize the total number of failures [9, 10], while the RAR2 procedure is the optimal allocation derived from the play-the-winner design based on the asymptotic theory of Markov chains [11]. It can be seen that ρ is an increasing function of the estimated first stage response rate of the treatment group pE1 when pC1 is fixed in each RAR procedure. When a new investigated treatment has a higher response rate than the control, more patients will be assigned to the treatment group in the second stage to benefit more participants in a trial.

The sample size ratio ρ depends on the outcome from the first stage. We divide the Z1 values into equal sub-ranges. The lower bound of Z1 is the first stage threshold value r1, and its upper bound value (say, r1u) should be large enough such that the probability above r1u in the standard normal distribution is very small (e.g., r1u = 6 in the simulation studies). Suppose we equally divide the range (r1, r1u) into K complementary sub-ranges:

Rk=(r1+(k1)τ,r1+kτ),

where τ = (r1ur1)/K, r1u r1 + , and k = 1, 2, 3, …, K. The value of K is large (e.g., 1000). A large value of K is needed to improve the accuracy of type I error rate and statistical power calculation.

For each sub-range Rk, the corresponding sample size ratio ρk is computed as a simple average of the estimated ρ^ based on all possible outcomes whose test statistics fall in Rk. These possible outcomes are identified through complete enumeration given the sample sizes of n1 in each group with the sample space size of (n1 + 1) × (n1 + 1). In the case with no data points in Rk, we will use ρk = ρk−1. Once the estimate of ρk is calculated, the second stage sample sizes for the treatment group and the control group are nE2 = ρkN2 and nC2 (1 − ρk)N2, respectively. Then, the total sample sizes for the treatment group and the control group from both stages are: NE = n1 + nE2 and NC = n1 + nC2.

Let p^G2=xG2/nG2 be the estimated response rate of the G group in the second stage, G = E, C. Their proportion difference in the second stage may be estimated as Δ^2=p^E2p^C2. Let p^2=(p^E2nE2+p^C2nC2)/N2 be the overall response rate in the second stage. The test statistic for the second stage data is:

Z2=Δ^2p^2(1p^2)(1nE2+1nC2),

When a trial proceeds to the second stage with additional N2 participants, the final statistic, Zf, is calculated as a linear combination of Z1 and Z2:

Zf=wZ1+1wZ2, (3)

where w = N1/(N1 + N2) is the ratio of the sample sizes in the first stage. The calculated Zf is then compared to the pre-specified threshold value r to determine whether the new treatment is effective or not.

2.1. Type I error rate

A two-stage design needs to determine sample sizes for each stage (N1 and N2), and threshold values for Z1 and Zf (r1 and r). The second stage sample size ratio ρ will be determined from the first stage outcome. With a one-sided hypothesis in Equation (1), the tail area of the hypothesis testing is

Ω={(Z1,Zf)|Z1>r1,andZf>r}.

Under the null hypothesis, both Z1 and Zf follow the standard normal distribution asymptotically. Then, the type I error (TIE) rate is calculated as

TIE=P((Z1,Zf)Ω|H0)=P(Z1>r1,wZ1+1wZ2>r|H0)=r1+ϕ(x)P(Z2>(rwx)/1w|H0)dx,=r1+ϕ(x)[1Φ(rxw1w)]dx, (4)

where ϕ and Φ are the probability density function and the cumulative distribution function of the standard normal distribution.

2.2. Statistical power

In the statistical power calculation, we follow Thall et al. [13] to derive the equivalent threshold values from the normal distribution under the alternative hypothesis. Under the alternative hypothesis, Z1 follows a normal distribution

N(μ1a,σ1a2)=N(pEpCσ1,pooled,σ1,unpooled2σ1,pooled2),

where σ1,pooled2=pE+pC2(1pE+pC2)(1nC1+1nE1), σ1,unpooled2=pC(1pC)nC1+pE(1pE)nE1, nE1 = nC1 = n1 = N1/2. Similarly, Z2 follows a normal distribution under the alternative hypothesis as

N(μ2a,σ2a2)=N(pEpCσ2,pooled,σ2,unpooled2σ2,pooled2),

where σ2, pooled and σ2, unpooled are calculated by using nE2 and nC2 to replace nE1 and nC1 in σ1, pooled and σ2, unpooled.

It should be noted that Z1 and Z2 may be correlated as the sample size ratio in Z2 is determined by the first stage outcome (Z1). A linear combination of Z1 and Z2 follows a normal distribution, and the distribution for Zf under the alternative hypothesis may be approximated by

N(μfa,σfa2)N(wμ1a+1wμ2a,wσ1a2+(1w)σ2a2).

The distributions of Z1 and Zf under the alternative hypothesis are used in calculating the statistical power. When Z1 falls in one of the sub-ranges Rk, the associated threshold value under the alternative hypothesis is r1a(k):

P(Z1>r1+(k1)τ|Ha)=P(X>r1a(k)),

where r1a(k)=[r1+(k1)μ1a]/σ1a, k = 1, …, K. For Zf with r as the threshold value in P(Zf > r|Ha), the associated threshold value is ra = (rμfa)/σfa. Then, the statistical power of a two-stage design may be approximated by

Power=P((Z1,Zf)Ω|Ha)=k=1KP(Z1Rk,Zf>r|Ha)k=1Kr1a(k)r1a(k+1)ϕ(x)[1Φ(raxw1w)]dx. (5)

2.3. Optimal design search

Two objective functions are considered here for the optimal designs. The expected sample size (ESS) under the null hypothesis is frequently used to as the criteria for an optimal design:

ESS=N1+N2×P(Z1>r1).

The second objective function is the expected number of non-responders (ENR), which is also known the expected number of failures,

ENR=(qE+qC)N1/2+k=1KP(Z1Rk|Ha)[qEN2ρk+qCN2(1ρk)].

The two constraints in the design search are the nominal type I error rate and the statistical power:

TIEα,Power(1β),

where α and β are the pre-specified type I and II error rates, respectively. We conduct the design search with the first stage and the second stage sample sizes from the design by Thall et al. [6] as the initial value (N10 and N20). For each sample sizes N1 and N2 (e.g., N1 ∈ [N10−15, N10+15], N2 ∈ [N20−15, N20+15]), we search over all possible designs with r1 from 0.23 to 0.75 by 0.005, and r from 1.50 to 1.75 by 0.005. Among the designs meet both the TIE and the statistical power, the one with the smallest ESS is the optimal ESS design, and the one with the smallest ENR is the optimal ENR design. We will search for the optimal ESS and ENR designs for each RAR procedure with a total of 4 optimal designs (referred to be as the RAR1-ESS design, the RAR2-ESS design, the RAR1-ENR design, and the RAR2-ENR design).

3. Results

We compare the performance of the proposed RAR two-stage designs with the existing optimal two-arm two-stage design by Thall et al. [6] (referred to be as the Thall design).

3.1. Simulation study

Suppose a study is designed to achieve 80% or 90% power at the significance level of α = 0.05. The proportion difference between the treatment group and the control group is Δ = pEpC = 15% or 20%. The response rate of the control group pC is from 20% to 70%. These configurations were studied by Thall et al. [6].

It should be noted that the Thall design was developed under the criteria with the smallest average of ESS under H0 and ESS under Ha. For that reason, the computed ESS in this article may not be exactly the same as their results, although the difference is often small. From the optimal Thall designs, the identified threshold value r1 for the first stage is near 0.3 and r for both stages is close to 1.6. For that reason, we search the r1 value from 0.23 to 0.75 with an increment of 0.005, and r from 1.50 to 1.75 by an increment of 0.005. These ranges and the parameter precision value can be easily adjusted in the software to provide finer designs.

We presented the ESS and ENR comparison between the five designs for a study to achieve 80% statistical power in Table 1 when Δ = 15% and Table 2 when Δ = 20%. The proposed optimal RAR designs with the smallest ESS often have similar ESS as compared to the Thall design, although the new designs have slight more sample sizes. Meanwhile, the proposed RAR1-ENR and RAR2-ENR designs can reduce the expected number of failures as compared to the Thall design with the average savings of 9.97% (with the range from 8.85% to 11.68%).

Table 1:

Optimal two-stage response adaptive randomization designs under the criteria of the expected sample size or the expected number of non-responder, as compared to the Thall design when Δ = pEpC = 15%, α = 5%, and 80% power.

Criteria RAR N 1 N 2 ESS ENR r 1 r
pC = 20%, pE = 35%
The Thall design 86 174 141.2 171.3 0.475 1.520
ESS RAR1 90 170 141.9 170.2 0.510 1.520
ESS RAR2 96 170 142.9 172.6 0.595 1.505
ENR RAR1 132 88 164.5 155.3 0.335 1.630
ENR RAR2 140 80 170.1 155.6 0.315 1.635
pC = 40%, pE = 55%
The Thall design 114 210 177.3 155.0 0.520 1.520
ESS RAR1 112 210 178.3 153.0 0.480 1.530
ESS RAR2 118 206 179.8 153.0 0.525 1.525
ENR RAR1 160 116 205.2 140.9 0.280 1.630
ENR RAR2 170 108 206.4 141.0 0.420 1.625
pC = 60%, pE = 75%
The Thall design 98 190 155.6 85.0 0.515 1.515
ESS RAR1 102 184 156.8 83.3 0.530 1.520
ESS RAR2 106 184 158.6 83.4 0.565 1.515
ENR RAR1 136 108 177.9 76.4 0.285 1.625
ENR RAR2 140 106 179.7 76.4 0.320 1.625

Table 2:

Optimal two-stage response adaptive randomization designs under the criteria of the expected sample size or the expected number of non-responder, as compared to the Thall design when Δ = pEpC = 20%, α = 5%, and 80% power.

Criteria RAR N 1 N 2 ESS ENR r 1 r
pC = 30%, pE = 50%
The Thall design 60 114 95.0 95.1 0.505 1.520
ESS RAR1 58 122 95.9 96.0 0.495 1.510
ESS RAR2 62 120 96.9 96.1 0.550 1.505
ENR RAR1 102 46 115.5 86.4 0.545 1.630
ENR RAR2 96 54 112.6 86.6 0.505 1.625
pC = 50%, pE = 70%
The Thall design 60 114 95.0 63.4 0.505 1.520
ESS RAR1 62 114 96.2 62.6 0.525 1.520
ESS RAR2 64 120 97.5 63.8 0.585 1.500
ENR RAR1 96 52 115.0 57.5 0.345 1.635
ENR RAR2 96 54 114.2 57.5 0.420 1.630
pC = 70%, pE = 90%
The Thall design 38 78 62.6 21.1 0.480 1.520
ESS RAR1 36 84 64.5 20.3 0.415 1.520
ESS RAR2 44 78 66.0 20.0 0.575 1.510
ENR RAR1 48 54 69.5 18.7 0.260 1.610
ENR RAR2 54 50 71.7 18.6 0.375 1.610

The Thall design was developed under the criteria with the smallest ESS. When that criteria is used, the identified optimal design often has a relatively small first stage sample size. For the studied configurations, the proportion of the first stage sample size in the Thall design ranges from 32.8% to 35.2% with the average proportion of 33.9%. Similar first stage sample size proportion is observed for the proposed RAR designs under the ESS criteria. However, for the proposed optimal RAR designs under the ENR criteria, that trend was reversed with the average proportion of the first stage sample size being 59.5% (between 47.1% and 68.9%). The proposed RAR designs under the ENR criteria allocate slightly more participants in the first stage as compared to the second stage. With more participants in the first stage, more information is collected at this stage to make reliable statistical inference.

In Figure 1, we present the ESS and ENR comparison between the Thall design and the proposed RAR designs when the nominal statistical power is 90%. In general, the estimated ENR is a decreasing function of pC. The relationship between the estimated ESS and pC is similar to a bell curve from a normal density function curve. The maximum of ESS occurs when pC and pE are close to 50%. The designs under the ESS criteria are often similar to each other with regards to the ESS and the ENR. As compared to the Thall design, the savings in the expected number of failures from the proposed RAR designs under the ENR criteria goes up as pC is getting smaller, and the savings are substantial when pC is more than 50%.

Figure 1:

Figure 1:

The expected sample size and the expected number of non-responder using the proposed four RAR designs and the Thall design, as a function of the response rate of the control group pC, when Δ = 15% (top) and Δ = 20% (bottom) for a study to achieve 90% power at the significance level of 0.05.

We also compare the proposed two RAR designs under the ENR criteria. For the configurations in Tables 1 and 2 to achieve 80% power, the RAR2-ENR design has negligible advantage over the RAR1-ENR design with regards to the expected number of failures. For the expected sample size, the RAR1 designs have smaller sample sizes than the RAR2 designs when pC is small (e.g., pC ≤ 30%). But, that trend is reversed when pC is large and Δ = 20%. As the statistical power is increased to 90% (see, Figure 1), their numbers of failures are very close to each other. The ESS of the RAR1-ENR design could be slightly less than that of the RAR2-ENR design when pC is small. When pC is large and Δ = 15% (top left figure), the RAR2-ENR design could save a few sample sizes.

In Table 3, we present the simulated type I error rate and statistical power for the optimal designs with the smallest ESS from Table 1. It can be seen that the type I error rate and the statistical power are very close to the nominal level. In a few cases, the actual statistical power could be slightly below the nominal level. Alternatively, one may use the simulated statistical power instead of using the approximation power in Equation (5). Similar findings are observed for other designs in Table 1, and Table 2. It should be noted that when the estimated ρ based on data from the first stage is near to the boundary (e.g., 0 or 1), all participants in the second stage may be assigned to one group. That would end up with no participant in the other group. For such rare cases, we will assign one participant to the other group in order to calculate the test statistics Z2 and Zf.

Table 3:

Simulated type I error rate and statistical power for the optimal designs under the ESS criteria in Table 1, at the nominal level of α = 5%, and 80% statistical power.

p u p a Criteria N 1 N 2 r 1 r α Power
0.2 0.35 RAR1 90 170 0.510 1.520 0.0515 0.7926
0.2 0.35 RAR2 96 170 0.595 1.505 0.0487 0.8068
0.4 0.55 RAR1 112 210 0.480 1.530 0.0519 0.8002
0.4 0.55 RAR2 118 206 0.525 1.525 0.0512 0.8137
0.6 0.75 RAR1 102 184 0.530 1.520 0.0502 0.7993
0.6 0.75 RAR2 106 184 0.565 1.515 0.0512 0.8113

3.2. Real trial example

The Thall design was used to design a randomized clinical trial in chemotherapy-naive patients with advanced non-small-cell lung cancer (NSCLC) to compare the activity of a cisplatin-etoposide regimen (CDDP-VP16) and a cisplatin/carboplatin-etoposide-vinorelbine combination, where the former is the standard care [14]. The therapeutic activity rate of the standard CDDP-VP16 regimen was estimated to be pC = 25%. A 15% higher activity rate was expected from the new experimental treatment, with pE = 40%.

That NSCLC trial was designed to achieve 80% power at the significance level of 0.05. The Thall two-stage design needs N1 = 104 in the first stage and N2 = 150 in the second stage. If the target function was the ENR instead of the ESS, the proposed optimal RAR designs would have more participants in the first stage, with (N1, N2) = (144,98) for the proposed RAR1-ENR design, and (N1, N2) = (140,104) for the proposed RAR2-ENR design. The proposed RAR1-ENR design has slight smaller ENR=159.29 as compared to the RAR1-ENR design (ENR=159.62). The savings in the ENR from the proposed RAR2-ENR and RAR2-ENR designs are close to 10% as compared to the Thall design (ENR=176.5).

4. Discussion

In optimal randomized two-stage designs to minimize the sample size, the identified design often has a smaller sample size in the first stage than the second stage [13, 15]. With fewer participants in the first stage, the expected sample size could be reduced. However, the variance of the test statistic for the first stage could be larger as compared to that for a study with more participants assigned to the first stage. We developed a Shiny app: https://AdaptiveDesigns.shinyapps.io/RAR2Stage/.

The asymptotic normal distribution of the Z test statistic is used to compute the TIE and the statistical power [13]. In a RAR design, the quantity raxw1w in the statistical power formula in Equation (5) should be

raxw×σ1aσfa1w×σ2aσfa,

because nE2 and nC2 are often not the same. In the case with equal randomization, both σ1aσfa and σ2aσfa are equal to 1. We conducted a simulation study with pC from 0 to 80%, and Δ = pEpC from 5% to 30%. The ratio of σ1a and σ2a is very close to 1 with the mean value of 1.027, and the ratio of σ1a and σfa are even closer to 1. For that reason, we use raxw1w in the power caluclation to approximate the statstical power. The simulation studies show the statistical power is very close to the nominal level in Table 3.

When the limiting distribution is utilized, the actual error rate functions could be under- or over-estimated. Thall et al. [13] calculated the exact threshold values by using binomial distributions, and found that the results based on normal approximation are close to those based on exact calculation. In the Thall design, participants from the second stage are equally randomized. In the proposed RAR designs, the sample size ratio between the two treatment groups changes as the outcome from the first stage, with a possible of K different ratios depending on the Z1 value [16, 17, 18]. For each possible value of Z1, one has to generate the complete sample space. For that reason, the size of the all possible data is huge, and it is a challenge to conduct exact calculation for the proposed RAR designs without efficient integer algorithms [19, 20, 15, 21].

In this article, we use the threshold value r regardless the outcome from the first stage, which reduces the computational intensity. We consider developing a RAR design that allows r to be changed based on the Z1 value in the future [22, 23]. In addition, a monotonic relationship between the second stage sample size N2 and Z1 may be incorporated in the design search. It is an intuitive feature that fewer participants are needed to confirm the activity of a new treatment when a higher effect size is observed from the first stage. In a single-arm two-stage design, we found that intuitive feature would improve the effectiveness of the study design while the improvement is limited in some configurations.

In addition to the propsoed RAR designs for a randomized two-stage design with binary outcome, covariate adaptive randomization (CAR) designs may be considered to balance the importance covariates in a study. For example, in the trials for Alzheimer’s Disease (AD), the apolipoprotein E (ApoE) e4 genotype is found be to be associated with early onset of AD [24, 25, 26, 27]. When the ApoE e4 genotype is included as a covariate in the adaptive randomization, the success rate of AD trials could be increased by reducing the heterogeneity in AD patients [28]. We consider developing CAR designs for a trial with binary outcome as future work.

Acknowledgments

The authors are very grateful to the Editor, Associate Editor, and two reviewers for their insightful comments that help improve the manuscript, and their encouragement to develop more adaptive designs for clinical trials. Shan’s research is partially supported by the National Institutes of Health under Award Number R01AG070849 and R03CA248006.

References

RESOURCES