Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Jul 20.
Published in final edited form as: Stat Med. 2014 Feb 27;33(16):2718–2735. doi: 10.1002/sim.6124

A New Approach to Designing Phase I-II Cancer Trials for Cytotoxic Chemotherapies

Jay Bartroff *, Tze Leung Lai †,, Balasubramanian Narasimhan †,§
PMCID: PMC4048734  NIHMSID: NIHMS570706  PMID: 24577750

Abstract

Recently there has been much work on early phase cancer designs that incorporate both toxicity and efficacy data, called Phase I-II designs because they combine elements of both phases. However, they do not explicitly address the Phase II hypothesis test of H0: pp0, where p is the probability of efficacy at the estimated maximum tolerated dose (MTD) η̂ from Phase I and p0 is the baseline efficacy rate. Standard practice for Phase II remains to treat p as a fixed, unknown parameter and to use Simon’s 2-stage design with all patients dosed at η̂. We propose a Phase I-II design that addresses the uncertainty in the estimate p = p(η̂) in H0 by using sequential generalized likelihood theory. Combining this with a Phase I design that incorporates efficacy data, the Phase I-II design provides a common framework that can be used all the way from the first dose of Phase I through the final accept/reject decision about H0 at the end of Phase II, utilizing both toxicity and efficacy data throughout. Efficient group sequential testing is used in Phase II that allows for early stopping to show treatment effect or futility. The proposed Phase I-II design thus removes the artificial barrier between Phase I and Phase II, and fulfills the objectives of searching for the MTD and testing if the treatment has an acceptable response rate to enter into a Phase III trial.

Keywords: cancer trials, generalized likelihood ratio, group sequential, isotonic regression, maximum tolerated dose, Phase I, Phase II

1. Introduction

In typical Phase I studies in the development of relatively benign drugs, the drug is initiated at low doses and subsequently escalated to show safety at a level where some positive response occurs, and healthy volunteers are used as study subjects. This paradigm does not work for diseases like cancer, for which a non-negligible probability of severe toxic reaction has to be accepted to give the patient some chance of a favorable response to the treatment. Therefore patients (rather than healthy volunteers) are used as study subjects, and it is widely accepted that some degree of toxicity must be tolerated to experience any substantial therapeutic effects. Hence, an acceptable proportion q of patients experiencing dose limiting toxicities (DLTs) is generally agreed on before the trial, which depends on the type and severity of the DLT; the dose resulting in this proportion is thus referred to as the maximum tolerated dose (MTD). In addition to the explicitly stated objective of determining the MTD, a Phase I cancer trial also has the implicit goal of safe treatment of the patients in the trial. However, the aims of treating patients in the trial and generating an efficient design to estimate the MTD for future patients often run counter to each other. Commonly used designs in Phase I cancer trials implicitly place their focus on the safety of the patients in the trial, beginning from a conservatively low starting dose and escalating cautiously.

In [1, 2], Bartroff and Lai have given a review of model-based methods to design Phase I cancer trials and proposed a general framework that incorporates both “individual” and “collective” ethics into the design of the trial. We have also developed a new design which minimizes a risk function composed of two terms, with one representing the individual risk of the current dose and the other representing the collective risk, and have shown that it performs better than existing model-based designs in accuracy of the MTD estimate at the end of the trial, and toxicity and overdose rates of patients in the trial, and loss functions reflecting the individual and collective ethics.

The MTD determined from a Phase I study is used in a subsequent Phase II study, in which “a cohort of patients is treated, and the outcomes are related to the prespecified target or bar. If the results meet or exceed the target, the treatment is declared worthy of further study; otherwise, further development is stopped. This has been referred to as the ‘go/ no go’ decision” ([3], p. 927). The most widely used designs for these single-arm Phase II trials are Simon’s two-stage design [4], which allows early stopping of the trial if the treatment has not shown beneficial effect, that is measured by a Bernoulli proportion. Simon considered the design that stops for futility (i.e., accepts the null hypothesis H0 in (1)) after n1 patients if the number of patients exhibiting positive treatment effect is r1 (≤ n1) or fewer, and otherwise treats an additional n2 patients and rejects the treatment (again, accepts H0) if and only if the number of patients exhibiting positive treatment effect is r(≤ n1 + n2) or fewer. Simon’s design requires that a null proportion p0, representing some “uninteresting” level of positive treatment effect, and an alternative p1 > p0 be specified. The null hypothesis is

H0:pp0, (1)

where p denotes the probability of positive treatment effect. The type I and II error probabilities α = Pp0 (Reject H0), β = Pp1 (Accept H0) and the expected sample size Ep0 N can be computed for any design of this form, which can be represented by the parameter vector (n1, n2, r1, r). Using computer search over these integer-valued parameters, Simon [4] tabulated the optimal designs in his Tables 1 and 2 for different values of (p0, p1). Simon’s design has been generalized by Jung et al. [5, 6] who also give a graphical method of selecting from among the admissible designs, Simon’s original procedure being one of them, and by Lu et al. [7] to allow for partial responses. Whether the new treatment is declared promising in a Phase II trial depends strongly on the prescribed p0 and p1. The sample size m of a typical Phase I trial and the maximum sample size M = n1 + n2 of a typical Phase II trial are relatively small, 20–30 for Phase I and no more than 60 for Phase II. Vickers et al. [3] conclude that uncertainty in the choice of p0 and p1 can increase the likelihood that (a) a treatment with no viable positive treatment effect proceeds to Phase III, or (b) a treatment with positive treatment effect is abandoned at Phase II.

Table 1.

Operating characteristics, based on 100,000 simulations, of the traditional design described on page 7 in which η = 250, η̂ denotes the final MTD estimate by either MLE, posterior mean (CRM), or EWOC, and α denotes the prescribed type I error probability of Simon’s Phase II test of p(η̂; ψ) ≤ p0 with design parameters n1, n2, r1 and r. The actual probability of rejecting H0 : p(η; ψ) ≤ p0 is denoted by P(−H0|η̂), with standard errors in parentheses, where η̂ = MLE, CRM, or EWOC is the final MTD estimate in Phase I.

Method for η̂ MLE CRM EWOC
min (η̂) 140.0 141.2 141.0
Q1 (η̂) 226.3 246.9 229.1
med (η̂) 244.7 264.7 246.9
Q3 (η̂) 264.1 318.1 246.9
max (η̂) 425.0 391.6 362.7
E(η̂) 252.6 276.7 239.8
RMSE (η̂) 52.2 44.2 29.0
α n1/n2/r1/r P(−H0|MLE) P(−H0|CRM) P(−H0|EWOC)
.05 18/25/2/7 .180 (.001) .479 (.002) .100 (.0009)
.04 18/30/2/8 .176 (.001) .476 (.002) .094 (.0009)
.03 18/35/2/9 .170 (.001) .470 (.002) .088 (.0009)
.02 22/44/3/11 .167 (.001) .464 (.002) .083 (.0009)
.01 22/58/3/14 .156 (.001) .458 (.002) .074 (.0008)

Table 2.

Operating characteristics of the traditional (denoted Trad) and new (denoted New) designs described on page 8. The toxicity parameter is fixed at (η, ρ) = (250, .1), and the six values of the efficacy parameter ψ are determined by p(xmax; ψ) = .9 and p(η; ψ) = .05, .1, .2, .3, .4, .5. All designs have Phase I sample size m = 24 and maximum Phase II sample size 43, for a maximum Phase I-II sample size of 67. Eff is the overall response rate for subjects in the study, OD is the overall overdose rate of subjects treated at doses above the true MTD, and RMSE(η̂rec) is the root-mean-square-error of the recommended dose.

p(η; ψ) 5% 10% 20% 30% 40% 50%
Trad New Trad New Trad New Trad New Trad New Trad New
p(η̂rec; ψ) .101 .054 .150 .102 .233 .202 .319 .296 .409 .392 .499 .486
Eff .096 .061 .140 .104 .219 .200 .310 .293 .405 .381 .498 .474
OD .303 .291 .314 .312 .326 .289 .327 .256 .336 .252 .331 .249
RMSE(η̂rec) 51.0 28.4 52.2 29.0 52.4 29.3 52.3 28.6 51.7 29.0 52.1 29.8
P(rej. H0) .090 .051 .180 .180 .479 .645 .776 .923 .939 .989 .987 .999
EN 45.9 40.2 49.8 47.3 57.7 51.0 63.2 43.7 66.0 37.0 66.7 34.6

1.1. An integrated approach to dose finding and testing for efficacy

In Sections 2 and 3 we address these issues concerning the design of early-phase single-arm cancer trials by developing a novel seamless Phase I-II trial design that uses efficient statistical methods for the design and analysis of the integrated trial, subject to ethical and sample size constraints. The data from the trial are toxicity and efficacy outcomes at various doses and consist of (xi, yi, zi), i = 1, …, N, where N is the Phase I-II total sample size, xi denotes the dose given to the ith subject, yi = 1 or 0 according to whether a DLT occurs or not, and zi = 1 or 0 according to whether the subject responds to the treatment. For cytotoxic treatments, both the dose-toxicity curve P(yi = 1|xi = x) and the dose-response curve P(zi = 1|xi = x) increase with the dose x, and therefore the MTD is the most efficacious dose subject to a prespecified probability q of severe toxic reaction. Whereas the objective of a traditional Phase I cancer trial is to estimate the MTD, denoted by η, from (xi, yi), i = 1, …, m, and that of the ensuing Phase II trial with maximum sample size M is to test if the response rate exceeds some prespecified level p0 when all patients in the trial are assigned dose η̂, which is the MTD estimate from the Phase I trial, our integrated design continues sequential estimation of η throughout the trial with total maximum sample size m + M and uses an efficient group sequential test of the null hypothesis that the response rate at η does not exceed p0. In Section 2 we consider commonly used logisitic regression models for dose-toxicity and dose-response relationships to pinpoint the basic ideas. Section 3 removes the parametric assumptions and extends the methodology to dose-toxicity and dose-response relationships that are only assumed to be monotone. Simulation studies in Section 4 demonstrate the advantages of the integrated design, and Section 5 describes the underlying theory and implementation details.

1.2. Review of current methods using toxicity and efficacy/response data

Gooley et al. [8] suggested using efficacy and toxicity data together, and performed simulations to compare the operating characteristics of three ad-hoc designs. Thall and Russell [9] proposed a design combining binary toxicity data yi and trinomial response data zi = 0, 1, or 2 for no, moderate, or severe response, respectively, into a single trinomial variable

wi={0,ifzi=0andyi=01,ifzi=1andyi=02,ifzi=2oryi=1. (2)

Using a proportional odds regression model for wi on dose xi with a prior distribution on its unknown parameters, a Bayesian posterior calculation along the lines of O’Quigley, Pepe, and Fisher’s [10] continual reassessment method (CRM) is performed to calculate the acceptability of the available discrete dose levels and escalate or de-escalate the current dose level. For a similar setting, O’Quigley et al. [11] proposed a Phase I design for HIV trials in which binary efficacy zi and toxicity yi variables are combined into a single trinomial variable (2) in which we now set wi = 2 if yi = 1. A CRM-like calculation is used to treat the current patient at the posterior estimate of the dose maximizing the probability of simultaneous efficacy and non-toxicity.

For efficacy and toxicity measurements, Ivanova [12] proposed an up-and-down design which assigns doses in pairs on a discrete set of dose levels. Braun [13] proposed a bivariate version of CRM in which a bivariate joint distribution is chosen for (yi, zi), and the target dose is defined to be the one minimizing the expected Euclidean distance to pre-specified toxicity and efficacy rates, with respect to a chosen noninformative posterior distribution. In particular, the bivariate distribution of Arnold and Strauss [14] which gives Bernouilli conditional distributions of yi given zi, and vice-versa, was recommended. Thall and Cook [15] proposed a different method for combining efficacy and toxicity responses. First, marginal efficacy and toxicity curves are assumed which are then combined using a Gaussian or Gumbel copula; this approach differs from Braun’s method that specifies the conditional distributions rather than the marginals. Doses are then selected using “tradeoff contours” in the two-dimensional space of outcome probabilities on which the outcomes are equally desirable. Thall et al. [16] extend this method to allow for the inclusion of patient-specific covariates.

Even when the designs summarized above are called “Phase I-II” designs, it is because they incorporate efficacy (or tumor response) data. They do not address testing the efficacy hypothesis that is the purpose of typical Phase II cancer studies, for which the standard practice is to use Simon’s 2-stage design following the dose-finding portion. Moreover, this skirts the issue of uncertainty in the estimated MTD used in the null hypothesis in Phase II, as well as ignores toxicity outcomes that are available during Phase II which could help improve this estimate, especially since the Phase I sample size is usually small. The innovative Phase I-II design proposed herein aims at rectifying these issues, and hence provides a common framework that can be used all the way from Phase I through the final accept/reject decision about the null hypothesis on efficacy in the Phase II portion of the study, utilizing both toxicity and efficacy data for dose finding while performing efficient group sequential testing of the null hypothesis.

2. An integrated approach to designing early-phase cancer clinical trial designs

A widely-used model for the dose-toxicity curve is the logistic regression model

P(yi=1xi=x)=F(x;θ):=1/(1+e-(θ1+θ2x)), (3)

where θ = (θ1, θ2) and it is assumed that θ2 > 0 (i.e., probability of toxicity increases with dose), for which the MTD is given by η = [log(q/(1 − q)) − θ1]/θ2. Under (3), the estimate η̂ based on (xi, yi), i = 1, …, m, can be obtained by maximum likelihood, which is equivalent to logistic regression. Similarly, we can model the dose-response curve by

P(zi=1xi=x)=p(x;ψ):=1/(1+e-(ψ1+ψ2x)), (4)

under which the probability p of the response in the null hypothesis H0: pp0 of the traditional Phase II cancer trial is actually p(η̂; ψ). The difference between η̂ and η is completely ignored in currently used designs, and the toxicity outcomes in the Phase II trial are also ignored. Combining the toxicity outcomes in Phase II with those in Phase I can improve the estimate of η, especially since the Phase I sample size is small. Changing the null hypothesis to

H0:p(η;ψ)p0 (5)

not only takes into consideration the uncertainties in η̂ as an estimate of η but also leads to continual updating of η with toxicity outcomes in the Phase II trial if one uses a generalized likelihood ratio (GLR) test. Moreover, the GLR test also uses the Phase I efficacy outcomes zi, i = 1, …, m.

2.1. The first phase of the Phase I-II trial

The first phase of the new Phase I-II (or dose-finding) design involves only the dose-toxicity data, but not the responses zi. We can use traditional methods or recent advances in Phase I cancer trial designs to perform dose escalation; see Section 5.2 and the references therein for details. At the end of the Phase I trial, we compute the maximum likelihood or Bayes estimates θ̃, η̃, and ψ̃ of θ, η, and ψ. Let Inline graphic denote the Phase I data (x1, y1, z1), …, (xm, ym, zm).

2.2. The ensuing group sequential design to test efficacy and re-estimate η

After this initial group of m patients, the proposed design switches to a group sequential scheme, with specified group sizes m1, …, mK (e.g., m1 = … = mK gives constant group size sampling). The group sequential scheme updates the MTD estimate via MLE at the kth interim analysis with an additional batch of size mk of dose-toxicity data (xτk−1+1, yτk−1+1), … (xτk, yτk), where

τk=m+i=1kmi,k=1,,K. (6)

It also uses all the observed data (xi, yi, zi), 1 ≤ iτk, to perform a group sequential GLR test of H0: p(η; ψ) ≤ p0 at the kth interim analysis, where p(x; ψ) is defined by (4). Lai and Shih [17] have developed a methodology of nearly optimal group sequential tests, which use versatile and asymptotically efficient GLR test statistics and stopping boundaries. In conjunction with GLR statistics, maximum likelihood (rather than Bayes) estimates of η are used for sequential updating of the estimated MTD.

To simplify the description, we begin by assuming that yi and zi are independent; this assumption will be removed in Section 2.3. Let ℓk(ψ) denote the log-likelihood function for ψ at the kth interim analysis, which because of the independence assumption only depends on the zi and not the yi:

k(ψ)=log{i=1τkp(xi;ψ)zi[1-p(xi;ψ)]1-zi}=i=1τk{zi(ψ1+ψ2xi)-log(1+eψ1+ψ2xi)}.

Let ψ̂k be an MLE maximizing this, θ̂k = (θ̂k,1, θ̂k,2) be an MLE of θ based on the data up to and including the kth interim analysis, η̂k = (logit(q) − θ̂k,1)/ θ̂k,2, and

Skj={ψ:p(η^k;ψ)=pj}for0kK,j=0,1, (7)

where η̂0 = η̃, p1 > p0 and H1: p(η; ψ) ≥ p1 is the alternative hypothesis. The choice of p1 will be discussed in Section 5.2.

As will be explained in Section 5.1, we can compute at the kth interim analysis the test statistics

k,j=minψSkj[k(ψ^k)-k(ψ)],j=0,1, (8)

so that the group sequential test stops and rejects H0 at interim analysis k < K if

p(η^k,ψ^k)>p0andk,0b, (9)

and early stopping for futility (accepting H0) at analysis k < K can also occur if

p(η^k,ψ^k)<p1andk,1b. (10)

The test rejects H0 at the Kth analysis if

p(η^K,ψ^K)>p0andK,0c. (11)

The thresholds b, , and c are chosen so that

maxψS00Pθ,ψ(H0rejectedF0)=α (12)

and the power

minψS01Pθ,ψ(H0rejectedF0) (13)

is close to 1 − β, as in [17], [18], and [19]. Details and software for implementation are given in Section 5.2.

2.3. Modeling the dependence between yi and zi

We can model the dependence between yi and zi by replacing the marginal model (4) by the following model for the conditional distribution of zi given yi:

P(zi=1yi=0,xi)=1/(1+e-(ψ10+ψ20xi)) (14)
P(zi=1yi=1,xi)=1/(1+e-(ψ11+ψ21xi)) (15)

with parameters ψ0=(ψ10,ψ20) and ψ1=(ψ11,ψ21). Generalizing (5) to include (14)–(15), the null hypothesis is that the probability of efficacy at dose x = η is less than or equal to p0, i.e.,

H0:1-q1+e-(ψ10+ψ20η)+q1+e-(ψ11+ψ21η)p0, (16)

noting that F(η; θ) = q. This null hypothesis is an extension of that in Section 2.2 and can again be tested by sequential GLR theory.

2.4. Modifications for discrete dose levels

In practice the dose levels in dose-finding studies of cancer drugs are usually chosen before the trial from a finite set

Λ={λ1,,λd},whereλ1<λ2<<λd, (17)

unlike the continuous doses we have assumed so far. In this case the MTD has to be redefined as

η={max{λΛ:F(λ;θ)q},ifF(λi;θ)qforsomeiλ1,otherwise. (18)

Putting this modified definition of η in (5) or (16), we can still apply the group sequential GLR test of Section 2.2 or 2.3, in which we also modify the definition of η̂k accordingly to be Λ-restricted. That is, η̂k is the smallest λj ∈ Λ maximizing the likelihood ℓk up through the kth interim analysis, and we set xi = η̂k for i = τk + 1, …, τk+1. Note that the group sequential GLR test is based on all the observed data (xi, yi, zi) up to the time of an interim analysis, irrespective of how the xi are chosen and therefore no additional modifications are needed.

Since Λ is discrete, one can use more robust specification of the dose-toxicity and/or dose-response curve than the logisitic regression models (3) and (4). Details are given in the next section. For samples of the size typically used in early-phase cancer trials, however, one usually does not have enough data to detect departures from these “working models.” In addition, the initial phase of a dose-finding study for cytotoxic chemotherapies is often very conservative, to avoid causing harm to patients before observing how the new treatment actually works in human subjects. This explains the popularity of the widely-used, although inefficient, 3+3 designs. A more efficient alternative is to use a 2-stage Phase I design in which a more cautious design is used for the first stage before switching to a parametric model-based design in the second stage; see [1, Section 4.2]. Once we have zoomed in on a range around the MTD that is narrow relative to the original dose range, the logistic model is actually quite robust because it can be viewed as a locally linear regression model around the MTD, adjusted with the logit link for Bernoulli outcomes. What this means is that one only needs to be concerned with the choice of the design levels xi to ensure such robustness in the locally logit-linear model. Thus, the GLR test statistic can be restricted only to those xi that are within a certain distance from η̂k at the kth interim analysis.

3. Extension to monotone dose-toxicity and dose-response relationships

In many dose-finding trials, the number of discrete dose levels (17) is relatively small. For this situation, in this section we develop an approach similar to the Bayesian models of Yin at el. [20] and Yin and Yuan [21] where the probabilities of toxicity and efficacy are order-restricted, but in a frequentist setting. Assume for now that yi and zi are independent; the general case will be covered below in Section 3.2. Because the number of dose levels is small, we also assume that all the levels have been used at least once during Phase I; if this does not hold then only the used dose levels are carried forward into Phase II. Instead of the parameterization by the toxicity and efficacy parameters θ and ψ, we parameterize by the toxicity and efficacy probabilities

ϕi=P(y=1x=λi),πi=P(z=1x=λi),i=1,,d. (19)

The MTD (18) can then be written

η=λiwherei={max{i:ϕiq},ifϕiqforsomei1,otherwise

so that the Phase II null and alternative hypotheses can be expressed as

H0:πip0vs.H1:πip1.

3.1. Order-restricted MLE and GLR statistics

Letting xt = λit denote the t-th dose, t = 1, …, τK with τk given by (6), and π = (π1, …, πd), the log-likelihood at the kth interim analysis of Phase II under the independence assumption is

k(π)=log{t=1τkπitzt(1-πit)1-zt}. (20)

The order-restricted MLE π̂k = (π̂1,k, …, π̂d,k) maximizing (20) subject to π1 ≤ … ≤ πd is given by the formula

π^i,k=minjimaxji(Sj,k++Sj,kνj,k++νj,k),i=1,,d, (21)

where Si,k=t=1τkzt1{it=i} is the sum of the efficacy responses at level i and νi,k=t=1τk1{it=i} is the number of patients that have been dosed at level i up through the kth analysis ([22], p. 52). An analogous formula holds for the order-restricted MLE of the toxicity probabilities ϕ̂k = (ϕ̂1,k, …, ϕ̂d,k). These order-restricted MLEs can be computed by solving the minimization-maximization problem in (21) or, equivalently, by using the well known Pool Adjacent Violators Algorithm (PAVA); see [22, Section 2.4].

The order-restricted MLE of the MTD at the kth interim analysis can be defined as

η^k=λi^k,wherei^k={max{i:ϕ^i,kq},ifϕ^i,kqforsomei1,othewise. (22)

Let πkj=(π1,kj,,πd,kj), j = 0, 1, be the constrained order-restricted MLE which maximizes (20) subject to the order-restriction π1 ≤ … ≤ πd and the additional constraint that

πi^kp0forj=0andπi^kp1forj=1, (23)

which can be computed as follows. If π^i^k,kp0, then πk0=π^k. Otherwise, π^i^k,k>p0, so suppose that π^i^k-r-1,kp0<π^i^k-r,k, in which case we set πi^k-r,k0==πi^k,k0=p0, and πi,k0 coincides with π̂i,k for all other i. In other words, when π̂k falls outside H0, πk0 is computed by setting the appropriate elements of π̂k to the boundary value p0, and πk1 is computed similarly.

The log-likelihood ratio statistics at the kth interim analysis for testing H0 : πi*p0 vs. H1 : πi*p1 are given by

k,j=k(π^k)-k(πkj),j=0,1, (24)

with ℓk(π) defined by (20), and the group sequential test stops and rejects H0 at interim analysis k < K if

π^i^k,k>p0andk,0b, (25)

stops for futility if

π^i^k,k<p1andk,1b, (26)

and otherwise rejects H0 at the Kth analysis if

π^i^K,K>p0andK,0c. (27)

As in Section 2.2, the thresholds b, , and c are chosen so that (12) holds and the power is close to 1 − β. Details are given in Section 5.2.

3.2. Modeling the dependence between yi and zi

A flexible method for modeling the general case where the toxicity and efficacy observations may not be independent is to introduce d additional parameters in the form of the global cross ratios

ρi=Πi(0,0)Πi(1,1)Πi(1,0)Πi(0,1),i=1,,d,whereΠi(y,z)=P(yt=y,zt=zxt=λi).

Dale [23] proposed using the global cross ratio as a useful measurement of dependence in discrete ordered bivariate responses and they have been recently used by Yin et al. [20] in a Bayesian Phase I-II design. If the toxicity and efficacy responses are independent, then ρi = 1 for all i = 1, …, d. The complete joint distribution Πi(y, z) of the toxicity and efficacy responses can be recovered from the parameters πi, ϕi, ρi, i = 1, …, d through the following formulas:

Πi(1,1)={(ai-ai2+bi)/[2(ρi-1)],ifρi1πiϕi,ifρi=1Πi(1,0)=πi-Πi(1,1),Πi(0,1)=ϕi-Πi(1,1),Πi(0,0)=1-πi-ϕi+Πi(1,1),

where ai = 1 + (πi + ϕi)(ρi − 1) and bi = −4ρi(ρi − 1)πiϕi. The log-likelihood at the kth interim analysis of Phase II for this general case is

k(π,ϕ,ρ)=log{t=1τkΠit(yt,zt)}wherext=λit, (28)

and the log-likelihood ratio statistics at the kth interim analysis for testing H0 : πi*p0 vs. H1 : πi*p1 are given by

k,j=k(π^k,ϕ^k,ρ^k)-k(πkj,ϕkj,ρkj),j=0,1, (29)

with stopping rules as above in (25)–(27), where π̂k, ϕ̂k, ρ̂k are MLEs maximizing (28) subject to the order restrictions π1 ≤ … ≤ πd and ϕ1 ≤ … ≤ ϕd, and πkj,ϕkj,ρkj maximize (28) subject to these order restrictions plus the constraints (23).

4. Simulation studies

4.1. Operating characteristics of the traditional and proposed Phase I-II designs on a continuous dose space

To investigate the effect of uncertainty in the estimate η̂ on the operating characteristics of the Phase II hypothesis test that is used in current practice, we first simulated a Phase I design, which we take to be EWOC introduced by Babb et al. [24], followed by Simon’s optimal 2-stage design. EWOC is a popular dose-finding method originally proposed for continuous dose spaces, which we consider here. Motivated by a real trial for 5-flourouracil to treat solid colon tumors described in Babb et al. [24], we let [xmin, xmax] = [140, 425] denote the known range of acceptable dose values and assume m = 24 patients are treated in Phase I. We parametrize the toxicity responses’ distribution F(x; ·) by η and ρ = F(xmin; θ) rather than θ = (θ1, θ2) and assume that (ρ, η) has the uniform distribution on [0, q] × [xmin, xmax] as its prior distribution; see [1, Section 2] for more details. Fixing η = 250, q = 1/3 and ρ = .1, Table 1 gives some operating characteristics of this Phase I-II design using Simon’s design for testing (1) with p0 = .1 and p1 = .25, with β = .2 and various values of α. These were evaluated from 100,000 simulations using the above values of η, xmin, and xmax, and under the efficacy parameter ψ = (−3.895, .00679) chosen so that p(η; ψ) = p0 = .1 and p(xmax; ψ) = .9.

For several values of the parameters (n1, n2, r1, r) of Simon’s two-stage design [4, Table 2] of the Phase II trial, Table 1 compares the prescribed type I error probability α of Simon’s test with the actual probability of rejecting H0 : p(η; ψ) ≤ p0, denoted by P(−H0|·), for three choices of the MTD estimate η̂ that is used as the dose for the Phase II trial. The three types of estimation are the MLE, the final posterior mean of the Phase I trial (which is what the original version of the Bayesian CRM [10] would use), and the dose recommended by EWOC that is used in the Phase I design of this simulation study. Table 1 shows that the actual probability P(−H0|·) of falsely rejecting H0 is largely inflated over the prescribed value α of the type I error probability used for the Phase II trial, especially when the posterior mean is used for the MTD estimate η̂. The reason for this is the frequent over-estimation of η by η̂, as shown by the 5-number summary (maximum, first quartile Q1, median, third quartile Q3, and maximum) of the 100,000 simulated values of η̂ given in the table. Although under-estimation of η by η̂ also occurs, it is more often over-estimated, which causes rejection of H0 at rates higher than prescribed by the design parameters of Simon’s test. Also given in the table are the mean E(η̂) and the root-mean-square-error RMSE(η̂) = {E(η̂η)2}1/2 of the estimated MTD. We comment that here we have only considered the most basic versions of CRM [10] and EWOC [24], and many variants have been proposed since then (e.g., [25, 26]). It seems likely that the properties of η̂ could be improved using one of these variants of CRM or EWOC, but since our focus here is more on the interaction between Phase I and Phase II, we do not explore that option here.

Focusing on the traditional two-stage design with α = .05 in Table 1 (denoted here by Trad) and concentrating on MLE estimation for simplicity, we compare its operating characteristics with those of the new Phase I-II design described in Section 2 (denoted by New). In order to match the Trad design’s probability P(rej. H0) = .18 of falsely rejecting H0 : p(η; ψ) ≤ p0, where p0 = .1, at the parameter values determined by p(η; ψ) = .1, we choose critical values b = 3, = 3.5, and c = .7 in (9)–(11). Although a type I error probability of .18 is usually deemed too high, we can keep the probability of falsely rejecting H0 close to .05 if we use p0 = .05 instead, as shown in Table 2 which compares the operating characteristics of the Trad and New designs based on 10,000 simulations. The two designs both have Phase I sample size of m = 24 and maximum Phase II sample size of 43, and the New design achieves this through Phase II group sizes 10, 10, 10, 10, and 3. As in Table 1, η is fixed at 250 and ρ = .1, while ψ is specified by fixing p(xmax, ψ) = .9 and varying p(η; ψ) over the values .05, .1, .2, .3, .4, and .5. For each scenario, Table 2 gives P(rej. H0), in which p0 = .1, and the total expected sample size EN over the two phases. It shows that the new design has smaller P(rej. H0) than Trad for p(η; ψ) = .05 and larger P(rej. H0) for all values p(η; ψ) > .1, and uniformly smaller expected sample size, substantially so for parameter values p(&eegr;; ψ) > .3. In addition, Table 2 also gives the probability p(η̂rec; ψ) of efficacious response at the recommended dose η̂rec which for Trad is the MTD estimate at the end of Phase I and for New is the final MLE at the end of Phase II, the overall response rate (denoted Eff) for subjects in the study, the overall overdose rate (denoted OD) of subjects treated at doses above the true MTD, and the root-mean-square-error RMSE(η̂rec) of the recommended dose. The RMSE of the recommended dose for New is substantially smaller than Trad throughout, which we attribute to its continued estimation of η during Phase II. The values p(η̂rec; ψ) and Eff are comparable to p(η; ψ) throughout for New, while the corresponding values for Trad are larger, and Trad has larger OD values than New.

Table 2 shows a dramatic improvement of the New design relative to the Trad design in terms of both power and average sample size. In order to discern how much of this improvement is due to the group sequential sampling used (relative to Simon’s 2-stage design) versus how much is due to the continued estimation of the MTD during Phase II that the proposed design allows, more simulation studies were performed whose results are in Tables 3 and 4. In addition, both of these simulation studies were performed under different parameter values than in Table 2 in order to see the proposed design’s performance over a broad range of scenarios.

Table 3.

Operating characteristics of the traditional (denoted Trad) and new (denoted New) designs described on page 9. The toxicity parameter is fixed at (η, ρ) = (350, .2), and the six values of the efficacy parameter ψ are determined by p(xmax; ψ) = .95 and p(η; ψ) = .4, .5, .6, .7, .8, .9. All designs have Phase I sample size m = 24 and maximum Phase II sample size 43, for a maximum Phase I-II sample size of 67. Eff is the overall response rate for subjects in the study, OD is the overall overdose rate of subjects treated at doses above the true MTD, and RMSE(η̂rec) is the root-mean-square-error of the recommended dose.

p(η; ψ) 40% 50% 60% 70% 80% 90%
Trad New Trad New Trad New Trad New Trad New Trad New
p(η̂rec; ψ) .183 .179 .207 .219 .250 .270 .324 .361 .474 .511 .798 .812
Eff .129 .123 .153 .151 .195 .195 .270 .280 .432 .439 .785 .788
OD .111 .108 .109 .108 .111 .107 .109 .111 .109 .106 .102 .093
RMSE(η̂rec) 73.3 65.4 72.6 65.8 73.2 65.8 72.8 65.9 73.3 65.8 72.5 65.8
P(rej. H0) .040 .043 .050 .050 .057 .062 .071 .088 .085 .134 .234 .590
EN 61.2 60.7 62.8 62.2 64.6 64.0 66.2 65.7 66.8 66.6 65.9 63.7

Table 4.

Operating characteristics of the traditional (denoted Trad) and new (denoted New) designs described on page 9. The toxicity parameter is fixed at (η,ρ) = (200, .15), and the six values of the efficacy parameter ψ are determined by p(xmax; ψ) = .99 and p(η; ψ) = .025, .05, .25, .45, .65, .85. All designs have Phase I sample size m = 24 and maximum Phase II sample size 30, for a maximum Phase I-II sample size of 54. Eff is the overall response rate for subjects in the study, OD is the overall overdose rate of subjects treated at doses above the true MTD, and RMSE(η̂rec) is the root-mean-square-error of the recommended dose.

p(η; ψ) 2.5% 5% 25% 45% 65% 85%
Trad New Trad New Trad New Trad New Trad New Trad New
p(η̂rec; ψ) .011 .032 .017 .061 .082 .263 .197 .454 .383 .648 .706 .849
Eff .008 .038 .011 .067 .061 .268 .178 .458 .377 .656 .706 .850
OD .612 .611 .613 .615 .367 .564 .595 .569 .575 .578 .570 .570
RMSE(η̂rec) 31.3 25.8 31.7 27.6 30.1 34.7 30.8 35.4 30.9 33.7 30.7 35.3
P(rej. H0) .007 .002 .010 .010 .209 .185 .636 .478 .940 .763 .999 .819
EN 33.9 34.1 34.8 36.8 42.7 52.8 49.4 54.0 53.2 53.9 53.9 53.9

In Table 3, the traditional Phase I-II design (denoted Trad) was implemented but, instead of using Simon’s two-stage design for Phase II, the same group sequential sampling scheme that New used in Table 2 with group sizes 10, 10, 10, 10, and 3 was used. The proposed design (denoted New) was also implemented using these groups sizes and compared with Trad, so that the only difference between the two designs is that Trad does not update the estimate η̂ of the MTD during Phase II. To see the performance of the proposed design in a different scenario than Table 2, using the same dose range [xmin, xmax] = [140, 425] and prior structure as there, the true MTD η was taken to be 350 and the probability of toxicity ρ at dose xmin was taken to be .2. This scenario represents a much “flatter” dose-toxicity curve than in Table 2. In this set-up, the Phase II null hypothesis H0 : p(η; ψ) ≤ p0 was tested with p0 = .5 and Table 3 contains the operating characteristics of these designs at six different values of the response parameter ψ determined by p(xmax, ψ) = .95 and p(η, ψ) = .4, .5, .6, .7, .8 and .9. Unlike the Trad design in Table 2 which does not achieve the overall type I error probability P(rej. H0) at p(η; ψ) = p0 equal to the prescribed value α = .05 because of the variance of the MTD estimate used in Phase II, here the Trad design uses the stopping rule (9)–(11) with the critical values b, , and c chosen so that this quantity is equal to α for p0 = .5; they are b = 24.1, = 64.4, and c = 21.8. The New design uses the values b = 18.8, = 54.4, and c = 11.7, also chosen so that its type I error probability is α, and are slightly different than Trad’s critical values because New continues to update η̂ during Phase II. Table 3 contains the operating characteristics of these designs based on 10,000 Monte Carlo replications at each parameter value. As might be expected from designs using the same sampling scheme, Trad and New have very similar expected sample size, and sample sizes are in general larger in this scenario than the one in Table 2 which is also to be expected because of the flatness of the dose-toxicity curve which makes η difficult to estimate accurately, reflected in the power of both designs being low until p(η, ψ) reaches 90%, where the power of New is 59% but Trad is still severely underpowered. Note also that even though the flatness of the dose-toxicity makes the MTD difficult to estimate accurately, the chance of overdose is relatively low. Overall, New is slightly but consistently more efficient with smaller RMSE despite having slightly smaller average sample size, and New has higher power and response probabilities p(η̂rec; ψ) over the range of parameter values in the alternative. These results are consistent with the two designs using the same Phase II sampling scheme but New using continued estimation of the MTD throughout Phase II.

Table 4 considers another scenario, with a smaller Phase II sample size, in which both Trad and New use for Phase II a two-stage design with early stopping only for futility. For Trad this is Simon’s two-stage design and for New this is the stopping rule (9)–(11) with K = 2 groups and b fixed at ∞ so that only early stopping for futility can occur. In this scenario both Trad and New have maximum Phase II sample size 30 (compared with 43 in Tables 2 and 3) and Phase I sample size m = 24. To achieve this, Trad uses Simon’s [4, Table 1] design with r1 = 0, n1 = 9, r = 3, and n2 = 21 for α = .05 and β= .1 at p0 = .05 and p1 = .25. As in Tables 1 and 2, the Trad design using these parameters does not achieve the type I error probability at the prescribed value α = .05 because of variance of the MTD estimate used in Phase II. Indeed, Table 4 shows its actual type I error probability to be .01 at p(η; ψ) = p0 = .05. Unlike Table 2 that shows inflation of type I error probability, here the type I error probability is substantially smaller then the prescribed value α = .05. In order to make a meaningful comparison between designs we choose the parameters of the New design to match this smaller value of the type I error probability, for which we use b = ∞ (to allow early stopping only for futility), = 2.1 and c = 19.3 in (9)–(11) and Phase II group sizes 9 and 21, the same as the Simon design. The operating characteristics of these designs are given in Table 4, based on 10,000 Monte Carlo replications each, in yet another scenario with η= 200, ρ = .15, and six values of ψ determined by p(xmax; ψ) = .99 and p(η; ψ) = .025, .05, .25, .45, .65, and .85. The dose range and prior structure are the same as in Table 3. The response probabilities p(η̂rec; ψ) at New’s recommended dose stay much closer to the true values than at Trad’s recommended dose, likely due to New’s update of the MTD estimate during Phase II. The overall response rate of subjects in the study is also substantially higher in New than in Trad. The two designs have similar average sample sizes, reflective of their similar sampling schemes, and Trad has higher power in the alternative. The RMSEs of the two designs are small and relatively close, with Trad’s being slightly smaller. Note, however, that the squared error RMSE(η̂rec) ignores the sign of ηη̂rec and that the results on p(η̂rec; ψ) show that η̂rec tends to under-estimate η.

4.2. Performance of the traditional and new Phase I-II designs on discrete dose space under monotonicity constraints

To evaluate the performance of the Phase II method proposed in Section 3 for monotonic efficacy and toxicity models on a discrete dose space, we performed a similar study to the one in Section 4.1, assuming independence of the toxicity and efficacy responses for simplicity; we have performed additional simulations under dependent responses using the model described in Section 3.2 and the performance of the new method is similar. Again focusing on the Trad design in Table 1 and using isotonic MLE estimation (22) for both the Trad and new (denoted by New) designs, the estimated operating characteristics are compared in Table 5 based on 10,000 simulated trials, wherein the Phase I doses of the m = 24 patients are uniformly sampled from the dose set Λ= {140, 200, 250, 300, 350, 425}. In this setting, the Trad design with nominal level α = .05 for testing πî*p0 actually has type I error probability P (rej. H0) = .211 of falsely rejecting H0: πi*p0 = .1, and so in order to compare New and Trad in this setting we choose critical values b = .13, = 3.3, and c = .03 in (9)–(11) in order to approximately match this, giving P(rej. H0) = .201 for New at πi* = .1. In order to have the same maximum Phase II sample size M = 43 as Trad, again New uses group sequential sampling with group sizes 10, 10, 10, 10, and 3. In this discrete nonparametric setting, the unknown parameters are the true toxicity and efficacy probabilities ϕ and π given by (19), and in order to compare Trad and New in a setting similar to the one in Section 4.1, we consider values of ϕ and π given by the corresponding parametric models F(x; θ) and p(x; ψ) and parameter values given there: η= λi* is fixed at 250, ρ = ϕ1 = .1, p(xmax, ψ) = πd = .9, and p(η; ψ) = πi* = .05, .1, .2, .3, .4, and .5. The relative performance of Trad and New is very similar to that in the previous section: The new design has smaller P(rej. H0) than Trad for parameter values πi* ≤ .1 in the null hypothesis, larger P(rej. H0) for all values πi*> .1 in the alternative, and uniformly smaller expected sample size, substantially so when πi* is large or small relative to p0 = .1. The other operating characteristics given in the table are the same as in Table 2: The response rate πî* at the final recommended dose, overall response rate (Eff) and overdose rate (OD) of patients in the study, and the RMSE of the final recommended dose (λî*. The Eff rate of New is larger than Trad at all parameter values considered, which we attribute to the proposed design’s ability to vary the dose throughout Phase II, and hence “correct” for a poorly chosen MTD estimate at the end of Phase I, to some measure. The OD rates of the two designs are close, with New being sometimes smaller and sometimes larger. The RMSE of New is slightly larger, but comparable to Trad, which we attribute to its markedly smaller average sample size.

Table 5.

Operating characteristics of the traditional (denoted Trad) and new (denoted New) designs described in Section 4.2. The true toxicity and efficacy probabilities are determined by the same parameters as in Table 2: λi*= 250, ϕ1 = .1, πd = .9, and the six cases πi*= .05, .1, .2, .3, .4, and .5 as described in Section 4.2. Eff is the overall response rate for subjects in the study, OD is the overall overdose rate of subjects treated at doses above the true MTD, and RMSE(η̂rec) is the root-mean-square-error of the recommended dose.

πi* 5% 10% 20% 30% 40% 50%
Trad New Trad New Trad New Trad New Trad New Trad New
πî* .072 .030 .116 .061 .194 .131 .274 .206 .357 .295 .449 .395
Eff .185 .196 .225 .231 .286 .296 .350 .364 .416 .441 .492 .524
OD .390 .376 .388 .363 .366 .361 .347 .370 .328 .390 .320 .406
RMSE(λî*) 56.5 60.1 57.4 60.9 56.7 59.1 56.9 59.2 56.5 58.3 57.2 57.9
P(rej. H0) .117 .076 .211 .201 .410 .486 .615 .729 .805 .895 .931 .981
EN 56.5 38.7 49.4 40.7 54.8 41.7 59.8 40.2 63.6 37.7 65.9 35.6

5. Group sequential likelihood theory and implementation details

5.1. Theory of group sequential GLR tests

We first assume independence between yi and zi given xi as in Section 2.2. In this case, the likelihood function, based on a sample of size τk, is of the form L1,k(θ)L2,k(ψ), where

L1,k(θ)=i=1τk[F(xi;θ)]yi[1-F(xi;θ)]1-yi,L2,k(ψ)=i=1τk[p(xi;ψ)]zi[1-p(xi;ψ)]1-zi.

The GLR statistic for testing p(η; ψ) = pj, which is the boundary of Hj, is

log[{supθL1,k(θ)×supψL2,k(ψ)}/{sup(θ,ψ):p(η;ψ)=pjL1,k(θ)L2,k(ψ)}], (30)

and the signed root likelihood ratio statistic is approximately normal under p(η; ψ) = pj; see [17, p. 513]. Note that p(η; ψ) = pj can be expressed as an equality constraint ψ1 + ηψ2 = logit(pj) on the linear function ψ1 + ηψ2 of ψ, and we can reparameterize ψ as (ψ1, ψ1 + ηψ 2) and θ as (η, ρ). Therefore, standard asymptotic analysis of GLR statistics shows that under p(&eegr;; ψ;) = pj, (30) has the same limiting distribution as

log[supψL2,k(ψ)/supψ:p(η^k;ψ)=pjL2,k(ψ)], (31)

jointly over 1 ≤ kK; see [27, Section 9.3(iii)]. Because the xi are sequentially determined random variables (based on group sequential estimates of the MTD), we use the martingale central limit theorem ([28], p. 411) here instead of the traditional central limit theorem as in [17]. Note that (31) is the same as ℓk,j defined in (8). For the dependent case in Section 2.3, the likelihood function L2,k(ψ) involves both zi and yi in view of (14) and (15) but does not depend on η. A similar argument can be used to show that the GLR statistic at the kth interim analysis is still asymptotically equivalent to (31).

The group sequential GLR test of H0 is much more flexible and efficient than Simon’s 2-stage likelihood ratio test [4] for Phase II cancer trials. As noted in the last paragraph of Section 1.1, Simon’s procedure actually tests p(η̂; ψ) ≤ p0 with all doses set at the MTD estimate η̂ from the Phase I toxicity data, while the proposed test considers the more natural H0: p(η; ψ) ≤ p0 and uses all the observed (xi, yi, zi) up to the time of interim analysis to test H0. Moreover, unlike Simon’s two-stage design which is actually a group sequential test with two groups and only allows futility stopping in the first stage, we use a more flexible group sequential design that allows early stopping for both efficacy and futility. In addition, the estimate of η of the Phase I-II trial uses data up to the end of the trial. The group sequential GLR test uses the alternative p1 implied by the maximum size τK (see Section 5.2) to derive the futility stopping criterion, namely stopping when there is enough evidence against H1: p(η; ψ) ≥ p1. Similarly, it stops early for efficacy if the GLR statistics show enough evidence against H0: p(η; ψ) ≤ p0.

The group sequential GLR test in Section 3 that considers discrete dose levels also involves a finite number of parameters satisfying certain monotonicity constraints. Therefore the theory of group sequential tests that we have applied to the logistic regression models in Section 2 can also be applied to Section 3 that imposes certain structure on the parameter space. Lai and Shih [17, Section 3] have established the asymptotic efficiency of these group sequential GLR tests in terms of the expected sample size and power function. Here we extend this theory in two ways. The first extension is from the i.i.d. model to the regression model, with sequentially determined regressors xi. The second extension is to replace the GLR statistics by more easily computable and interpretable approximations that have the same asymptotic distributions. As noted above, martingale theory used in conjunction with likelihood theory provides the key tools for such extensions.

5.2. Implementation details

The MLEs of θ and ψ involved in the design proposed in Section 2 should be computed under the assumption of positive slope, i.e., θ2 > 0 and ψ2 > 0. In practice this can be imposed by choosing a small value δ > 0 and computing the MLEs under the constraint θ2δ and ψ2δ. A related issue is that the MLEs of θ and ψ may not exist in the first few stages of Phase I (see p. 195 of [29]). In this case, their Bayes estimates from a Bayesian model-based design can be used instead.

The alternative p1 > p0 is that implied by the maximum sample size τK and the desired type I and II error probabilities α and β, respectively. That is, for the GLR test that has fixed sample size τK and rejects H0 if and only if

p(η^K;ψ^K)>p0andminψSK0[K(ψ^K)-K(ψ)]Cα, (32)

let p1 > p0 be the alternative satisfying

minψS01Pθ,ψ[(32)occursF0]=1-β. (33)

In (32), Cα is such that

maxψS00Pθ,ψ[(32)occursF0]=α (34)

and the doses x1,…, xτK are chosen by some design. The computation of the left-hand sides of (33) and (34) will be described below.

The thresholds b, , and c in (9)–(11) can be determined as follows. Let 0 < ε < 1/2 and first choose so that

maxψS01Pθ,ψ[(10)occursforsome1k<KF0]=εβ. (35)

Then choose b so that

maxψS00Pθ,ψ[(9)occursforsome1k<K,p(η^τk,ψ^τk)p1andk,1<bforallk<kF0]=εα, (36)

and finally choose c so that

maxψS00Pθ,ψ[(11)occursand(9),(10)donotoccurforany1k<KF0]=(1-ε)α. (37)

The determination of b, and c in (35)–(37) follows that in [17] and aims at controlling the type I error probability (12) and keeping the power (13) close to 1 – β.

As in Section 3.4 of [17], we can use the joint asymptotic normality of the signed root likelihood ratio statistics to approximate the probabilities in (33)–(37). Because the GLR statistics are asymptotic pivots, the convergence in distribution holds uniformly over S01 or S00 and therefore the minimum (or maximum) over S01 or S00 in the left-hand sides of (33)–(37) poses no additional difficulty when we use the normal approximation. An alternative to normal approximation is to use Monte Carlo similar to that used in the bootstrap tests. Bootstrap theory suggests that we can simulate from the estimated distribution under an assumed composite hypothesis since the GLR statistic is an approximate pivot under that hypothesis. Thus, the bootstrap test chooses the ψS0j in (33)–(37) to be the MLE based on the Phase I data Inline graphic, of ψ under the constraint p(η̃; ψ) = pj. In the simulation studies in Section 4, we use 10,000 bootstrap simulations to estimate the probabilities in (33)–(37). The implementation of the group sequential order-restricted GLR test of H0: πi*p0 in Section 3 is similar, as we have explicit formulas (20) and (21). A software package to design the proposed Phase I-II trial has been developed using R and is available at the website http://med.stanford.edu/biostatistics/ClinicalTrialMethodology.html.

6. Discussion

The simulation studies in Section 4, which are motivated by the trial in Babb et al. [24], show that the estimate η̂ at the end of the Phase I trial can substantially over- or under-estimate η and therefore have a significantly higher or lower response rate than p(η; ψ). Another situation in which the latter can occur is when using the 3+3 dose escalation scheme in Phase I, which tends to produce a sub-therapeutic dose η̂ at the end of Phase I. Continuing dose-finding in Phase II can add substantial information for estimating η, as Section 4 has shown.

Recognizing that the dose chosen at the end of the Phase I trial may not ensure safety, Bryant and Day [30] have extended Simon’s two-stage design for the Phase II trial to incorporate toxicity outcomes in the Phase II trial by stopping the trial after the first stage if either the observed response rate is inadequate or the number of observed toxicities is excessive, and by recommending the treatment at the end of the Phase II trial only if there are both a sufficient number of responses and an acceptably small number of toxicities. Note that the Bryant-Day design still uses η̂ determined from the Phase I data to be the dose throughout the Phase II trial. We have developed herein a novel methodology which continues dose finding to estimate the MTD in Phase II and which uses the toxicity outcomes throughout the trial in a natural way, while focusing on testing the efficacy hypothesis during the Phase II component of the Phase I-II design. The methodology enables the user to carry out the novel group sequential extensions, allowing early stopping not only for futility but also for efficacy, of Simon’s two-stage design that is widely used in Phase II cancer trials. These group sequential tests use efficient GLR statistics, which we have extended herein from the traditional logistic regression models in Section 2 to robust isotonic regression models in Section 3.

Bayesian designs have been proposed for Phase II trials, allowing early stopping for efficacy or futility, and rejecting (or accepting) the hypothesis pp0 if the posterior probability of p > p0 exceeds some threshold (or falls below another threshold), thereby extending the Bayesian approach from Phase I to Phase II trials; see Chapter 4 of [31]. Yin et al. [20] and Yuan and Yin [21] have developed Bayesian Phase I-II designs to incorporate the bivariate outcomes of toxicity and efficacy to determine the dose sequentially for the next cohort of patients in the trial. Their underlying philosophy is that “with a very limited sample size in the (traditional) phase I trial, the MTD might not be obtained in a reliable way,” and therefore they aim instead at finding “the optimal dosage of a drug which has the highest effectiveness as well as tolerable toxicity” [20, p. 777]. Two motivating trials that attempt to “speed up the drug discovery and reduce the total cost” are given in [32, p. 925 and Section 3] and [20].

The trials that motivate the Phase I-II design proposed herein are traditional Phase I and Phase II trials at cancer centers of most medical schools, such as the Norris Comprehensive Cancer Center at the University of Southern California and the Cancer Institute at Stanford University. The protocols usually have small sample sizes for Phase I, followed by Simon’s two-stage design for Phase II that uses the MTD estimated from the Phase I data. Simon’s design has been popular because it allows interim analysis for a go/no go decision while preserving the type I error probability and power at the effect size used to justify the sample size specified in the protocol. The reason why investigators with whom we have worked adhere to this design although they recognize difficulties with the relatively small sample sizes for both phases is that they can publish the trial results in medical journals that prefer frequentist testing. The Phase I-II design proposed herein is an attempt to enable the investigators to perform valid group sequential tests of efficacy while continuing estimation of the MTD during the entire course of the Phase I-II trial. Even though pharmaceutical companies do not need to publish the results of Phase II trials and can focus on dose finding that incorporates both toxicity and efficacy as in the Bayesian designs of [20] and [21], many industry-sponsored Phase II trials are still conducted at academic centers where this innovative Phase I-II design can allow investigators to carry out group sequential frequentist testing of efficacy at the MTD and update the MTD estimate during the entire course of the trial. While the present paper has established the basic methodology, much of the work for its adoption still lies ahead. This includes generating some experience in actual trials and their protocols, holding monthly forums and regular consulting sessions for clinical investigators at the U.S.C. Norris Cancer Center and the Stanford Cancer Institute, and developing user-friendly software based on this experience, which will facilitate its use by other academic centers.

Acknowledgments

Bartroff’s work was supported by NSF grants DMS-0907241 and DMS-1310127 and NIH grant GMS-068968. Lai’s work was supported by NSF grant DMS-1106535 and NIH grant 5P30CA124435. Narasimhan’s work was supported by NCI Cancer Center Support Grant 5P30CA124435.

References

  • 1.Bartroff J, Lai TL. Approximate dynamic programming and its applications to the design of phase I cancer trials. Statistical Science. 2010;25:245–257. [Google Scholar]
  • 2.Bartroff J, Lai TL. Incorporating individual and collective ethics into phase I cancer trial designs. Biometrics. 2011;67:596–603. doi: 10.1111/j.1541-0420.2010.01471.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Vickers AJ, Ballen V, Scher HI. Setting the bar in phase III trials: The use of historical data for determining “go/no go” decision for definitive phase II trials. Clinical Cancer Research. 2007;13:972–976. doi: 10.1158/1078-0432.CCR-06-0909. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Simon R. Optimal two-stage designs for phase II clinical trials. Controlled Clinical Trials. 1989;10:1–10. doi: 10.1016/0197-2456(89)90015-9. [DOI] [PubMed] [Google Scholar]
  • 5.Jung S, Carey M, Kim K. Graphical search for two-stage designs for phase II clinical trials. Controlled Clinical Trials. 2001;22:367–372. doi: 10.1016/s0197-2456(01)00142-8. [DOI] [PubMed] [Google Scholar]
  • 6.Jung S, Lee T, Kim K, George S. Admissible two-stage designs for phase II cancer clinical trials. Statistics in Medicine. 2004;23(4):561–569. doi: 10.1002/sim.1600. [DOI] [PubMed] [Google Scholar]
  • 7.Lu Y, Jin H, Lamborn KR. A design of phase II cancer trials using total and complete response endpoints. Statistics in Medicine. 2005;24(20):3155–3170. doi: 10.1002/sim.2188. [DOI] [PubMed] [Google Scholar]
  • 8.Gooley TA, Martin PJ, Fisher LD, Pettinger M. Simulation as a design tool for phase I/II clinical trials: An example from bone marrow transplantation. Controlled Clinical Trials. 1994;15(6):450–462. doi: 10.1016/0197-2456(94)90003-5. [DOI] [PubMed] [Google Scholar]
  • 9.Thall PF, Russell KE. A strategy for dose-finding and safety monitoring based on efficacy and adverse outcomes in phase I/II clinical trials. Biometrics. 1998;54:251–264. [PubMed] [Google Scholar]
  • 10.O’Quigley J, Pepe M, Fisher L. Continual reassessment method: A practical design for phase I clinical trials in cancer. Biometrics. 1990;46:33–48. [PubMed] [Google Scholar]
  • 11.O’Quigley J, Hughes MD, Fenton T. Dose-finding designs for HIV studies. Biometrics. 2001;57(4):1018–1029. doi: 10.1111/j.0006-341x.2001.01018.x. [DOI] [PubMed] [Google Scholar]
  • 12.Ivanova A. A new dose-finding design for bivariate outcomes. Biometrics. 2003;59(4):1001–1007. doi: 10.1111/j.0006-341x.2003.00115.x. [DOI] [PubMed] [Google Scholar]
  • 13.Braun T. The bivariate continual reassessment method: Extending the CRM to phase I trials of two competing outcomes. Controlled Clinical Trials. 2002;23:240–256. doi: 10.1016/s0197-2456(01)00205-7. [DOI] [PubMed] [Google Scholar]
  • 14.Arnold BC, Strauss DJ. Bivariate distributions with conditionals in prescribed exponential families (Corr: V53 p700) Journal of the Royal Statistical Society, Series B: Methodological. 1991;53:365–375. [Google Scholar]
  • 15.Thall PF, Cook JD. Dose-finding based on efficacy-toxicity trade-offs. Biometrics. 2004;60(3):684–693. doi: 10.1111/j.0006-341X.2004.00218.x. [DOI] [PubMed] [Google Scholar]
  • 16.Thall PF, Nguyen HQ, Estey EH. Patient-specific dose finding based on bivariate outcomes and covariates. Biometrics. 2008;64(4):1126–1136. doi: 10.1111/j.1541-0420.2008.01009.x. [DOI] [PubMed] [Google Scholar]
  • 17.Lai TL, Shih MC. Power, sample size and adaptation considerations in the design of group sequential clinical trials. Biometrika. 2004;91:507–528. [Google Scholar]
  • 18.Bartroff J, Lai TL. Generalized likelihood ratio statistics and uncertainty adjustments in adaptive design of clinical trials. Sequential Analysis. 2008;27:254–276. [Google Scholar]
  • 19.Bartroff J, Lai TL. Efficient adaptive designs with mid-course sample size adjustment in clinical trials. Statistics in Medicine. 2008;27:1593–1611. doi: 10.1002/sim.3201. [DOI] [PubMed] [Google Scholar]
  • 20.Yin G, Li Y, Ji Y. Bayesian dose-finding in phase I/II clinical trials using toxicity and efficacy odds ratios. Biometrics. 2006;62(3):777–787. doi: 10.1111/j.1541-0420.2006.00534.x. [DOI] [PubMed] [Google Scholar]
  • 21.Yin G, Yuan Y. Bayesian model averaging continual reassessment method in phase I clinical trials. Journal of the American Statistical Association. 2009;104(487):954–968. [Google Scholar]
  • 22.Silvapulle MJ, Sen PK. Constrained Statistical Inference: Inequality, Order, and Shape Restrictions. Wiley-Interscience; Hoboken, New Jersey: 2004. [Google Scholar]
  • 23.Dale J. Global cross-ratio models for bivariate, discrete, ordered responses. Biometrics. 1986;42:909–917. [PubMed] [Google Scholar]
  • 24.Babb J, Rogatko A, Zacks S. Cancer phase I clinical trials: Efficient dose escalation with overdose control. Statistics in Medicine. 1998;17:1103–1120. doi: 10.1002/(sici)1097-0258(19980530)17:10<1103::aid-sim793>3.0.co;2-9. [DOI] [PubMed] [Google Scholar]
  • 25.Goodman SN, Zahurak ML, Piantadosi S. Some practical improvements in the continual reassessment method for phase I studies. Statistics in Medicine. 1995;14:1149–1161. doi: 10.1002/sim.4780141102. [DOI] [PubMed] [Google Scholar]
  • 26.Tighiouart M, Rogatko A. Dose finding with escalation with overdose control (EWOC) in cancer clinical trials. Statistical Science. 2010;25(2):217–226. [Google Scholar]
  • 27.Cox DR, Hinkley DV. Theoretical Statistics. Chapman and Hall; London: 1974. [Google Scholar]
  • 28.Durrett R. Probability: Theory and Examples. 3. Thomson; Belmont: 2005. [Google Scholar]
  • 29.Agresti A. Categorical Data Analysis. John Wiley & Sons; 2002. [Google Scholar]
  • 30.Bryant J, Day R. Incorporating toxicity considerations into the design of two-stage phase II clinical trials. Biometrics. 1995;51:1372–1383. [PubMed] [Google Scholar]
  • 31.Berry SM, Carlin BP, Lee JJ, Muller P. Bayesian Adaptive Methods for Clinical Trials. CRC Press; Boca Raton, FL: 2010. [Google Scholar]
  • 32.Yuan Y, Yin G. Bayesian phase I/II adaptively randomized oncology trials with combined drugs. Annals of Applied Statistics. 2011;5(2A):924–942. doi: 10.1214/10-AOAS433. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES