Abstract
In this paper, a parametric sequential test is proposed under the Weibull model. The proposed test is asymptotically normal with an independent increments structure. The sample size for fixed sample test is derived for the purpose of group sequential trial design. In addition, a multi-stage group sequential procedure is given under the Weibull model by applying the Brownian motion property of the test statistic and sequential conditional probability ratio test methodology.
Keywords: Brownian motion, Group sequential trial, Randomized clinical trial, Sample size, Time-to-event, Weibull distribution
1 Introduction
For ethical reasons, clinical trials are often monitored for early stopping if a sufficiently large treatment difference is observed during an interim analysis. Various group sequential monitoring methods have been developed in the past few decades, such as the procedures of Haybittle (1971), Pocock (1977), and O'Brien and Fleming (1979); the type I error spending function approach of Lan and DeMets (1983); the triangular test of Whitehead and Stratton (1983); the sequential conditional probability ratio test (SCPRT) of Xiong (1995); and many others. Comprehensive reviews of the these methods are provided by Jennison and Turnbull (2000, references therein).
In cancer clinical trials, time-to-event is often a primary endpoint for the study design, such as overall survival and event-free survival, where the event could be disease progression, relapse, or death. The primary interest is to compare the survival distributions between treatment groups. The non-parametric log-rank test is the most popular test statistic used to design such a study (Collett, 2003). Its Brownian motion property makes it easy to monitor such trials by using the group sequential procedure (Tsiatis, 1982; Sellke and Siegmund, 1983; Slud, 1984; Kim and Tsiatis, 1990).
For survival data, the exponential and Weibull distributions are the two most frequently used parametric models. Of the two distribution forms, the Weibull distribution is usually more appropriate to describe time-to-event data than the exponential distribution because it includes the shape parameter in addition to the scale parameter, with a decreasing or increasing hazard. In advanced-stage cancer studies, the survival rate usually drops dramatically toward the end of the study. Such characteristics of the survival distribution can be better approximated by a Weibull distribution. In general, a cancer survival trial under the Weibull model can also be designed under the proportional hazards model using the log-rank test. However, a parametric test derived under the Weibull model has better small sample properties than the non-parametric log-rank test because the latter has to be general and thus information from continuous quantities derived from specific parametric model cannot be included for inference (Wu, 2013). The maximum sample size is often large for a phase III group sequential trial. However, the available data could be small in the early stages of interim monitoring; therefore, a study with group sequential design under the Weibull model may perform better in the early stages than that a general proportional hazards model. Tsiatis et al. (1995) derived asymptotic sequential distributions for score and Wald tests in general parametric survival models, but the method has not been applied to group sequential trial design under the Weibull model. Recently, Jiang et al. (2012) proposed a simulation method for group sequential trial design under the Weibull model, but it is a computationally intensive method with restrictive assumptions. Heo et al. (1998) and Wu (2013) proposed a sample size formula for a fixed sample test under the Weibull model. Lu et al. (2012) derived a sample size formula for designing a two-stage seamless adaptive design under the Weibull model. However, a general multi-stage group sequential design under the Weibull model is not available in the literature.
The rest of this paper is organized as follows. In Section 2, a parametric sequential test statistic is proposed under the Weibull model. The sample size for a fixed sample test is given in Section 3. A general multi-stage group sequential procedure is discussed in Section 4. In Section 5, the empirical type I error and power of the proposed parametric sequential test are compared with those of the well-known non-parametric log-rank test. An example is given in Section 6 to illustrate the proposed method. The final conclusion is presented in Section 7.
2 Sequential Test Statistics
A parametric sequential test statistic is discussed in this section to provide group sequential design for randomized two-arm survival trials under the Weibull model. Assume that time-to-event variable Tj of a subject from the jth group follows the Weibull distribution with a common shape parameter κ and scale parameter ρj, j = 1,2. That is, Tj has survival distribution function
and hazard function
The shape parameter κ indicates the degree of acceleration (κ > 1) or deceleration (κ < 1) of the hazard over time. In a cancer trial, the median survival time is an intuitive endpoint for clinicians. The median survival time of the jth group for the Weibull distribution can be calculated as . Therefore, the Weibull survival distribution can be expressed as
The one-sided hypotheses of a randomized two-arm trial defined by median survival times can be expressed as
For notation convenience, we convert the scale parameter ρj to a hazard parameter . Then the above hypotheses on median survival times are equivalent to the following:
where the hazards ratio δ = λ1/λ2 = Rκ, with R = m2/m1. Then the survival distribution is Sj(t) = e−λjtκ with hazard function hj(t) = κλjtκ−1, in which κ is taken as a known constant. This indicates that the testing problem can also fit into a proportional hazard model and the log-rank test is applicable for the intended testing.
Now, suppose during the accrual phase of the trial, nj subjects of the jth group are enrolled in the study, and let Tij and Cij denote, respectively, the event time and censoring time of the ith subject of the jth group, with both being measured from the time of study entry, Yij. We assume that the event time Tij is independent of the censoring time Cij and entry time Yij, and {(Yij, Tij, Cij); i = 1,…, nj} are independent and identically distributed. When the data are examined at calendar time t ≤ τ, where τ; is the study duration, we observe the time-to-event Xij(t) = Tij Λ Cij Λ (t − Yij)+ and failure indicator Δij(t) = I(Tij ≤ CijΛ (t−;Yij)+), i = 1,…, nj. Based on the observed data {Xij(t), Δij(t), i = 1,…, nj, j = 1, 2}, the observed likelihood function at time t is proportional to (see, e.g., Cox & Oakes, 1984, Chapter 3)
where is the total number of events observed in the jth group by time t, and is the cumulative follow-up time by time t penalized by the Weibull shape parameter κ. The maximum likelihood estimate of λj(t) can be derived as
and its variance is approximately . Therefore, under the null hypothesis, the Wald statistic of the log-hazard ratio γ = log(δ) at calendar time t is given by (see Appendix 1)
(1) |
and has approximately a standard normal distribution. To derive the group sequential design, let
(2) |
then under the alternative γ = log(δ) > 0, the statistic U(t) is approximately normal with mean γV(t) and variance V(t) and has an independent increment structure, where . The above results can be derived from Tsiatis et al. (1995), who proved similar results for general survival parametric models. Since
(3) |
where pj(t) = P(Δ1j(t) = 1) and π = n2/n1 is the treatment allocation ratio. Thus, is approximately a Brownian motion with drift parameter and information time I = D(t)/D(τ), where D(τ) is the value of D(t) at t = τ.
3 Sample Size for Fixed Sample Test
The sample size for a fixed sample test is calculated for the situation at the end of the study. Based on test statistic Z(t) at t = τ, under the null hypothesis,
has an approximate standard normal distribution. To calculate the power, let pj(τ) be the probability of a subject from the jth group having an event during the study. Then under the alternative δ = λ1/λ2 > 1, Z(τ) is an approximately normal distribution with mean and unit variance. Therefore, given a significance level α, the power (1 − β) of the Z(τ) test under the alternative is given by
where Φ(·) is the standard normal distribution function and z1−α = Φ−1(1 −α). Thus, the sample size of the first group based on the Z(τ) test can be calculated as
(4) |
where δ = Rκ (Wu, 2013). Therefore, the total sample size for the two groups is given by
(5) |
To calculate the number of subjects required for the study, we need to calculate pj(τ), the probability of a subject in the jth group having an event during the study. Typically, we assume that subjects are accrued over an accrual period of length ta with an additional follow-up period of length tf. A subject enters the study at time u, the entry time is uniformly distributed on [0,ta], and no subject is lost to follow-up during the study. Then the probability of a subject having an event during the study under the Weibull model can be calculated by (Collett, 2003):
(6) |
Therefore, given the design parameters δ (or κ), m1, m2, α, β, π, tf and ta, the number of subjects n required for the study can be calculated using formula (5).
In designing an actual trial, given the accrual time ta, calculating the sample size is often impractical because we may not be able to enroll the total number of subjects as planned in the given accrual duration. It is more practical to design the study starting with the accrual rate r and then calculate the required accrual time ta. This can be accomplished under the Weibull model assumption. First, the integration in the probability formula (4) can be simplified by approximation using Simpson's rule,
(7) |
Then, combining the sample size formula based on (5) with equation (7), we can define a root function of the accrual time ta
(8) |
Now the accrual time ta can be obtained by solving the root equation root(ta) = 0 numerically in Splus using the uniroot function. The total sample size required for the study is approximately n = [rta]+, where [x]+ denotes the smallest integer greater than x.
The total sample sizes for two groups under each scenario for fixed sample tests recorded in Table 1 were calculated from formula (5) for parametric test Z(τ) and from a formula given by Collett (2003) for the log-rank test L(τ),
Table 1.
Sample size and simulated empirical type I error (α) and power (1 − β) based on 100,000 simulation runs for the Weibull distribution for fixed sample tests Z(τ) and L(τ) with a nominal type I error of 0.05 and powers of 80% and 90% (one-sided test).
Design | (90% Power) | R=1.5 | R=1.6 | R=1.7 | ||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
κ | Test | n* | α | 1 − β | n | α | 1 − β | n | α | 1 − β |
0.5 | Z(τ) | 578 | 0.049 | 0.902 | 434 | 0.051 | 0.901 | 344 | 0.050 | 0.903 |
L(τ) | 577 | 0.049 | 0.900 | 433 | 0.050 | 0.901 | 342 | 0.050 | 0.899 | |
| ||||||||||
1 | Z(τ) | 118 | 0.052 | 0.903 | 89 | 0.050 | 0.902 | 71 | 0.051 | 0.907 |
L(τ) | 118 | 0.052 | 0.900 | 89 | 0.051 | 0.899 | 70 | 0.051 | 0.897 | |
| ||||||||||
2 | Z(τ) | 27 | 0.051 | 0.903 | 20 | 0.052 | 0.902 | 16 | 0.053 | 0.903 |
L(τ) | 27 | 0.054 | 0.887 | 20 | 0.055 | 0.881 | 16 | 0.057 | 0.876 | |
| ||||||||||
Design | (90% Power) | R=1.8 | R=1.9 | R=2.0 | ||||||
| ||||||||||
κ | Test | n | α | 1 − β | n | α | 1 − β | n | α | 1 − β |
| ||||||||||
0.5 | Z(τ) | 283 | 0.049 | 0.902 | 239 | 0.050 | 0.903 | 207 | 0.050 | 0.903 |
L(τ) | 281 | 0.050 | 0.899 | 237 | 0.050 | 0.900 | 205 | 0.049 | 0.900 | |
| ||||||||||
1 | Z(τ) | 58 | 0.051 | 0.904 | 50 | 0.051 | 0.909 | 43 | 0.052 | 0.910 |
L(τ) | 58 | 0.052 | 0.898 | 49 | 0.052 | 0.898 | 43 | 0.053 | 0.902 | |
| ||||||||||
2 | Z(τ) | 13 | 0.054 | 0.903 | 11 | 0.055 | 0.906 | 10 | 0.054 | 0.916 |
L(τ) | 13 | 0.058 | 0.871 | 11 | 0.059 | 0.867 | 10 | 0.061 | 0.876 | |
| ||||||||||
Design | (80% Power) | R=1.5 | R=1.6 | R=1.7 | ||||||
| ||||||||||
κ | Test | n | α | 1 − β | n | α | 1 − β | n | α | 1 − β |
| ||||||||||
0.5 | Z(τ) | 418 | 0.051 | 0.802 | 314 | 0.050 | 0.801 | 248 | 0.050 | 0.802 |
L(τ) | 417 | 0.049 | 0.798 | 313 | 0.049 | 0.800 | 247 | 0.049 | 0.800 | |
| ||||||||||
1 | Z(τ) | 85 | 0.052 | 0.802 | 64 | 0.051 | 0.805 | 51 | 0.052 | 0.804 |
L(τ) | 85 | 0.052 | 0.796 | 64 | 0.053 | 0.798 | 51 | 0.053 | 0.796 | |
| ||||||||||
2 | Z(τ) | 20 | 0.052 | 0.816 | 15 | 0.053 | 0.815 | 12 | 0.054 | 0.819 |
L(τ) | 20 | 0.055 | 0.789 | 15 | 0.056 | 0.781 | 12 | 0.059 | 0.778 | |
| ||||||||||
Design | (80% Power) | R=1.8 | R=1.9 | R=2.0 | ||||||
| ||||||||||
κ | Test | n | α | 1 − β | n | α | 1 − β | n | α | 1 − β |
| ||||||||||
0.5 | Z(τ) | 204 | 0.050 | 0.805 | 173 | 0.051 | 0.806 | 149 | 0.049 | 0.804 |
L(τ) | 203 | 0.051 | 0.800 | 172 | 0.051 | 0.803 | 148 | 0.051 | 0.799 | |
| ||||||||||
1 | Z(τ) | 42 | 0.053 | 0.807 | 36 | 0.052 | 0.811 | 31 | 0.052 | 0.811 |
L(τ) | 42 | 0.053 | 0.798 | 36 | 0.052 | 0.801 | 31 | 0.054 | 0.798 | |
| ||||||||||
2 | Z(τ) | 10 | 0.054 | 0.827 | 8 | 0.055 | 0.810 | 7 | 0.056 | 0.816 |
L(τ) | 10 | 0.061 | 0.778 | 8 | 0.062 | 0.748 | 7 | 0.063 | 0.746 |
n is the sample size per group (with equal allocation), and the sample size for the log-rank test was calculated from formula (9) given by Collett (2003).
(9) |
where π1 = 1/(1 + π) and π2 = π/(1 + π) are proportions of subjects assigned to treatment 1 and treatment 2, respectively, and P(τ) = π1p1(τ)+π2p2(τ) is the combined probability of failure in [0, τ] for subjects from the two groups. As shown in Table 1, the sample sizes for the parametric test and log-rank test are very similar. In an unpublished manuscript on the log-rank test, Xiong (2014) obtained a precise analytical formula of E(τ), where E(τ) =μ̄ (τ)2/v̄ (τ) and μ̄(τ) = limd→∞ μ(τ)/d and v̄(τ) = lim d→∞ V(τ)/d, where μ(τ) and V(τ) are the mean and variance of log-rank score statistic, and d is the total number of events in [0, τ]. The precise number of failures in calendar time interval [0,τ] for the log-rank test should be d = (z1−α + z1−β)2/E(τ), and the precise sample size should be n = d/P(τ), where P(τ) is defined above as the combined probability of failure on [0, τ] for the two groups. E(τ) is a function of the survival distributions, the hazards ratio, the entry time distribution, the censoring distribution, and allocation proportions of subjects in the two groups. By numerical computation using E(τ) as a criterion, we evaluated the existing formula of the sample size for the log-rank test and proposed a new formula for it. The computation indicates that [log(δ)]2π1π2≈E(τ) when the allocation is balanced (i.e., π1(π2) is close to 0.5) and |log(δ)| ≤ 1; this verifies the accuracy of the formula (9) for this range of parameters in the application. The computation also indicates that [log(δ)]2 π1π2p1(τ)p2(τ)/[P(τ)]2 ≈ E(τ) for any 0 < π1 < 1 and |log(δ)| ≤ 2, which leads to
(10) |
as a formula for the sample size calculation of the log-rank test that is more accurate than the formula in (9). We will give a numerical example in Section 5 to illustrate this feature. It is straightforward to check that equations (5) and (10) are mathematically equivalent, which implies that the formula in (5) not only works for the proposed parametric test for the Weibull distribution, but also works well for the log-rank test, especially when the assignment of subjects is unbalanced for the two groups.
By the way, in practice the attrition from clinical trials should be considered in the design stage. Patients who are lost to follow-up for various causes during the study are censored in the survival analysis. Assume that losses to follow-up are random and independent of the survival distribution. Thus it is part of the censoring distribution and can be incorporated in the trial design. Sample sizes can also be adjusted for the attrition due to patients' dropout or noncompliance to the study (see e.g. Lachin and Foulkes, 1986).
4 Group Sequential Procedure
In this section, we will apply an SCPRT procedure (Xiong, 1995) to the test statistic Z(t). The SCPRT has two unique features: (1) the maximum sample size of the sequential test is not greater than the size of the reference fixed sample test; and (2) the probability of discordance, or the probability that the conclusion of the sequential test would be reversed if the experiment were not stopped according to the stopping rule but continued to the planned end, can be controlled to an arbitrarily small level (Xiong et al., 2007). Furthermore, the power function of the SCPRT is virtually the same as that of the fixed sample test (Xiong, 1995). The SCPRT boundaries derived in this paper have analytical solutions. All these features make the SCPRT attractive and simple to use.
Let {Bt: 0 < t ≤ 1} be the Brownian motion Bt ∼ N(θt,t) and B1 be the Bt at the final stage with full information t = 1. Then the joint distribution of (Bt, B1) has a bivariate distribution with mean μ = (θt,θ) and a variance matrix Σ = (σij)2×2 with σ11 = σ12 = σ21 = t and σ22 = 1. Therefore, according to multivariate conditional distribution theory (e.g., Anderson, 1958), the conditional density f(Bt|B1) is the normal density of N(B1t, (1−t)t). Let s0 = z1−α be the critical value of B1 to reject the null for the fixed sample test. Then the conditional maximum likelihood ratio for the stochastic process on information time t is (Xiong, 1995; Xiong, et al., 2003)
Taking the logarithm, the log-likelihood ratio can be simplified as
which has a positive sign if St > z1−αt and a negative sign if Bt < z1−αt. This equation leads to lower and upper boundaries for Btk as
(11) |
for k = 1, …, K, where K is the total number of looks, and t1, t2,…, tK(= 1) are the information times of the interim looks and the final look. The a in (11) is the boundary coefficient, and it is crucial to choose an appropriate a for the design such that the probability of conclusion by the sequential test being reversed by the test at the planned end is small but not unnecessarily small. The larger a is, the smaller is the discordance probability, and the wider apart the upper and lower boundaries are, making it harder for the sample path to reach the boundaries and stop early and resulting in larger expected sample sizes. Thus, an appropriate a can be determined by choosing an appropriate discordance probability (Xiong, 1995; Xiong et al., 2003).
Now we apply the SCPRT to the test statistic which is a Brownian motion in information time I = D(t)/D(τ) on [0,1], and the drift parameter where D(t) is defined by (3) in Section 2. Suppose kth interim looks are planned at calendar time tk, k =1, …,K. Then based on the SCPRT procedure presented above, the lower and upper boundaries for BIk = B(Ik) at the kth look are given by
(12) |
for k = 1, …, K, where Ik = D(tk)/D(τ) is the information time at the kth look at calendar time tk. The nominal critical p-values for testing H0 are
(13) |
The observed p-value at the kth look is
(14) |
The stopping rule for monitoring the trial can be executed by stopping the trial when, for the first time, PBlk ≥ Pak (accept H0 and stop for futility) or PBlk ≤ Pbk (reject H0 and stop for efficacy). Since Z(tk) and have the same asymptotic distribution under the null hypothesis, the observed p-value at the kth stage can be calculated from the test statistic Z(tk) by applying all observations up to stage k. As an illustration, the calculations of the operating characteristics of a multi-stage group sequential design are given in the Appendix 2.
5 Simulation Studies
In this section, we conducted simulation studies to compare the power and type I error of the proposed parametric test statistic Z(t) and the non-parametric log-rank test L(t) under various scenarios. In the simulations, the survival distribution of the jth group was taken as Sj(t) = e−log(2)(t/mj)κ, which is the Weibull distribution with shape parameter κ and median survival time mj, j = 1, 2, where the shape parameter κ was taken as 0.5, 1, and 2 to reflect cases of decreasing, constant, and increasing hazard functions.
The null hypothesis was set to H0 : m1 = m2 (= 1), and the ratio of medians R = m2/m1 under the alternative was taken as 1.5 − 2.0. Furthermore, we assumed that subjects were recruited with a uniform distribution over the accrual period ta = 5 (years) and followed for tf = 2 (years), and no subject was lost to follow-up during the study period τ = ta + tf = 7. Therefore, a subject was censored at calendar time t if his/her event time was longer than t − u, where u is the time when the subject entered the study. The sample sizes of the two groups were balanced, and thus π = n2/n1 = 1; this setting was based on the facts that the total sample size n = n1 + n2 is minimized with π close to 1 and a study with groups of equal sample sizes is easier to plan and manage.
In each design parameter configuration, 100,000 observed samples of censored event times were generated from the Weibull distribution to calculate the test statistics under the null or alternative hypothesis. The nominal significance level was set to 0.05 and power was set to 80% and 90%. Sample sizes, empirical type I error, and power for the fixed sample test were calculated at the end of the study, τ = 7. The empirical type I error and power for the two-stage SCPRT design were calculated at calendar time t1 = 4 and t2 = 7. The simulated empirical powers and type I errors in various scenarios for the fixed sample test and two-stage SCPRT test are summarized in Table 1 and Table 2, respectively.
Table 2.
Simulated empirical type I error and power of the two-stage SCPRT designs based on 100,000 simulation runs for sequential test statistics Z(t) and L(t) with a nominal type I error of 0.05 and power of 90% (one-sided test).
Design | (90% Power) | Type I error | Power | ||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
R | κ | Test | At kth interim look | k = 1 | k = 2 | total | k = 1 | k = 2 | total |
1.5 | 0.5 | Z(t) | Empirical | 0.005 | 0.045 | 0.050 | 0.382 | 0.519 | 0.901 |
L(t) | Empirical | 0.005 | 0.044 | 0.049 | 0.385 | 0.514 | 0.899 | ||
Nominal | 0.005 | 0.046 | 0.051 | 0.382 | 0.518 | 0.899 | |||
| |||||||||
1 | Z(t) | Empirical | 0.005 | 0.047 | 0.052 | 0.325 | 0.577 | 0.902 | |
L(t) | Empirical | 0.005 | 0.047 | 0.053 | 0.324 | 0.575 | 0.899 | ||
Nominal | 0.005 | 0.046 | 0.051 | 0.325 | 0.574 | 0.899 | |||
| |||||||||
2 | Z(t) | Empirical | 0.005 | 0.047 | 0.052 | 0.323 | 0.579 | 0.902 | |
L(t) | Empirical | 0.006 | 0.049 | 0.055 | 0.293 | 0.593 | 0.886 | ||
Nominal | 0.004 | 0.046 | 0.051 | 0.326 | 0.573 | 0.899 | |||
| |||||||||
1.8 | 0.5 | Z(t) | Empirical | 0.005 | 0.045 | 0.050 | 0.378 | 0.524 | 0.902 |
L(t) | Empirical | 0.005 | 0.046 | 0.051 | 0.377 | 0.522 | 0.899 | ||
Nominal | 0.005 | 0.046 | 0.051 | 0.382 | 0.518 | 0.899 | |||
| |||||||||
1 | Z(t) | Empirical | 0.005 | 0.047 | 0.052 | 0.315 | 0.588 | 0.903 | |
L(t) | Empirical | 0.005 | 0.048 | 0.053 | 0.315 | 0.582 | 0.897 | ||
Nominal | 0.005 | 0.046 | 0.051 | 0.314 | 0.585 | 0.899 | |||
| |||||||||
2 | Z(t) | Empirical | 0.005 | 0.049 | 0.054 | 0.286 | 0.614 | 0.900 | |
L(t) | Empirical | 0.006 | 0.053 | 0.060 | 0.236 | 0.633 | 0.869 | ||
Nominal | 0.004 | 0.046 | 0.051 | 0.285 | 0.615 | 0.899 | |||
| |||||||||
2.0 | 0.5 | Z(t) | Empirical | 0.005 | 0.046 | 0.051 | 0.378 | 0.525 | 0.903 |
L(t) | Empirical | 0.005 | 0.045 | 0.050 | 0.375 | 0.524 | 0.899 | ||
Nominal | 0.005 | 0.046 | 0.051 | 0.377 | 0.523 | 0.899 | |||
| |||||||||
1 | Z(t) | Empirical | 0.005 | 0.048 | 0.052 | 0.309 | 0.599 | 0.908 | |
L(t) | Empirical | 0.006 | 0.048 | 0.054 | 0.312 | 0.589 | 0.901 | ||
Nominal | 0.005 | 0.046 | 0.051 | 0.307 | 0.593 | 0.899 | |||
| |||||||||
2 | Z(t) | Empirical | 0.006 | 0.050 | 0.055 | 0.275 | 0.637 | 0.913 | |
L(t) | Empirical | 0.007 | 0.056 | 0.063 | 0.214 | 0.660 | 0.874 | ||
Nominal | 0.004 | 0.046 | 0.051 | 0.285 | 0.615 | 0.899 |
Sample sizes under each scenario for fixed sample tests recorded in Table 1 were calculated by formula (5) for the parametric test Z(τ) and by formula (9) for the log-rank test L(τ); the latter formula was given by Collett (2003). As shown in Table 1, the sample sizes for the two tests were very similar, both assuming π = 1, which is a favorable condition as discussed in the last paragraph of Section 3. For π not close to 1, the sample sizes for the two tests could be different. For example, in Table 1, assume the conditions are same as before except letting π = 7/13 (or π1 = 0.65 and π2 = 0.35); then for κ = 0.5 and R = 1.5, the total sample size n from formula (9) is 1249, whereas the total sample size n from formula (5) or (10) is 1289. The latter is close to the precise sample size, 1287, from the log-rank test using E(τ) (see the last paragraph of Section 3). The simulation results in Table 1 for the fixed sample tests showed that both the log-rank test L(τ) and parametric test Z(τ) had adequate empirical type I error and power for moderate to large sample sizes. However, the log-rank test L(τ) was liberal and underpowered when the sample size was small. The type I error and power of the parametric test Z(τ) were satisfactorily close to the nominal type I error of 0.05 and power of 90% or 80%, respectively, even when the sample size was small. The difference in performance for the two tests with a small sample size may be explained by the fact that L(τ) is non-parametric and includes only counting data, whereas Z(τ) is parametric and includes counting data d1(τ) and d2(τ) as well as continuous data U1(τ) and U2(τ). The results for the two-stage SCPRT design (Table 2) showed again that the empirical type I error and power of both tests were close to the nominal level at each stage for moderate to large sample sizes. However, the log-rank test L(t) was liberal and underpowered when the sample size was small. The parametric test Z(t) performed better with adequate empirical type I error and power at each stage.
To study the Brownian motion property of the statistic U(t) in equation (2), the empirical correlation matrix of the increments of the U(t) at times t = 3, 4, 5, 6, 7 with sample size n = 100 per group were also computed through 100,000 simulations. All of the correlations were close to theoretical value zero (see Table 3).
Table 3.
Simulated empirical correlation matrix of the statistic U(t) based on 100,000 simulation runs with sample size n = 100.
Calendar time tk | |||||
---|---|---|---|---|---|
|
|||||
κ | 3 | 4 | 5 | 6 | 7 |
0.5 | 1 | 0.0092 | 0.0061 | 0.0059 | 0.0056 |
1 | 0.0036 | 0.0066 | 0.0070 | ||
1 | 0.0088 | 0.0087 | |||
1 | 0.0034 | ||||
1 | |||||
| |||||
1 | 1 | 0.0089 | 0.0068 | 0.0060 | 0.0057 |
1 | 0.0060 | 0.0079 | 0.0075 | ||
1 | 0.0056 | 0.0062 | |||
1 | 0.0049 | ||||
1 | |||||
| |||||
2 | 1 | 0.0024 | -0.0002 | -0.0012 | -0.0003 |
1 | -0.0004 | -0.0004 | 0.0010 | ||
1 | 0.0018 | 0.0027 | |||
1 | 0.0019 | ||||
1 |
6 An Example
Rhabdoid tumors are aggressive pediatric malignancies with a poor prognosis. Over the past 5 years, St. Jude Children's Research Hospital accrued 14 pediatric patients with recurrent or refractory non-CNS rhabdoid tumors treated with conventional chemotherapy. The median event-free survival is only about 1 year, where the event is defined as disease relapse or death. All 14 patients had events within about 3 years. The Weibull model was fitted in Splus to the data, resulting an estimate (standard error) of the shape parameter κ = 1.37(0.28) and median event-free survival time of m1 = 0.936 years, which provides a more satisfactory model than the exponential model (Wu, 2013). Now, suppose that we would like to design a multi-center randomized two-arm trial to assess the effectiveness of the small molecule inhibitor alisertib (treatment 2) versus conventional chemotherapy (treatment 1) for this group of patients. Patients will be randomized with equal allocation to each treatment group and hence π = n2/n1 = 1. The hypotheses of the planned study are H0 : m2 ≤ m1 vs. H1 : m2 > m1. The investigators would like to detect a half-year increase in the median event-free survival of the alisertib treatment group over that of the conventional chemotherapy group, with 90% power, 5% type I error, and 2 years of follow-up (tf = 2) after the last patient is enrolled in the study. Then for the alternative hypothesis, m2 = m1 + 0.5 = 1.436 and δ = λ1/λ2 = (m2/m1)κ = 1.797 and γ = log(δ) = 0.586. Assume this multi-center trial has the capacity to enroll and treat 20 patients per year. Then under the assumption of the Weibull model, with uniform entry and no loss to follow-up, the required total accrual time is ta = 5.3 years, calculated by equation (8), in which p1(τ) and p2(τ) are found by (7) with τ = ta + tf. Then the study duration is τ = ta + tf = 7.3 years, and the total sample size is 106 patients (53 per group). Now assuming interim and final looks are planned at calendar times t1 = 4, t2 = 5, and t3 = τ = 7.3 years, the corresponding information time at each planned interim look tk can be calculated by Ik = D(tk)/D(τ), where with
and π = 1. By calculation, the corresponding information times are I1 = 0.511, I2 = 0.706, and I3 = 1. Assuming the maximum conditional probability of discordance ρ = 0.02, then the boundary coefficient a = 2.593 for K = 3 (Xiong et al., 2003), and the maximum probability of discordance is ρmax = 0.0043. That is, under the most unfavorable setting of mean parameter, i.e., the underlying true drift θI of Brownian motion B(I) is pointing to the cutoff point z1−α of the test at the final stage, on average, for every 1,000 such sequential tests, there would be only 4.3 cases of later reversal of conclusion (significant or futility) at early stopping if the sequential test were not stopped as it should be but continued to the planned end. The lower boundary and upper boundary calculated from (12) are (a1, a2, a3) = (−0.298, 0.124, 1.645) and (b1, b2, b3) = (1.979, 2.199, 1.645), respectively. The acceptance and rejection nominal critical significance levels are (0.6615, 0.4415, 0.050) and (0.0028, 0.0044, 0.050).
We performed 100,000 simulation runs under the Weibull distribution to evaluate the operating characteristics of the proposed group sequential design. The empirical (nominal) type I error and power of the sequential test are 0.0514 (0.0505) and 0.9034 (0.8995), respectively. The empirical (nominal) probabilities of stopping under the null and alternative hypotheses are 0.3446 (0.3412) and 0.2456 (0.2406) at the first look and 0.5902 (0.5817) and 0.4739 (0.4690) at the second look. The details of the operating characteristics for the proposed group sequential design are shown in Table 4.
Table 4.
The empirical operating characteristics of the three-stage SCPRT design for the example were estimated based on 100,000 simulation runs under the Weibull distribution.
At kth interim look | k = 1 | k = 2 | k = 3 | Total |
---|---|---|---|---|
Type I error | ||||
Nominal | 0.0028 | 0.0031 | 0.0446 | 0.0505 |
Empirical | 0.0031 | 0.0029 | 0.0453 | 0.0514 |
| ||||
Power | ||||
Nominal | 0.2494 | 0.2066 | 0.4435 | 0.8995 |
Empirical | 0.2499 | 0.2106 | 0.4429 | 0.9034 |
| ||||
Probability of stopping under null | ||||
Nominal | 0.3412 | 0.2405 | 0.4184 | 1.0000 |
Empirical | 0.3446 | 0.2456 | 0.4098 | 1.0000 |
| ||||
Probability of stopping under alternative | ||||
Nominal | 0.2554 | 0.2136 | 0.5311 | 1.0000 |
Empirical | 0.2557 | 0.2182 | 0.5261 | 1.0000 |
| ||||
Expected stopping time and sample size | ET(0) | ET(θa)* | EN(0) | EN(θa) |
Nominal | 0.7625 | 0.8124 | 81 | 87 |
Empirical | 0.7593 | 0.8108 | 81 | 86 |
θa = z1−α + z1−β is the draft parameter at the alternative hypothesis.
7 Conclusion
A parametric sequential test statistic under the Weibull model is proposed. Simulation results showed that the proposed parametric sequential test Z(t) has better small-sample properties than the log-rank test L(t) under the Weibull model. A multi-stage group sequential procedure is given based on the SCPRT test proposed by Xiong (1995). The maximum sample size of the sequential test is the same as the sample size of the fixed sample test, and group sequential boundaries have analytical solutions. Therefore, the proposed group sequential procedure is attractive and simple to use. The study can be monitored by pre-planned multi-stage interim analyses to stop for either efficacy or futility of the new treatment. By the way, if the hypothesis of a randomized phase III trial is a two-sided hypothesis, one can replace z1−α by z1−α/2 in equations (5) or (8) to obtain a sample size formula, and in equation (12) to obtain SCPRT boundaries for the two-sided hypotheses for a group sequential trial design.
Acknowledgments
This work was supported in part by National Cancer Institute (NCI) support grant CA21765 and the American Lebanese Syrian Associated Charities (ALSAC).
Appendix 1: Derivation of the Sequential Test Statistic Z(t)
First, for notation convenience, we convert (λ1, λ2) to (γ, λ), where γ = log(λ1/λ2) is the log hazards ratio and λ = λ2. Then the log-likelihood at calendar time t for (γ, λ) is given by
By solving the following score equations:
the maximum likelihood estimates of γ and λ are
The observed Fisher information matrix is given by
and then the variance of γ̂ can be estimated by , which is the (1,1) entry in the inverse of the Fisher information matrix j−1(γ̂, λ̂, t). Therefore, the Wald test statistic of γ̂(t) is given by
Under the null hypothesis H0: γ = 0,
has an approximate standard normal distribution.
Appendix 2: Computation for the sequential test normalized with information time
Assumption
Let B(t) ∼ N(θt,t) be a Gaussian process with the time variable t on interval [0,1] and drift θ. Let 0 < t1 < … < tK = 1 be the information times of the looks for a sequential test with K looks. Let ak < bk be lower and upper boundaries for B(t) at time tk for k = 1,…, K − 1, and aK = bK.
Function
Define as a function of s on interval (ak, bk) for k = 1,…, K − 1; this series of functions can be determined recursively as follows. Let for s on (a1, b1); for k = 2,…, K, for s in (ak, bk),
(15) |
where ϕ(·) in (15) is the density function of the standard normal distribution.
Function ltk(·)
Define ltk(s) as a function of s on interval (−∞, ak)∪(bk, ∞), for k = 1,…, K; this series of functions can be determined using functions defined in (15) as follows. Let lt1(s) = 1 for s on (−∞, a1) ∪ (b1, ∞); for k = 2,…, K, for s in (−∞, ak) ∪ (bk, ∞), let
(16) |
where ϕ(·) in (16) is the density function of the standard normal distribution.
Power Function
For testing H0: θ ≤ 0 vs. Ha: θ > 0, with functions ltk(·) by (16), the power function P(θ) or the probability of rejecting H0 under the true mean θ is
(17) |
For the sequential test design, the significance level is α = P(0) and the power is 1 − β = P(θa), where θa is the value of θ under Ha.
Probability of Stopping
The probability of stopping at time tk is a function of θ as
(18) |
with which the probability of stopping at tk is Ptk (0) for the null hypothesis and Ptk (θa) for the alternative hypothesis.
Expected Stopping Time
With the probability of stopping Ptk(θ) from (18), the expected stopping ET(θ) is a function of θ as
(19) |
with which the expected stopping time is ET(0) for the null hypothesis and ET(θa) for the alternative hypothesis.
Expected Sample Size
Suppose the maximum sample size for the sequential test is n. The expected sample size for the sequential test is a function of θ and can be obtained by
(20) |
with an expected sample size of EN(0) for the null hypothesis and EN(θa) for the alternative hypothesis.
For SCPRT design
To test H0: θ ≤ 0 vs. Ha: θ > 0 with significance level α and power 1 − β by an SCPRT design, the cutoff value at the final stage tK = 1 is aK = bK = z1−α, the drift at the null hypothesis is θ0 = 0, and the drift at the alternative hypothesis is θa = z1−α + z1−β, which are the same as those for the fixed test at the final stage with information time t = 1. Including the cutoff value θ0 and θa into equations (17), (18), (19), and (20), we can compute the type I error, power, probability of stopping at a given tk, and expected sample sizes under the null and alternative hypotheses for the SCPRT design.
For details of the derivation of these computational formulas, please refer to Xiong and Tan (1999, 2001) and Xiong et al. (2002).
References
- Anderson TW. An introduction to multivariate statistical analysis. New York: Wiley; 1958. [Google Scholar]
- Collett D. Modeling survival data in medical research. 2nd. London: Chapman and Hall; 2003. [Google Scholar]
- Cox DR, Oakes DV. Analysis of Survival Data. London: Chapman and Hall; 1984. [Google Scholar]
- Haybittle JL. Repeated assessment of results in clinical trials of cancer treatment. British Journal of Radiology. 1971;44:793–797. doi: 10.1259/0007-1285-44-526-793. [DOI] [PubMed] [Google Scholar]
- Heo M, Faith MS, Allison DB. Power and sample size for survival analysis under the Weibull distribution when the whole lifespan is of interest. Mechanisms of Ageing and Development. 1998;102:45–53. doi: 10.1016/s0047-6374(98)00010-4. [DOI] [PubMed] [Google Scholar]
- Jennison C, Turnbull BW. Group sequential methods with applications to clinical trials. New York: Chapman and Hall; 2000. [Google Scholar]
- Jiang Z, Wang L, Li C, Xia J, Jia H. A practical simulation method to calculate sample size of group sequential trials for time-to-event under exponential and Weibull distribution. PLOS ONE. 2012;7:1–12. doi: 10.1371/journal.pone.0044013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lan KKG, DeMets DL. Discrete sequential boundaries for clinical trials. Biometrika. 1983;70:659–663. [Google Scholar]
- Lachin JM, Foulkes MA. Evaluation of sample size and power for analyses of survival with allowance for nonuniform patient entry, losses to follow-up, noncompliance, and stratification. Biometrics bf. 1986;42:507–519. [PubMed] [Google Scholar]
- Lu Q, Tse SK, Chow SC, Lin M. Analysis of time-to-event data nonuniform patient entry and loss to follow-up under a two-stage seamless adaptive design with Weibull distribution. Journal of Biophar-maceutical Statistics. 2012;22:773–784. doi: 10.1080/10543406.2012.678528. [DOI] [PubMed] [Google Scholar]
- Pocock SJ. Group sequential methods in the design and analysis of clinical trials. Biometrika. 1977;64:191–199. [Google Scholar]
- Schoenfeld DA. Sample-size formula for the proportional-hazards regression model. Biometrics. 1983;39:499–503. [PubMed] [Google Scholar]
- Schoenfeld DA, Ritcher JR. Nomograms for calculating the number of patients needed for a clinical trial with survival as an endpoint. Biometrics. 1982;38:163–170. [PubMed] [Google Scholar]
- Sellke T, Siegmund D. Sequential analysis of the proportional hazards model. Biometrika. 1983;79:315–326. [Google Scholar]
- Slud EV. Sequential linear rank tests for two-sample censored survival data. Annals of Statistics. 1984;12:551–571. [Google Scholar]
- Tsiatis AA. Repeated significance testing for a general class of statistics used in censored survival analysis. Journal of the American Statistical Association. 1982;77:855–861. [Google Scholar]
- Tsiatis AA, Boucher H, Kim K. Sequential methods for parametric survival models. Biometrika. 1995;70:165–173. [Google Scholar]
- O'Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics. 1979;35:549–556. [PubMed] [Google Scholar]
- Whitehead J, Stratton I. Group sequential clinical trial with triangular continuation regions. Biometrics. 1983;39:227–236. [PubMed] [Google Scholar]
- Wu J. Power and sample size for randomized phase III survival trials under the Weibull model. Journal of Biopharmaceutical Statistics. 2013 doi: 10.1080/10543406.2014.919940. inprint. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiong X. A class of sequential conditional probability ratio tests. Journal of American Statistical Association. 1995;15:1463–1473. [Google Scholar]
- Xiong X. A precise approach for sequential test design on comparing survival distributions by log-rank test. Un-published Manuscript 2014 [Google Scholar]
- Xiong X, Tan M, Boyett J. Sequential conditional probability ratio tests for normalized test statistic on information time. Biometrics. 2003;59:624–631. doi: 10.1111/1541-0420.00072. [DOI] [PubMed] [Google Scholar]