Abstract
Mid-study design modifications are becoming increasingly accepted in confirmatory clinical trials, so long as appropriate methods are applied such that error rates are controlled. It is therefore unfortunate that the important case of time-to-event endpoints is not easily handled by the standard theory. We analyze current methods that allow design modifications to be based on the full interim data, i.e., not only the observed event times but also secondary endpoint and safety data from patients who are yet to have an event. We show that the final test statistic may ignore a substantial subset of the observed event times. An alternative test incorporating all event times is found, where a conservative assumption must be made in order to guarantee type I error control. We examine the power of this approach using the example of a clinical trial comparing two cancer therapies.
1 Introduction
There are often strong ethical and economic arguments for conducting interim analyses [1] of an ongoing clinical trial and for making changes to the design if warranted by the accumulating data. One may decide, for example, to increase the sample size on the basis of promising interim results. Or perhaps one might wish to drop a treatment from a multi-arm study on the basis of unsatisfactory safety data. Owing to the complexity of clinical drug development, it is not always possible to anticipate the need for such modifications, and therefore not all contingencies can be dealt with in the statistical design.
Unforeseen interim modifications complicate the frequentist statistical analysis of the trial considerably. Over recent decades many authors have investigated so-called “adaptive designs” in an effort to maintain the concept of type I error control [2–6]. While the theory behind these methods is now well understood if responses are observed immediately, subtle problems arise when responses are delayed, e.g., in survival trials.
[7] proposed adaptive survival tests that are constructed using the independent increments property of logrank test statistics [8–10]. However, as pointed out by [11], these methods only work if interim decision making is based solely on the interim logrank test statistics and any secondary endpoint data from patients who have already had an event. In other words, investigators must remain blind to the data from patients who are censored at the interim analysis. [12] argue that decisions regarding interim design modifications should be as substantiated as possible, and propose a test procedure that allows investigators to use the full interim data. This methodology, similar to that of [13], does not require any assumptions regarding the joint distribution of survival times and short-term secondary endpoints, as do, e.g., the methods proposed by [14], [15, 16] and [17].
In this article we analyze the proposals of [13] and [12] and show that they are both based on weighted inverse-normal test statistics [18], with the common disadvantage that the final test statistic may ignore a substantial subset of the observed survival times. This is a serious limitation, as disregarding part of the observed data is generally considered inappropriate even if statistical error probabilities are controlled—see, for example, the discussion on overrunning in group sequential trials [17]. We quantify the potential inflation of the type I error rate if all observed data were used in these approaches. By adjusting the critical boundaries for the least favourable scenario we derive an alternative testing procedure which allows both, sample size reassessment and the use of all observed data.
The article is organized as follows. In Section 2 we review standard adaptive design theory and the recent methods of [13] and [12], as well as calculating the maximum type I error rate if the ignored data is naively reincorporated into the test statistic. In addition we construct a full-data guaranteed level-α test. In Section 3 we illustrate the procedures in clinical trial example and discuss the efficiency of the considered testing procedures. We present our conclusions in Section 4. R code to reproduce our results in provided in a supplementary file (S1 File).
2 Methods
2.1 Adaptive Designs
Comprehensive accounts of adaptive design methodology can be found in [6, 19]. For testing a null hypothesis, H0: θ = 0, against the one-sided alternative, Ha: θ > 0, the two-stage adaptive test statistic is of the form f1(p1) + f2(p2), where p1 is the p-value based on first-stage data, p2 is the p-value based on second-stage data, and f1 and f2 are prespecified monotonically decreasing functions. Consider the simplest case that no early rejection of the null hypothesis is possible at the end of the first stage. We will restrict attention to the weighted inverse-normal test statistic [18],
| (1) |
where Φ denotes the standard normal distribution function and w1 and w2 are prespecified weights such that . If Z > Φ−1(1 − α), then H0 may be rejected at level α. The assumptions required to make this a valid level-α test are as follows [20].
Assumption 1
Let denote the data available at the interim analysis, where with distribution function . In general, will contain information not only concerning the primary endpoint, but also measurements on secondary endpoints and safety data. It is assumed that the first-stage p-value function p1: ℝn → [0, 1] satisfies
Assumption 2
At the interim analysis, a second-stage design d is chosen. The second-stage design is allowed to depend on the unblinded first-stage data without prespecifying an adaptation rule. Denote the second-stage data by Y, where . It is assumed that the distribution function of Y, denoted by , is known for all possible second stage designs, δ, and all first-stage outcomes, .
Assumption 3
The second-stage p-value function p2: ℝm → [0, 1] satisfies .
Immediate responses
The aforementioned assumptions are easy to justify when primary endpoint responses are observed more-or-less immediately. In this case contains the responses of all patients recruited prior to the interim analysis. A second-stage design δ can subsequently be chosen with the responses from a new cohort of patients contributing to Y.
Delayed responses and the independent increments assumption
An interim analysis may take place whilst some patients have entered the study but have yet to provide a data point on the primary outcome measure. Most approaches to this problem [7, 8, 10] attempt to take advantage of the well known independent increments structure of score statistics in group sequential designs [21]. As pictured in Fig 1, will generally include responses on short-term secondary endpoints and safety data from patients who are yet to provide a primary outcome measure, while Y consists of some delayed responses from patients recruited prior to the interim analysis, mixed together with responses from a new cohort of patients.
Fig 1. Schematic of a two-stage adaptive trial design with delayed response using the independent increments assumption.
Let and denote the score statistic and Fisher’s information for θ, calculated from primary endpoint responses in . Assuming suitable regularity conditions, the asymptotic null distribution of is Gaussian with mean zero and variance [22]. The independent increments assumption is that for all first-stage outcomes and second-stage designs δ, the null distribution of Y is such that
| (2) |
at least approximately, where and denote the score statistic and Fisher’s information for θ, calculated from primary endpoint responses in .
Unfortunately, Eq (2) is seldom realistic in an adaptive setting. [11] show that if the adaptive strategy at the interim analysis is dependent on short-term outcomes in that are correlated with primary endpoint outcomes in Y, i.e., from the same patient, then a naive appeal to the independent increments assumption can lead to very large type I error inflation.
Delayed responses with patient-wise separation
An alternative approach, which we call “patient-wise separation”, redefines the first-stage p-value, p1: ℝp → [0, 1], to be a function of X1, where X1 denotes all the data from patients recruited prior to the interim analysis at calendar time Tint, followed-up until a pre-fixed maximum calendar time Tmax. In this case p1 may not be observable at the time the second-stage design δ is chosen. This is not a problem, as long as no early rejection at the end of the first stage is foreseen. Any interim decisions, such as increasing the sample size, do not require knowledge of p1. It is assumed that Y consists of responses from a new cohort of patients, such that could be formally replaced with x1 in the aforementioned adaptive design assumptions. We call this patient-wise separation because data from the same patient cannot contribute to both p1 and p2.
[23] and [24] apply this approach when a patient’s primary outcome can be measured after a fixed period of follow-up, e.g., 4 months. However, one must take additional care with a time-to-event endpoint, as one is typically not prepared to wait for all first-stage patients—those patients recruited prior to Tint—to have an event. Rather, p1 is defined as the p-value from a statistical test applied to the data from first-stage patients followed up until time Tend, for some Tend ≤ Tmax. In this case it is vital that Tend be fixed at the start of the trial, either explicitly or implicitly [12, 13]. Otherwise, if Tend were to depend on the adaptive strategy at the interim analysis, this would impact the distribution of p1 and could lead to type I error inflation.
The situation is represented pictorially in Fig 2. An unfortunate consequence of pre-fixing Tend is that this will not, in all likelihood, correspond to the end of follow-up for second-stage patients. All events from first-stage patients that occur after Tend make no contribution to the statistic Eq (1).
Fig 2. Schematic of a two-stage adaptive trial design with delayed response using patient-wise separation.
2.2 Adaptive Survival Studies
Consider a randomized clinical trial comparing survival times on an experimental treatment, E, with those on a control treatment, C. For simplicity, we will focus on the logrank statistic for testing the null hypothesis H0: θ = 0 against the one-sided alternative Ha: θ > 0, where θ is the log hazard ratio, assuming proportional hazards. Similar arguments could be applied to the Cox model. Let D1(t) and S1(t) denote the number of uncensored events and the usual logrank score statistic, respectively, based on the data from first-stage patients—those patients recruited prior to the interim analysis—followed up until calendar time t, t ∈ [0, Tmax]. Under the null hypothesis, assuming equal allocation and a large number of events, the variance of S1(t) is approximately equal to D1(t)/4 [25]. The first-stage p-value must be calculated at a prefixed time point Tend:
| (3) |
The number of events can be prefixed at d1, say, with Tend chosen implicitly
| (4) |
Jenkins et al., method
[13] describe a “patient-wise separation” adaptive survival trial, with test statistic Eq (1), first-stage p-value Eq (3) and Tend defined as in Eq (4). While their focus is on subgroup selection, we will appropriate their method for the simpler situation of a single comparison, where at the interim analysis one has the possibility to adapt the pre-planned number of events from second-stage patients—i.e., those patients recruited post Tint. The weights in Eq (1) are pre-fixed in proportion to the pre-planned number of events to be contributed from each stage, i.e., , where d1 + d2 is the total originally required number of events. The second-stage p-value corresponds to a logrank test based on second-stage patients, i.e.,
where with S2(t) and D2(t) defined analogously to S1(t) and D1(t), and where is specified at the interim analysis.
Irle and Schäfer method
Instead of explicitly combining stage-wise p-values, [12] employ the closely related “conditional error” approach [3, 4, 26].
They begin by prespecifying a level-α test with decision function, φ, taking values in {0, 1} corresponding to non-rejection and rejection of H0, respectively. For a survival trial, this entails specifying the sample size, duration of follow-up, test statistic, recruitment rate, etc. Then, at some not necessarily prespecified timepoint, Tint, an interim analysis is performed. The timing of the interim analysis induces a partition of the trial data, (X1, X2), where X1 and X2 denote the data from patients recruited prior- Tint and post- Tint, respectively, followed-up until time Tmax. For a standard log-rank test, the decision function is
| (5) |
where D(Tend) and S(Tend) denote the number of uncensored events and the usual logrank score statistic, respectively, based on data from all patients followed-up until time Tend: = min{t : D(t) = d} for some prespecified number of events d.
At the interim analysis, the general idea is to use the unblinded first-stage data to define a second-stage design, δ, without the need for a prespecified adaptation strategy. Again, the definition of δ includes factors such as sample size, follow-up period, recruitment rate, etc., in addition to a second-stage decision function ψ: ℝm → {0, 1} based on second-stage data Y ∈ ℝm. Ideally, one would like to choose ψ such that , as this would ensure that
| (6) |
i.e., the overall procedure controls the type I error rate at level α. Unfortunately, this approach is not directly applicable in a survival trial where contains short-term data from first-stage patients surviving beyond Tint. This is because it is impossible to calculate and , owing to the unknown joint distribution of survival times and the secondary endpoints already observed at the interim analysis. One may, however, condition on X1 rather than on and choose ψ such that EH0(ψ ∣ X1 = x1) = EH0(φ ∣ X1 = x1), thus ensuring type I error control following the same argument as Eq (6). For example, it is possible to extend patient follow-up and use the second-stage decision function
| (7) |
where T*: = min{t : D(t) = d*}, d* ≥ d is chosen at the interim analysis, and b* is a cutoff value that must be determined. [12] show that, asymptotically,
and
In each case, calculation of the right-hand-side is facilitated by the asymptotic result that, assuming equal allocation under the null hypothesis, for t ∈ [0, Tmax],
| (8) |
One remaining subtlety is that can only calculated at calendar time T*, where T* > Tint. Determination of b* must therefore be postponed until this later time.
Using result Eq (8), it is straightforward to show that ψ = 1 if and only if Z > Φ−1(1 − α), where Z is defined as in Eq (1) with p1 defined as in Eq (3), the second-stage p-value function defined as
| (9) |
and the specific choice of weighting . Full details are provided in supplementary material (S2 File).
Remark 1. The Irle and Schäfer method uses the same test statistic as the Jenkins et al. method, with a clever way of implicitly defining the weights and the end of first-stage follow-up, Tend. It has two potential advantages. Firstly, the timing of the interim analysis need not be prespecified—in theory, one is permitted to monitor the accumulating data and at any moment decide that design changes are necessary. Secondly, if no changes to the design are necessary, i.e., the trial completes as planned at calendar time Tend, then the original test Eq (5) is performed. In this special case, no events are ignored in the final test statistic.
Remark 2. From first glance at Eq (7), it may appear that the events from first-stage patients, occurring after Tend, always make a contribution to the final test statistic. However, this data is still effectively ignored. We have shown in the online supplement that the procedure is equivalent to a p-value combination approach where p1 depends only on data available at time Tend. In addition, the distribution of p2 is asymptotically independent of the data from first-stage patients: note that S(T*) − S1(T*) and S2(T*) are asymptotically equivalent [12]. The procedure therefore fits our description of a “patient-wise separation” design, and the picture is the same as in Fig 2. The first-stage patients have in effect been censored at Tend, despite having been followed-up for longer. This fact has important implications for the choice of d*. If one chooses d* based on conditional power arguments, one should be aware that the effective sample size has not increased by d* − d. Rather, it has increased by d* − d − {D1(T*) − D1(Tend)}, which could be very much smaller.
Remark 3. A potential disadvantage of the Irle and Schäfer method compared to the Jenkins et al. method is that one is not permitted to adapt any aspect of the recruitment process prior to time Tend. Contrary to what is claimed in [12], it is not valid to extend the recruitment period (or speed up recruitment as in the example they give) to reach an increased number of events d* within the originally planned trial duration. This is because Tend is defined implicitly as Tend: = min{t : D(t) = d} under the assumptions of the original design. Therefore Tend is unobservable if the recruitment process is changed in response to the interim data. [27] discuss this issue further.
2.3 Hypothesis tests based on all available follow-up data
Suppose that the trial continues until calendar time T* ∈ (Tend, Tmax). Data from first-stage patients—those patients recruited prior to Tint—accumulating between times Tend and T* should be ignored. In this section we will investigate what happens, in a worst case scenario, if this illegitimate data is naively incorporated into the adaptive test statistic Eq (1). Specifically, we find the maximum type I error associated with the test statistic
| (10) |
Since T* depends on the interim data in a complicated way, the null distribution of Eq (10) is unknown. One can, however, consider properties of the stochastic process
In other words, we consider continuous monitoring of the logrank statistic based on first-stage patient data. The worst-case scenario assumption is that the responses on short-term secondary endpoints, available at the interim analysis, can be used to predict the exact calendar time the process Z(t) reaches its maximum. In this case, one could attempt to engineer the second stage design such that T* coincides with this timepoint, and the worst-case type I error rate is therefore
| (11) |
Although the worst-case scenario assumption is clearly unrealistic, Eq (11) serves as an upper bound on the type I error rate. It can be found approximately via standard Brownian motion results. Let u: = D1(t)/D1(Tmax) denote the information time at calendar time t, and let S1(u) denote the logrank score statistic based on first-stage patients, followed-up until information time u. It can be shown that B(u): = S1(u)/ {D1(Tmax)/4}1/2 behaves asymptotically like a Brownian motion with drift ξ: = θ {D1(Tmax)/4}1/2[28]. We wish to calculate
| (12) |
where u1 = D1(Tend)/D1(Tmax). While the integrand on the right-hand-side is difficult to evaluate exactly, it can be found to any required degree of accuracy by replacing the square root stopping boundary with a piecewise linear boundary [29].
The two parameters that govern the size of Eq (11) are w1 and u1. Larger values of w1 reflect an increased weighting of the first-stage data, which increases the potential inflation. In addition, a low value for u1 increases the window of opportunity for stopping on a random high. Table 1 shows that for a nominal α = 0.025 level test, the worst-case type I error can be up to 15% when u1 = 0.1 and w1 = 0.9. As u1 → 0 the worst-case type I error rate tends to 1 for any value of w1 > 0 [30].
Table 1. Worst case type I error for various choices of weights and information fractions.
Nominal level α = 0.025 one-sided.
| u1 | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| w1 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
| 0.1 | 0.052 | 0.047 | 0.044 | 0.041 | 0.039 | 0.037 | 0.035 | 0.033 | 0.030 |
| 0.2 | 0.067 | 0.059 | 0.054 | 0.050 | 0.046 | 0.043 | 0.039 | 0.036 | 0.032 |
| 0.3 | 0.081 | 0.070 | 0.062 | 0.057 | 0.052 | 0.047 | 0.043 | 0.039 | 0.034 |
| 0.4 | 0.094 | 0.080 | 0.071 | 0.063 | 0.057 | 0.052 | 0.046 | 0.041 | 0.036 |
| 0.5 | 0.106 | 0.089 | 0.078 | 0.069 | 0.062 | 0.056 | 0.050 | 0.044 | 0.037 |
| 0.6 | 0.119 | 0.098 | 0.085 | 0.075 | 0.067 | 0.059 | 0.053 | 0.046 | 0.038 |
| 0.7 | 0.131 | 0.107 | 0.092 | 0.081 | 0.072 | 0.063 | 0.055 | 0.048 | 0.040 |
| 0.8 | 0.143 | 0.116 | 0.100 | 0.087 | 0.076 | 0.067 | 0.058 | 0.050 | 0.041 |
| 0.9 | 0.155 | 0.125 | 0.106 | 0.092 | 0.081 | 0.070 | 0.061 | 0.052 | 0.042 |
A full-data guaranteed level-α test
If one is unprepared to give up the guarantee of type I error control, an alternative test can be found by increasing the cut-off value for Z* from Φ−1(1 − α) to k* such that
3 Results
3.1 Clinical trial example
The upper bound on the type I error rate varies substantially across w1 and u1. To give an indication of what can be expected in practice, consider a simplified version of the trial described in [12]. A randomized trial is set up to compare chemotherapy (C) with a combination of radiotherapy and chemotherapy (E). The anticipated median survival time on C is 14 months. If E were to increase the median survival time to 20 months then this would be considered a clinically relevant improvement. Assuming exponential survival times, this gives anticipated hazard rates λC = 0.050 and λE = 0.035, and a target log hazard ratio of θR = −log(λE/λC) ≈ 0.36. If the error rates for testing H0 : θ = 0 against Ha : θ = θR are α = 0.025 (one-sided) and β = 0.2, the required number of deaths (assuming equal allocation) is
The relationship between the required number of events and the sample size depends on the recruitment pattern, and we will consider two scenarios. In our “slow recruitment” scenario, patients are recruited uniformly at a rate of 8 per month for a maximum of 60 months with an interim analysis performed at 23 months. In our“fast recruitment” scenario, patients are recruited uniformly at a rate of 50 per month for a maximum of 18 months with an interim analysis after 8 months. In both cases, the only adaptation we allow at the interim analysis is to increase the number of events. Recruitment must continue as planned but the follow-up period may be extended. The maximum duration of the trial is restricted to 100 months in the first case and 30 months in the second case.
Fig 3 shows the expected number of events as a function of time for both scenarios assuming exponentially distributed survival times with hazards equal to the planned values.
Fig 3. Expected total number of events as a function of time based on exponential survival with hazard rates λC = 0.05 and λE = 0.035.
Slow recruitment: 8 patients per month for a maximum of 60 months. Fast recruitment: 50 patients per month for a maximum of 18 months. Vertical lines are at Tint, Tend and Tmax.
The maximum type I error inflation, determined via w1 and u1, will depend on the observed number of events from first- and second-stage patients at calendar times Tint and Tend. However, the expected pattern of events in Fig 3 provide some indication. In the slow recruitment scenario, we expect to recruit 179 patients by the time of the interim analysis. We also expect 149 of the first 248 events to come from patients recruited prior to the interim analysis. These numbers would give w1 = (149/248)1/2, u1 = 149/179 and, according to Eq (12), max α = 0.044. For the fast recruitment scenarios the respective quantities are w1 = (169/248)1/2, u1 = 169/264 and max α = 0.060.
On the efficiency of the full-data level-α test
Consider the full-data guaranteed level-α test defined above. Recall that this test has the advantage of allowing interim decision making to be based on all available data whilst using a final test statistic that takes account of all observed event times. Unfortunately, this advantage is likely to be outweighed by the loss in power resulting from the increased cut-off value, as can be seen in Fig 4. The difference between the noncentrality parameters of Z(T*) and Z(Tend) is plotted against the time extension T* − Tend for various choices of θ. In the slow recruitment scenario the increase in the noncentrality parameter is outweighed by the increase in the cut-off value, even when the log-hazard ratio is as large as was expected in the planning phase. In the fast recruitment setting, it is possible for the increase in the noncentrality parameter to exceed the increase in the cut-off value when the trial is extended substantially. However, the trial would typically only need to be increased substantially if the true effect size were lower than planned. And in this case (θ ≤ 0.66θR) one can see that the increased cut-off value still dominates.
Fig 4. Difference between the noncentrality parameters of the adaptive test statistics Z(T*) and Z(Tend) as a function of the time extension T* − Tend ∈ [0, Tmax − Tend].
Horizontal lines are drawn at k* − Φ−1(0.975), where k* denotes the cut-off value of the full-data guaranteed level-α test, and Φ denotes the standard normal distribution function.
4 Discussion
Unblinded sample-size recalculation has been criticized for its lack of efficiency relative to classical group sequential designs [31, 32]. If the recalculation is made on the basis of an early estimate of treatment effect, the final sample size is likely to have high variability [33], and, in addition, the test decision is based on a non-sufficient statistic. [34] show how, for a given re-estimation rule, a classical group sequential design can be found with an essentially identical power function but lower expected sample size.
In response to these arguments [35] emphasize that “the real benefit of the adaptive approach arises through the ability to invest sample resources into the trial in stages”. An efficient group sequential trial, on the other hand, requires a large up-front sample size commitment and aggressive early stopping boundaries. From the point of view of the trial sponsor, the added flexibility may in some circumstances outweigh the loss of efficiency.
In this paper we have shown that when the primary endpoint is time-to-event, a fully unblinded sample-size recalculation—i.e., a decision based on all available efficacy and safety data—has additional drawbacks not considered in the aforementioned literature. Recently proposed methods [12, 13] share the common disadvantage that some patients’ event times are ignored in the final test statistic. This is usually deemed unacceptable by regulators. Furthermore, it is the long-term data of patients recruited prior to the interim analysis that is ignored, such that more emphasis is put on early events in the final decision making. This neglect becomes more serious, therefore, if the hazard rates differ substantially only at large survival times. Note, however, that a standard logrank test would already be inefficient in this scenario [36].
The relative benefit of the Irle and Schäfer method [12], in comparison with that of Jenkins et al. [13], is that the timing of the interim analysis need not be pre-specified and, in addition, the method is efficient if no design changes are necessary. On the other hand, the Irle and Schäfer method has the serious practical flaw that it is not permissible to change any aspect of the recruitment process in response to the interim data.
Confirmatory clinical trials with time-to-event endpoints appear to be one of the most important fields of application of adaptive methods [37]. It is therefore especially important that investigators considering an unblinded sample size re-estimation in this context are aware of the additional issues involved. We have shown that all considered procedures will require giving up an important statistical property—a situation summarized succinctly in Table 2.
Table 2. Trade-off involved in choosing between methods when extending the follow-up period of a survival trial.
Methods considered: (A), data is combined assuming independent stage-wise increments; (B), patient-wise separation with pre-fixed end of first-stage follow-up; (C), naive patient-wise separation without pre-fixed end of first-stage follow-up; and (D), patient-wise separation using the full-data guaranteed level-α test.
| Strict type I error control | All data available for interim decisions | All events included in test statistic | Relative power | |
|---|---|---|---|---|
| (A) Ind. Increments | √ | × | √ | √ |
| (B) Z(Tend) > z1−α | √ | √ | × | √ |
| (C) Z(T*) > z1−α | × | √ | √ | √ |
| (D) Z(T*) > k* | √ | √ | √ | × |
The relevance of these issues is highlighted by the recently published VALOR trial in acute myeloid leukaemia [38]. Treatment effect estimates from phase II data suggested that 375 events might be sufficient to confirm efficacy. However, there is always uncertainty surrounding such an estimate. A smaller effect size—corresponding to upwards of 500 required events—would still be clinically meaningful, but funding such a trial was beyond the resources of the study sponsor. The solution was to initiate the trial with the smaller sample size but plan an interim analysis, whereby promising results would trigger additional investment. In this case, the interim decision rules were pre-specified and, upon observing a promising hazard ratio after 173 events, the total required number of events was increased to 562. The final analysis was based on a weighted combination of log-rank statistics, corresponding to method (A) in Table 2. It is important to emphasize that the validity of this approach relies on the second-stage sample size being a function of the interim hazard ratio. Had other information—e.g., disease progressions—played a part in the interim decision making, then the type I error rate could have been compromised as described in this paper.
While statistical theory can be developed to control the type I error rate given certain model assumptions, there is always the potential for “operational bias” to enter an adaptive trial. FDA draft guidance [39] emphasizes the need to shield investigators as much as possible from knowledge of the adaptive changes. The very knowledge that sample size has been increased—implying a “promising” interim effect estimate—could lead to changes of behavior in terms of treating, managing, and evaluating study participants. As a minimum, the European Medicines Agency requires that the primary analysis “be stratified according to whether patients were randomized before or after the protocol amendment” [40]. Aside from the regulatory importance, it is also in the sponsor’s interest to minimize operational bias when trial outcomes will influence significant investment decisions [41]. For a further discussion on the regulatory and logistical challenges sponsors may face we refer to [6, 19].
We have focussed our attention on the type I error control and power of the various procedures. Estimation of the treatment effect size following an adaptive survival trial is also an important topic. Current available methods can be found in [8], [42] and [43].
Supporting Information
(R)
(PDF)
Acknowledgments
DM was funded by the Medical Research Council (MR/J004979/1) and the Austrian Science Fund (FWF P23167), TJ was supported by the National Institute for Health Research (NIHR-CDF-2010-03-32). FK was supported by European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no 602552 (IDEAL). MP was supported by the EU FP7 HEALTH.2013.4.2-3 project Asterix (Grant Number 603160). The views expressed in this publication are those of the authors and should not be attributed to any of the funding institutions, or organisations with which the authors are affiliated
Data Availability
All relevant data are within the paper and code is available at http://dx.doi.org/10.6084/m9.figshare.1608320.
Funding Statement
DM was funded by the Medical Research Council (MR/J004979/1) and the Austrian Science Fund (FWF P23167). TJ was supported by the National Institute for Health Research (NIHR-CDF-2010-03-32). FK was supported by European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no 602552 (IDEAL). MP was supported by the EU FP7 HEALTH.2013.4.2-3 project Asterix (Grant Number 603160). The views expressed in this publication are those of the authors and should not be attributed to any of the funding institutions, or organisations with which the authors are affiliated. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. Jiang Z 1, Wang L, Li C, Xia J, Jia H. A practical simulation method to calculate sample size of group sequential trials for time-to-event data under exponential and Weibull distribution. PLoS One. 2012;7(9):e44013 10.1371/journal.pone.0044013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Bauer P, Köhne K. Evaluation of experiments with adaptive interim analyses. Biometrics. 1994;50:1029–1041. Correction: Biometrics 1996; 52:380. 10.2307/2533441 [DOI] [PubMed] [Google Scholar]
- 3. Proschan MA, Hunsberger SA. Designed Extension of Studies Based on Conditional Power. Biometrics. 1995;51:1315–1324. Available from: http://www.jstor.org/stable/2533262. 10.2307/2533262 [DOI] [PubMed] [Google Scholar]
- 4. Müller HH, Schäfer H. Adaptive Group Sequential Designs for Clinical Trials: Combining the Advantages of Adaptive and of Classical Group Sequential Approaches. Biometrics. 2001;57:886–891. 10.1111/j.0006-341X.2001.00886.x [DOI] [PubMed] [Google Scholar]
- 5. Hommel G. Adaptive modifications of hypotheses after an interim analysis. Biometrical Journal. 2001;43(5):581–589. [Google Scholar]
- 6. Bauer P, Bretz F, Dragalin V, König F, Wassmer G. Twenty-five years of confirmatory adaptive designs: opportunities and pitfalls. Statistics in Medicine. 2015. (Early View). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Schäfer H, Müller HH. Modification of the sample size and the schedule of interim analyses in survival trials based on data inspections. Statistics in medicine. 2001;20(24):3741–3751. 10.1002/sim.1136 [DOI] [PubMed] [Google Scholar]
- 8. Wassmer G. Planning and analyzing adaptive group sequential survival trials. Biometrical Journal. 2006;48(4):714–729. 10.1002/bimj.200510190 [DOI] [PubMed] [Google Scholar]
- 9. Desseaux K, Porcher R. Flexible two-stage design with sample size reassessment for survival trials. Statistics in medicine. 2007;26(27):5002–5013. 10.1002/sim.2966 [DOI] [PubMed] [Google Scholar]
- 10. Jahn-Eimermacher A, Ingel K. Adaptive trial design: A general methodology for censored time to event data. Contemporary clinical trials. 2009;30(2):171–177. 10.1016/j.cct.2008.12.002 [DOI] [PubMed] [Google Scholar]
- 11. Bauer P, Posch M. Letter to the editor. Statistics in Medicine. 2004;23:1333–1334. [DOI] [PubMed] [Google Scholar]
- 12. Irle S, Schäfer H. Interim design modifications in time-to-event studies. Journal of the American Statistical Association. 2012;107:341–348. 10.1080/01621459.2011.644141 [DOI] [Google Scholar]
- 13. Jenkins M, Stone A, Jennison C. An Adaptive Seamless Phase II/III Design for Oncology Trials with Subpopulation Selection Using Correlated Survival Endpoints. Pharmaceutical Statistics. 2011;10:347–356. 10.1002/pst.472 [DOI] [PubMed] [Google Scholar]
- 14. Stallard N. A confirmatory seamless phase II/III clinical trial design incorporating short-term endpoint information. Statistics in medicine. 2010;29(9):959–971. [DOI] [PubMed] [Google Scholar]
- 15. Friede T, Parsons N, Stallard N, Todd S, Valdes Marquez E, Chataway J, et al. Designing a seamless phase II/III clinical trial using early outcomes for treatment selection: An application in multiple sclerosis. Statistics in medicine. 2011;30(13):1528–1540. [DOI] [PubMed] [Google Scholar]
- 16. Friede T, Parsons N, Stallard N. A conditional error function approach for subgroup selection in adaptive clinical trials. Statistics in Medicine. 2012;31(30):4309–4320. 10.1002/sim.5541 [DOI] [PubMed] [Google Scholar]
- 17. Hampson LV, Jennison C. Group sequential tests for delayed responses (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2013;75(1):3–54. 10.1111/j.1467-9868.2012.01030.x [DOI] [Google Scholar]
- 18. Lehmacher W, Wassmer G. Adaptive Sample Size Calculations in Group Sequential Trials. Biometrics. 1999;55(4):pp. 1286–1290. Available from: http://www.jstor.org/stable/2533757. 10.1111/j.0006-341X.1999.01286.x [DOI] [PubMed] [Google Scholar]
- 19. Bretz F, Koenig F, Brannath W, Glimm E, Posch M. Adaptive Designs for Confirmatory Clinical Trials. Statistics in Medicine. 2009;28:1181–1217. 10.1002/sim.3538 [DOI] [PubMed] [Google Scholar]
- 20. Brannath W, Gutjahr G, Bauer P. Probabilistic Foundation of Confirmatory Adaptive Designs. Journal of the American Statistical Association. 2012;107:824–832. 10.1080/01621459.2012.682540 [DOI] [Google Scholar]
- 21. Jennison C, Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. Boca Raton, FL: Chapman and Hall; 2000. [Google Scholar]
- 22. Cox DR, Hinkley DV. Theoretical statistics. CRC Press; 1979. [Google Scholar]
- 23. Liu Q, Pledger GW. Phase 2 and 3 Combination Designs to Accelerate Drug Development. Journal of the American Statistical Association. 2005;100(470):493–502. Available from: http://amstat.tandfonline.com/doi/abs/10.1198/016214504000001790. 10.1198/016214504000001790 [DOI] [Google Scholar]
- 24. Schmidli H, Bretz F, Racine-Poon A. Bayesian predictive power for interim adaptation in seamless phase II/III trials where the endpoint is survival up to some specified timepoint. Statistics in medicine. 2007;26(27):4925–4938. 10.1002/sim.2957 [DOI] [PubMed] [Google Scholar]
- 25. Whitehead J. The Design and Analysis of Sequential Clinical Trials. Chichester: Wiley; 1997. [Google Scholar]
- 26. Posch M, Bauer P. Adaptive Two Stage Designs and the Conditional Error Function. Biometrical Journal. 1999;41:689–696. [Google Scholar]
- 27. Mehta C, Schäfer H, Daniel H, Irle S. Biomarker driven population enrichment for adaptive oncology trials with time to event endpoints. Statistics in Medicine. 2014;33(26):4515–4531. 10.1002/sim.6272 [DOI] [PubMed] [Google Scholar]
- 28. Proschan MA, Lan KKG, Wittes JT. Statistical Monitoring of Clinical Trials. New York: Springer; 2006. [Google Scholar]
- 29. Wang L, Pötzelberger K. Boundary crossing probability for Brownian motion and general boundaries. Journal of Applied Probability. 1997;34:54–65. 10.2307/3215174 [DOI] [Google Scholar]
- 30. Proschan MA, Follmann DA, Waclawiw MA. Effects of Assumption Violations on Type I Error Rate in Group Sequential Monitoring. Biometrics. 1992;48:1131–1143. 10.2307/2532704 [DOI] [Google Scholar]
- 31. Tsiatis AA, Mehta C. On the Inefficiency of Adaptive Design for Monitoring Clinical Trials. Biometrika. 2003;90:367–378. 10.1093/biomet/90.2.367 [DOI] [Google Scholar]
- 32. Jennison C, Turnbull BW. Adaptive and Nonadaptive Group Sequential Tests. Biometrika. 2006;93:1–21. 10.1093/biomet/93.1.1 [DOI] [Google Scholar]
- 33. Bauer P, Koenig F. The reassessment of trial perspectives from interim data–a critical view. Statistics in Medicine. 2006;25(1):23–36. 10.1002/sim.2180 [DOI] [PubMed] [Google Scholar]
- 34. Jennison C, Turnbull BW. Mid-course sample size modification in clinical trials based on the observed treatment effect. Statistics in Medicine. 2003;22(6):971–993. 10.1002/sim.1457 [DOI] [PubMed] [Google Scholar]
- 35. Mehta CR, Pocock SJ. Adaptive increase in sample size when interim results are promising: A practical guide with examples. Statistics in medicine. 2011;30(28):3267–3284. 10.1002/sim.4102 [DOI] [PubMed] [Google Scholar]
- 36. Lagakos S, Schoenfeld D. Properties of proportional-hazards score tests under misspecified regression models. Biometrics. 1984;p. 1037–1048. 10.2307/2531154 [DOI] [PubMed] [Google Scholar]
- 37. Elsäßer A, Regnstrom J, Vetter T, Koenig F, Hemmings RJ, Greco M, et al. Adaptive clinical trial designs for European marketing authorization: a survey of scientific advice letters from the European Medicines Agency. Trials. 2014;15(1):383 10.1186/1745-6215-15-383 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Ravandi F, Ritchie EK, Sayar H, Lancet JE, Craig MD, Vey N, et al. Vosaroxin plus cytarabine versus placebo plus cytarabine in patients with first relapsed or refractory acute myeloid leukaemia (VALOR): a randomised, controlled, double-blind, multinational, phase 3 study. The Lancet Oncology. 2015;16(9):1025–1036. 10.1016/S1470-2045(15)00201-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Food and Drug Administration. Draft Guidance for Industry: Adaptive Design Clinical Trials for Drugs and Biologics. 2010. Available from: http://www.fda.gov/downloads/Drugs/Guidances/ucm201790.pdf. [DOI] [PubMed]
- 40.European Medicines Agency Committee for Medicinal Products for Human Use. Reflection Paper on Methodological Issues in Confirmatory Clinical Trials Planned with an Adaptive Design. 2007. Doc. Ref. CHMP/EWP/2459/02. Available from: http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2009/09/WC500003616.pdf.
- 41. Cuffe RL, Lawrence D, Stone A, Vandemeulebroecke M When is a seamless study desirable? Case studies from different pharmaceutical sponsors. Pharmaceutical statistics. 2014;13(4):229–237. 10.1002/pst.1622 [DOI] [PubMed] [Google Scholar]
- 42. Brannath W, Zuber E, Branson M, Bretz F, Gallo P, Posch M, et al. Confirmatory adaptive designs with Bayesian decision tools for a targeted therapy in oncology. Statistics in Medicine. 2009;28(10):1445–1463. 10.1002/sim.3559 [DOI] [PubMed] [Google Scholar]
- 43. Carreras M, Gutjahr G, Brannath W. Adaptive seamless designs with interim treatment selection: a case study in oncology. Statistics in Medicine. 2015;34(8);1317–1333. 10.1002/sim.6407 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
(R)
(PDF)
Data Availability Statement
All relevant data are within the paper and code is available at http://dx.doi.org/10.6084/m9.figshare.1608320.




