Skip to main content
Taylor & Francis Open Select logoLink to Taylor & Francis Open Select
. 2015 Jan 20;25(1):170–189. doi: 10.1080/10543406.2013.840646

A Comparison of Methods for Treatment Selection in Seamless Phase II/III Clinical Trials Incorporating Information on Short-Term Endpoints

Cornelia Ursula Kunz a, Tim Friede b, Nicholas Parsons a, Susan Todd c, Nigel Stallard a,*
PMCID: PMC4339952  PMID: 24697322

Abstract

In an adaptive seamless phase II/III clinical trial interim analysis, data are used for treatment selection, enabling resources to be focused on comparison of more effective treatment(s) with a control. In this paper, we compare two methods recently proposed to enable use of short-term endpoint data for decision-making at the interim analysis. The comparison focuses on the power and the probability of correctly identifying the most promising treatment. We show that the choice of method depends on how well short-term data predict the best treatment, which may be measured by the correlation between treatment effects on short- and long-term endpoints.

Key Words: Adaptive seamless design, Multi-arm multi-stage trial, Surrogate endpoints

1. INTRODUCTION

In recent years, adaptive designs in the various phases of drug development have gained popularity. Such designs use information from accumulating data in an ongoing trial to make decisions about the conduct of the rest of the study (Gallo et al., 2006). One particular form of adaptive design is the combined phase II/III adaptive seamless design. A trial of this type is conducted in two stages. During the first stage, the exploratory stage, patients are recruited to several experimental treatments and a control treatment. One or more interim analyses are then performed, at which treatments that appear ineffective are dropped. The main objective of this first stage is to identify the most promising treatments, so that recruitment of further patients can be restricted to only those treatments and the control. At the end of the second stage, the confirmatory stage, the selected treatment(s) is (are) compared to the control within a formal testing framework, again possibly involving a sequence of interim analyses, based on all data from the selected treatment(s) and the control. Several authors have developed methodology for conducting phase II/III studies that protects the overall type I error rate of the trial (see, e.g., Bauer and Kieser, 1999; Stallard and Todd, 2003; Kelly et al., 2005; Posch et al., 2005; Bretz et al., 2006; Koenig et al., 2008). Reviews of the different approaches are given by Chow et al. (2005), Friede and Stallard (2008), Bretz et al. (2009), and Stallard and Todd (2011).

In the pharmaceutical setting, adaptive designs continue to gain acceptance. Regulatory authorities have recently produced guidance documents on the topic (Food and Drug Administration (FDA), 2010; European Medicines Agency (EMEA) - Committee for Medicinal Products for Human Use (CHMP), 2007), giving further evidence that they anticipate more clinical trials will be designed using this framework. Indeed, there are a number of therapeutic areas where phase II/III seamless adaptive designs have already been implemented. Schmoll et al. (2010) describe a pharmaceutical trial in oncology that was designed using the methodology of Stallard and Todd (2003) and Todd and Stallard (2005). Barnes et al. (2010) discuss the use of a phase II/III design in chronic obstructive pulmonary disease. In other therapeutic areas adaptive designs have been proposed and promoted. Dragalin (2011) discusses the potential for the use of adaptive designs in all phases of development, including discussion of phase II/III trials, in central nervous system studies. Chataway et al. (2011) and Friede et al. (2011) propose a phase II/III seamless adaptive design for use in secondary progressive multiple sclerosis trials.

A recent area of research in the development of further methodology for phase II/III designs concerns the question of how to incorporate early endpoint data into the treatment selection part of such a trial. The desire to do this arises when the primary endpoint of interest for each patient is only available after a number of months or even years and yet there are more immediately measured endpoints available, building on earlier work on incorporation of early endpoints in sequential clinical trials comparing a single experimental treatment with a control (Cook and Farewell, 1996; Marschner and Becker, 2001; Galbraith and Marschner, 2003; Sooriyarachchi et al., 2006; Whitehead et al., 2008). An example can be found in secondary progressive multiple sclerosis, where long-term changes in disability scales are the main goal, but early evidence of treatment effect may be observed as changes to lesions in the brain detected using magnetic resonance imaging scanning technology.

Two alternative methods for incorporating early endpoint data in phase II/III clinical trials have been proposed by Stallard (2010) and Friede et al. (2011). The methods differ in the way in which the treatment to continue to the second stage is chosen. Treatment selection under the method described by Stallard (2010) makes use of short-term endpoint data combined with any available long-term data. In contrast, Friede et al. (2011) propose a method of treatment selection that uses only short-term endpoint data. Both approaches base the final inference on the long-term endpoint data only, though they differ in the way in which data from the two stages of the trial are combined. The aim of this paper is to compare the methods proposed in these two manuscripts. Since both methods have been shown to control the type I error rate, we will focus on comparison of the power of the two approaches in a range of realistic scenarios. This will inform researchers aiming to design a seamless phase II/III trial in which short-term endpoint data can be used for decision-making at an interim analysis.

The two methods under consideration are reviewed in detail in Section 2, where a common notation is also established. Sections 3 and 4 describe comparisons of the two approaches in the settings of fixed and random treatment effect models, respectively. The paper concludes with a discussion in Section 5.

2. NOTATION AND REVIEW OF METHODS

2.1. Setting and Notation

Consider a clinical trial conducted in two stages. In the first stage, patients are randomized to the control treatment T 0 or to one of k experimental treatments, Ti, Inline graphic. Suppose that data on the primary, long-term, endpoint are available for n 1 patients in each treatment group, and that in addition, short-term endpoint data are observed for N 1 patients in each treatment group, with Inline graphic. In stage one, we therefore have Inline graphic patients with short-term endpoint data only and n l patients with both short- and long-term endpoint data in each treatment group. Following an interim analysis, one experimental treatment, denoted by TI, is chosen to continue to the second stage along with the control treatment with a further Inline graphic patients recruited to each of these treatment groups, giving a total of n 2 patients per group in all. Two possible ways of making the treatment selection are described below.

Suppose that following the second stage, patients are followed up so that primary long-term endpoint data are available for the total of n 2 patients receiving each of treatments TI and T 0.

Denote by Inline graphic and Inline graphic, respectively, the short-term and long-term endpoint data from patient j in group i. When both endpoints are observed, that is for Inline graphic for i = 0, I and Inline graphic for other Inline graphic, the two endpoints for each patient are assumed to follow a bivariate normal distribution. When only the short-term endpoint is observed, that is for Inline graphic, Inline graphic, Inline graphic, Inline graphic is assumed to follow a normal distribution so that we have

graphic file with name lbps-25-170-e001.jpg (1)

where Inline graphic and Inline graphic denote the true means on the short- and long-term endpoints, respectively, in group i; Inline graphic and Inline graphic denote the true variances for the short- and long-term endpoints, respectively; and Inline graphic denotes the true correlation between the endpoints within each group.

The variances Inline graphic and Inline graphic and the correlation Inline graphic will be assumed known and equal for all patients. In the calculation of selection probabilities below, the true variances and correlation will be used. In the simulations, estimates obtained from the data will be used in place of the true values, as suggested by Stallard (2010) and Friede et al. (2011).

Given the mean values, individual patients are assumed to be independent so that Inline graphic, Inline graphic, and Inline graphic for Inline graphic or Inline graphic.

A summary of the parameters in the fixed and random effects models are given in Table 1. The parameters of interest are the treatment effects relative to the control treatment on the long-term endpoint, that is Inline graphic, and we wish to test the null hypotheses denoted Inline graphic against the one-sided alternative hypotheses denoted by Inline graphic for treatment group Inline graphic.

Table 1 .

Summary of model parameters

Sample sizes
N1 Total number of patients per group with short-term data at interim analysis
n1 Number of patients per group with short-term and long-term data at interim analysis
N1n1 Number of patients per group with short-term data only at interim analysis
n2 Number of patients per group with short-term and long-term data at final analysis
Fixed or random effects model parameters
Inline graphic Long-term endpoint treatment mean for group i
Inline graphic Long-term endpoint variance
Inline graphic Short-term endpoint treatment mean for group i
Inline graphic Short-term endpoint variance
Inline graphic Correlation between long-term and short-term endpoints within each treatment group
Random effects model parameters
Inline graphic Mean long-term treatment mean for group i
Inline graphic Variance of long-term treatment mean
Inline graphic Mean short-term treatment mean for group i
Inline graphic Variance of short-term treatment mean
Inline graphic Correlation between long-term and short-term treatment means

Two methods for use of short-term endpoint data for treatment selection in a two-stage trial have been proposed (Friede et al., 2011; Stallard, 2010). These methods are described briefly below.

The aim of this paper is to compare these methods. This comparison will be based on model (1). We will consider two cases. In the first case, the fixed effects model, it is assumed that the true means Inline graphic and Inline graphic can be specified so that these can be taken to be constant. And in the second, the random effects model, the means will be taken to be random and to follow a bivariate normal distribution with

2.1. (2)

where Inline graphic and Inline graphic denote the true means, Inline graphic and Inline graphic denote the true variances, and Inline graphic denotes the true correlation between the means for the two endpoints for any given treatment. We assume that the random treatment means have the same variances and correlations, so we may drop the subscript and denote these by Inline graphic, Inline graphic, and Inline graphic, and are independent for different treatments, that is Inline graphic, Inline graphic, and Inline graphic for Inline graphic. The random effects model will allow us to model a situation in which we envisage that the treatments being evaluated are drawn at random from the distribution given by (2). In this case, the treatment means are considered to be unknown but correlated for the two endpoints with specified correlation and variance.

2.2. Method of Friede et al. (2011)

Friede et al. (2011) propose a method for selection of the treatment that will continue to the next stage based on the short-term endpoint only, selecting the experimental treatment with the largest observed sample mean at the interim analysis.

Let

graphic file with name lbps-25-170-e003.jpg (3)

denote the standardized test statistic for comparison of treatment i, Inline graphic to the control in terms of the short-term endpoint only on the basis of data available at the interim analysis. The experimental treatment group with the highest value of Inline graphic is then chosen to continue to the second stage along with the control while all other treatments are dropped.

At the end of the trial, long-term endpoint data are available from all n 2 patients randomized to the selected treatment and the control. Thus, as the parameters of interest are the long-term endpoint means Inline graphic, only the long-term endpoint data will be used in the final analysis. In this method, then, the short-term data are thus only used for treatment selection, and the long-term data are used only for the final comparison of the selected treatment with the control. In order to control the type I error rate, the final analysis must allow for the treatment selection. Friede et al propose using a combination test approach to combine all data from those patients with any data observed at the interim analysis with the data from new patients observed at the end of the second stage, with a Dunnett correction applied to the first stage test statistics.

In detail, let

2.2.

denote the standardized test statistic for comparison of group i to the control group based on the long-term endpoint data from the N 1 patients per group who have short-term endpoint data available at the interim analysis. Let Inline graphic denote a p-value based on Inline graphic.

Similarly, let

2.2.

denote the standardized test statistic for comparison of group I and the control group based on the additional long-term endpoint data observed at the end of the trial and let Inline graphic.

Note that the Inline graphic are based on some data not observed at the time of the interim analysis and that Inline graphic is independent of all Inline graphic and of any data available at the interim analysis.

To allow for the treatment selection at the first stage, in order to test a null hypothesis Inline graphic, where Inline graphic is some nonempty subset of Inline graphic and Inline graphic denotes the intersection hypothesis Inline graphic, the stage one p-value is obtained from a Dunnett test (Dunnett, 1955) using the test statistic Inline graphic in, for instance, equation (1) of Friede and Stallard (2008). This gives a stage one p-value for the test of Inline graphic, Inline graphic corrected for the multiple comparisons. If the selected treatment, I, is in Inline graphic, a stage two p-value for testing Inline graphic, Inline graphic is just that for testing the selected treatment, Inline graphic. If Inline graphic, Inline graphic is set to one to give a conservative test (Posch et al., 2005). The stage one and stage two p-values may then be combined, for example, using the weighted inverse normal combination function (Lehmacher and Wassmer, 1999)

2.2. (4)

for predefined weights w 1 and w 2 with Inline graphic, which may be used to test Inline graphic.

The construction of the p-values ensures that the stage two p-values are independent of any data available at the interim analysis, and hence of the treatment selection. The p-values obtained thus satisfy the weaker p-clud condition (Brannath et al., 2002), so that no further correction for the treatment selection is necessary and the combination test provides a test of Inline graphic that controls the type I error rate at the nominal level for any treatment selection method.

If the null hypothesis Hi is rejected and if all Inline graphic with Inline graphic are rejected, the type I error rate for the family of hypotheses Hi, Inline graphic, is controlled in the strong sense (Marcus et al., 1976).

2.3. Method of Stallard (2010)

Stallard (2010) proposes basing treatment selection on the maximum likelihood estimate of the long-term treatment effects, Inline graphic, Inline graphic calculated at the interim analysis.

Let Inline graphic denote the standardized score statistic for Inline graphic obtained from all data available at the interim analysis. In the case that Inline graphic, this depends on the short-term data in addition to the long-term data. If Inline graphic, Inline graphic, and Inline graphic are unknown, Inline graphic may be estimated using the double regression method proposed by Engel and Walstra (1991) (see, Stallard, 2010), in which results of regression of X on group membership for Inline graphic and of Y on X and group membership for Inline graphic are combined to give Inline graphic. For known Inline graphic, Inline graphic, and Inline graphic, Inline graphic is shown by Hampson and Jennison (2013) to be given by

graphic file with name lbps-25-170-e005.jpg (5)

where

2.3. (6)

and Inline graphic denotes the sample mean of the N l short-term endpoint observations from group i observed at the interim analysis.

The quantity Inline graphic given by expression (6) can be viewed as an effective sample size per group, corresponding to the number of long-term observations per group that would give the same amount of information on Inline graphic as that available from the n l long-term and N l short-term responses allowing for the correlation Inline graphic. If Inline graphic so that long-term and short-term responses for any given patient are independent, and the short-term observations give no information on Inline graphic, Inline graphic. If Inline graphic, so that short- and long-term responses are perfectly correlated, Inline graphic, so that the amount of information on Inline graphic is the same as if long-term data had been observed for all patients.

In the method described by Stallard (2010), treatment selection is based on statistics Inline graphic with the treatment group with the highest value for Inline graphic being selected to continue to the second stage together with the control group. Note that, unlike the Friede et al. method, this method requires that at least some long-term endpoint data are available at the time of the interim analysis.

At the end of the trial, long-term endpoint data are available from all n 2 patients randomized to the selected treatment and the control, so that as with the Friede et al. method, only the long-term endpoint data will be used in the final analysis. However, the final analysis using the Stallard method combines the evidence from the two stages in a different way to that suggested by Friede et al.

Suppose that treatment TI is selected to continue with the control to the second stage. Let Inline graphic denote the standardized score test statistic for Inline graphic based on all data available at the end of the trial, that is,

2.3.

Stallard derives the joint distribution of Inline graphic, showing that this is similar to that for test statistics in a seamless phase II/III trial with the primary endpoint alone used at an interim analysis with Inline graphic patients per group. The joint distribution of Inline graphic and Inline graphic can thus be obtained, allowing a critical value c to be constructed so as to control the type I error rate if H 0 is rejected whenever Inline graphic.

3. COMPARISON OF METHODS: FIXED EFFECTS MODEL

We are interested in comparing the methods proposed by Stallard (2010) and Friede et al. (2011). We first consider the fixed effects model setting and explore the properties of the two methods for fixed treatment effects on the short- and long-term endpoints.

The methods will be compared in terms of the probability of selecting an effective treatment in Section 3.1 and of the resulting power of the final analysis in Section 3.2.

3.1. Selection Probability

Although we wish to focus on the probability of selecting the correct treatment, we can define this in two different ways. For given treatment means for the long-term endpoints, Inline graphic, we could consider either the probability of selecting any effective treatment, that is choosing I to be any i with Inline graphic, or the probability of selecting the most effective treatment, that is choosing I to be the i that maximizes Inline graphic. Throughout this paper, we will focus on the latter. Furthermore, we will, without loss of generality, generally consider scenarios in which T l has the best effect, and report the probability of selecting treatment T 1.

The probability of selecting treatment T 1 with the Friede et al. method based on equation (3) is equal to

3.1. (7)

while the probability of selecting treatment T 1 with the Stallard selection method based on equation (5) is equal to

3.1. (8)

These probabilities could be estimated via simulation. Alternatively, for Inline graphic and Inline graphic assumed known, they can be calculated exactly from the joint distributions of Inline graphic and Inline graphic, respectively. These distributions are given in the Appendix. Selection probabilities can thus be found using standard numerical routines for calculation of multivariate normal tail areas, for example using pmvnorm in R (Genz et al. 2012). Computer code to perform these calculations and the simulations described below in R can be obtained from the corresponding author.

For the Friede et al. method, the selection probability depends only on N 1 and the standardized short-term endpoint treatment effects, Inline graphic, while for the Stallard method, the selection probability depends on n 1, N 1, and the correlation between the endpoints, Inline graphic, through Inline graphic, the standardized long-term endpoint treatment effects, Inline graphic, but not on Inline graphic.

The upper panels (panels A1 and B1) of Fig. 1 show the probability of selecting treatment T 1 using the Friede et al and Stallard selection methods when three experimental treatments are included in the first stage and Inline graphic. Panel A1 gives the selection probability with Inline graphic for different stage one sample sizes for a a range of Inline graphic values. Panel B1 gives the selection probability with Inline graphic, Inline graphic, and Inline graphic for a range of Inline graphic values (with Inline graphic so that these are the standardized values), again for a range of Inline graphic values.

Figure 1 .

Figure 1

Probability to select treatment 1 (panels A1 and B1) and power (panels A2 and B2) for the Stallard (2010) and Friede et al. (2011) methods for different parameter settings under the fixed effects model.

As indicated above, the probability of selection with the Friede et al. method does not depend on Inline graphic. With Inline graphic, the probability of selection with the Stallard method is equal to that with the Friede et al. method when Inline graphic, when the most information is obtained from the Inline graphic observations per group for whom only short-term endpoint data are available and Inline graphic. For Inline graphic, the probability of selecting treatment T 1 is lower for the Stallard method than that for the Friede method when Inline graphic and Inline graphic exceed 0, so that treatment T 1 is actually the most effective, with the difference between the two methods larger for larger Inline graphic. The selection probability for the Stallard method is smallest, and most different from that for the Friede et al. method, when Inline graphic and Inline graphic. The selection probability in this case is equal to that for a method that selects the best treatment solely on the basis of the long-term endpoint data from n 1 patients per group available at the interim analysis and so, unsurprisingly, decreases with decreasing n 1.

Since the selection probability for the Stallard method depends on Inline graphic and not on Inline graphic, whilst that for the Friede et al. method depends on Inline graphic and not on Inline graphic, panel B1 of Fig. 1 enables comparison of selection probabilities in settings with Inline graphic. Although when Inline graphic the probability of selecting treatment T l with the Friede et al. is always as great at that for the Stallard method, it can be seen that this probability may be lower for the Friede et al. method when Inline graphic.

3.2. Power

As was the case with the probability of correct selection, we can define the power in different ways. In order to be consistent with the definition of selection probability, we define the power as the probability of rejecting the false null hypothesis corresponding to the most effective treatment, that is to rejecting HI when I is the i that maximimes Inline graphic. This definition is closely related to the “individual power” defined as the probability of rejecting a particular false null hypothesis (Westfall et al., 2011). The difference is that in the case of the individual power the null hypothesis we are interested in is specified in advance. Note that other definitions for the power are possible, such as, for example, defining the power as the probability to reject any false null hypothesis. For a discussion of different power concepts in the context of multiple testing see Westfall et al. (2011).

Assuming as above, without loss of generality, that the treatment effect on the long-term endpoint, Inline graphic, is largest for Inline graphic, the power for the Friede et al. method is equal to

3.2. (9)

where Inline graphic is the combination function defined by (4) so that Inline graphic for all Inline graphic corresponds to rejection of H 1 in the Friede et al. method using the combination test and closed testing procedure as described above.

For the Stallard method, the power is equal to

3.2. (10)

where c is the critical value obtained to control the type I error rate using the method of Stallard (2010).

For the Stallard method, the power depends on Inline graphic, n 2, and the standardized long-term endpoint treatment effects, Inline graphic, but not on the short-term endpoint effects Inline graphic. For the Friede et al. method, as the selection is based on the short-term endpoint data and the final test of the long-term endpoint data, the power depends on Inline graphic in addition to Inline graphic and Inline graphic.

As (9) and (10) involve data from both stage one and stage two, analytic calculation of the power is less straightforward than that for the selection probabilities. The power values can most easily be estimated through simulation of data from the fixed effects model (1). This also allows the assumption of known Inline graphic and Inline graphic to be relaxed.

Simulated power values for the two methods are shown in the lower panels (panels A2 and B2) of Fig. 1 in the same settings as the selection probabilities shown in the upper panels and discussed above with Inline graphic. Estimated power values plotted are based on 10,000 simulations for each of the scenarios considered. For the larger effect sizes as shown in the panel A2 and the upper curves in panel B2, the power is very similar to the selection probability shown in the upper two panels. In this case if treatment T 1 is selected in stage one, the combination of the larger stage two sample size and the large effect size mean that it is very likely to be shown to be superior to the control. For smaller effect sizes, there is a larger chance of failing to demonstrate superiorioty even if the treatment T l is correctly selected, so that power values are smaller than the selection probabilities. In this case for extreme values of Inline graphic or for very small treatment effects the Friede et al. method may be less powerful than the Stallard method. It is also interesting to note that while the power for the Stallard method is the same for positive and negative values of Inline graphic of the same magnitude, for the Friede et al. method the power appears to be slightly lower for negative Inline graphic than for positive Inline graphic.

Figure 1 shows power values for Inline graphic. As the power cannot exceed the selection probability, we may note, as above, that the Stallard method will be more powerful than the Friede et al. method if Inline graphic is sufficiently large compared to Inline graphic.

4. COMPARISON OF METHODS: RANDOM EFFECTS MODEL

For the fixed effects model, the distributional forms and calculated values given above show that the probability of selecting treatment T 1 and the power to reject the null hypothesis for this treatment, H 1, is higher for the Friede et al. selection method than that for the Stallard method when Inline graphic, but can be lower when Inline graphic. Unsurprisingly, given that the Friede et al. selection method relies solely on short-term endpoint observations, the performance of the method is good when the effects on the short-term endpoint are similar (or larger) to those on the long-term endpoint, but may be poor when they are smaller or reversed. In order to capture the relationship between the treatment effects Inline graphic and Inline graphic, it is therefore interesting to consider the random effects model introduced above, in which the correlation between the treatment means is explicitly included in the statistical model.

4.1. Selection Probability

As with the fixed effects model, we will consider the probability of selecting treatment T 1. Since the mean effect for this treatment, Inline graphic, is now considered to be a random variable, however, treatment T 1 might not always be the most effective even if Inline graphic for Inline graphic. We will therefore focus on the probability of selecting treatment T 1 given that it is the most effective treatment, that is given that Inline graphic for all Inline graphic. This is given by

4.1. (11)

in the case in which selection is made using the Friede et al. method, and by

4.1. (12)

in the case when selection is made using the Stallard method. These probabilities may be evaluated using the joint distributions of Inline graphic and Inline graphic or of Inline graphic and Inline graphic given in the Appendix.

Figure 2 shows the probability of selecting treatment T 1 given that this is actually the most effective treatment when selection uses either the Friede et al. or the Stallard method. Selection probabilities are shown for a range of Inline graphic values in the setting in which Inline graphic. In panels A1 and A2, Inline graphic and Inline graphic for Inline graphic, so that on average the first treatment is effective on both endpoints and all others are not, and Inline graphic. The separate lines give the treatment selection for different sample sizes. In panels B1 and B2, Inline graphic, Inline graphic and Inline graphic, Inline graphic for Inline graphic and Inline graphic with different lines on the plot corresponding to different treatment effects. In panels C1 and C2, Inline graphic and Inline graphic for Inline graphic, Inline graphic and Inline graphic and the different lines correspond to different values of Inline graphic and Inline graphic. The right-hand column in the figure shows selection probabilities for Inline graphic, that is when there is perfect correlation between the means on the two endpoints, and the left-hand column those for Inline graphic.

Figure 2 .

Figure 2

Probability to select treatment 1 based on the methods by Stallard (2010) and by Friede et al. (2011) for different parameter settings under the random effects model (given that treatment 1 is the most effective).

The selection probabilities for Inline graphic shown in the right-hand column are generally similar to those for the fixed effects model with Inline graphic given in Fig. 1. The best treatment is more likely to be selected using the Friede et al. method than the Stallard method, with the two methods coinciding when Inline graphic. The main difference between the selection probabilities under the random effects model and the fixed effects model in this case is that under the random effects model there is very little effect of the average treatment effect, Inline graphic, in contrast to the results for the fixed effects model considered above. This is reasonable given that the figure shows the probability of selecting T 1 given that the actual treatment effect is largest for that treatment, that is given Inline graphic. An increase in Inline graphic for the Stallard method or in Inline graphic for the Friede et al. method does, however, reduce the probability of selecting treatment T l, as the standardized average difference between the treatments on the long- or short-term endpoint, respectively, is reduced.

As the Friede et al. method uses only the short-term endpoint data for the selection, it is not surprising that it performs well when the means on the two endpoints are perfectly correlated, since the selection is based on a larger number of observations and a treatment performing well on the short-term endpoint is more likely to have a large long-term endpoint mean. The left-hand column shows selection probabilities for Inline graphic. The selection probabilities for the Stallard method do not depend on Inline graphic, so that these are exactly the same as those in the panels in the right-hand column. The Friede et al. method selects the correct treatment with lower probability than when the short-term and long-term treatment means are perfectly correlated; in this case the short-term endpoint means are less predictive of the treatment with the largest long-term responses. In this case, the Stallard method can lead to a higher probability of correctly selecting treatment T 1, particularly when Inline graphic is high. Smaller values of the correlation Inline graphic will result in worse performance of the Friede et al. method.

The latter point is illustrated more clearly in Fig. 3. This shows the probability under the random effects model of correctly selecting treatment T 1 given that this is the most effective for Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic for the Stallard method for a range of Inline graphic values and for the Friede et al. method for a range of Inline graphic values. Since the selection probabilities for the Stallard method do not depend on Inline graphic and for the Friede et al. method do not depend on Inline graphic, the two lines are shown on the same graph. Comparing the two lines, we see that the Stallard method always has a higher selection probability than the Friede et al. method if Inline graphic except when Inline graphic, when both probabilities are the same. The three horizontal lines represent selection probabilities for the Friede et al. method where Inline graphic is fixed to either 1 (short dash), 0.95 (dash dot), or 0.9 (long dash). Comparing these lines with those for the Stallard method, we observe that if Inline graphic the Friede et al. method is always better than the Stallard method regardless of Inline graphic (with the exception of Inline graphic, when the selection probabilities for the two methods are again equal). If Inline graphic, the Friede et al. method is only better than the Stallard method if Inline graphic is small. While if Inline graphic, the Stallard method is always better than the Friede et al. method regardless of Inline graphic.

Figure 3 .

Figure 3

Probability to select treatment 1 based on the methods by Stallard (2010) for a range of Inline graphic values and by Friede et al. (2011) for a range of Inline graphic values under the random effects model (given that treatment 1 is the most effective).

When Inline graphic, the probability of selecting treatment T 1 using the Friede et al. method can be low. The probability approaches zero as Inline graphic approaches −1 and the treatment effects on the long- and short-term endpoints consistently go in opposite directions.

4.2. Power

In a similar approach to that used for evaluation of the treatment selection probabilities, we consider the power defined to be the probability that treatment T 1 is selected at the interim analysis and found to be significantly superior to the control at the final analysis conditional on it actually being the best treatment, that is on Inline graphic. As in the fixed effects model, the power will again be estimated via simulation. In this case, data are simulated from the random effects model given by (1) and (2). In detail, for each simulation, treatment means Inline graphic are first simulated from (2) then, given these treatment mean values, data are simulated from (1).

Simulated power values are shown in Fig. 4 under the same scenarios as Fig. 2. As in the fixed effects setting, for reasonably large standardized effect sizes, the power is similar to the selection probability, but is slightly lower when the standardized effect is smaller, either because of a reduction in the effect size or an increase in the within-treatment variance.

Figure 4 .

Figure 4

Power for the Stallard (2010) and Friede et al. (2011) methods for different parameter settings under the random effects model (given that treatment 1 is the most effective).

5. DISCUSSION

There has been much recent interest in adaptive seamless phase II/III clinical trials in which randomization is initially between a number of experimental treatments and a control, with less effective treatments dropped from the study on the basis of results from an interim analysis. Building on methods using short-term information to supplement long-term information originally developed in the context of interim analyses for early stopping, two methods have been proposed for using short-term endpoint data in the treatment selection (Stallard, 2010; Friede et al., 2011). In this paper, we have compared these two methods. Our aim has been to provide a comparison that will enable choice of the most appropriate method when designing an adaptive seamless phase II/III design.

In the Friede et al. method, only the short-term endpoint data are used for the treatment selection. In contrast, the Stallard method uses a combination of short- and long-term endpoint data. The latter method can thus only be used when some long-term responses are available for inclusion in the interim analysis. In both methods, the final analysis is based on the long-term endpoint data alone from the selected treatment and control. This is in contrast to other group-sequential methods in which it is desired to draw inference on both endpoints, for example requiring both to be sufficiently promising (see, e.g., Jennison and Turnbull, 1993; Kimani et al., 2009) or with early and late observations of the same endpoint treated as a longitudinal data (see, e.g., Spiessens et al., 2000; Lee et al., 1996).

Our comparison has considered scenarios in which the treatment means are taken to be fixed, with one treatment more effective than all others and the control, which are equally effective, and scenarios in which the treatment means are taken to be random but are correlated. A summary of the effects of the different model parameters on the selection probability based on the simulations reported above is given in Table 2. Our results indicate that under the fixed effects model, if the treatment effect on the short-term endpoint is as large or larger than that on the long-term endpoint for the effective treatment, the Friede et al. method is more likely to lead to selection of the most effective treatment, and is correspondingly more powerful. If the effect on the short-term endpoint is less than that on the long-term endpoint, the Stallard method may be more likely to select the correct treatment and more powerful, particularly when the within-group correlation between the endpoints is high.

Table 2 .

Summary impact of model parameters on selection probabilities

Sample sizes
n1 Larger values reduce impact of short-term endpoint data.
N1 Larger numbers increase impact of short-term endpoint data.
n2 Larger values increase power but do not influence treatment selection.
Fixed or random effects model parameters
Inline graphic More disperse values increase differences between treatments making treatment selection easier.
Inline graphic Larger values increase variability making treatment selection harder.
Inline graphic More disperse values increase differences between treatments making treatment selection easier with Friede et al. method. No impact on Stallard method.
Inline graphic Larger values increase variability making treatment selection harder.
Inline graphic Larger values (of Inline graphic) increase influence of short-term endpoints in Stallard method. No impact in Friede et al. method.
Random effects model parameters
Inline graphic More disperse values increase differences between treatments making treatment selection easier.
Inline graphic Larger values make treatment means more disperse making treatment selection easier.
Inline graphic More disperse values increase differences between treatments making treatment selection easier with Friede et al. method. No impact on Stallard method.
Inline graphic Larger values make treatment means more disperse making treatment selection easier with Friede et al. method. No impact on Stallard method.
Inline graphic Larger values make treatment effects on two endpoints more closely related and improve treatment selection with Friede et al. method. No impact on Stallard method.

Under the random effects model, the effect of correlation between the treatment means on the two endpoints can be considered. This parameter gives an indication of the extent to which treatment effects on the long- and short-term endpoints go in the same direction. In this case, our results indicate that the Friede et al. method leads to a higher probability of selecting the best treatment and to higher power only when the correlation between the treatment means is sufficiently high. The threshold depends on the sample sizes and variances, but we have shown that even when the number of patients for whom long-term endpoint data are available at the interim analysis is small, under the scenarios we have considered, the Friede et al. method is less powerful unless the correlation between the means is relatively high; for the scenario we considered above 0.9 when the within-group variance and between-group variance are both equal to 1.

In order to be able to choose between the different methods, some estimates of the model parameters, including the variances and correlations in (1) and (2) are required. In some cases, data from other trials will be available, particularly to give information on the parameters in (1). The correlation can vary considerably depending on the setting and endpoints chosen. Julious and Mullee (2008), for example, report a Inline graphic of 0.67 between the same endpoint measured at baseline and at the end of the trial, so that the correlation between an early and final measurement of this endpoint would presumably be higher than this, whereas Chataway et al. (2011) report a Inline graphic of 0.13 between two different endpoints, though it was still proposed to use the early endpoint for treatment selection. The parameters in (2) are harder to estimate since their estimation requires data from a number of different trials or treatments.

If detailed information on parameter values is unavailable, it may still be possible to make some guess of possible ranges for parameters, or to use the methods described above to conduct sensitivity analyses. We are currently working on approaches that use the data from the first stage of the trial to estimate the parameters of (1) and (2) and to decide between the different treatment selection strategies on the basis of these estimates.

Our comparison of the procedures has used a combination of analytic calculations based on multivariate normal distributions to calculate selection probabilities and simulations to estimate the power. The simulations can be time-consuming when an extensive search for an appropriate sample size is required, or when it is desirable to explore the tradeoff between patients in stages one and two of the trial. The power is bounded above by the selection probability and in many of the settings considered above, the two probabilities are quite similar. This is likely to be particularly true when the assumed effect size is relatively large and the sample size for the second stage is substantially larger than that for the first stage. For example, in the settings described above with three treatments compared to a control treatment on the basis of long-term data on 5 or 15 patients per group and short-term data on 100, 20, or 50 patients per group at the interim analysis with a final sample size of 200 per group, when the standardized effect size on both endpoints for the sole effective treatment of 0.5, we found that the estimated power was at least 97.5% of the selection probability. In such cases, an approximate sample size calculation could be based on the selection probability using the analytic calculations described. If necessary, this could be followed by a much more restricted set of simulations to confirm the power of the final design chosen.

A. APPENDIX: DISTRIBUTIONS REQUIRED FOR CALCULATION OF TREATMENT SELECTION PROBABILITIES

A.1. Fixed Effects Model

Calculation of the probability of selecting treatment T 1 using the Friede et al. and Stallard methods under the fixed effects model require the joint distribution of Inline graphic and Inline graphic, respectively. Detailed derivations of these are given in the online Supplementary Material, leading to

graphic file with name lbps-25-170-e013.jpg (A.1)

and

graphic file with name lbps-25-170-e014.jpg (A.2)

A.2. Random Effects Model

Treatment selection probabilities using the Friede et al. and Stallard methods under the fixed effects model may be evaluated using the joint distributions of Inline graphic and Inline graphic or of Inline graphic and Inline graphic, respectively.

The joint distribution of Inline graphic and Inline graphic is given by

graphic file with name lbps-25-170-e015.jpg (A.3)

where

graphic file with name lbps-25-170-u004.jpg

with

graphic file with name lbps-25-170-u005.jpg
graphic file with name lbps-25-170-u006.jpg

and

graphic file with name lbps-25-170-u007.jpg

The joint distribution of Inline graphic and Inline graphic is given by

graphic file with name lbps-25-170-e016.jpg (A.4)

where

graphic file with name lbps-25-170-u008.jpg

with

graphic file with name lbps-25-170-u009.jpg
graphic file with name lbps-25-170-u010.jpg

Inline graphic and Inline graphic given by (6).

Detailed derivations are again given in the online supplemental material.

ACKNOWLEDGMENTS

We are grateful to the Editor and two anonymous reviewers for their helpful comments on this paper.

SUPPLEMENTAL MATERIAL

Supplemental data for this article can be accessed on the publisher’s website.

FUNDING

The work was funded by UK Medical Research Council grant number G1001344.

Funding Statement

The work was funded by UK Medical Research Council grant number G1001344.

REFERENCES

  1. Barnes P., Pocock S., Magnussen H., Iqbal A., Kramer B., Higgins M., Lawrence D. Integrating indacaterol dose selection in a clinical study in COPD using an adaptive seamless design. Pulmonary Pharmacology and Therapeutics. 2010;23:165. doi: 10.1016/j.pupt.2010.01.003. –. [DOI] [PubMed] [Google Scholar]
  2. Bauer P., Kieser M. Combining different phases in the development of medical treatments within a single trial. Statistics in Medicine. 1999;18:1833. doi: 10.1002/(sici)1097-0258(19990730)18:14<1833::aid-sim221>3.0.co;2-3. –. [DOI] [PubMed] [Google Scholar]
  3. Brannath W., Posch M., Bauer P. Recursive combination tests. Journal of the American Statistical Association. 2002;97:236. –. [Google Scholar]
  4. Bretz F., Koenig F., Brannath W., Glimm E., Posch M. Tutorial in biostatistics: Adaptive designs for confirmatory clinical trials. Statistics in Medicine. 2009;28:1181. doi: 10.1002/sim.3538. –. [DOI] [PubMed] [Google Scholar]
  5. Bretz F., Schmidli H., König F., Racine A., Maurer W. Confirmatory seamless phase II/III clinical trials with hypotheses selection at interim: General concepts. Biometrical Journal. 2006;48:623. doi: 10.1002/bimj.200510232. –. [DOI] [PubMed] [Google Scholar]
  6. Chataway J., Nicholas R., Todd S., Miller D., Parsons N., Valdés-Márquez E., Stallard N., Friede T. A novel adaptive design strategy increases the efficiency of clinical trials in secondary progressive multiple sclerosis. Multiple Sclerosis. 2011;17:81. doi: 10.1177/1352458510382129. –. [DOI] [PubMed] [Google Scholar]
  7. Chow S.-C., Chang M., Pong A. Statistical consideration of adaptive methods in clinical development. Journal of Biopharmaceutical Statistics. 2005;15:575. doi: 10.1081/BIP-200062277. –. [DOI] [PubMed] [Google Scholar]
  8. Cook R., Farewell V. Incorporating surrogate endpoints into group sequential trials. Biometrical Journal. 1996;38:119. –. [Google Scholar]
  9. Dragalin V. An introduction to adaptive designs and adaptation in CNS trials. European Neuropsychopharmacology. 2011;21(2):153. doi: 10.1016/j.euroneuro.2010.09.004. –. [DOI] [PubMed] [Google Scholar]
  10. Dunnett C. W. A multiple comparison procedure for comparing several treatments with a control. Journal of the American Statistical Association. 1955;50:1096. –. [Google Scholar]
  11. Engel B., Walstra P. Increasing precision or reducing expense in regression experiments by using information from a concomitant variable. Biometrics. 1991;47(1):13. –. [Google Scholar]
  12. 2007 http://www.emea.europa.eu/docs/en_GB/document_library/Scientific_guideline/2009/09/WC500003616.pdf European Medicines Agency (EMEA) - Committee for Medicinal Products for Human Use (CHMP) CHMP reflection paper on methodological issues in confirmatory clinical trials planned with an adaptive design. (accessed July 13, 2012)
  13. 2010 http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/ucm201790.pdf Food and Drug Administration (FDA) Guidance for industry - adaptive design clinical trials for drugs and biologics. (accessed July 13, 2012)
  14. Friede T., Parsons N., Stallard N., Todd S., Valdés-Márquez E., Chataway J., Nicholas R. Designing a seamless phase II/III clinical trial using early outcomes for treatment selection: an application in multiple sclerosis. Statistics in Medicine. 2011;30:1528. doi: 10.1002/sim.4202. –. [DOI] [PubMed] [Google Scholar]
  15. Friede T., Stallard N. A comparison of methods for adaptive treatment selection. Biometrical Journal. 2008;50:767. doi: 10.1002/bimj.200710453. –. [DOI] [PubMed] [Google Scholar]
  16. Galbraith S., Marschner I. Interim analysis of continuous long-term endpoints in clinical trials with longitudinal outcomes. Statistics in Medicine. 2003;22:1787. doi: 10.1002/sim.1311. –. [DOI] [PubMed] [Google Scholar]
  17. Gallo P., Chuang-Stein C., Dragalin V., Gaydos B., Krams M., Pinheiro J. Adaptive designs in clinical drug development - an executive summary of the pharma working group. Journal of Biopharmaceutical Statistics. 2006;16:275. doi: 10.1080/10543400600614742. –. [DOI] [PubMed] [Google Scholar]
  18. 2012 http://CRAN.R-project.org Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Bornkamp, B., Hothorn, T. Package ‘mvtnorm’. URL. R package version 0.9-9992.
  19. Hampson L. V., Jennison C. Group sequential tests for delayed responses. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2013;75:1. –. [Google Scholar]
  20. Jennison C., Turnbull B. Group sequential tests for bivariate response: Interim analyses of clinical trials with both efficacy and safety endpoints. Biometrics. 1993;49:741. –. [PubMed] [Google Scholar]
  21. Julious S. A., Mullee M. A. Issues with using baseline in last observation carried forward analysis. Pharmaceutical Statistics. 2008;7:142. doi: 10.1002/pst.311. –. [DOI] [PubMed] [Google Scholar]
  22. Kelly P. J., Stallard N., Todd S. An adaptive group sequential design for phase II/III clinical trials that select a single treatment from several. Journal of Biopharmaceutical Statistics. 2005;15:641. doi: 10.1081/BIP-200062857. –. [DOI] [PubMed] [Google Scholar]
  23. Kimani P. K., Stallard N., Hutton J. L. Dose selection in seamless phase ii/iii clinical trials based on efficacy and safety. Statistics in Medicine. 2009;28:917. doi: 10.1002/sim.3522. –. [DOI] [PubMed] [Google Scholar]
  24. Koenig F., Brannath W., Bretz F., Posch M. Adaptive dunnett tests for treatment selection. Statistics in Medicine. 2008;27:1612. doi: 10.1002/sim.3048. –. [DOI] [PubMed] [Google Scholar]
  25. Lee S. J., Kim K., Tsiatis A. A. Repeated significance testing in longitudinal clinical trials. Biometrika. 1996;83:779. –. [Google Scholar]
  26. Lehmacher W., Wassmer G. Adaptive sample size calculations in group sequential trials. Biometrics. 1999;55(4):1286. doi: 10.1111/j.0006-341x.1999.01286.x. –. [DOI] [PubMed] [Google Scholar]
  27. Marcus R., Peritz E., Gabriel K. R. On closed testing procedures with special reference to ordered analysis of variance. Biometrika. 1976;63(3):655. –. [Google Scholar]
  28. Marschner I., Becker S. Interim monitoring of clinical trials based on long-term binary endpoints. Statistics in Medicine. 2001;20:177. doi: 10.1002/1097-0258(20010130)20:2<177::aid-sim653>3.0.co;2-k. –. [DOI] [PubMed] [Google Scholar]
  29. Posch M., König F., Branson M., Brannath W., Dunger-Baldauf C., Bauer P. Testing and estimation in flexible group sequential designs with adaptive treatment selection. Statistics in Medicine. 2005;24:3697. doi: 10.1002/sim.2389. –. [DOI] [PubMed] [Google Scholar]
  30. Schmoll H., Cunnighmam D., A., S., Krapetis C., Rougier P., Koski S., P., B., Mookerjee B., Robertson J., van Cutsem E. Ann. Oncol. 2010;21:vii189. mFOLFOX6 + cediranib vs mFOLFOX6 + bevacizumab in previously untreated metastatic colorectal cancer (mcrc): A randomised, double-blind, phase II/III study (HORIZON III) (Supplement 8) –. [Google Scholar]
  31. Sooriyarachchi M., Whitehead J., Whitehead A., Bolland K. The sequential analysis of repeated binary responses: A score test for the case of three time points. Statistics in Medicine. 2006;25:2196. doi: 10.1002/sim.2339. –. [DOI] [PubMed] [Google Scholar]
  32. Spiessens B., Lesaffre E., Verbeke G., Kim K., DeMets D. L. An overview of group sequential methods in longitudinal clinical trials. Statistical Methods in Medical Research. 2000;19:497. doi: 10.1177/096228020000900506. –. [DOI] [PubMed] [Google Scholar]
  33. Stallard N. A confirmatory seamless phase II/III clinical trial design incorporating short-term endpoint information. Statistics in Medicine. 2010;29:959. doi: 10.1002/sim.3863. –. [DOI] [PubMed] [Google Scholar]
  34. Stallard N., Todd S. Sequential designs for phase III clinical trials incorporating treatment selection. Statistics in Medicine. 2003;22:689. doi: 10.1002/sim.1362. –. [DOI] [PubMed] [Google Scholar]
  35. Stallard N., Todd S. Seamless phase II/III designs. Statistical Methods in Medical Research. 2011;20:623. doi: 10.1177/0962280210379035. –. [DOI] [PubMed] [Google Scholar]
  36. Todd S., Stallard N. A new clinical trial design combining phases II and III: Sequential designs with treatment selection and a change of endpoint. Drug Information Journal. 2005;39:109. –. [Google Scholar]
  37. Westfall P. H., Tobias R. D., Wolfinger R. D. Multiple Comparisons and Multiple Tests Using SAS. Cary, NC: SAS Institute Inc; 2011. [Google Scholar]
  38. Whitehead A., Sooriyarachchi M., Whitehead J., Bolland K. Incorporating intermediate binary responses into interim analyses of clinical trials: A comparison of four methods. Statistics in Medicine. 2008;27:1646. doi: 10.1002/sim.3046. –. [DOI] [PubMed] [Google Scholar]

Articles from Journal of Biopharmaceutical Statistics are provided here courtesy of Taylor & Francis

RESOURCES