Abstract
Recruitment of patients to a clinical trial usually occurs over a period of time, resulting in the steady accumulation of data throughout the trial's duration. Yet, according to traditional statistical methods, the sample size of the trial should be determined in advance, and data collected on all subjects before analysis proceeds. For ethical and economic reasons, the technique of sequential testing has been developed to enable the examination of data at a series of interim analyses. The aim is to stop recruitment to the study as soon as there is sufficient evidence to reach a firm conclusion. In this paper we present the advantages and disadvantages of conducting interim analyses in phase III clinical trials, together with the key steps to enable the successful implementation of sequential methods in this setting. Examples are given of completed trials, which have been carried out sequentially, and references to relevant literature and software are provided.
Keywords: clinical trials, error rates, monitoring, sequential trials
Introduction
In this, the first in a series of three papers dealing with the opportunities and dangers presented by interim analyses in clinical trials, we focus on phase III clinical studies. A phase III clinical trial is a large-scale study, typically comparing a promising experimental treatment with a control (placebo or active). Its purpose is to seek firm evidence to support a claim that the experimental treatment has clinical benefits. In this paper we show how sequential methodology can play an important role in such trials.
The traditional approach to conducting phase III clinical trials has been to calculate a single fixed sample size in advance of the study, which depends upon a specified significance level and power and the treatment advantage to be detected. Data on all patients are then collected before any formal analyses are performed. While such a framework is logical when observations are available simultaneously, as in an agricultural field trial, it may be less suitable for medical studies, in which patients are recruited over months if not years, and data are available sequentially. Here, results from patients who enter the trial early on are available for analysis while later patients are still being enrolled. It is natural to be interested in such results, but the uncontrolled examination of data can lead to misleading and sometimes wholly inappropriate conclusions, an issue which is considered further in this article.
Some routine monitoring of trial progress, usually blinded to treatment allocation, is often undertaken as part of a phase III trial. This can range from simple checking of protocol compliance and the accurate completion of record forms, to monitoring adverse events in trials of serious conditions so that prompt action can be taken. Such monitoring may be undertaken in conjunction with a data and safety monitoring board (DSMB), established to review the information collected. It would therefore appear that assessment of interim treatment differences is a logical and worthwhile extension. However, the handling of treatment comparisons while a trial is still in progress poses problems in medical ethics, statistical analysis and practical organization [1]. In methodological terms, the approach presented in this paper is known as the frequentist approach and is the most widely used framework in clinical trials. An alternative school of thought, not discussed here, but mentioned for completeness, is the Bayesian approach as described by Spiegelhalter et al. [2].
Opportunities and dangers
The most appealing reason for monitoring trial data for treatment differences is that, ethically, it is desirable to terminate or change a trial when evidence has emerged that one treatment is clearly superior to the other. This is particularly important when life-threatening diseases are involved. Alternatively, the data may support the conclusion that the experimental treatment and the control do not differ by some predetermined clinically relevant magnitude, in which case it would be desirable, both ethically and economically, to stop the study and divert resources elsewhere. Finally, if information in a trial is accruing more slowly than expected, perhaps because of a low event rate, then extension of recruitment until a large enough sample has been recruited may be appropriate.
Unfortunately multiple analyses of accumulating data lead to problems in the interpretation of results. The main problem occurs when significance testing is undertaken at the various interim looks. Even if the treatments are really equally effective, the more often one analyses the accumulating data, the greater the chance of eventually and wrongly detecting a difference, thereby drawing incorrect conclusions from the trial. Armitage et al. [3] were the first to compute numerically the extent to which the type I error probability (the probability of incorrectly declaring the experimental treatment as different from control) is increased over its nominal level if a standard hypothesis test is conducted at each of a series of interim looks. They studied the problem of testing a normal mean with known variance and set the significance level or type I error probability for the trial to be 5%. If one interim analysis and one final analysis are performed this error rises to 8%. If four interim analyses and a final analysis are undertaken this figure is 14%. Similar figures can be anticipated for other response types. In order to make use of the advantages of monitoring the treatment difference, methodology is required to maintain the overall type I error rate at an acceptable level.
A second problem concerns the final analysis. When data are inspected at interim looks, the analysis appropriate for fixed sample size studies is no longer valid. Quantities such as P values, point estimates and confidence intervals are still well defined, but new methods of calculation are required. If a traditional analysis is performed at the end of a trial that stops because the experimental treatment is found better than control, the P value will be too small (too significant), the point estimate too large and the confidence interval too narrow.
To deal with the above problems, special techniques are required. These can be broadly termed sequential methods. In the following section a brief overview of this methodology and related issues is given.
Sequential methodology
In his 1999 paper [4], Whitehead lists the key ingredients required to conduct a trial sequentially (see Figure 1). The first two ingredients are common to both fixed sample size and sequential studies, but are worth emphasizing for completeness. The second two are solutions to the particular problems of error rates and analysis in the sequential setting. Any combination of choices for the four ingredients is permissible, but, largely for historical reasons, particular combinations preferred by authors in the field have been extensively developed, incorporated into software (see below) and used in practice. Each of the four ingredients will now be considered briefly in turn.
Figure 1.
Key ingredients for conducting a sequential trial.
Parameterization of the treatment difference
As with a fixed sample size study the first stage in designing a phase III sequential clinical trial is to establish a primary measure of efficacy. The authority of any clinical trial will be greatly enhanced if a single primary response is specified in the protocol and is subsequently found to show significant benefit of the experimental treatment. The choice should depend upon such criteria as clinical relevance, ease of obtaining accurate measurements and familiarity to clinicians. Appropriate choice for the associated parameter measuring treatment difference can then be made. This should depend upon such criteria as interpretability, for example whether a measurement based on a difference or a ratio is more familiar, and precision of the resulting analysis. A wide variety of continuous and discrete data types can be dealt with. Suppose that in a clinical trial the appropriate response is identified as survival time following treatment for cancer, then a suitable parameter of interest might be the log-hazard ratio. If the primary response is a continuous measure such as the reduction in blood pressure after 1 month of antihypertensive medication then the difference in true (unknown) means is of interest. Finally, if we are considering a dichotomous variable, such as the occurrence (or not) of deep vein thrombosis following hip replacement, the log-odds ratio may be the parameter of interest.
Test statistics for use in interim analyses
A sequential test monitors a statistic summarizing the current difference between the experimental treatment and control at a series of times during the trial. If the absolute value of this statistic exceeds some specified critical value, the trial is stopped and the null hypothesis of no difference between treatments is rejected. The timing of the interim looks can be measured directly in terms of number of patients, or more flexibly in terms of information. It should be noted that the test statistic measuring treatment difference may increase or decrease between looks, while the statistic measuring information will always increase. Early work in this area prescribed designs whereby traditional test statistics such as the t-statistic or the chi-squared statistic, were monitored after each patient's response was obtained. Examples can be found in the book by Armitage [5]. Later work by Pocock [6] and O'Brien & Fleming [7] allowed inspections after the responses from each group of k patients were obtained, where k was predefined. Since then, statisticians have developed more flexible ways of conducting sequential trials when considering the number and the timing of interim inspections. Whitehead [8] monitors a statistic measuring treatment difference known in technical terms as the efficient score and times the interim looks in terms of a second statistic approximately proportional to study sample size known as observed Fisher's information. Jennison & Turnbull [9] use a direct estimate of the treatment difference itself as the test statistic of interest and record inspections in terms of a function of its standard error.
Stopping rules for sequential trials
As highlighted above, a sequential test compares the test statistic measuring treatment difference with appropriate critical values. These critical values form a stopping rule or boundary for the trial. At any stage in the trial, if the boundary is crossed, the study is stopped and an appropriate conclusion drawn. If the statistic stays within the test boundary then there is not enough evidence to come to a conclusion at present and a further interim look should be taken. It is possible to look after every patient or to have just one or two interim analyses. When interims are performed after groups of patients this may be referred to as a ‘group sequential trial’. The advantage of looking after every patient is that a trial can be stopped as soon as an additional patient response results in the boundary being crossed. In contrast, performing just one or two looks reduces the potential for stopping, and hence delays it. However, the logistics of performing interim analyses after groups of subjects are far easier to manage. In practice, planning for between 4 and 8 interim analyses appears sensible.
Once it had been established that there was a problem with inflating the type I error when using traditional tests and the usual fixed sample size critical values, designs had to be suggested which adjusted for this. It is the details of the derivation of the stopping rule that introduces much of the variety of sequential methodology. Key early work in the area includes the tests of Pocock [6] and O'Brien & Fleming [7]. A more flexible approach, referred to as the alpha-spending method was proposed by Lan & DeMets [10] and extended by Kim & DeMets [11]. A collection of designs based on straight line boundaries, which builds on work that has steadily accumulated since the 1940s is discussed by Whitehead [8], the best known and most widely implemented of these being the triangular test.
The important issues to focus upon are the desirable reasons for stopping or continuing a study. Reasons for stopping may include:
The experimental treatment is obviously worse than the control
The experimental treatment is already obviously better
There is little chance of showing that the experimental treatment is better.
Reasons for continuing may include:
A moderate advantage of the experimental treatment is likely and it is desired to estimate the magnitude carefully
The event rate is low and more patients are needed to achieve power.
These will determine the type of stopping rule that is appropriate for the study under consideration. Stopping rules are now available for testing superiority, noninferiority, equivalence and even safety aspects of clinical trials. As an example, consider a clinical trial conducted by the Medical Research Council Renal Cancer Collaborators between 1992 and 1997 [12]. Patients with metastatic renal carcinoma were randomly assigned to treatment with either the biological therapy, interferon-α, or the hormone therapy, oral medroxyprogesterone acetate (MPA). The use of interferon-α was experimental and this treatment is known to be both toxic and costly. Consequently its benefits over MPA needed to be substantial to justify its wider use. A stopping rule was required to satisfy the following requirements:
Early stopping if data showed a clear advantage of interferon-α over oral MPA
Early stopping if data showed no worthwhile advantage of interferon-α (either interferon-α obviously worse or little difference between treatments).
This suggested use of an asymmetric stopping rule. The design chosen was the triangular test [8], similar in appearance to the stopping rule in Figure 2. Interim analyses were planned every 6 months from the start of the trial.
Figure 2.
Statistics for Viagra.
The precise form of the stopping rule is defined, as is the sample size in a fixed sample size trial, by consideration of significance level, power and desired treatment advantage, with reference to the primary endpoint. The primary endpoint in the MRC study was survival time and the treatment difference was measured by the log-hazard ratio. It was decided that if a difference in 2 year survival from 20% on MPA to 32% on interferon-α (log-hazard ratio −0.342) was present, then a significant treatment difference at the two-sided 5% significance level should be detected with 90% power.
Analysis following a sequential trial
Once a sequential trial has stopped, an analysis will be performed. The interim analyses determine only whether stopping should take place, they do not provide a complete interpretation of the data. An appropriate final analysis must take account of the fact that a sequential design was used. Unfortunately, many trials which have been terminated at an interim analysis are finally reported with analyses which take no statistical account of the inspections made [13]. In a sequential trial, although the meaning and interpretation of data summaries such as significance levels, point estimates and confidence intervals remain as for fixed sample size trials, various methods of calculation have been proposed. These lead to slightly different results when applied to the same set of data. The user of a computer package such as those referenced below may accept the convention of the package and use the resulting analysis without being concerned about the details of calculation. Readers who wish to develop a deeper understanding of statistical analysis following a sequential trial are referred to Chapter 5 of Whitehead [8] and Chapter 8 of Jennison & Turnbull [14].
Sequential clinical trials in practice
Increasingly, sequential procedures are being implemented in modern clinical trials. Peace [15] presents case studies of several applications, some of which have formed part of New Drug Applications (NDAs) that have been approved by the Food and Drug Administration (FDA). Additional examples can be found in the proceedings of two workshops, one on practical issues in data monitoring sponsored by the US National Institutes of Health held in 1992 (published in issues 5 and 6 of volume 12, 1993, of Statistics in Medicine) and the other on early stopping rules in cancer clinical trials held at Cambridge University in 1993 (published in issues 13 and 14 of volume 13, 1994, of Statistics in Medicine). The medical literature also demonstrates the widening use of sequential methods. Examples of such studies include trials of corticosteroids for AIDS-induced pneumonia [16], of enoxaparin for prevention of deep vein thrombosis resulting from hip replacement surgery [17] and of implanted defibrilators in coronary heart disease [18]. Two books dealing exclusively with the implementation of sequential methods in clinical trials are those by Whitehead [8] and Jennison & Turnbull [14]. In addition, there are three commercial software packages currently available. The package PEST [19] is based on straight line boundaries. The package EaSt [20] implements the alpha-spending boundaries of Wang & Tsiatis [21] and Pampallona & Tsiatis [22]. A recent addition to the package S-Plus is the S + SeqTrial module [23]. PEST and EaSt have both been developed over a number of years and are the leading packages in this field. Both packages allow construction of stopping rules for a variety of practical circumstances, and provide a valid final analysis. PEST also includes computation of appropriate test statistics at each interim analysis, together with some additional final analysis options. A good review of the capabilities of earlier versions is given by Emerson [24]. The S-plus module is relatively new this year and consequently has not yet been as extensively used. An example of the design and implementation of an actual sequential trial is given in Figure 2.
When planning any clinical trial sequentially, the implications of introducing a stopping rule need to be thought out carefully in advance of the study. In addition, all involved in the trial should be consulted with regard to the choice of a clinically relevant difference, specification of an appropriate power requirement, and the selection of a suitable stopping rule. As part of the protocol for the study the operation of any sequential procedure should be described clearly in the statistical section.
If a DSMB is appointed one of their roles should be to scrutinize any proposed sequential stopping rule prior to the start of the study and to review the protocol in collaboration with the trial Steering Committee. The procedure for undertaking the interim analyses should also be finalized in advance of the trial start-up. The DSMB would then review results of the interim analyses as they are reported. Membership of the DSMB and its relationship with other parties in a clinical trial has been considered in the 1993 Statistics in Medicine volume referenced above and by Whitehead [25]. It is important that the interim results of an ongoing trial are not circulated widely as this may have an undesirable effect on the future progress of the trial. Investigators' attitudes will clearly be affected by whether a treatment looks good or bad as the trial progresses. It is usual for the DSMB to be supplied with full information and, ideally, the only other individual to have knowledge of the treatment comparison would be the statistician who performs the actual analyses.
Decision making as part of a sequential trial (whether by a DSMB or another party involved in the trial) is both important and time sensitive. A decision taken to stop a study not only affects the current trial, but often affects future trials planned in the same therapeutic area. However, continuing a trial too long puts participants at unnecessary risk and delays the dissemination of important information. It is essential to make important scientific and ethical decisions with confidence. Wondering whether the data supporting interim analyses are accurate and up-to-date is unsettling and makes the decision process harder. It is therefore necessary for the statistician performing the interim analyses to have both timely and accurate data. Unfortunately, a trade-off exists — it takes time to ensure accuracy. Potential problems can be alleviated if data for interim analyses are reported separately from the other trial data, as part of a ‘fast-track’ system. Less data means that they can be validated quicker. If timeliness and accuracy are not in balance, not only may real-time decisions be made on old data, but more seriously, differential reporting may lead to inappropriate study conclusions.
Discussion
Sequential methodology in phase III clinical trials is not new, but it is true to say that it is the more recent theoretical developments, together with the availability of software, which have precipitated its wider use. The methodology is flexible as it enables choice of a stopping rule from a number of alternatives, allowing the trial design to meet the study objectives. One important point is that a stopping rule should not govern the trial completely. If external circumstances change the appropriateness of the trial or assumptions made when choosing the design are suspected to be false, it can and should be overridden, although the reasons for doing so must be carefully documented.
Methodology for conducting a phase III clinical trial sequentially has been extensively developed, evaluated and documented. Error rates can be accurately preserved and valid inferences drawn. It is important that this fact is recognized and that individuals contemplating the use of interim analyses conduct them correctly. Both the FDA and the Medicines Control Agency (MCA) do not look favourably on evidence from trials incorporating unplanned looks at data. In the US, the Federal Register (1985) published regulations for NDAs which included the requirement that the analysis of a phase III trial ‘assess…the effects of any interim analyses performed’. The FDA guidelines were updated by publication of ‘E9 Statistical Principles for Clinical Trials’ in a later Federal Register (1998). Section 3 of this document discusses group sequential designs and Section 4 covers trial conduct including trial monitoring, interim analysis, early stopping, sample size adjustment and the role of an independent DSMB. With such acknowledgement from regulatory authorities the future for sequential methodology within clinical trials is encouraging.
Acknowledgments
The authors are grateful to the two referees for their comments and suggestions.
References
- 1.Pocock SJ. Clinical trials: a practical approach. New York: Wiley; 1983. [Google Scholar]
- 2.Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian approaches to randomized trials. J Roy Statist Soc Series A. 1994;157:394–399. [Google Scholar]
- 3.Armitage P, McPherson CK, Rowe BC. Repeated significance tests on accumulating data. J Roy Statist Soc Series A. 1969;132:235–244. [Google Scholar]
- 4.Whitehead J. A unified theory for sequential clinical trials. Statistics Med. 1999;18:2271–2286. doi: 10.1002/(sici)1097-0258(19990915/30)18:17/18<2271::aid-sim254>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]
- 5.Armitage P. Sequential medical trials. 2. Oxford, UK: Blackwell; 1975. [Google Scholar]
- 6.Pocock SJ. Group sequential methods in the design and analysis of clinical trials. Biometrika. 1977;64:191–199. [Google Scholar]
- 7.O'Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics. 1979;35:549–556. [PubMed] [Google Scholar]
- 8.Whitehead J. The design and analysis of sequential clinical trials. 2. Chichester, UK: John Wiley & Sons Ltd; 1997. [Google Scholar]
- 9.Jennison C, Turnbull BW. Group sequential analysis incorporating covariate information. J Am Statist Assoc. 1997;92:1330–1341. [Google Scholar]
- 10.Lan KKG, DeMets DL. Discrete sequential boundaries for clinical trials. Biometrika. 1983;70:659–663. [Google Scholar]
- 11.Kim K, DeMets DL. Design and analysis of group sequential tests based on the type I error spending rate function. Biometrika. 1987;74:149–154. [Google Scholar]
- 12.Medical Research Council Renal Cancer Collaborators. Interferon-α and survival in metastatic renal carcinoma: early results of a randomised controlled trial. Lancet. 1999;353:14–17. [PubMed] [Google Scholar]
- 13.Facey KM, Lewis JA. The management of interim analyses in drug development. Statistics Med. 1998;17:1801–1809. doi: 10.1002/(sici)1097-0258(19980815/30)17:15/16<1801::aid-sim981>3.0.co;2-l. [DOI] [PubMed] [Google Scholar]
- 14.Jennison C, Turnbull BW. Group sequential methods with applications to clinical trials Boca Raton. USA: Chapman & Hall/CRC; 2000. [Google Scholar]
- 15.Peace KE. Biopharmaceutical sequential statistical applications. New York: Marcel Dekker; 1992. [Google Scholar]
- 16.Montaner JSG, Lawson LM, Levitt N, et al. Corticosteroids prevent early deterioration in patients with moderately severe Pneumocystis carinii pneumonia and the acquired immunodeficiency syndrome (AIDS) Ann Inter Med. 1990;113:14–20. doi: 10.7326/0003-4819-113-1-14. [DOI] [PubMed] [Google Scholar]
- 17.Whitehead J. Sequential designs for pharmaceutical clinical trials. Pharmaceut Med. 1992;6:179–191. [Google Scholar]
- 18.Moss AJ, Hall WJ, Cannom DS, et al. Improved survival with implanted defibrillator in patients with coronary disease at high risk of ventricular arrhythmia. N Engl J Med. 1996;335:1933–1940. doi: 10.1056/NEJM199612263352601. [DOI] [PubMed] [Google Scholar]
- 19.MPS Research Unit. PEST 4: operating manual. UK: The University of Reading; 2000. [Google Scholar]
- 20.Cytel Software Corporation. EaSt. A software package for the design and interim monitoring of group-sequential clinical trials. Cambridge, Mass: Cytel Software Corporation; 2000. [Google Scholar]
- 21.Wang SK, Tsiatis AA. Approximately optimal one-parameter boundaries for group sequential trials. Biometrics. 1987;43:193–199. [PubMed] [Google Scholar]
- 22.Pampallona S, Tsiatis AA, Kim K. Group sequential designs for one-sided and two-sided hypothesis testing with provision for early stopping in favour of the null hypothesis. J Statistical Planning Inference. 1994;42:19–35. [Google Scholar]
- 23.MathSoft Inc. S-Plus. Seattle, Washington 2000: MathSoft Inc; 2000. [Google Scholar]
- 24.Emerson SS. Statistical packages for group sequential methods. Amer Statist. 1996;50:183–192. [Google Scholar]
- 25.Whitehead J. On being the statistician on a data and safety monitoring board. Statistics Med. 1999;18:3425–3434. doi: 10.1002/(sici)1097-0258(19991230)18:24<3425::aid-sim369>3.0.co;2-d. [DOI] [PubMed] [Google Scholar]
- 26.Derry FA, Dinsmore WW, Fraser M, et al. Efficacy and safety of oral sildenafil (viagra) in men with erectile dysfunction caused by spinal cord injury. Neurology. 1998;51:1629–1633. doi: 10.1212/wnl.51.6.1629. [DOI] [PubMed] [Google Scholar]