Abstract
In clinical trials with survival endpoint, it is common to observe an overlap between two Kaplan-Meier curves of treatment and control groups during the early stage of the trials, indicating a potential delayed treatment effect. Zhang and Quan [1] derived formulas for the asymptotic power of the log-rank test in the presence of delayed treatment effect and its accompanying sample size calculation. In this paper, we first reformulate the alternative hypothesis with the delayed treatment effect in a rescaled time domain, which can yield a simplified sample size formula for the log-rank test in this context. We further propose an intersection-union test to examine the efficacy of treatment with delayed effect, and show it to be more powerful than the log-rank test. Simulation studies are conducted to demonstrate the proposed methods.
Keywords: lagged treatment effect, log-rank test, power calculation
1. Introduction
Clinical trials with time-to-event endpoints are often designed under the assumption that the hazards ratio between treatment and control groups remains constant. If the treatment has a delayed start of effect, the proportional hazards assumption is violated and one often observes an overlap between two Kaplan-Meier (KM) curves. The analytic issues with the delayed treatment effect have been studied by many researchers (see [2, 3, 4, 5]).
When designing trials with potential delayed treatment effect in mind, one often calculates the sample size assuming a constant hazard ratio and then inflates the final number to maintain statistical power, which can be wasteful. Zhang and Quan [1] studied the asymptotic distribution of the two-sample log-rank test statistic under a lagged treatment effect model specified as
(1) |
where λ1(t) and λ0(t) denote the underlying hazard functions in the treatment and control groups, respectively; t0 is the assumed change point until which the new treatment does not show effect relative to the control; and γ denotes the treatment effect of interest. Under the assumption that survival times from both groups are exponentially distributed, a new approximation to the non-centrality parameter of the asymptotic distribution was proposed:
(2) |
where D denotes the expected total number of events from the two groups combined; D̃0 and D̃1 respectively are the expected numbers of events after t0 from the treated group and the control groups; and π is the probability of assignment to the treatment group. Zhang and Quan [1] showed that the new non-centrality parameter (2) gives more accurate power than Shoenfeld’s formula [6] that is commonly used in practice.
The assumption that the survival time in its original time scale follows the exponential distribution with a constant hazard rate may be too strong, but can be alleviated by rescaling the survival time using the cumulative hazard function so that a simple exponential distribution is rendered after transformation [7]. In this paper, we reformulate the alternative hypothesis that defines the delayed treatment effect after rescaling and derive a simplified formula for sample size calculation in Section 2. In Section 3, we further propose an intersection-union test to improve the overall power when a delayed treatment effect is hypothesized. Section 4 presents simulation studies and Section 5 concludes with discussions.
2. An Alternative Formulation
We consider a randomized controlled trial that randomly allocates subjects to a treatment group with probability π and to a placebo control group with probability of 1 − π. The primary endpoint of interest is time to a pre-defined event denoted by Tj, j = 0, 1, where the subscript j = 1 for the treatment group and 0 for the control group throughout the paper. The hazard function, cumulative hazard function, and survival function are denoted by λj(t), Λj(t) and Sj(t), respectively. By observing that
(3) |
we formulate the hypothesis on this rescaled time domain as follows:
(4) |
Note that we specify the change point in terms of survival proportion, which is invariant to any time scale change.
To derive the formula for power analysis, we assume that the entry time Z follows a uniform distribution within the period of [0, A], the total study duration is L, and no drop out occurs before the end of study. The results can be easily extended to accommodate more complicated recruitment plan and/or drop out distribution. Let N denote the total sample size, be the expected number of events before the change-point in group j, and D̃j be the expected number of events after the change-point. Following the derivations in Zhang and Quan [1], we have
(5) |
where
Detailed derivations of and D̃j, j = 0, 1 are given the appendix. Then it follows that the expected number of total events
(6) |
which clearly illustrates the relationship between the expected number of events and the total sample size. By plugging D̃1, D̃0 and D into (2), we have the non-centrality parameter as
(7) |
where . Therefore, for given type I error α and type II error β, the required total number of events is
Note that the rescaled hypothesis (4) formulation leads to this simplified sample size formula with easy interpretation and implementation.
3. Intersection-Union Test for the Lagged Treatment Effect
When a delayed treatment effect is expected, investigators often use the weighted log-rank test with weights based on prior knowledge. Examples of different weights include Gehan’s weight, Tarone-Ware’s weight, Fleming and Harrington’s weight, and Peto-Prentice weight [8]. Shoenfeld [6] and Gill [9] show that the optimal weighting depends on the underlying true hazard ratio function. In addition to difficulties of choosing appropriate weights for weighted log-rank tests, the key problem is that the tests conducted do not reflect the essence of the lag-treatment effect set up: the hazard ratio is not constant throughout the trial.
The early overlap of the KM curves between two groups suggests that the treatment does not exhibit its full effect early on but also creates no harm to the patients. A natural hypothesis of a desirable treatment would consist of two components, namely (i) the treatment is non-inferior to the control during the whole study period and (ii) the treatment is superior to the control group after a pre-defined change point. We thus protect patients from being harmed by the treatment and also provide researchers a proper conclusion in the sense that the alternative hypothesis truly reflects a change point in the hazards ratio. Such formulation readily fits the framework of intersection union tests proposed by Berger [10] that originally developed for product quality control. Specifically, we formulate the overall null hypothesis H0 = H10 ∪ H20, where
By De Morgan’s law, the overall alternative hypothesis H1 = (H10 ∪ H20)c = H11 ∩ H21, where
A non-inferiority log-rank test can be used for H10 over the entire study duration [0, τ] and a superiority log-rank test for H20 after the pre-specified change point [t0, τ]. We conduct the tests using the non-inferiority/superiority log-rank test formulation in Chow et al. [11]. Under this construction, only when both hypotheses are rejected we can conclude that the overall null hypothesis is rejected. Controlling each test as a α-level test can maintain the overall type I error at α-level [10].
The proposed intersection-union test, as we show in the next section, is more powerful than the usual log-rank test. This observation is expected because the proposed procedure utilizes the prior knowledge of change-point of t0 and takes into account the fact that the hazard ratios between the two groups of patients is not constant throughout. The reject of H0 by this method also allows researchers to claim that new treatment does not adversely affect patients overall and shows beneficial effect after the assumed change-point.
Selection of the non-inferiority and superiority margins (Δ10, Δ20) is necessary before implementing the proposed intersection union test. We need to determine how close the new treatment must be compared to the control treatment on the efficacy so to declare significant improvement. As discussed in [12], the ICH documents offer two guidelines for determining the corresponding margins (see [13]): (1) the non-inferiority margins are chosen based on both statistical support and clinical judgement in a suitably conservative manner; (2) the margin cannot exceed the minimum effect size that the active drug would be reasonably expected to produce compared with placebo, based on, if exist, past placebo-controlled trials under similar conditions. According to the guidelines, we may either adopt the putative placebo approach (see [12]), where the margin is set so that the new treatment retains at least a certain amount of the superiority of the active control over the placebo, i.e. avoiding biocreep and/or 95–95 approach based on the meta-analytic methods with data from previous studies as discussed in [14].
4. Simulation Studies
We performed numerical studies to evaluate the type I errors and power of the proposed intersection union test and the log-rank test. We considered the total sample sizes of 40, 50, 80 and 100 with 1:1 allocation to the treatment and control groups. Survival times were generated from the Weibull distribution. Under the null hypothesis, both the shape and scale parameter were set to be 3. Under the alternative hypothesis, we considered a change point (t0) at 2, and modified the scale parameter for the the treatment group to be 4.8 after the change point, which yielded a hazard ratio 1.6. The non-inferior margins Δ10 was set at various levels of {1.05, 1.10, 1.15, 1.20, 1.30, 1.40}, and the superiority test margin Δ20 was chosen to be 0.90. We also conducted sensitivity analyses when wrong change points {1.5, 1.75, 1.875, 2.125, 2.25} were assumed. For each setting, 1,000 simulations were carried out, and results were summarized in Tables 1 and 2.
Table 1.
N | Δ10 | Assumed change-point | |||||
---|---|---|---|---|---|---|---|
1.500 | 1.750 | 1.875 | 2.000 | 2.150 | 2.250 | ||
40 | 1.05 | 3.77 | 3.82 | 4.00 | 3.75 | 3.73 | 3.60 |
1.10 | 3.86 | 4.06 | 4.29 | 4.06 | 4.11 | 3.96 | |
1.15 | 3.91 | 4.18 | 4.48 | 4.27 | 4.31 | 4.26 | |
1.20 | 3.94 | 4.27 | 4.62 | 4.45 | 4.56 | 4.54 | |
1.30 | 3.72 | 4.08 | 4.48 | 4.45 | 4.55 | 4.46 | |
1.40 | 3.68 | 4.06 | 4.32 | 4.42 | 4.46 | 4.50 | |
| |||||||
100 | 1.05 | 3.71 | 3.91 | 4.16 | 4.10 | 4.05 | 3.87 |
1.10 | 3.77 | 4.06 | 4.35 | 4.26 | 4.38 | 4.21 | |
1.15 | 3.67 | 3.97 | 4.49 | 4.48 | 4.47 | 4.46 | |
1.20 | 3.75 | 3.98 | 4.56 | 4.51 | 4.37 | 4.56 | |
1.30 | 3.96 | 4.30 | 4.73 | 4.59 | 4.72 | 4.77 | |
1.40 | 3.96 | 4.30 | 4.75 | 4.64 | 4.76 | 4.83 |
Table 2.
N | Δ10 | Assumed change-point | Log-rank | |||||
---|---|---|---|---|---|---|---|---|
1.500 | 1.750 | 1.875 | 2.000 | 2.125 | 2.250 | |||
40 | 1.05 | 83.8 | 84.8 | 85.4 | 83.5 | 83.6 | 83.1 | 83.3 |
1.10 | 87.1 | 86.4 | 86.3 | 83.1 | 85.1 | 84.1 | ||
1.15 | 88.5 | 87.2 | 86.7 | 87.0 | 87.3 | 87.6 | ||
1.20 | 88.6 | 88.9 | 90.5 | 88.8 | 87.3 | 86.9 | ||
1.30 | 90.2 | 89.9 | 91.0 | 90.5 | 91.9 | 88.2 | ||
1.40 | 90.8 | 92.3 | 91.4 | 90.7 | 90.4 | 90.6 | ||
| ||||||||
50 | 1.05 | 91.3 | 92.3 | 90.8 | 91.7 | 90.5 | 90.6 | 90.4 |
1.10 | 93.1 | 92.7 | 92.8 | 93.4 | 94.3 | 89.9 | ||
1.15 | 93.7 | 94.8 | 94.5 | 93.8 | 94.3 | 92.5 | ||
1.20 | 93.0 | 96.1 | 94.8 | 94.5 | 94.7 | 93.5 | ||
1.30 | 95.5 | 95.0 | 95.7 | 95.1 | 93.4 | 94.2 | ||
1.40 | 94.8 | 95.9 | 95.5 | 96.2 | 95.8 | 93.6 | ||
| ||||||||
80 | 1.05 | 99.4 | 99.4 | 99.8 | 99.4 | 99.1 | 98.8 | 98.5 |
1.10 | 99.5 | 99.2 | 99.2 | 99.4 | 99.3 | 98.8 | ||
1.15 | 99.8 | 99.7 | 99.6 | 99.6 | 99.6 | 99.6 | ||
1.20 | 99.8 | 99.6 | 99.8 | 99.7 | 99.7 | 99.3 | ||
1.30 | 99.7 | 99.6 | 99.8 | 99.4 | 99.7 | 99.8 | ||
1.40 | 99.8 | 99.8 | 99.6 | 99.5 | 99.7 | 99.6 | ||
| ||||||||
100 | 1.05 | 99.8 | 99.7 | 99.8 | 99.7 | 99.7 | 99.8 | 99.6 |
1.10 | 99.6 | 100.0 | 99.8 | 99.9 | 99.9 | 99.8 | ||
1.15 | 99.8 | 100.0 | 100.0 | 100.0 | 99.9 | 99.9 | ||
1.20 | 99.9 | 100.0 | 100.0 | 100.0 | 100.0 | 99.8 | ||
1.30 | 100.0 | 99.9 | 99.8 | 100.0 | 100.0 | 100.0 | ||
1.40 | 99.8 | 100.0 | 100.0 | 99.9 | 100.0 | 100.0 |
Table 1 shows that the intersection union test can well control the type I error at the nominal level of 0.05. Table 2 shows that the proposed test is more powerful compared with the log-rank test across in all settings. It can be seen that the larger the value of Δ10 is, the higher the power the intersection union test can be, because Δ10 can be deemed as tolerance level for the treatment group’s non-inferiority against the control group. Furthermore, in all simulation studies, the intersection union approach is robust and performs better than the log-rank test with various choices of t0.
5. Discussion
In this short note, we have proposed alternative methods for study design and hypothesis testing in clinical trials that expect a lagged treatment effect. Note that, in many cases, the hazard rate function for the subjects in the control group (as well as the treatment group) is not exponential with a constant rate. The rescaling of time via allows us to transform the point process into a time-homogeneous Poisson process with a constant rate. Thus, we approach the change-point problem from the “number-of-events” perspective, and provide an alternative formulation that allows trial designers to estimate the number of events needed so as to reach a certain level of power, given a preliminary estimates of the change point. This adds versatility to our formulation which is applicable to a broad range of settings.
In addition, we also propose a new intersection union testing procedure that can tackle the lagged treatment effect phenomenon. The proposed inclusion-exclusion test first guarantees that the new treatment creates no extra harm to the existing treatment (control) and then tests for superiority of the new treatment after some predetermined change point t0. Such a construction provides a correct conclusion for the test and avoids blindly increasing the number of subjects so as to maintain a certain level of power. In addition, interim analysis such as proposed by [15] may experience premature termination of the test due to the acceptance of H0 during the interim analysis. The re-estimated sample size will explode even if the test can be carried on to the second stage because of the inappropriately small value of hazard ratio estimated. In conclusion, the intersection union testing procedure improves the power while enabling the user to correctly claim that the new treatment is beneficial after a certain time point t0, prior to which the new treatment performs equally well with the existing drug or treatment. Moreover, because the intersection union testing procedure only rejects the overall null H0 when both individual nulls, H10 and H20, are rejected, the sample size required to attain a certain level of power can be easily obtained as
where ni and ns denote respectively the sample size for testing for non-inferiority and that for superiority.
Acknowledgments
Research supported in part by CUHK Direct Grant 4053086 and RGC ECS 24300514 for Sit, T., the NIH/NCI grants R21CA169739 for Liu, M. and R37GM047845 for Ying, Z.
6. Appendix: Derivation of the Expected Numbers of Events (5)
Following to the derivations and arguments presented on pages 869–70 of [1], we can rewrite , Di, (i = 0, 1) and thus D̃i(i = 0, 1) as follows:
It follows that
Contributor Information
Tony Sit, Email: tonysit@sta.cuhk.edu.hk.
Michael Shnaidman, Email: michael.shnaidman@pfizer.com.
Zhiliang Ying, Email: zying@stat.columbia.edu.
References
- 1.Zhang D, Quan H. Power and sample size calculation for log-rank test with a time lag in treatment effect. Statistics in Medicine. 2009;28:864–879. doi: 10.1002/sim.3501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Halperin M, Rogot E, Gurian J, Ederer F. Sample sizes for medical trials with reference to long-term therapy. Journal of Chronic Diseases. 1968;21:13–24. doi: 10.1016/0021-9681(68)90082-9. [DOI] [PubMed] [Google Scholar]
- 3.Lakatos E. Sample size based on the log-rank statistic in complex clinicial trials. Biometrics. 1988;44:229–241. [PubMed] [Google Scholar]
- 4.Zucker DM, Lakatos E. Weighted log rank type statistics for comparing survival curves when there is a time lag in the effectiveness of treatment. Biometrika. 1990;77:853–864. [Google Scholar]
- 5.Luo X, Turnbull BW, Cai H, Clark LC. Regression for censored survival curves with lag effects. Communications in Statistics - Theory and Methods. 1994;23:3417–3438. [Google Scholar]
- 6.Shoenfeld D. The asymptotic properties of nonparametric tests for comparing survival distributions. Biometrika. 1981;68:316–319. [Google Scholar]
- 7.Griffith W. Representation of distributions having monotone or bathtub-shaped failure rates. IEEE Transactions on Reliability. 1982;31:95–96. [Google Scholar]
- 8.Klein JP, Moeschberger ML. Survival analysis: techniques for censored and truncated data. Springer; 2003. [Google Scholar]
- 9.Gill R. Non-parameteric estimation based on censored observations of a markov renewal process. Z Wahrsch Verw Gebiete. 1980;53:97–116. [Google Scholar]
- 10.Berger R. Multiparameter hypothesis testing and acceptance sampling. Technometrics. 1982;24:295–300. [Google Scholar]
- 11.Chow SC, Shao J, Wang H. Sample Size Calculations in Clinical Research. 2. Chapman and Hall/CRC; 2008. [Google Scholar]
- 12.D’Agostino R, Massaro J, Sullivan L. Non-inferiority trials: design concepts and issues - the encounters of academic consultants in statistics. Statistics in Medicine. 2003;22:169–186. doi: 10.1002/sim.1425. [DOI] [PubMed] [Google Scholar]
- 13.Huitfeldt B, Danielson L, Ebbutt A, Schmidt K. Choice of control in clinical trials: issues and implications of ich-e10. Drug Information Journal. 2000;35:1147–1156. [Google Scholar]
- 14.Rothmann M, Li N, Chen G. Design and analysis of non-inferiority mortality trials in oncology. Statistics in Medicine. 2003;22:239–264. doi: 10.1002/sim.1400. [DOI] [PubMed] [Google Scholar]
- 15.Li G, Shin W, Wang Y. Two-stage adaptive design for clinical trials with survival data. Journal of Biopharmaceutical Statistics. 2005;15:707–718. doi: 10.1081/BIP-200062293. [DOI] [PubMed] [Google Scholar]