Short abstract
There exist a wide variety of statistical methods for clinical trials. Despite decades of clinical trial research, little is known about which methods are used in data analysis practice of the real world. This review describes the evidence of existing practices on survival data analysis and identifies potential opportunities to improve them.
Keywords: Interpretability, Robustness, Survival data analysis, Test/estimation coherency, Treatment decision making
There are two major analytic tasks in comparative clinical trials with time‐to‐event outcomes. The first task is testing the equality of the treatment groups. Because statistical tests provide a dichotomous outcome, they naturally fit a binary decision, such as approval or nonapproval of a new drug. The second task is estimating the magnitude of the treatment effect. For clinicians and patients, the magnitude‐of‐effect estimate is weighed against the risk estimate when making treatment decisions. Together, the methods used to accomplish these two tasks can be described as the “test/estimation” approach for any given analysis. Many statistical methods can be used to perform these two tasks. For example, the combination of the log‐rank test and estimation of hazard ratio (HR) is one widely used test/estimation approach. Despite decades of clinical trials research, we know comparatively little about which test/estimation methods are used in everyday practice. We sought to describe existing analytic practices and identify potential opportunities to improve conventional test/estimation approaches.
Methods
We conducted a systematic review to identify the fraction of different test/estimation methods used in contemporary cancer randomized controlled trials (RCTs).
Data Sources and Searches
We searched reports of phase III RCTs in which overall survival, progression‐free survival, or disease‐free survival was included as the primary or secondary endpoint. For inclusion, papers had to be published in one of seven journals: New England Journal of Medicine, Lancet, Lancet Oncology, JAMA, JAMA Oncology, Journal of Clinical Oncology, and Journal of National Cancer Institute. The registration date on PubMed had to be between July 1, 2016, and June 30, 2017. After initial identification by the PubMed search, two authors (H.U. and M.H.) independently examined the papers and determined their eligibility. Any discrepancies were resolved by consensus. The eligibility criteria were that (A) data were from a phase III RCT, (B) comparative groups were randomized, and (C) the result was based on the primary analysis. Therefore, we excluded papers whose primary goal was to report secondary analyses, subgroup analyses, results of correlative studies, or meta‐analyses.
Classification of Papers
We classified papers by the test/estimation approach employed. Specifically, for each eligible paper, we identified the primary testing procedure used to compare the two groups and the summary measure used to estimate the magnitude of the treatment effect (e.g., HR, difference in t‐year event rate). Because the log‐rank test and the partial likelihood‐based tests via Cox's proportional hazards (PH) model (e.g., score test and Wald test) are asymptotically equivalent, we classified them into the same category as the log‐rank test (HR‐based test). Regarding the estimation methods, we only counted papers as reporting a specific between‐group treatment effect summary measure if a corresponding standard error or confidence interval was reported with the point estimate. For example, if a paper reported the median survival time only for each group, we did not count the paper as reporting difference or ratio of median survival time, although one can calculate a point estimate for the absolute difference or ratio from the two medians.
Results
A total of 150 papers were identified by PubMed search. Of these, 101 (67%) satisfied the eligibility criteria (Fig. 1). Based on the employed test/estimation approaches, the 101 eligible papers were summarized in a cross table (Table 1).
Figure 1.
Disposition of papers. “Not primary manuscript” includes ad hoc subgroup analyses, reports of correlative studies (e.g., quality of life data), reports of analysis of prognostic factors, prediction modeling, results of long‐term follow‐up data, and so on.
Abbreviations: DFS, disease‐free survival; OS, overall survival; PFS, progression‐free survival.
Table 1.
Test/estimation methods used for reporting results of recent phase III cancer clinical trials with time‐to‐event outcomes
Test: Procedures used for the primary comparison | Estimation: Summary measures used for reporting the magnitude of treatment effect with standard error or confidence interval | Total | ||||
---|---|---|---|---|---|---|
HR | HR + RMST | HR + t‐Y | t‐Y | NR | ||
Log‐rank test (HR‐based test) | 91 | 1 | 2 | 1 | 2 | 97 (96%) |
Test based on a Weibull regression | 1 | 0 | 0 | 0 | 0 | 1 (1%) |
Test based on t‐Y | 0 | 0 | 1 | 2 | 0 | 3 (3%) |
Total | 92 (91%) | 1 (1%) | 3 (3%) | 3 (3%) | 2 (2%) | 101 |
Abbreviations: HR, hazard ratio; t‐Y, difference in t‐year event rate; RMST, difference in restricted mean survival time; NR, nothing is reported.
Regarding estimating the treatment effect, the most commonly used summary measure of treatment effect was the HR, which was used in 95% of papers. Only 16% of those papers addressed the PH assumption, even though the HR relies on the PH assumption for a valid inference 1, 2. Only three papers that reported the HR also reported the difference in t‐year event rate, and only one reported the difference in restricted mean survival time (RMST) because the data suggested that the PH assumption was not valid. Three papers reported difference in t‐year event rate only, and two papers did not have any between‐group difference measure with its standard error or confidence interval. No papers reported difference or ratio of median survival times with a confidence interval, whereas many papers reported median survival time by group.
Regarding testing the equality of the treatment groups, the log‐rank test was used as the primary test in 97 studies. No tests, other than the log‐rank test, tests based on the t‐year event rate, and a test based on the Weibull regression model, were used in any of these 101 papers.
One desirable property of the analysis method is “test/estimation coherency,” which means that the test method and the estimation method always provide the same conclusion in terms of statistical significance. So, if the test for equality claims significance, then the confidence interval for the treatment effect should exclude the null value. The log‐rank test and HR estimation is one example of a “coherent” test/estimation approach. Another coherent approach involves statistical testing applied to the t‐year event rate combined with use of the difference in t‐year event rate to estimate the magnitude of treatment effect. In this investigation, we found that 97% used a coherent approach. There was one paper using a test/estimation incoherent approach, whose authors used the log‐rank test for statistical comparison and the difference in t‐year event rate for quantifying the treatment effect. This combination is incoherent because the significant result of the log‐rank test does not imply a significant difference in the t‐year event rate. Without test/estimation coherency, the reported results can potentially confuse clinicians and patients with treatment decision making.
Another desirable property of the analysis method is “robustness and interpretability.” This means that (A) the summary measure to quantify the magnitude of treatment effect can be estimated accurately without relying on assumptions regarding the pattern of the difference between the survival time distributions from two groups and that (B) there exists a reference value from the control group that can be used to assess whether the magnitude of the estimated between‐group difference is clinically meaningful or not. Although our study found that HR was used in more than 95% of recent RCTs, the conventional HR estimate via Cox's regression model 1 demonstrates shortcomings with regard to robustness and interpretability, as previously discussed in recent publications 3, 4, 5, 6, 7, 8, 9. For example, Horiguchi et al. 5 illustrated that HR estimates generated by Cox's procedure can vary by accrual pattern and follow‐up time when the PH assumption is violated, and these changes in HR estimates can be large enough to affect treatment decision making. Because the HR is the ratio of two hazard functions under the PH assumption, the reference from the control is the baseline hazard function over time. There is no absolute number to reference when attempting to assess if the observed HR confers a clinically meaningful benefit relative to the reported absolute risks of treatment 3, 5. Suboptimal interpretability can confound treatment decision making. For example, suppose the study showed a 20% of risk reduction in 3‐year event rate. The treatment decision under the situation when the 3‐year event rate in the control group is 1% could be different from the case when it is 50%. As such, a reference number from the control group is essential for informed decision making, no matter which treatment contrast measures may be used. In this investigation, we found that only 3% of studies used an approach that had both desirable properties (test/estimation coherency, robustness and interpretability).
Discussion
The log‐rank test was used for almost all trials. Frequent use of the log‐rank test may occur because there is little or no information about the pattern of difference at the design stage of a clinical trial, statistical theories support that the log‐rank offers the most power to detect a difference between two event‐time distributions so long as the pattern of that difference is a PH 2, and the log‐rank test is a valid test even in non‐PH scenarios. However, because the pattern of difference is not PH for all RCTs, the current practice of using the log‐rank test in almost all trials may be suboptimal.
We also found that nearly all studies reported the HR as a measure of treatment effect. Selecting log‐rank as the primary test for statistical significance may reinforce the decision to use the HR to estimate the magnitude of the treatment effect, despite the limitations of HR. Indeed, the log‐rank test offers the highest power to identify a signal when the difference pattern is a PH. However, at the same time, it is important for investigators of clinical trials to provide a quantitative summary of the treatment effect that helps clinicians and patients make treatment decisions based on the risk‐benefit balance.
Besides the log‐rank/HR test/estimation method, a variety of test/estimation approaches that have two desirable properties (i.e., test/estimation coherency, and robustness and interpretability) are available. These include the difference or ratio of t‐year event rate, the difference or ratio of median, or the difference or ratio of RMST 3, 4, 7, 8, 9, 10, 11, 12, 13, 14. Good summaries of the pros and cons of these approaches are seen in recent publications by Uno et al. 4 and Chappell and Zhu 7. Statistical analyses for these measures can be implemented by surv2sampleComp (R package) available from the Comprehensive R Archive Network (CRAN) Web site (https://cran.r-project.org/). We recommend trial investigators to take alternative approaches as well as the routine log‐rank/HR approach into consideration, depending on the objectives of RCTs.
Disclosures
The authors indicated no financial relationships.
Acknowledgments
This work was supported by institutional funds of the Department of Biostatistics and Computational Biology, Dana‐Farber Cancer Institute (Dana Funds). These funds were made possible by a Dana Foundation donation.
Disclosures of potential conflicts of interest may be found at the end of this article.
References
- 1. Cox DR. Regression models and life‐tables. J Royal Stat Soc Ser B Stat Methodol 1972;34:187–220. [Google Scholar]
- 2. Fleming TR, Harrington DP. Counting Processes and Survival Analysis. New York, NY: John Wiley & Sons, 1991. [Google Scholar]
- 3. Uno H, Claggett B, Tian L et al. Moving beyond the hazard ratio in quantifying the between‐group difference in survival analysis. J Clin Oncol 2014;32:2380–2385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Uno H, Wittes J, Fu H et al. Alternatives to hazard ratios for comparing the efficacy or safety of therapies in noninferiority studies. Ann Intern Med 2015;163:127–134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Horiguchi M, Hassett MJ, Uno H. How do the accrual pattern and follow‐up duration affect the hazard ratio estimate when the proportional hazards assumption is violated? The Oncologist 2019;24:867–871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Péron J, Roy P, Ozenne B et al. The net chance of a longer survival as a patient‐oriented measure of treatment benefit in randomized clinical trials. JAMA Oncol 2016;2:901–905. [DOI] [PubMed] [Google Scholar]
- 7. Chappell R, Zhu X. Describing differences in survival curves. JAMA Oncol 2016;2:906–907. [DOI] [PubMed] [Google Scholar]
- 8. A'Hern RP. Restricted mean survival time: An obligatory end point for time‐to‐event analysis in cancer trials? J Clin Oncol 2016;34:3474–3476. [DOI] [PubMed] [Google Scholar]
- 9. A'Hern RP. Cancer biology and survival analysis in cancer trials: Restricted mean survival time analysis versus hazard ratios. Clin Oncol (R Coll Radiol) 2018;30:e75–e80. [DOI] [PubMed] [Google Scholar]
- 10. Trinquart L, Jacot J, Conner SC et al. Comparison of treatment effects measured by the hazard ratio and by the ratio of restricted mean survival times in oncology randomized controlled trials. J Clin Oncol 2016;34:1813–1819. [DOI] [PubMed] [Google Scholar]
- 11. Huang B, Kuan PF. Comparison of the restricted mean survival time with the hazard ratio in superiority trials with a time‐to‐event end point. Pharm Stat 2018;17:202–213. [DOI] [PubMed] [Google Scholar]
- 12. Liang F, Zhang S, Wang Q et al. Treatment effects measured by restricted mean survival time in trials of immune checkpoint inhibitors for cancer. Ann Oncol 2018;29:1320–1324. [DOI] [PubMed] [Google Scholar]
- 13. Royston P, Parmar MK. The use of restricted mean survival time to estimate the treatment effect in randomized clinical trials when the proportional hazards assumption is in doubt. Stat Med 2011;30:2409–2421. [DOI] [PubMed] [Google Scholar]
- 14. Royston P, Parmar MK. Restricted mean survival time: An alternative to the hazard ratio for the design and analysis of randomized trials with a time‐to‐event outcome. BMC Med Res Methodol 2013;13:152. [DOI] [PMC free article] [PubMed] [Google Scholar]