Abstract
Calibration more important statistic than discrimination.
Given the extensive published literature on risk prediction models to aid biopsy decision in prostate cancer (PCa) screening, we were pleased to see such rigorous a meta-analysis as the one by Louie et al. in the January 3 issue of Annals of Oncology [1]. The objective of this systematic review was to ‘assess the model's performance to predict PCa’ and the authors report summary area under the curve (AUC) statistics for the six risk prediction models identified. A reader, untrained in statistics, might interpret the model with the highest summary AUC, that is, the best discrimination, to be superior to the others.
We would argue that calibration is a more important statistic than discrimination. AUC gives us the probability that, for a randomly selected pair of individuals, one with the disease and one without, a model will give a higher score to the patient with the disease [2]. Calibration tells us whether the model correctly estimates the probability of disease for an individual; a model is said to be well calibrated if, for every 100 patients given a risk of x%, close to x actually have the disease [3]. If one took a model and divided risk by 100, e.g. a man with a 75% risk of PCa would be told that his risk is 0.75%, AUC would be unchanged. We believe that it is more important for the individual patient to know that the risk given by the model is close to his true risk than to know how well the model distinguishes between patients.
The authors are not necessarily to be blamed for focusing on discrimination, it is more a general methodological problem of the included studies, correctly cited by the authors as ‘calibration measures of the models were poorly reported’ [1]. Of the six included risk prediction models, three did not report calibration measures, two had good calibration and one model predicted risks that were higher than those observed [1]. We would like to see more future risk prediction papers showing calibration plots and then examining clinical utility, for instance, examining whether use of a model would allow some men to avoid a biopsy and whether this would lead to an undue number of aggressive cancers being missed. The statistical methods for evaluation of prediction models have been discussed elsewhere [3].
We have two further critiques of this paper. First, the authors chose models predicting any PCa for inclusion. Because of the low lethality among men with low-grade PCa together with questionable benefit of treating such men, the end point in risk prediction studies for PCa involving biomarkers should be high-grade PCa, not any PCa [2]. Second, the authors include prediction models that include prostate volume. The clinical utility of such models are limited, since the assessment of volume requires an invasive test.
disclosure
AV is named on a patent application for a statistical method to detect PCa. The method has been commercialized by OPKO. AV receives royalties from sales of the test. All remaining authors have declared no conflicts of interest.
references
- 1.Louie KS, Seigneurin A, Cathcart P, Sasieni P. Do prostate cancer risk models improve the predictive accuracy of PSA screening? A meta-analysis. Ann Oncol 2015; 26(5): 848–864. [DOI] [PubMed] [Google Scholar]
- 2.Vickers S. Markers for the early detection of prostate cancer: some principles for statistical reporting and interpretation. J Clin Oncol 2014; 32(36): 4033–4034. [DOI] [PubMed] [Google Scholar]
- 3.Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology 2010; 21(1): 128–138. [DOI] [PMC free article] [PubMed] [Google Scholar]