Abstract
In evolving clinical environments, the accuracy of prediction models deteriorates over time. Guidance on the design of model updating policies is limited, and there is limited exploration of the impact of different policies on future model performance and across different model types. We implemented a new data-driven updating strategy based on a nonparametric testing procedure and compared this strategy to two baseline approaches in which models are never updated or fully refit annually. The test-based strategy generally recommended intermittent recalibration and delivered more highly calibrated predictions than either of the baseline strategies. The test-based strategy highlighted differences in the updating requirements between logistic regression, L1-regularized logistic regression, random forest, and neural network models, both in terms of the extent and timing of updates. These findings underscore the potential improvements in using a data-driven maintenance approach over “one-size fits all” to sustain more stable and accurate model performance over time.
Introduction
Clinical environments are continuously evolving through changes in patient case mix, outcome prevalence, and clinical practice shift. These changes can cause the performance characteristics of clinical prediction models to deteriorate over time1-8. As prediction models are increasingly incorporated into electronic health records to support decision-making by providers and patients, model updating strategies to sustain performance are becoming critical components of model implementations. Despite literature documenting performance drift5-9 and the availability of multiple updating methods1, 10-12, guidance on the design of model maintenance plans is limited and provides little insight into how differences between learning algorithms may impact updating requirements.
While a lack of model updating can harm the performance and utility of predictions, common model maintenance strategies, such as regularly scheduled model refitting, may be inefficient or even detrimental.1-3, 13. The assumption that a new model is necessary neglects information gleaned from previous modeling efforts and can lead to overfit models that lack generalizability, especially when updating datasets are smaller than development cohorts1-3, 10. Such pre-defined updating plans also fail to account for variations in the response of different modeling methods to changes in clinical environments, which may impact the extent and form of shifts in model accuracy7-9. Alternatively, a range of recalibration methods are available that may correct deteriorations in performance by incorporating information in recent observations while retaining information in existing models and reducing the risk of overfitting1-3, 10. Recalibration may thus be more appropriate than refitting for models in clinical use when equivalent or improved performance can be achieved by the former.
Recently proposed methods seek to provide data-driven guidance on when to retain the current model, apply recalibration of varying degrees of complexity, or refit a model10,13. We described a nonparametric testing procedure to select between competing updating methods while minimizing overfitting and taking the updating sample size into consideration13. As a baseline, the updating methods under consideration are retention of the existing model, intercept correction, linear logistic recalibration, flexible logistic recalibration, and model refitting. The testing procedure is designed to recommend the simplest updating method that does not compromise accuracy that may be achievable through more complex adjustments. In contrast with other tests10, our procedure is customizable and widely applicable to models for categorical outcomes regardless of the underlying learning algorithm.
In this study, we explore whether the long-term performance of clinical prediction models is improved through a data-driven approach to model maintenance. We compare three updating strategies—retention of the original model, predefined model refitting, and application of recommendations of the nonparametric testing procedure. These updating strategies are applied to a model for 30-day mortality after hospital admission in a national population of veterans for which calibration drift, and variability in drift by modeling methods, has been documented across multiple years7. We assess differences in discrimination and calibration over time under each updating strategy, as well as whether and how the learning algorithm underlying the model impacts updating requirements and accuracy.
Methods
We developed models for 30-day mortality after hospital admission among patients admitted to Department of Veterans Affairs (VA) facilities nationwide using logistic regression (LR), L1-regularized logistic regression (L1), random forests (RF), and neural networks (NN). Model predictors and cohort eligibility criteria have been detailed in previous work7. Each model was developed using a common set of predictors and admissions occurring in 2006. Data from 2007-2013 were collected for both updating and validation. Previous work indicated the LR and L1 were most subject to calibration drift in this population while the NN model did not experience significant calibration drift7. Updating was undertaken for all models on an annual basis at the end of 2007 through 2012, with updates based on admissions in the prior 12 months and applied to admissions in the following 12 months. Admissions occurring in 2013 served as validation data for updates at the end of 2012.
We implemented three competing strategies to update the LR, L1, RF, and NN models over time in yearly increments. As a baseline, we retained the original models developed on 2006 admissions and applied these models to all subsequent admissions through 2013. The second updating strategy called for annually refitting each model using all admissions that accrued over the prior 12 months. Hyperparameters for the L1, RF, and NN models were tuned annually using 5-fold cross-validation. Admissions in each year were assigned predicted probabilities based on the prior year’s models. The third updating strategy selected the updating approach for each model based on a nonparametric testing procedure that we have developed for this purpose13. A simplified illustration of the testing procedure is presented in Figure 1. The procedure selects between the retention of the current model, intercept correction, linear logistic recalibration, flexible logistic recalibration, or model refitting11, 12, 14. Users may specify additional updating approaches for consideration as desired. A two-stage bootstrapping framework accounts for both overfitting and sample size, while avoiding assumptions about the structure of the model’s learning algorithm13. In the first bootstrapping stage, updated predictions on the out-of-sample observations are stored to construct a population of pooled holdout predictions. These observations are leveraged by the second bootstrapping stage to characterize the performance of each updating approach. Paired differences in a user-specified performance metric, in this case the Brier score, between the best performance updating approach and simpler approaches are evaluated in the decision stage, with the procedure recommending the simplest update exhibiting no statistically significant difference in accuracy. We specified a Type I error rate of 0.05 for these comparisons.
Figure 1.
Simplified illustration of the nonparametric testing procedure used to select updating methods under the data-driven annual updating strategy. Hexagons indicate processes; canisters indicate datasets.
Updating sequences were retained over multiple years as needed, allowing updates to build on any prior adjustments to the model. For example, a model based initially on Year 0 admissions was applied to Year 1 admissions. At the end of Year 1, the testing procedure recommended either continued use of the existing model, adjustment of the existing model through recalibration, or replacement of the model with a newly refit model. This updated version of the model was used to generate predictions for Year 2. Following Year 2, the testing procedure considered whether any additional updates to the model as adjusted after Year 1 were warranted, not whether to adjust the original Year 0 model. If additional updating was recommended by the test, those changes were applied in addition to the existing Year 1 adjustments. At any point, if the test recommended refitting the model, then all previous models and sequences of adjustments were replaced by a new model moving forward.
We conducted analyses to assess the influence of each updating strategies on the long-term performance of the LR, L1, RF, and NN models, as well as to compare the updating requirements of the different models. Models were assessed for both discrimination (area under the receiver operating curve, AUC) and calibration (calibration curves, observed to expected outcome ratio, Cox intercept and slope, and estimated calibration index)15-17. We evaluated overall performance of the four models under each updating strategy across the entire validation and updating period (2007-2013). We further measured performance over time under each strategy on a monthly basis. In order to identify any differences in updating requirements, we recorded the recommendations of our nonparametric testing procedure for each of the four modeling methods.
This study was approved by the Institutional Review Board and the Research and Development committee of the Tennessee Valley Healthcare System VA.
Results
This study included data on 1,893,284 admissions to VA facilities nationwide. The initial 2006 models were developed on 235,548 admissions. The validation and updating set, consisting of admissions from 2007 through 2013, included 1,657,736 admissions with a mean of 236,819 per year. Fewer admissions were captured in 2013 because December admissions lacked sufficient follow-up time to ascertain the outcome. The overall 30-day mortality rate was 4.9%. This outcome rate was stable over time, with annual mortality varying between 4.7% and 5.0%.
Test-based recommendations for annual updates of the four models are noted in Table 1. The nonparametric testing procedure recommended model refitting each year for the NN model. The testing procedure recommended flexible logistic recalibration after the first year for the LR and L1 models, as well as linear logistic recalibration for the RF model. For the L1 model, the recalibration adjustments incorporated after the first year were maintained until the 5th year after model development, at which point an additional intercept correction was recommended. The RF model was also updated again after the 3rd year. Continued periodic updating across the study period was recommended for the LR model, with the testing procedure recommending some degree of recalibration every other year.
Table 1.
Annual updating recommendations for each modeling approach
Update set | LR | L1 | NN | RF |
2007 admissions |
Flexible logistic recalibration | Flexible logistic recalibration | Refit | Linear logistic recalibration |
2008 admissions |
No change | No change | Refit | No change |
2009 admissions |
Intercept correction | No change | Refit | Linear logistic recalibration |
2010 admissions |
No change | No change | Refit | No change |
2011 admissions |
Intercept correction | Intercept correction | Refit | No change |
2012 admissions |
No change | No change | Refit | No change |
Performance of the four models under each updating strategy over the entire validation and updating period (2007-2013) is reported in Table 2 and calibration curves are presented in Figure 2. In most cases, discrimination was unchanged by updating, the exception being the NN model for which refitting (and the test-based strategy) increased the AUC from 0.77 to 0.80. Annually refitting the models improved calibration across the study period compared to the original models (p<0.05). However, for all models, predictions based on test recommendations exhibited improved calibration compared to predictions based on either the original model without updating or the annually refit models (p<0.05)—the only exception being the NN model for the test-based strategy reduced to refitting. Differences in calibration across updating strategies were highlighted by the calibration curves and most apparent when focusing on the lower risk portion of the curves where over 95% of observations occur. For the LR model, the calibration curves under the test-based updating strategy captured more of the ideal 45° calibration line than either the annual refit or original models. For the RF model, none of the updating strategies resulted in calibration across a large range of probabilities and the calibration curves of all three strategies follow similar patterns; however, both the refitting and test-based updating strategies moved the calibration curve closer to the ideal 45° calibration line for the risk range where most observations fell. In the densely populated risk range, although the magnitude of miscalibration of the L1 model was similar between the refitting and test-based updating strategies, the refitting approach erred toward underprediction while the test-based strategy erred toward overprediction.
Table 2.
Overall performance by modeling approach and annual updating strategy
Model | Updating Strategy | AUC | O:E | Cox Intercept | Cox Slope | ECI |
LR | No updating |
0.849 | 0.876 | -0.227 | 0.970 | 0.029 |
[0.847, 0.850] | [0.871, 0.882] | [-0.242, -0.214] | [0.964, 0.976] | [0.027, 0.032] | ||
Refitting | 0.850 | 0.981 | -0.052 | 0.987 | 0.011 | |
[0.849, 0.851] | [0.975, 0.987] | [-0.068, -0.038] | [0.982, 0.993] | [0.010, 0.012] | ||
Test‑ based |
0.849 | 0.953 | -0.073 | 0.994 | 0.004 | |
[0.847, 0.850] | [0.947, 0.959] | [-0.089, -0.059] | [0.987, 1.000] | [0.003, 0.005] | ||
L1 | No updating |
0.846 | 0.815 | -0.221 | 1.014 | 0.038 |
[0.845, 0.847] | [0.810, 0.821] | [-0.237, -0.207] | [1.008, 1.021] | [0.036, 0.041] | ||
Refitting | 0.846 | 0.936 | 0.005 | 1.038 | 0.010 | |
[0.845, 0.848] | [0.930, 0.942] | [-0.011, 0.020] | [1.032, 1.044] | [0.009, 0.012] | ||
Test‑ based |
0.846 | 0.937 | -0.081 | 0.999 | 0.005 | |
[0.845, 0.847] | [0.932, 0.942] | [-0.097, -0.066] | [0.993, 1.005] | [0.005, 0.006] | ||
RF | No updating |
0.837 | 0.842 | -0.031 | 1.080 | 0.033 |
[0.835, 0.838] | [0.837, 0.848] | [-0.059, -0.006] | [1.068, 1.091] | [0.031, 0.035] | ||
Refitting | 0.837 | 0.950 | 0.127 | 1.082 | 0.026 | |
[0.836, 0.838] | [0.943, 0.956] | [0.096, 0.153] | [1.070, 1.094] | [0.024, 0.028] | ||
Test‑ based |
0.837 | 0.939 | -0.035 | 1.017 | 0.019 | |
[0.836, 0.838] | [0.933, 0.945] | [-0.061, -0.010] | [1.006, 1.028] | [0.017, 0.021] | ||
NN | No updating |
0.770 | 0.914 | -0.187 | 0.965 | 0.018 |
[0.768, 0.772] | [0.908, 0.920] | [-0.205, -0.171] | [0.959, 0.971] | [0.017, 0.020] | ||
Refitting | 0.800 | 0.991 | -0.104 | 0.961 | 0.004 | |
[0.798, 0.802] | [0.984, 0.997] | [-0.122, -0.087] | [0.955, 0.967] | [0.003, 0.004] | ||
Test‑ based |
0.800 | 0.991 | -0.104 | 0.961 | 0.004 | |
[0.798, 0.802] | [0.984, 0.997] | [-0.122, -0.087] | [0.955, 0.967] | [0.003, 0.004] |
Figure 2.
Overall calibration by modeling approach and annual updating strategy. Left panels display calibration curves across the range of predictions produced by each model; right panels zoom in on calibration curves for predicted probabilities below 30%, which includes over 95% of all observations.
Figure 3 displays monthly calibration of all four models under each updating strategy using the estimated calibration index (ECI). This stringent measure of calibration decreases toward 0 as calibration improves17,18. Without updating, calibration of the LR, L1, and RF models decayed over time. Both refitting and test-based updates improved calibration compared to the original model over the course of the 7 years following initial model development. The RF model was an exception to this pattern. In some time periods, such as 2009-2010, refitting the RF model each year did not improve performance compared to the original model without updating. Although calibration of the NN model was stable over time compared to the other models, annually refitting the NN model improved calibration and reduced month-to-month variability in performance. With the exception of the NN model, monthly ECIs under the test-based updating strategy were generally lower and less variable compared to ECIs under the refitting strategy. For those points at which the nonparametric testing procedure recommended updating, ECIs over the prior 12 months (i.e., performance among those admissions serving as the update set) did not reveal clear patterns that differentiated these timeframes from those for which the testing procedure did not recommend updating. Nevertheless, calibration improved immediately after these updates.
Figure 3.
Calibration over time by modeling approach and updating strategy (smaller values are better). Dotted vertical lines highlight points at which the testing procedure recommended recalibration.
Discussion
We evaluated the impact of three competing updating strategies on performance of models for 30-day mortality after hospital admission over 7 years following initial model development. In addition to common strategies of retaining the original model or routinely refitting the model, we included a new data-driven strategy based on a nonparametric testing procedure for selecting among competing updating methods. This testing procedure is applicable regardless of the learning algorithm underlying the model, allowing our study to compare updating requirements of parallel LR, L1, RF, and NN models.
Updating requirements varied across modeling methods, both in terms of the timing and extent of updates. One year after model development, the nonparametric testing procedure recommended updating of all four models. These initial adjustments lead to immediate improvements in calibration in the months following update. Subsequent updating recommendations were varied and less frequent. The adjusted LR model was further recalibrated every other year with additional intercept corrections. The initial recalibration of the L1 model was retained until the intercept was further adjusted for the final two years of the study. The most significant and frequent updating was recommended for the NN model, which exhibited the least calibration drift over time. The testing procedure recommended refitting each year due to quite small improvements in the Brier score (~0.0001) compared to other updating approaches. As the Brier score takes into consideration both discrimination and calibration, the improvement in both dimensions of performance that resulted from refitting the NN model may have driven this recommendation. Refitting of the other models impacted calibration but did not significantly improve discrimination.
Some form of updating was warranted for all models. Retaining the original model over the course of the study period resulted in inferior calibration compared to routine refitting and test-based updating. Calibration measures of the original NN model did not exhibit significant trends indicative of calibration drift over the course of the study7. Nevertheless, refitting this model each year, either as planned or as recommended by the testing procedure, still improved overall calibration, reduced month-to-month variability in calibration, and improved discrimination. Test-based updating of the other models improved upon the simple refitting strategy. Refitting corrected performance drift in the LR and L1 models; however, test-based updating recommendations resulted in lower ECIs (i.e., better calibration) and less month-to-month variability in performance compared to refitting. Refitting the RF model improved overall calibration compared to the original model, but still resulted in variable calibration over shorter periods and did not correct performance drift over time. On the other hand, the test-based strategy avoided performance drift of the RF model and periods of instability observed under the refitting strategy (e.g., 2012) despite no additional updates being recommended.
In some cases, differences in calibration metrics between updating strategies were small and may not be clinically meaningful in practice. Whether these improvements are clinical meaningful in addition to being statistically significant is an important consideration and an open question for model comparison and impact assessment work. Although small in magnitude, the improvements in calibration under the test-based strategy compared to the refitting strategy highlight how recalibration may be sufficient, or even superior, to the standard practice of undertaking more substantial change by refitting. In addition, impact from recalibration is most likely to occur when patients are scored near user-defined cut-points that are clinically relevant, and assessment of clinically meaningful risk category reclassification anchors around what proportion of patients are near the cut-points (and change classification after calibration degredation).
These findings underscore a need for data-driven maintenance plans for clinical prediction models. A “one-size fits all” updating strategy will not suffice for all models. We cannot assume a new model built on recent data will be more generalizable to and perform better in the next cohort of patients than an existing model, even when large datasets, such as those in this study, are used for updating. Although calibration improved over the entire study period by regularly refitting the RF model, the model built on 2008 admissions did not improve upon, and may have actually performed worse, than the original RF model when applied in 2009. Similarly, we should not assume refitting is superior to simpler updating through recalibration. The intermittent recalibrations recommended by the testing procedure lead to better performance across the study period than routine refitting, both overall and on a month-to-month basis. Tailoring updating methods through data-driven updating strategies may therefore extend the accuracy and subsequent utility of prediction models beyond what might be achieved through simpler maintenance plans. We note, however, that these results may be sensitive to the volume of data available for updating, and further investigation regarding the impact of sample size is warranted.
Our results also highlight differences in the frequency with which models require updating. Despite being applied to the same data and therefore exposed to the same shifts in patient case mix and clinical environments, the LR, L1, RF, and NN models required updating at different time points. With the exception of the NN model, updating on an annual basis was not indicated and annual refits did not provide additional benefits over less frequent updates. Thus, we may experience inefficiencies under model maintenance plans requiring updates on pre-planned schedule. On the other hand, prescheduled updating plans may also neglect to update models in a timely manner, allowing periods of performance drift to go unnoticed and uncorrected. The cost of interim periods of reduced model accuracy may be difficult to assess as the prediction errors may impact patient outcomes, user confidence, and clinical efficiency. As health systems seek to implement clinical prediction more broadly and begin managing many prediction models, additional data-driven methods to determine when models require attention may be necessary and would complement maintenance strategies implementing test-based updating methods.
There are several limitations of the analyses presented here. We evaluated the three updating strategies in one clinical use case and population. Exploring how these updating strategies perform on models subject to different patterns of shifts in the clinical environment would provide more generalizable understanding. In this study, we limited the nonparametric testing procedure to consider five updating methods – retention of the existing model, intercept correction, linear logistic recalibration, flexible logistic recalibration, and model refitting. These updating methods are common and applicable across models; however, additional updating methods, some of which may be specific to certain learning algorithms, could easily be incorporated into the testing procedure13. The availability of additional updating methods may impact when and how the test-based strategy adjusted the models over time. Further, we did not explore the impact of sample size. The volume of data available for constructing updates could have important impacts on both the model refitting and test-based updating strategies. For small samples, overfitting becomes more of a concern for the refitting strategy, while overly conservative updates may be a concern for the test-based strategy. We also acknowledge that the test-based updating strategy may be computationally intensive. Leveraging advances in computational resources and refining the number of bootstrap iterations considered in the first bootstrapping stage may reduce any computational burden. Tailoring the number of bootstrap iterations may also allow users to match statistical significance to clinically relevant magnitudes of change. Finally, all three of the updating strategies considered here may be inappropriate in the presence of significant changes in clinical practice or record systems that may render existing prediction models invalid. Any updating strategy must be flexible, both in terms of timing and approach, in response to such situations.
Conclusion
We illustrated the use of a new data-driven updating strategy for clinical prediction models based on a variety of underlying modeling methods and compared this strategy to two baseline approaches in which models are either never updated or regularly refit on recent observations. The test-based updating strategy conservatively adjusted most models by recommending intermittent recalibration rather than repeated model refitting. Despite making limited adjustments to the models, the test-based updating strategy lead to more highly calibrated predictions than either of the baseline strategies. The test-based approach also highlighted differences in the updating requirements of common biostatistical and machine learning models, both in terms of the extent and timing of updates. These results have important implications for the implementation of clinical prediction models and the design of model maintenance plans. As the volume, complexity, and variability of prediction models implemented in health systems grows, data-driven updating policies could support model developers and managers as they endeavor to provide more stable and accurate model performance. In this way, data-driven updating strategies, such as the test-based approach presented here, will become key components of automated surveillance procedures that promote the long-term performance and utility of prediction models underlying a variety of informatics applications for decision support and population management.
Acknowledgements
Financial support for this study was provided by grants from the National Library of Medicine (5T15LM007450); the Veterans Health Administration (VA HSR&D IIR 13-052); and the National Institutes of Health (BCHI-R01-130828).
Figures & Table
References
- 1.Toll DB, Janssen KJ, Vergouwe Y, Moons KG. Validation, updating and impact of clinical prediction rules: a review. J Clin Epidemiol. 2008;61((11)):1085–94. doi: 10.1016/j.jclinepi.2008.04.008. [DOI] [PubMed] [Google Scholar]
- 2.Moons KG, Altman DG, Vergouwe Y, Royston P. Prognosis and prognostic research: application and impact of prognostic models in clinical practice. Bmj. 2009:338–606. doi: 10.1136/bmj.b606. [DOI] [PubMed] [Google Scholar]
- 3.Moons KG, Kengne AP, Grobbee DE, Royston P, Vergouwe Y, Altman DG, et al. Risk prediction models: II. External validation, model updating, and impact assessment. 2012;98((9)):691–8. doi: 10.1136/heartjnl-2011-301247. Heart. [DOI] [PubMed] [Google Scholar]
- 4.Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York, NY: Spring. 2009 [Google Scholar]
- 5.Hickey GL, Grant SW, Murphy GJ, Bhabra M, Pagano D, McAllister K, et al. Dynamic trends in cardiac surgery: Why the logistic euroscore is no longer suitable for contemporary cardiac surgery and implications for future risk models. European Journal of Cardio-thoracic Surgery. 2013;43((6)):1146–52. doi: 10.1093/ejcts/ezs584. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Minne L, Eslami S, De Keizer N, De Jonge E, De Rooij SE, Abu-Hanna A. Effect of changes over time in the performance of a customized SAPS-II model on the quality of care assessment. Intensive Care Medicine. 2012;38((1)):40–6. doi: 10.1007/s00134-011-2390-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Davis SE, Lasko TA, Chen G, Matheny ME. Calibration Drift Among Regression and Machine Learning Models for Hospital Mortality. Proceedings of the AMIA Annual Symposium. 2017 [PMC free article] [PubMed] [Google Scholar]
- 8.Davis SE, Lasko TA, Chen G, Siew ED, Matheny ME. Calibration Drift in Regression and Machine Learning Models for Acute Kidney Injury. Journal of the American Medical Informatics Association. 2017;24((6)):1052–61. doi: 10.1093/jamia/ocx030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Minne L, Eslami S, de Keizer N, de Jonge E, de Rooij SE, Abu-Hanna A. Statistical process control for monitoring standardized mortality ratios of a classification tree model. Methods of information in medicine. 2012;51((4)):353–8. doi: 10.3414/ME11-02-0044. [DOI] [PubMed] [Google Scholar]
- 10.Vergouwe Y, Nieboer D, Oostenbrink R, Debray TP, Murray GD, Kattan MW, et al. A closed testing procedure to select an appropriate method for updating prediction models. Stat Med. 2017;36((28)):4529–39. doi: 10.1002/sim.7179. [DOI] [PubMed] [Google Scholar]
- 11.Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ, Habbema JD. Validation and updating of predictive logistic regression models: a study on sample size and shrinkage. Stat Med. 2004;23((16)):2567–86. doi: 10.1002/sim.1844. [DOI] [PubMed] [Google Scholar]
- 12.Janssen KJ, Moons KG, Kalkman CJ, Grobbee DE, Vergouwe Y. Updating methods improved the performance of a clinical prediction model in new patients. J Clin Epidemiol. 2008;61((1)):76–86. doi: 10.1016/j.jclinepi.2007.04.018. [DOI] [PubMed] [Google Scholar]
- 13.Davis SE, Greevy RA, Fonnesbeck C, Lasko TA, Walsh CG, Matheny ME. A Non-Parametric Updating Method to Correct Clinical Prediction Model Drift. Journal of the American Medical Informatics Association. 2019 doi: 10.1093/jamia/ocz127. (Forthcoming) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Dalton JE. Flexible recalibration of binary clinical prediction models. Stat Med. 2013;32((2)):282–9. doi: 10.1002/sim.5544. [DOI] [PubMed] [Google Scholar]
- 15.Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21((1)):128–38. doi: 10.1097/EDE.0b013e3181c30fb2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Nattino G, Finazzi S, Bertolini G. A new calibration test and a reappraisal of the calibration belt for the assessment of prediction models based on dichotomous outcomes. Statistics in medicine. 2014;33((14)):2390– 407. doi: 10.1002/sim.6100. [DOI] [PubMed] [Google Scholar]
- 17.Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol. 2016;74:167–76. doi: 10.1016/j.jclinepi.2015.12.005. [DOI] [PubMed] [Google Scholar]
- 18.Van Hoorde K, Van Huffel S, Timmerman D, Bourne T, Van Calster B. A spline-based tool to assess and visualize the calibration of multiclass risk predictions. J Biomed Inform. 2015;54:283–93. doi: 10.1016/j.jbi.2014.12.016. [DOI] [PubMed] [Google Scholar]