Skip to main content
Wolters Kluwer - PMC COVID-19 Collection logoLink to Wolters Kluwer - PMC COVID-19 Collection
. 2021 Aug 3;14(8):e007858. doi: 10.1161/CIRCOUTCOMES.121.007858

External Validations of Cardiovascular Clinical Prediction Models: A Large-Scale Review of the Literature

Benjamin S Wessler 1,2,, Jason Nelson 1, Jinny G Park 1, Hannah McGinnes 1, Gaurav Gulati 1,2, Riley Brazil 1, Ben Van Calster 3, David van Klaveren 1,4, Esmee Venema 6,7, Ewout Steyerberg 5,6, Jessica K Paulus 1, David M Kent 1
PMCID: PMC8366535  PMID: 34340529

Supplemental Digital Content is available in the text.

Keywords: calibration, cardiovascular disease, decision making, literature review

Abstract

Background:

There are many clinical prediction models (CPMs) available to inform treatment decisions for patients with cardiovascular disease. However, the extent to which they have been externally tested, and how well they generally perform has not been broadly evaluated.

Methods:

A SCOPUS citation search was run on March 22, 2017 to identify external validations of cardiovascular CPMs in the Tufts Predictive Analytics and Comparative Effectiveness CPM Registry. We assessed the extent of external validation, performance heterogeneity across databases, and explored factors associated with model performance, including a global assessment of the clinical relatedness between the derivation and validation data.

Results:

We identified 2030 external validations of 1382 CPMs. Eight hundred seven (58%) of the CPMs in the Registry have never been externally validated. On average, there were 1.5 validations per CPM (range, 0–94). The median external validation area under the receiver operating characteristic curve was 0.73 (25th–75th percentile [interquartile range (IQR)], 0.66–0.79), representing a median percent decrease in discrimination of −11.1% (IQR, −32.4% to +2.7%) compared with performance on derivation data. 81% (n=1333) of validations reporting area under the receiver operating characteristic curve showed discrimination below that reported in the derivation dataset. 53% (n=983) of the validations report some measure of CPM calibration. For CPMs evaluated more than once, there was typically a large range of performance. Of 1702 validations classified by relatedness, the percent change in discrimination was −3.7% (IQR, −13.2 to 3.1) for closely related validations (n=123), −9.0 (IQR, −27.6 to 3.9) for related validations (n=862), and −17.2% (IQR, −42.3 to 0) for distantly related validations (n=717; P<0.001).

Conclusions:

Many published cardiovascular CPMs have never been externally validated, and for those that have, apparent performance during development is often overly optimistic. A single external validation appears insufficient to broadly understand the performance heterogeneity across different settings.


What Is Known

  • There has been a proliferation of clinical prediction models (CPMs) to help risk-stratify patients at risk for cardiovascular disease. Clinically beneficial CPMs will yield accurate predictions for new patients and improve decision-making and clinical outcomes.

What the Study Adds

  • Here, we describe the extent to which CPMs have been validated and how performance varies across settings.

  • We show that many CPMs have never been externally validated; for those that have, performance during model development is often overly optimistic and that isolated validations do not adequately capture CPM performance heterogeneity across different settings.

Clinical prediction models (CPMs) are widely available to inform decisions in cardiovascular medicine. Our own database, the Tufts Predictive Analytics and Comparative Effectiveness (PACE) CPM Registry,1 demonstrates continued growth of prediction models for patients with cardiovascular disease despite apparent substantial redundancy. The growth in the literature reflects the increasing ease with which these models can be developed, given the wide availability of both data and statistical software. Despite the publication of methodologic2 and reporting guidelines3 and a large set of potential performance metrics,4 much remains unknown about the broad performance of these models, including the extent to which they have been validated, how well they validate, and how performance varies from one setting to another.

Although there are various ways to assess the performance of a statistical model,4 clinically beneficial CPMs will yield accurate predictions on new cohorts (external validation)5 and improve decision-making and subsequent clinical outcomes. Despite the increasing number of CPMs in the literature, how models perform generally during external validations and the determinants of that performance is largely unknown. Current reporting recommendations reinforce the need for external validation,3 although recent analyses suggest that most CPMs either have not been externally validated6 or have only been validated on a single external cohort.7 CPM discriminatory performance cannot not be assumed to be stable (ie, equivalent to model performance at derivation) when tested in new settings.8 Model calibration has been largely neglected and unless it is known to be excellent, CPMs may lead to harm if they are used to inform decisions at certain risk thresholds.9,10

Here, we perform a field synopsis of external validation studies of cardiovascular CPMs reported in a prior systematic review.1 We aimed to describe the extent of external validation, variation in performance of models across databases, and to explore factors that are associated with worse model performance.

Methods

Cardiovascular CPMs

The cardiovascular CPMs that form the basis of this review are found within the Tufts PACE CPM Registry. This registry represents a field synopsis of prediction models for patients at risk for and with known cardiovascular disease. All data and materials for this analysis have been made publicly available and can be accessed at www.pacecpmregistry.org. The search strategy and inclusion criteria have been previously reported.1 Briefly, for inclusion in the Registry, an article must present the development of a cardiovascular CPM, contain a model predicting a binary clinical outcome, and the model must be presented in a way that allows prediction of outcome risk for a future patient. The search strategy for CPM identification was previously reported1 and is presented in the Figure I in the Data Supplement. This analysis looked at cardiovascular CPMs published from 1990 through March 2015.

External Validation Search

A SCOPUS citation search of these cardiovascular CPMs was conducted on March 22, 2017. Citations were reviewed by 2 members of the study team to identify external validations of CPMs in the Registry. Discrepancies were reviewed by a third member of the research team. Consistent with prior work,6 external validations were defined as any report that claimed to study the CPM for the same outcome as originally reported, but in a nonoverlapping population.

Data Extraction

Information about each CPM/validation pair was extracted, including sample size, continent of study, number of events, and reporting of measures of discrimination and calibration. CPM validation performance focused on discrimination (area under the receiver operating characteristic curve [AUROC]) change compared with the AUROC seen in the derivation population. We also document whether validations include any assessment of CPM calibration. There are many methods to assess model calibration and only recent consensus on best practices.4,11 Given this lack of consistency and interpretability in the literature, we report whether or not this dimension of performance was assessed during external validation. Calibration assessment included any comparison of observed versus expected outcomes. Examples include a Hosmer-Lemeshow statistic or calibration plot. For this study, we also included measures of calibration-in-the-large, where overall observed event rates are compared with predicted rates.

CPM Performance

Consistent with prior work,12 changes in CPM discrimination from derivation to validation are described on a scale of 0% (no change in discrimination) to −100% (complete loss of discrimination) because it more intuitively reflects the true changes in discriminatory power.13 Positive changes represent improvements in discrimination. The percent change in discrimination is calculated using the following equation ([validation AUROC−0.5]–[derivation AUROC−0.5]/(derivation AUROC−0.5]×100).

Population Relatedness

To explore potential explanations for decreased performance on validation data sets, we assessed the similarity between the derivation and validation populations by creating detailed relatedness rubrics for the 10 index conditions with the greatest number of CPMs (Table I in the Data Supplement). These rubrics were created by investigators with expertise in these clinical areas. Relatedness was assessed for each CPM/validation pair to divide validation databases into 3 categories—closely related, related, and distantly related. A fourth category no match was assigned to validations that were excluded from the analysis because they were not clinically appropriate matches (eg, CPM validated on population with nonoverlapping index condition or outcome). Generally, the relatedness rubrics were based on 5 domains: (1) recruitment setting (eg, outpatient versus emergency room versus inpatient), (2) major inclusion/exclusion criteria, (3) intervention type (eg, percutaneous coronary intervention versus thrombolysis for acute myocardial infarction), (4) therapeutic era, (5) follow-up time. Two clinicians reviewed these domains for each CPM/validation match and assigned a relatedness category. Nonrandom split-sample validations were labeled as closely related validations. Discrepancies were reviewed by the study team to arrive at a consensus.

Factors Associated With CPM External Validation

We identified a set of study-level factors to evaluate associations with whether or not a CPM was externally validated. These factors were identified based on observed methodologic and reporting patterns as well as prior literature.8 These factors included: Index clinical condition, internal validation performed, year of publication (divided here before 2004, 2004–2009, 2009–2012, after 2012), continent of origin, study design (eg, clinical trial versus medical record), sample size, number of events, number of predictors, prediction time horizon (<30 days, 30–265 days, >365 days), regression method (eg, logistic regression versus Cox regression), and reporting of discrimination or calibration. We analyzed unadjusted associations and used multivariable logistic regression to assess whether these variables were associated with CPM external validation.

Factors Associated With Poor Performance

A set of study-level factors defined a priori were evaluated for association with worse CPM performance (discrimination) during validation. These factors included: population relatedness (here, dichotomized as distantly related versus other), presence of overlapping authors, same or different article, CPM modeling method, CPM data source, validation data source, outcome rate difference between derivation and validation data (defined as > versus ≤40%), CPM events per included variable (EPV). We used generalized estimating equations14,15 with robust covariance estimator to assess the multivariable association with the observed change in discrimination, taking into account the correlation between validations of the same CPM. Multiple imputation of 20 imputed data sets was used to account for missingness. These analyses estimated the absolute difference in the estimated percent change in the C statistic from derivation to validation populations, as calculated above. All statistical analyses were performed using SAS Enterprise Guide version 8.2 (SAS Institute, Inc, Cary, NC).

Results

Overview of Validations

The Registry includes 1382 CPMs for cardiovascular disease and the citation search of these CPMs identified 54 086 citations that were screened (Figure 1). These citations identified 14 615 abstracts that were screened to identify 6039 full-text articles. A total of 2030 external validations were extracted from 413 articles. Only 575 (42%) of the CPMs in the Registry have ever been validated (Table 1). On average, there were 1.5 validations per de novo CPM, with a very skewed distribution. The Logistic European System for Cardiac Operative Risk Evaluation16 has been externally validated 94×. For this analysis, we included 1846 validations of 556 CPMs after exclusion of 19 decision trees and 156 validations performed on unrelated (ie, populations with different index conditions or nonoverlapping outcomes) samples. The median external validation sample size was 861 (25th–75th percentile [interquartile range (IQR)] 326–3306), and the median number of outcome events was 68 (IQR, 29–192; Table 2).

Figure 1.

Figure 1.

Flowchart of external validation review process.

Table 1.

De Novo Models Summary

graphic file with name hcq-14-e007858-g001.jpg

Table 2.

External Validations Summary

graphic file with name hcq-14-e007858-g002.jpg

CPM Validation Discrimination

Overall, 91.3% (n=1685) of the external validations report AUROC. The median derivation AUROC was 0.77 (IQR, 0.73–0.82). The median external validation AUROC was 0.73 (IQR, 0.66–0.79) representing a median percent change in discrimination of -11.1% (IQR, −32.4% to +2.7%; Table 2). Of the validations with decreased performance (n=795), 25% (n=195) had <10% decrement in discrimination. Two percent (n=35) had >80% drop in discrimination; 19% (n=352) of model validations showed CPM discrimination at or above the performance reported in the derivation dataset.

CPM Calibration

In total, 53% (n=983) of the validations report some measure of CPM calibration. The Hosmer-Lemeshow test of goodness-of-fit was most commonly reported (30%, n=555) followed by calibration-in-the-large (26%, n=488), and calibration plots (22%, n=399). (Table 2). Overall, there was no externally assessed calibration information available for 86% (n=1182) of the CPMs in the Registry.

Clinical Domains

The ten conditions with the most CPM validations comprised 92% (1702/1846) of the total validations included in this analysis (Table 3). The condition with the largest number of validations was stroke (299 validations performed on 104 CPMs). There were a total of 286 validations of 87 CPMs for populations at risk for developing cardiovascular disease (population samples) and 286 validations of 52 CPMs for Cardiac Surgery. Only 5 index conditions had ≥50% of available CPMs externally validated (arrhythmias [81%], valve disease [62%], venous thromboembolism [53%], cardiac surgery [51%], and aortic diseases [50%]). There is an extreme range of CPM performance and consistent loss of discriminatory performance during external validations (Figure 2, Table 3). These observations were apparent for all conditions that were studied (specific condition waterfall analyses shown in Figure II in the Data Supplement).

Table 3.

Conditions With the Most External Validations (Top 10)

graphic file with name hcq-14-e007858-g004.jpg

Figure 2.

Figure 2.

Waterfall plot depicting the percent change in the C statistic in related (related and closely related) validations (in blue) and distantly validations (in orange). Plots comprise horizontal lines representing a total of 1701 validations that present C statistic that can be compared with the development C statistic. Vertical lines show that the median decrement in discrimination was more pronounced in the distantly related models than the related models.

Relatedness

Relatedness was assigned to each of the 1702 of the CPM/validation pairs for the top 10 index conditions. Of these, 123 (7%) of the validations were performed on closely related populations, 862 (51%) were performed on related populations, whereas 717 (42%) were performed on distantly related populations (Table 2). The median AUROC for closely related validations was 0.78 (IQR, 0.719–0.841). The median AUROC for related population validations was 0.75 (IQR, 0.68–0.803). The median AUROC for distantly related validations was 0.70 (IQR, 0.64–0.77; P<0.001). Overall, the median percent change in discrimination was -3.7% (IQR, −13.2 to 3.1) for closely related validations, −9.0% (IQR, −27.6 to 3.9) for related validations, and −17.2% (−42.3 to 0) for distantly related validations (P<0.001).

Range of Performance for Individual CPMs

Table 4 shows the variation in performance across the 10 CPMs16,1826 that were validated most frequently. Uniformly, there was a substantial range in performance of each CPM across datasets, from virtually useless to excellent. For example, discrimination for the Logistic European System for Cardiac Operative Risk Evaluation (validated 94×) ranged from 0.48 to 0.90 across different databases. None of these highly cited (and validated) CPMs had consistently good discrimination across validation databases.

Table 4.

Top 10 Most Validated CPMs

graphic file with name hcq-14-e007858-g006.jpg

Predictors of External Validation

Study features that are associated with CPM external validation (yes/no) are shown in Table II in the Data Supplement. The index condition was strongly associated with subsequent external validation. Models that were internally validated and models that were published more recently were less likely to be externally validated. Sample size, number of predictors, and reporting of discrimination or calibration were positively associated with external validation. On multivariable analysis, these predictors remained associated with CPM external validation. Study design, prediction time horizon, and regression method were not apparently associated with a model being externally validated.

Predictors of Poor Performance

Predictors of CPM validation performance are shown in Table 5. On univariate analysis, population relatedness was significantly associated with CPM discrimination in validations. When CPMs were tested on distantly related cohorts, the AUROC decrease was −15.6% (95% CI, −22.0 to −9.1) compared with the reference (validations done on closely related cohorts). When evaluated in a multivariable model, population relatedness remained significantly associated with CPM discrimination in validations (−9.8% [95% CI, −18.8 to −0.8). We also observed that validations demonstrated AUROCs that were 9.8% (95% CI, 5.4–14.2) higher when reported in the same article (with the same authors) as the de novo CPM report compared with validations reported in different articles with nonoverlapping authors. There was a trend toward higher AUROC (+7.3% [95% CI, −1.2 to 15.8], P=0.09) when validations were reported by overlapping authors in a subsequent publication (compared with reports by nonoverlapping authors).

Table 5.

Predictors of Worse Discrimination: Variable Distributions and GEE Model Results

graphic file with name hcq-14-e007858-g007.jpg

Discussion

Our Tufts PACE CPM Registry documents the tremendous proliferation and redundancy of CPMs being developed and published. The review reported here underscores that this proliferation is occurring without adequate—or even minimal—external evaluation. Approximately 60% of published CPMs have never been externally validated. Approximately half of the CPMs that have been validated only once. A small minority of models have been validated numerous times. The value of single validations is unclear because there is substantial performance heterogeneity and good (or poor) performance on a single validation does not appear to reliably forecast performance on subsequent validations. No CPM showed consistently good discrimination across multiple validation databases. For example, the 10 most validated CPMs have each been validated >20×; all show substantial variation in discrimination across these validation studies, from virtually useless (ie, C statistic=≈0.5) to very good (C statistic=~0.8 or higher). This demonstrates the difficulty of defining the quality of a model generically because performance greatly depends on characteristics of the database on which a model is tested. These findings underscore recent calls for a fundamental paradigm shift in how models are assessed for validity and utility7 and calls for more robust stewardship of algorithms for health care.27

The majority of cardiovascular CPMs in our Registry have never been externally validated. This finding mirrors an observation made in previous assessment of primary prevention models8,28 and broadly suggests that cardiovascular clinicians should be skeptical about the accuracy of individual risk estimates. In our registry, model level predictors associated with subsequent external validation include the disease being studied and also larger sample size, higher outcome rates, and whether discrimination or calibration were reported in the original presentation. Older CPMs were generally more likely to be externally validated—an observation that may relate to insufficient time to allow for validation of more recently published CPMs. Given the extreme redundancy of CPMs and the relative scarcity of external validations, it seems reasonable to prioritize the study of existing cardiovascular CPMs (as opposed to developing new ones), and how these might be optimized for clinical use.

Although this review focuses on external validations, this emphasis does not imply that internal validation is not important. Internal and external validation provide different information. Internal validation is especially important when the sample size and the number of outcomes are relatively small for the complexity of the model-building procedure. In such cases, reporting the apparent performance is likely over-optimistic. External validation provides information about the transportability of the model to other settings and across time, and how robust predictions are to distributional shifts in the data. Combination of internal-external validation procedures to assess CPM performance may represent best practice to broadly understand CPM performance.29 Yet for those charged with deciding whether a given model is deployed in clinical practice, understanding how a model performs in the local setting may be most important. Our work suggests that this may be difficult to understand from the literature, especially if the target of inference is a setting other than where a model was validated.

It was common to observe substantial decrements in discrimination during validations. This finding is consistent with prior reports that have shown CPM validation discriminatory ability that is highly variable and often worse than anticipated (when compared with performance on the derivation database).6,8 There are several potential reasons why model performance might decrease, including model invalidity (eg, due to over-fitting on the derivation population) and a change in case mix.5 Model invalidity might be expected to be more pronounced when models are evaluated in populations that are dissimilar to the derivation population. We found that models had a substantially larger decrease in discriminatory performance when tested on distantly related populations compared with either related or closely related populations. However, judging the relatedness of the populations is laborious and requires substantial clinical expertise. Differences that may appear subtle can be very influential. For example, a CPM developed on patients in the emergency room might not be expected to have similar discriminatory performance if the validation cohort includes only patients admitted to the hospital since—as in the case of many acute cardiac syndromes—care30 and outcome predictors31 are different very early in the disease course. So too changes in treatments received (eg, different ACS revascularization approaches,32 stent types,33 or outcome definitions34,35) likely impact model validation performance. If the model was derived on patients receiving lytic therapy and validated using data from a more contemporary percutaneous coronary intervention trial, it should not be surprising that model performance appears worse than expected. Other study-level characteristics we examined apart from relatedness did not appear to greatly influence model performance.

One of the most striking observations of this work is that isolated validations appear insufficient to understand the performance of CPMs when tested in new populations. There was often an extreme range in performance for CPMs evaluated in multiple databases—an observation that calls into question the generalizability of any one validation result. These data challenge the current approach in which a model might be evaluated on a single external population and then declared to be a validated prediction model that is ready for use. Even when a model performs well using statistical criteria, it is unclear whether such a model improves decision-making when used on a closely related population. Further, good statistical performance on one external database does not guarantee good statistical performance in another setting—such as where a CPM is eventually used to support care. There is no evidence from our analysis that so-called validated CPMs that have been integrated into clinical practice guidelines36,37 should be accepted as trustworthy unless CPM performance is specifically known to be excellent on populations like those being treated. Although having a single CPM that is accepted by the clinical community and promoted in guidelines is appealing as a means of standardizing practice across a range of different settings, the degree of variation seen in our review suggests that this paradigm may result in substantial variation of performance across different settings and poor performance in some settings. Testing CPMs for improved decision-making and better clinical outcomes (eg, in a cluster-randomized trial38) is rarely performed before dissemination into practice. Novel paradigms, emphasizing increasing the accuracy of model performance on local populations, through continual recalibration and updating, are an appealing approach that deserves further consideration.

There are several potential reasons why external validations of prediction models are so rare. First, model developers typically exhaust their data deriving (and sometimes internally validating) their model and may not have additional data sources. Second, informally, there appear to be much stronger academic incentives for the development of new models, rather than the validation of previously published models. Third, there is limited understanding that it is informative to test and retest a validated model on new data to understand how robust predictions are to distributional shifts over time and settings. This is supported indirectly by the observation that internally validated models appear to be less likely to be externally validated than other models. Finally, as predictive modeling methodologic and reporting standards have been published and adopted,2,3 there remain few standards for how best to conduct and report on validations of existing models.

Our review has several limitations. First, the review was limited by the information collected and presented in the original articles. We relied on changes in discrimination largely because CPM calibration is woefully underassessed. Only 62% of models in the CPM Registry have had calibration formally assessed in an external population; even among the models that were validated only 48% report any calibration. Finally, even when calibration is reported, it is usually reported in a form that is not clinically interpretable (eg, as a Hosmer-Lemeshow statistic4,13) or graphically (easy to summarize according to calibration slope [ideal: 1] and systematic under or overestimation [intercept ideally 0]). Some less frequently used metrics, such as the integrated calibration index,39 may help compare performance across multiple validations. Decrements in calibration may be as serious as, or even more serious than, decrements in discrimination because miscalibrated models yield misinformation which may cause harmful decision-making.9 Ideally, we would be able to evaluate the net benefit of model use, which integrates discrimination, calibration, and relative utility to compare the value of prediction-based decision-making compared with best one-sized-fits-all strategies.4,40 Such evaluations would have required individual patient data because these approaches are so rarely used in the published literature. Similarly, we could not assess how much of the decrement in discrimination was due to differences in case mix, rather than invalidity, which would have also required evaluation of patient-level data.41 Finally, our systematic review does not include more recent validations after 2017, due to the enormous scope of this literature, the lack of efficient search strategies, and the laborious nature of comprehensive data extraction and evaluation of relatedness. We do not anticipate the more recent literature would substantially change our findings. Maintenance and continual updating data of this registry will registry will require a semiautomated approach heavily reliant on natural language processing.42

Conclusions

Many published cardiovascular CPMs have never been externally validated, and for those that have, it is common to see significant performance heterogeneity and marked decreases in the discriminatory performance compared with the model development phase. Calibration has been widely underassessed, and single validations do not sufficiently capture CPM performance. Granular information about population relatedness is associated with CPM performance in external validations, and when CPMs are tested on distantly related populations, model performance is often substantially worse than expected. This review raises substantial concerns about the current approach to validating cardiovascular CPMs and underscores the need for a radical rethinking for how performance heterogeneity is explored and quantified (eg, through multiple validations across various practice settings) and how models are evaluated for clinical use.

Acknowledgments

We wish to acknowledge the contributions of Vandan Patel for his work on the relatedness effort.

Sources of Funding

Research reported in this work was funded through a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1606-35555). The views, statements, opinions presented in this work are solely the responsibility of the author(s) and do not necessarily represent the views of the PCORI, its Board of Governors, or Methodology Committee. Dr Wessler is supported by K23AG055667 from National Institutes of Health (NIH)–National Institute on Aging (NIA) and R03AG056447 from NIH-NIA.

Disclosures

None.

Supplemental Materials

Figures I and II

Tables I and II

Supplementary Material

Nonstandard Abbreviations and Acronyms

AUROC
area under the receiver operating characteristic curve
CPM
clinical prediction model
EPV
events per included variable
IQR
interquartile range
PACE
Predictive Analytics and Comparative Effectiveness Center

For Sources of Funding and Disclosures, see page 910.

Contributor Information

Jason Nelson, Email: jnelson2@tuftsmedicalcenter.org.

Jinny G. Park, Email: jpark4@tuftsmedicalcenter.org.

Hannah McGinnes, Email: hlmcginnes@gmail.com.

Gaurav Gulati, Email: ggulati@tuftsmedicalcenter.org.

Riley Brazil, Email: Riley.Brazil@tufts.edu.

Ben Van Calster, Email: ben.vancalster@kuleuven.be.

David van Klaveren, Email: d.vanklaveren@erasmusmc.nl.

Esmee Venema, Email: e.venema@erasmusmc.nl.

Ewout Steyerberg, Email: e.steyerberg@erasmusmc.nl.

Jessica K. Paulus, Email: jess.paulus@gmail.com.

David M. Kent, Email: dkent1@tuftsmedicalcenter.org.

References

  • 1.Wessler BS, Paulus J, Lundquist CM, Ajlan M, Natto Z, Janes WA, Jethmalani N, Raman G, Lutz JS, Kent DM. Tufts PACE Clinical Predictive Model Registry: update 1990 through 2015. Diagn Progn Res. 2017;1:20. doi: 10.1186/s41512-017-0021-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Steyerberg EW, Moons KG, van der Windt DA, Hayden JA, Perel P, Schroter S, Riley RD, Hemingway H, Altman DG; PROGRESS Group. Prognosis Research Strategy (PROGRESS) 3: prognostic model research. PLoS Med. 2013;10:e1001381. doi: 10.1371/journal.pmed.1001381 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Collins GS, Reitsma JB, Altman DG, Moons KG; TRIPOD Group. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. The TRIPOD Group. Circulation. 2015;131:211–219. doi: 10.1161/CIRCULATIONAHA.114.014508 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21:128–138. doi: 10.1097/EDE.0b013e3181c30fb2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Vergouwe Y, Moons KG, Steyerberg EW. External validity of risk models: Use of benchmark values to disentangle a case-mix effect from incorrect coefficients. Am J Epidemiol. 2010;172:971–980. doi: 10.1093/aje/kwq223 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Siontis GC, Tzoulaki I, Castaldi PJ, Ioannidis JP. External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. J Clin Epidemiol. 2015;68:25–34. doi: 10.1016/j.jclinepi.2014.09.007 [DOI] [PubMed] [Google Scholar]
  • 7.Adibi A, Sadatsafavi M, Ioannidis JPA. Validation and utility testing of clinical prediction models: time to change the approach. JAMA. 2020;324:235–236. doi: 10.1001/jama.2020.1230 [DOI] [PubMed] [Google Scholar]
  • 8.Damen JA, Hooft L, Schuit E, Debray TP, Collins GS, Tzoulaki I, Lassale CM, Siontis GC, Chiocchia V, Roberts C, et al. Prediction models for cardiovascular disease risk in the general population: systematic review. BMJ. 2016;353:i2416. doi: 10.1136/bmj.i2416 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW; Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17:230. doi: 10.1186/s12916-019-1466-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Van Calster B, Vickers AJ. Calibration of risk prediction models. Med Decis Making. 2015;35:162–169. doi: 10.1177/0272989X14547233 [DOI] [PubMed] [Google Scholar]
  • 11.Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol. 2016;74:167–176. doi: 10.1016/j.jclinepi.2015.12.005 [DOI] [PubMed] [Google Scholar]
  • 12.Wessler BS, Lundquist CM, Koethe B, Park JG, Brown K, Williamson T, Ajlan M, Natto Z, Lutz JS, Paulus JK, et al. Clinical prediction models for valvular heart disease. J Am Heart Assoc. 2019;8:e011972. doi: 10.1161/JAHA.119.011972 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Harrell FE. Regression Modeling Strategies. 2015. Springer International Publishing [Google Scholar]
  • 14.Zeger SL, Liang KY. Longitudinal data analysis for discrete and continuous outcomes. Biometrics. 1986;42:121–130. [PubMed] [Google Scholar]
  • 15.Zeger SL, Liang KY, Albert PS. Models for longitudinal data: a generalized estimating equation approach. Biometrics. 1988;44:1049–1060. [PubMed] [Google Scholar]
  • 16.Roques F, Michel P, Goldstone AR, Nashef SA. The logistic EuroSCORE. Eur Heart J. 2003;24:881–882. doi: 10.1016/s0195-668x(02)00799-6 [DOI] [PubMed] [Google Scholar]
  • 17.Steyerberg EW. Clinical Prediction Models. 2009. Springer New York [Google Scholar]
  • 18.Nashef SA, Roques F, Michel P, Gauducheau E, Lemeshow S, Salamon R. European system for cardiac operative risk evaluation (EuroSCORE). Eur J Cardiothorac Surg. 1999;16:9–13. doi: 10.1016/s1010-7940(99)00134-7 [DOI] [PubMed] [Google Scholar]
  • 19.Nashef SA, Roques F, Sharples LD, Nilsson J, Smith C, Goldstone AR, Lockowandt U. EuroSCORE II. Eur J Cardiothorac Surg. 2012;41:734–44. discussion 744. doi: 10.1093/ejcts/ezs043 [DOI] [PubMed] [Google Scholar]
  • 20.Granger CB, Goldberg RJ, Dabbous O, Pieper KS, Eagle KA, Cannon CP, Van De Werf F, Avezum A, Goodman SG, Flather MD, et al. ; Global Registry of Acute Coronary Events Investigators. Predictors of hospital mortality in the global registry of acute coronary events. Arch Intern Med. 2003;163:2345–2353. doi: 10.1001/archinte.163.19.2345 [DOI] [PubMed] [Google Scholar]
  • 21.O’Brien SM, Shahian DM, Filardo G, Ferraris VA, Haan CK, Rich JB, Normand SL, DeLong ER, Shewan CM, Dokholyan RS, et al. ; Society of Thoracic Surgeons Quality Measurement Task Force. The Society of Thoracic Surgeons 2008 cardiac surgery risk models: part 2–isolated valve surgery. Ann Thorac Surg. 2009;881 SupplS23–S42. doi: 10.1016/j.athoracsur.2009.05.056 [DOI] [PubMed] [Google Scholar]
  • 22.Lip GY, Nieuwlaat R, Pisters R, Lane DA, Crijns HJ. Refining clinical risk stratification for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: the euro heart survey on atrial fibrillation. Chest. 2010;137:263–272. doi: 10.1378/chest.09-1584 [DOI] [PubMed] [Google Scholar]
  • 23.Gage BF, Waterman AD, Shannon W, Boechler M, Rich MW, Radford MJ. Validation of clinical classification schemes for predicting stroke: results from the National Registry of Atrial Fibrillation. JAMA. 2001;285:2864–2870. doi: 10.1001/jama.285.22.2864 [DOI] [PubMed] [Google Scholar]
  • 24.Wilson PW, D’Agostino RB, Levy D, Belanger AM, Silbershatz H, Kannel WB. Prediction of coronary heart disease using risk factor categories. Circulation. 1998;97:1837–1847. doi: 10.1161/01.cir.97.18.1837 [DOI] [PubMed] [Google Scholar]
  • 25.Hemphill JC, 3rd, Bonovich DC, Besmertis L, Manley GT, Johnston SC. The ICH score: a simple, reliable grading scale for intracerebral hemorrhage. Stroke. 2001;32:891–897. doi: 10.1161/01.str.32.4.891 [DOI] [PubMed] [Google Scholar]
  • 26.Ranucci M, Castelvecchio S, Menicanti L, Frigiola A, Pelissero G. Risk of assessing mortality risk in elective cardiac operations: age, creatinine, ejection fraction, and the law of parsimony. Circulation. 2009;119:3053–3061. doi: 10.1161/CIRCULATIONAHA.108.842393 [DOI] [PubMed] [Google Scholar]
  • 27.Eaneff S, Obermeyer Z, Butte AJ. The case for algorithmic stewardship for artificial intelligence and machine learning technologies. JAMA. 2020;324:1397–1398. doi: 10.1001/jama.2020.9371 [DOI] [PubMed] [Google Scholar]
  • 28.Ban JW, Stevens R, Perera R. Predictors for independent external validation of cardiovascular risk clinical prediction rules: Cox proportional hazards regression analyses. Diagn Progn Res. 2018;2:3. doi: 10.1186/s41512-018-0025-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Steyerberg EW, Harrell FE., Jr.Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol. 2016;69:245–247. doi: 10.1016/j.jclinepi.2015.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Collins SP, Levy PD, Lindsell CJ, Pang PS, Storrow AB, Miller CD, Naftilan AJ, Thohan V, Abraham WT, Hiestand B, et al. The rationale for an acute heart failure syndromes clinical trials network. J Card Fail. 2009;15:467–474. doi: 10.1016/j.cardfail.2008.12.013 [DOI] [PubMed] [Google Scholar]
  • 31.Karam N, Bataille S, Marijon E, Tafflet M, Benamer H, Caussin C, Garot P, Juliard JM, Pires V, Boche T, et al. ; e-MUST Study Investigators. Incidence, mortality, and outcome-predictors of sudden cardiac arrest complicating myocardial infarction prior to hospital admission. Circ Cardiovasc Interv. 2019;12:e007081. doi: 10.1161/CIRCINTERVENTIONS.118.007081 [DOI] [PubMed] [Google Scholar]
  • 32.Mehta SR, Wood DA, Storey RF, Mehran R, Bainey KR, Nguyen H, Meeks B, Di Pasquale G, López-Sendón J, Faxon DP, et al. ; COMPLETE Trial Steering Committee and Investigators. Complete revascularization with multivessel PCI for myocardial infarction. N Engl J Med. 2019;381:1411–1421. doi: 10.1056/NEJMoa1907775 [DOI] [PubMed] [Google Scholar]
  • 33.Piccolo R, Bonaa KH, Efthimiou O, Varenne O, Baldo A, Urban P, Kaiser C, Remkes W, Räber L, de Belder A, et al. ; Coronary Stent Trialists’ Collaboration. Drug-eluting or bare-metal stents for percutaneous coronary intervention: a systematic review and individual patient data meta-analysis of randomised clinical trials. Lancet. 2019;393:2503–2510. doi: 10.1016/S0140-6736(19)30474-X [DOI] [PubMed] [Google Scholar]
  • 34.Kip KE, Hollabaugh K, Marroquin OC, Williams DO. The problem with composite end points in cardiovascular studies: the story of major adverse cardiac events and percutaneous coronary intervention. J Am Coll Cardiol. 2008;51:701–707. doi: 10.1016/j.jacc.2007.10.034 [DOI] [PubMed] [Google Scholar]
  • 35.Mehran R, Rao SV, Bhatt DL, Gibson CM, Caixeta A, Eikelboom J, Kaul S, Wiviott SD, Menon V, Nikolsky E, et al. Standardized bleeding definitions for cardiovascular clinical trials: a consensus report from the Bleeding Academic Research Consortium. Circulation. 2011;123:2736–2747. doi: 10.1161/CIRCULATIONAHA.110.009449 [DOI] [PubMed] [Google Scholar]
  • 36.Goff DC, Jr, Lloyd-Jones DM, Bennett G, Coady S, D’Agostino RB, Sr, Gibbons R, Greenland P, Lackland DT, Levy D, O’Donnell CJ, et al. 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. J Am Coll Cardiol. 2014;6325pt B2935–2959. doi: 10.1016/j.jacc.2013.11.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Yancy CW, Jessup M, Bozkurt B, Butler J, Casey DE, Jr, Drazner MH, Fonarow GC, Geraci SA, Horwich T, Januzzi JL, et al. 2013 ACCF/AHA guideline for the management of heart failure: executive summary: a report of the American College of Cardiology Foundation/American Heart Association Task Force on practice guidelines. Circulation. 2013;128:1810–1852. doi: 10.1161/CIR.0b013e31829e8807 [DOI] [PubMed] [Google Scholar]
  • 38.Chew DP, Hyun K, Morton E, Horsfall M, Hillis GS, Chow CK, Quinn S, D’Souza M, Yan AT, Gale CP, et al. Objective risk assessment vs standard care for acute coronary syndromes: a randomized clinical trial. JAMA Cardiol. 2021;6:304–313. doi: 10.1001/jamacardio.2020.6314 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Austin PC, Steyerberg EW. The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat Med. 2019;38:4051–4065. doi: 10.1002/sim.8281 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26:565–574. doi: 10.1177/0272989X06295361 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.van Klaveren D, Gönen M, Steyerberg EW, Vergouwe Y. A new concordance measure for risk prediction models in external validation settings. Stat Med. 2016;35:4136–4152. doi: 10.1002/sim.6997 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Marshall IJ, Wallace BC. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Syst Rev. 2019;8:163. doi: 10.1186/s13643-019-1074-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials


Articles from Circulation. Cardiovascular Quality and Outcomes are provided here courtesy of Wolters Kluwer Health

RESOURCES