Introduction
The prediction of a patient’s risk for disease, death, or morbidity has become an ever more essential part of modern clinical care. Guidelines for clinical management are often dependent on risk estimates. As such, it is imperative that surgeons understand how to properly use and evaluate risk estimates and the models that underlie them. Making matters more complex, an expanding number of risk prediction models are now available to cardiothoracic surgeons.[1–4] Table 1 displays a few high-profile examples along with some key measures of model performance, which we will describe in the sections below. This review is intended to be a practical guide for clinicians without any specialized training in statistics. The examples and literature herein focus on models a cardiothoracic surgeon is most likely to encounter and the statistical procedures used to validate them. Our purpose is to provide guidance to cardiothoracic surgeons on the evaluation of risk models and their application in clinical decision-making. The use of risk models for quality assessment, public reporting, value-based purchasing and cost-effectiveness are well covered elsewhere and beyond the scope of this manuscript.[5, 6]
Table 1.
Common models in cardiothoracic surgery
Model | Model development population | Validation Technique | Reported Predictive Accuracya (95% Confidence Interval) |
---|---|---|---|
STS CABG mortality[43] | Predominantly Caucasian males aged 55–74 undergoing CABG | Internal (split sample) | 81% (no CI reported) |
ACS NSQIP mortality[31] | Predominantly ASA Class 1 females (59.4% general surgery, 1.1% thoracic, 0.8% cardiac) | Internal (split sample) | 91% (no CI reported) |
Mayo model for lung cancer[20] | Pulmonary clinic - Predominantly smokers with lung nodules | External | 83% (81% to 85%) |
TREAT model for lung cancer[13] | Surgical clinic - Predominantly Caucasian smokers evaluated for surgery | External | 87% (83% to 92%) |
Tammemagi model for lung cancer screening[44] | Predominantly male Caucasian smokers in their 60s | Internal (split sample) | 80% (78% to 81%) |
STS, Society of Thoracic Surgeons
CABG, coronary artery bypass graft
CI, confidence interval
ACS, American College of Surgeons
NSQIP, National Surgical Quality Improvement Project
ASA, American Society of Anesthesiologists
SE, standard error
TREAT, Thoracic disease, Research, Epidemiology, Assessment and Treatment
Assessed by Area Under the ROC Curve (AUC)
What is a prediction model?
A prediction model is a mathematical model that does at least one of the following: (1) classifies individuals into groups (e.g., disease or no disease, dead or alive, major morbidity or not), (2) ranks individuals based on their likelihood to be in a group, or (3) provides an estimate of the probability of being in a certain group. There are many different statistical approaches to prediction, each with its own advantages and disadvantages. Examples include logistic regression models, discriminant analysis, classification and regression trees, random forests, penalized regression, neural networks, and deep learning. We will not delve into these different approaches here. Rather, we will focus on how to evaluate the appropriateness of a given model, or algorithm, for use in a clinical setting.
Prediction models can help surgeons, patients, and other providers make more informed decisions. For example, it might be necessary to identify high-risk patients to guide the choice between invasive treatment strategies, less aggressive interventions, or simple surveillance. These prediction models, when implemented into broad scale practice, can effectively improve clinical outcomes and population health metrics.[7] Online risk calculators and visual nomograms, on the other hand, are tools used to help clinicians apply risk models in clinical practice. However, before employing a risk model, the user should be able to (1) understand its performance and (2) judge its applicability to his or her patient population.
Critical Appraisal of Prediction Model Validity
We propose five key aspects to consider when evaluating a prediction model or algorithm (Table 2). Although many aspects of model development deserve scrutiny, some are beyond the scope of this paper and most are covered in the statistical literature.[8] For quick reference, Table 3 lists the general evaluative goals of a good model. The characteristics in this table are rough guides and should not be considered as absolute criteria. In the Discrimination and Calibration sections, we delve into this topic in more detail and refer the reader to a number of more technical reviews.[9–11] Note that ‘model validity’ and ‘model validation’ are not interchangeable terms. The term ‘model validity’ refers to the degree to which the conclusions drawn from a prediction are accurate and reliable in the intended population. In contrast, ‘model validation’ is one of the many important statistical aspects of assessing model validity, discussed further in limited technical detail in Section 4 below.
Table 2.
Five questions to ask before using a prediction model in your practice
1. How well does the model differentiate between the outcomes of interest (Discrimination)? |
2. Does the model yield risk estimates that are close to the actual (observed) risk (Calibration)? |
3. How precise is the model’s estimate of risk (Confidence Intervals)? 4. How was the model validated (Validation)? |
5. Is my patient similar to the population in which the model was developed (Generalizability)? |
Table 3.
Performance characteristics and measures
Characteristic | Measure |
---|---|
Discrimination | AUC (often > 0.70) |
Calibration | Calibration plot – agreement of observed and expected risks |
Brier score (often < 0.20) |
1. Discrimination
The discriminative ability of a prediction algorithm is defined as its ability to classify individuals into similar groups, e.g., all diseased versus all non-diseased. A prediction model has good discrimination when the predicted probabilities for the different groups have very little overlap with each other. If instead the predicted probabilities tend to have a large amount of overlap, the model is said to have poor discriminative ability because the predictions alone are not indicative of membership in the disease or non-diseased group. Note that high discrimination does imply accurate risk estimation; a model that assigns all truly non-diseased risk estimates to 0.98 or less and all truly diseased risk estimates to 0.99 or more will have near perfect discrimination between the groups. However, it will lack calibration (defined as accurate risk estimates). We discuss calibration later.
The most common measure of discrimination for dichotomous outcomes is the area under the receiver operating characteristic or AUC. While the AUC is not necessarily an ideal measure of model performance (see the Technical Details section below), it is widely used in practice to evaluate a model’s discriminative ability. Thus, it is important to understand what the AUC is measuring and what various values might indicate about the risk model. The AUC is sometimes referred to as the c-statistic or c-index, but there are more general versions of the c-statistics/c-index for complex data, which should not be confused with the AUC.[12] Commonly, the AUC is defined as the probability that the predicted risk is higher for a diseased patient than a non-diseased one. The AUC will be large when the distributions of diseased and non-diseased risk scores have little overlap. A key point is that the AUC alone is not a sufficient measure of quality or usefulness of a prediction model because the AUC does not consider risk score estimation. The AUC only indicates how well the predictive model categorizes risk score estimates into the diseased or non-diseased groups.
An AUC near 1 indicates perfect discrimination, an AUC near 0.8 indicates high discrimination, and AUC near 0.5 indicates a complete lack of discrimination. As the left panel of Figure 1 illustrates, point A would be an ideal test with perfect discrimination. Line C shows an AUC of 0.5, meaning complete lack of discrimination akin to tossing a fair coin. Line B is somewhere in the middle, with an AUC of approximately 0.75. The right panel of Figure 1 shows an example of an AUC from the TREAT model for predicting lung cancer in a population of individuals referred to thoracic surgery because of a lung nodule.[13] The blue curves are higher, reflecting the better discriminative ability of the TREAT model in comparison to the Mayo Model (red lines). Note that the dotted lines represent the validation in a Veterans Administration (VA) population. When reporting a model’s performance, it is also important to remember that the observed AUC is only an estimate; the confidence interval for the AUC should always be provided for proper inference. A wide CI indicates inconclusiveness and is a warning to not over-interpret the observed accuracy. Confidence intervals are discussed in further detail in Section 3.
Figure 1.
Example of an AUC plot. Left, schematic of AUC; right, TREAT model for lung cancer. Reproduced from Deppen et al.[13]
Models used to inform for critical clinical decisions (e.g., such as surgical resection of a lung nodule) should generally meet more stringent requirements. Published models like those in Table 1 tend to have AUCs in the 0.7 to 0.9 range. Here, the clinician must weigh the consequences of their decision in terms of possible harms (e.g., false-positive-driven intervention or false-negative-induced delay) within the context of the patient’s history. As the magnitude of potential harms from a clinical decision increase (as would be encountered in a decision for cardiac surgery in a high risk patient), the accuracy of the model should increase as well. For decisions involving less potential for harm, such as the decision to start a statin drug, a model like the Framingham risk score, which stratifies patients based on coronary heart disease risk, can have a lower AUC and still be clinically useful in identifying patients for primary and secondary cardiovascular prevention.[14]
Models with low AUC or wide AUC confidence intervals should be used with caution. Such models can lead to worse patient care, because models with low discrimination are less likely to correctly classify patients, and instead may interfere with clinicians’ usual decision-making processes. Models with mid-range AUCs can be clinically useful if they perform better than the clinician’s classification based on a synthesis of all available information. For example, consider a model to predict persistent post-concussive symptoms. It has a AUC of 0.68, whereas the AUC for providers predicting the same outcome is only 0.55.[15] Assuming the 95% CIs are sufficiently narrow, it would be clear that “suboptimal” AUC values may in fact be better than the alternative (e.g., usual care). Therefore, it is important to judge model discrimination relative to an alternative approach to prediction.
A Note on Risk Stratification
Prediction models provide an estimate of the risk, often known as the ‘risk score’—a continuous variable ranging from 0 to 1. It is often helpful to sub-divide patients into risk groups (e.g., high, medium, low). Combining similar patients into groups has its advantages and disadvantages. One advantage of grouping is that it facilitates communication, decision-making, and the delivery of care. A disadvantage is that patients near the cut-off for one group may be more similar to those in the adjacent group than one would otherwise think.
The clinical utility or “performance” of risk stratification can be evaluated in terms of a familiar two-by-two table and summarized in terms of the true-positive (sensitivity) and false-positive (1-specificity) rates when there are only two categories. This methodology provides information to the user regarding the frequency of misclassification. Stakeholders must then make judgments about the model performance based not only on the frequency of misclassification, but also on the potential unintended consequences of misclassification (e.g., missing a cancer).
See supplemental table S1 for some further technical notes on the discriminative ability of a predictive model.
2. Calibration
Calibration is a measurement of how accurate an estimate of the risk prediction is in the target population. It is often visually assessed with a calibration plot.[10] A calibration plot depicts observed versus expected (i.e., predicted) values across a range of predicted values (e.g., deciles of predicted risk). Ideally, the predicted values should be accompanied by their 95% confidence intervals (see Section 3: Confidence Intervals).
The left panel of Figure 2 shows the calibration plot for the Society of Thoracic Surgeons (STS) CABG mortality calculator, which appears well calibrated because the predicted risk for each data point corresponds closely to the observed risk. The line at 45 degrees denotes perfect calibration, where the predicted probability of an event exactly matches the observed frequency of the event in the observed data. In contrast, in the right panel of Figure 2 (from a fall-risk assessment tool for elderly people living in the community), the predicted risk consistently overestimates the observed risk across all observed probabilities.
Figure 2.
Examples of calibration plots. Left, STS CABG mortality calculator calibration plot; right, FRAT-up fall-risk assessment tool calibration plot. Reproduced with modifications from Shahian et al.[43] (left) and Cattelani et al.[45] (right)
A calibration plot is infinitely more useful than a single p-value from a test of goodness-of-fit, because the plot shows where the prediction model is over- or under-predicting across all levels of risk and the degree to which it is inaccurate. Regardless, a summary measure of calibration, often reported in the clinical literature, is the goodness-of-fit test (e.g., Hosmer-Lemeshow test). This statistical test checks whether the predicted and observed event rates are similar across each decile of risk values (e.g., <10%, 10–20%, etc.). A significant p-value on the Hosmer-Lemeshow test (i.e., p<0.05) implies that the model is not well calibrated to the observed data.
However, the converse is not true. A p-value ≥0.05 on the Hosmer-Lemeshow test does not necessarily mean that the model is well calibrated; it only implies the data are inconclusive regarding the question of model calibration. This unintuitive feature of goodness-of-fit tests is due to the asymmetry in statistical hypothesis testing whereby large p-values do not “prove” the null-hypothesis. In this instance, large p-values should be interpreted as “no evidence that the model is, or is not, calibrated”, and care should be taken to not over-interpret such a result as “evidence the model is well calibrated”. To assess the degree of calibration, best practice is to thoughtfully examine a calibration plot and to carefully examine areas with large under- or over- estimation of the risk. This approach is more informative than simply reporting the Hosmer-Lemeshow goodness-of-fit test result.
Another commonly reported calibration check is the Brier score, which quantifies how well a model that estimates a dichotomous outcome performs. The Brier score contains components of both discrimination and calibration, but it is more commonly used as a measure of calibration (see the Technical Details section below for further discussion). As with the AUC, the 95% confidence interval of the Brier score should supplement interpretation of calibration across a range of predicted values. Suppose a model predicts that a patient has a 15% chance of dying after coronary artery bypass surgery. The actual outcome will be either death or survival, not a numeric probability. Brier scores range from 0 to 1, with lower scores corresponding to better calibration. A model with a Brier score of less than 0.2 is well calibrated, whereas a score of 1.0 means the prediction was completely inaccurate. Figure 3 is an example of a Brier score box plot that compares the calibration of the TREAT model for lung cancer with the Mayo model for lung cancer in both the development cohort (VUMC, Vanderbilt University Medical Center) and the validation cohort (VA, Veterans Affairs). The lower Brier score for the TREAT model demonstrates better calibration than the Mayo model for the surgical population.
Figure 3.
Example of Brier score box plot. The lower Brier score for the TREAT model demonstrates better calibration than the Mayo model for the surgical population in the development (VUMC) and validation populations (VA). Reproduced from Deppen et al.[46]
Recalibration:
Risk scores can be adjusted post-hoc to adapt to a new population or to account for changing trends in a population. This adjustment is called “recalibration” and has been done by the STS.[16] To illustrate, recalibration may be needed to adjust a patient’s predicted risk of mortality when it is calculated using a risk model that was derived from historical data during a time period when the mortality rate was higher. This would take into consideration a global trend of dropping perioperative CABG mortality rates. Finally, when a risk model is applied in a new population (other than the one in which it was developed), its calibration is never assured and should always be checked. Examining how the model’s predictive characteristics (calibration and discrimination) perform in a new population is a form of model validation and may also address the generalizability of the model, which are discuss further below.
See supplemental table S2 for some further technical notes on the concept of calibration of a predictive model.
3. Confidence intervals
It is critical not to over-interpret a single point estimate of risk as the true risk. A model actually does not distinguish between risk scores contained within its confidence interval. This general statistical principle, acknowledging uncertainty, is an important part of interpreting risk scores. The uncertainty of a model’s predicted outcome (or predicted probability of an outcome) can be represented with a confidence interval. Confidence intervals are typically set at the 95% level, meaning that if the study was repeated multiple times under the same conditions, the observed confidence intervals (different in each instance) would contain the true risk score 95% of the time. Generally speaking, a smaller sample size leads to wider confidence interval whereas larger sample size leads to narrower confidence interval. A smaller sample size with minimal variability in the underlying data generating mechanism can also lead to narrow confidence intervals, but this is unusual to observe in practice. Thus, wider confidence intervals reflect less certainty in the estimated risk profile of a patient, and narrow confidence intervals imply greater certainty from either less variability in the underlying data or very large sample sizes. For example, an estimated risk score for lung cancer of 75% would be actionable if the 95% CI is 73% to 77%, but not actionable if the 95% CI was 35% to 95%. In the first case, the model is fairly precise about the magnitude of the risk, whereas in the second case it is largely inconclusive. Failure to acknowledge the precision of the risk score with confidence intervals can make the difference between a false positive and a true negative.
The precision of a risk score is generally worse at the extremes (near 0 or near 1). This occurs, in part, because there are often fewer data points at the extremes and because very high and low risk scores are harder to estimate. The precision can also vary as a function of the underlying prevalence. For example, the TREAT model for estimating the risk of lung cancer in a person with an incidentally-detected nodule has wider confidence intervals at certain underlying prevalence ranges of cancer and much narrower confidence intervals at others, which reflects the distribution of risk across patients used in the original model development.[13]
Confidence intervals also provide critical information about how well the prediction models perform at the extremes. Well-powered models, such as the STS cardiac models, integrate hundreds of thousands of patients across decades and likely perform well at the extremes in addition to the middle of risk estimates. It is important to distinguish between the predictive accuracy of the model and the precision of risk estimates. These are not the same thing. Predictive accuracy assesses the ability of the model to classify patients into two groups, and the precision of the risk assessment measures how close the repeated estimates are to each other. A model with very imprecise risk estimates (wide CIs) can still have very high predictive accuracy (high AUC). This is one additional reason not to place too much credence in the risk estimates themselves, as they are often better used as a device to classify participants into groups (e.g., diseased or not). Unfortunately, not all prediction models are able to easily generate confidence intervals for the risk score. However, the TREAT model does provide them, allowing surgeons to use this important information to trust or distrust the risk estimate to make clinical decisions.[13]
See supplemental table S3 for some further technical notes on confidence intervals in the context of a predictive model.
4. Validation
Validation of a prediction model is an essential step in model evaluation. The idea is to see how the model performs in a similar yet independent population from the one in which the model was developed. The need for validation arises because the data cannot be used to both develop a model and check its validity. When the data are used twice in this fashion, the assessment of model performance is biased upwards; it is too optimistic. Proper validation leads to an unbiased estimate of model performance.
There are two types of validation—internal and external. Internal validation uses data from one data source and usually by the same research group at the same locations. There are several methods for internal validation—including cross-validation, hold-out or split-sample, bootstrap, and post-estimation validation—that are beyond the scope of this article.[17] One limitation of internal validation is that model performance will not deteriorate as a result of issues with the underlying data source (e.g., selection bias with recruitment, missing data, measurement errors). For instance, if the method of internal validation is splitting the sample randomly, then measurement errors and missing data resulting from data collection at one site will be randomly distributed across the samples and will not manifest as poor model performance.
External validation refers to evaluating model performance in a representative population from a fully independent source (e.g., from another institution or geographic location by another group of investigators). Evaluating model performance in such a population may reveal decrements in model performance that can arise from ways in which the underlying data were collected in addition to how well the model estimates risk based on the underlying distribution of information. Internal validation schemes such as cross-validation, split-sample, and bootstrap validation attempt to mimic external validation and are important tools for assessing model viability when an external sample is not available. Ideally, both internal and external validation should be performed in order to provide the most conservative estimates of model performance. [9]
See supplemental table S4 for some further technical notes on validation of a predictive model.
5. Generalizability
Generalizability is the degree to which the sample characteristics accurately reflect the new target population. A valid, accurate, and precise prediction model may not be generalizable to a clinician’s patient population. One must evaluate how similar one’s patients are to the target population for which the model was developed and validated. A validated prediction model with excellent discrimination, calibration, and confidence intervals may still be unhelpful or potentially harmful if applied to a population markedly different from the development population. For example, the well-validated Mayo Clinic model for lung cancer has been shown to have limited generalizability to the typical population of patients that a thoracic surgeon sees in evaluation for suspicious lung nodules due to the higher prevalence of cancer that surgeons commonly encounter in their practice.[18–20] ‘
See supplemental table S5 for some further technical notes on the generalizability of a predictive model.
Considerations for Risk Calculator Implementation and Use
Once one has determined that a particular risk calculator is valid and appropriate for clinical practice, other important features must be considered. These include: (1) the user risk- calculator interface, (2) communication of model results and risk information, and (3) the burden and accuracy of a risk calculator’s inputs. In addition, the implementation of risk calculators may or may not change (4) decision makers’ behaviors or (5) improve patient-level outcomes. Table 4 summarizes questions to ask prior to model implementation.
Table 4.
Five questions to ask before implementing a model into clinical practice
1. Is the interface between the calculator and the user optimal (User interface)? |
2. Is risk information is conveyed properly (e.g., a probability with a range) (Conveying risk)? |
3. Does the model require the least amount of information to make the best possible prediction, thereby mitigating the impact of its use on clinical workflow (Burden of use)? |
4. Is there evidence that using the model will change the user’s behavior (Behavior change)? |
5. Is there evidence that using the model will lead to better outcomes, including improved cost-effectiveness and patient satisfaction (Impact on outcomes)? |
1. User-risk calculator interface
The target user of a risk calculator must be well defined and the interface designed accordingly. For example, an interface appropriate for use by clinicians in an EMR setting is likely not appropriate for use by patients at home over the web. Patient usage often requires well-designed online support and necessitates considerations of numeracy and health literacy with the elimination of medical jargon. Health numeracy, the degree to which individuals interpret and act on quantitative, graphical, and probabilistic health information, affects patient decision-making both in the presence and absence of risk calculators.[21] Personalized interpretation of model results by infographics is an example of a feature that a model dedicated to general public use should have.
2. Conveying risk information
The predicted risk for an outcome can be conveyed in many different ways. The point estimate (e.g., 23% probability of perioperative mortality) should be accompanied by the range of uncertainty. The most common way to convey the uncertainty of the point estimate to the clinician is with confidence intervals (e.g., 23% [95% CI, 15–31%] probability of perioperative mortality). Alternatively, the calculator may output a simple risk category (e.g., high vs. low risk of a particular outcome).
Ideally, a risk interpretation should accompany the estimate and include information on other outcomes that providers and patients use to arrive at clinical decisions, as decisions are not made in a vacuum. When providing informed consent to patients, it is important to discuss the benefits, risks, and alternatives (and their respective benefits and risks). Physicians should discuss the risks of surgery acknowledging the harms and benefits of non-operative management. Use of risk calculators to selectively provide patients and clinicians limited information could potentially lead to poor decision-making if these additional factors, such as the uncertainty of the risk estimate, are ignored. Moreover, risk calculators do not remove the need to understand the patients’ desires, which are important to consider in guiding patient care.
The way in which risk information is conveyed can influence decision-making.[22] For example, patients appear to understand imprecision better when it is described qualitatively (i.e., simply stating the confidence in the probability, e.g., high, moderate, or low confidence in the estimate risk of the event) rather than quantitatively (e.g., “4% probability of early death after valve replacement, with a 95% confidence interval of 2% to 10% probability of early death”).[23] Graphical or visual aids are particularly helpful in interfacing with patients or families with lower numeracy.[24, 25] However, little research has been done on how conveying risk information to clinicians affect their decision-making.[26]
Risk calculators are developed for broad implementation and generalizability across a variety of institutions and practice settings. Actual outcomes, however, may vary based on facility and individual surgeon. A calculator that predicts the expected mortality after a cardiothoracic operation may also be used to compare expected outcomes to observed outcomes. This generates an observed-to-expected (O/E) ratio, a metric commonly used to facilitate quality and performance. For example, the observed or actual mortality for CABG at Hospital X is divided by the expected or calculator-predicted mortality based on the individual patient details. An O/E ratio of 1.5 means that Hospital X’s mortality is 50% higher than expected, whereas an O/E ratio of 0.50 means the mortality is 50% lower than expected. An O/E ratio of 1.0 means that Hospital X’s mortality is equal to what is expected. The many statistical details behind risk-adjusting patients include direct and indirect standardization and are beyond the scope of this paper.[27, 28] That said, one should always also look for the O/E confidence intervals if inference and action is the goal.
3. Burden of use
Standard data entry checks should be employed in order to minimize data entry errors (e.g., requiring a patient age to fall between 20–120 years, or picking from a list). Integrating clinical models into electronic medical record (EMR) platforms is an excellent way to potentially mitigate the burden of data look-up and manual entry into models that are independent of the medical record and often separately web-based.[29] However, automatic population of such data can be challenging and may introduce errors, and the results must be validated against manual approaches. The STS provides an online risk calculator to its members for 7 individual or combined procedures as a benefit and service of its nonprofit educational mission at http://riskcalc.sts.org/stswebriskcalc/#/.
When developing a risk prediction model, more predictor variables may lead to better prediction, but the improvement can be incremental and come at the expense of user burden. The STS lung cancer resection risk model requires 20 variables to predict operative mortality and major morbidity.[30] The parallel calculator for cardiac surgery requires 16 to 36 variables depending on the procedure being performed and its corresponding predictive model.[16] Even the basic ACS NSQIP surgical risk calculator requires 21 input variables to predict 8 outcomes, including mortality.[31] This requirement for a large number of variables is magnified when multiple models are used to estimate the probabilities of multiple outcomes, as in the STS risk calculator for nine different outcomes after cardiac surgery.[32]
Ideally, the burden of use should be addressed in the development phase of a risk calculator by forcing preference for a parsimonious model when the loss in prediction accuracy is negligible – i.e., only including those clinical variables in the model development that are absolutely necessary, as determined by a priori selection prior to model development.[33–35] Furthermore, most models require complete data and force the user to enter some value for each variable unless missing data fields can be imputed by the user interface, as in the TREAT model for lung cancer.[13] Although parsimony is desirable, the disadvantage is that less common but highly important risk factors might be excluded, which disadvantages providers who treat more complex patients.
4. Changing decision-makers’ behavior
More research is needed on how decision-makers use risk calculators’ and how incorporating a risk calculator may change subsequent decisions. Research to date has focused on the way in which risk calculators change patients’ behavior or are misinterpreted by patients.[36–38] Few studies have examined the impact of risk calculators on clinicians’ behavior.[29] Most calculators provide an output of the probability of an event in a continuous number, such as “the probability of death after CABG is 2.3%.” Because surgeons must make dichotomous decisions (operate or not), they often have subjective cut-points and integrate model outputs into their decision-making process. Use of predictive models helps clinicians to decide whether or not to recommend an intervention to a patient, and which particular procedure may be the best choice. This reinforces the need for confidence intervals around predicted probabilities discussed previously. Clinicians’ decision-making would be impacted differently from a risk calculator showing “the probability of death after CABG is 2.3% [95% CI, 1.9–3%])” as compared to “the probability of death after CABG is 2.3% [95% CI, 0.2–42%]).” Here the width of the CI would have an understandably large impact in the decision space since uncertainty is so prevalent in the second example.
5. Changing outcomes
Ideally, use of a prediction model will improve patient outcomes relative to “usual care.” For instance, compared to guideline-recommended care, the use of a risk prediction model may reduce the number of invasive diagnostic tests without compromising the ability to accurately stage lung cancer patients prior to treatment.[39, 40] If confirmed through trials or simulation, preservation of oncologic outcomes would remain the same, but rare safety events are avoided entirely in some patients and overall costs of care are reduced.
Demonstrating a change in outcomes (e.g., benefit, value, patient satisfaction, etc.) is an important way to encourage implementation. One of the most important reasons for deviating from recommended care is a lack of high-level evidence in support of those guidelines leading to better outcomes.[41] Outcomes assessment should not only include the safety and utilization measures (morbidity, mortality, survival, length-of-stay) but also measures of efficacy, patient-reported outcomes, patient and provider satisfaction, decision-regret, healthcare utilization, and costs.
Unintended Consequences of Risk Calculators
Risk calculators, like surgery, comprise an intervention in and of themselves – an intervention intended to improve decision-making and clinical care. Like all interventions, risk calculators have benefits, risks, and alternatives (e.g., usual care) and should ideally be evaluated using a comparative framework such as a diagnostic trial. Potential harm from using risk calculators or models can take the form of under/over-diagnosis and/or under/over-treatment, if decision-making on the part of the patient, the clinician(s), or both is unduly influenced. Treatment decisions are complex, and multiple outcome domains must be considered, including treatment effectiveness, safety, function, pain, health-related quality-of-life, and long-term survival. Not all of these are quantifiable.
Furthermore, treatment decisions are often considered in the context of alternatives. Consider a patient with Stage I lung cancer who is pondering resection but has compromised lung function. A lobectomy mortality model suggests that the 30-day risk of death is 5% [95% CI, 3–7%], and the surgeon communicates to the patient that she is at “high risk.” The patient then chooses stereotactic body radiation therapy (SBRT), which has lower short-term risk, but subsequently develops a localized recurrence one year later. Surgical resection, though having an initial increased (but low) risk relative to SBRT, would ultimately have been a better choice for the patient. This example demonstrates how risk estimates must be comprehensive—i.e., not just short term risk but long-term likelihood of recurrence and survival.
Most patients (and surgeons) do not know how to interpret probabilities without further context.[23, 42] For example, while a 5% risk of death is over three times the average of ~1.6%, there is still a 95% chance of survival.[30] Framing it as a 5% probability of a bad outcome versus a 95% probability of a good outcome affects decision-making. In addition, communicating the range of the confidence interval is important, because due to statistical inaccuracy the prediction may be as low as 3% or as high as 7% (or even more extreme, in some circumstances). If the actual risk is 3% then this may be considered an acceptable mortality risk. Furthermore, decision-making is not independent of the alternative treatment and the probability of other outcomes. Perhaps the patient did not know the risks of SBRT, or the relative benefits of surgery compared to SBRT were not conveyed to the patient. The relative impact of surgery and SBRT in terms of pain, function, and side effects may also not have been conveyed to the patient in terms they can effectively weigh for their circumstances.
Closing statement
Risk calculators and clinical prediction models can be important facilitators of clinical decision-making and may be particularly useful in shared patient-provider decision-making. Careful attention is needed when selecting an appropriate risk model and when implementing it in clinical practice.
Supplementary Material
Acknowledgments
The authors wish to acknowledge David Shahian, Jeff Jacobs, Vinay Badhwar, and Jonathan Nesbitt for also reviewing the manuscript and providing suggestions. Dr. Maiga is supported by the Office of Academic Affiliations, Department of Veterans Affairs (VA) National Quality Scholars Program.
References
- 1.Shahian DM and Edwards FH, The Society of Thoracic Surgeons 2008 cardiac surgery risk models: introduction. Ann Thorac Surg, 2009. 88(1 Suppl): p. S1. [DOI] [PubMed] [Google Scholar]
- 2.Winkley Shroyer AL, et al. , The Society of Thoracic Surgeons Adult Cardiac Surgery Database: The Driving Force for Improvement in Cardiac Surgery. Semin Thorac Cardiovasc Surg, 2015. 27(2): p. 144–51. [DOI] [PubMed] [Google Scholar]
- 3.Edwards FH, Clark RE, and Schwartz M, Coronary artery bypass grafting: the Society of Thoracic Surgeons National Database experience. Ann Thorac Surg, 1994. 57(1): p. 12–9. [DOI] [PubMed] [Google Scholar]
- 4.Shroyer AL, et al. , The 1996 coronary artery bypass risk model: the Society of Thoracic Surgeons Adult Cardiac National Database. Ann Thorac Surg, 1999. 67(4): p. 1205–8. [DOI] [PubMed] [Google Scholar]
- 5.Atashi A, et al. , Models to predict length of stay in the Intensive Care Unit after coronary artery bypass grafting: a systematic review. J Cardiovasc Surg (Torino), 2018. 59(3): p. 471–482. [DOI] [PubMed] [Google Scholar]
- 6.Barringhaus KG, et al. , Impact of independent data adjudication on hospital-specific estimates of risk-adjusted mortality following percutaneous coronary interventions in massachusetts. Circ Cardiovasc Qual Outcomes, 2011. 4(1): p. 92–8. [DOI] [PubMed] [Google Scholar]
- 7.Moons KG, et al. , Risk prediction models: II. External validation, model updating, and impact assessment. Heart, 2012. 98(9): p. 691–8. [DOI] [PubMed] [Google Scholar]
- 8.Collins GS, et al. , Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. The TRIPOD Group. Circulation, 2015. 131(2): p. 211–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Harrell F, Regression Modeling Strategies. 2015, Switzerland: Springer International Publishing. [Google Scholar]
- 10.Steyerberg EW, et al. , Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology, 2010. 21(1): p. 128–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Iezzoni LI, Risk Adjustment for Measuring Health Care Outcomes. 2013: AUPHA. [Google Scholar]
- 12.Uno H, et al. , On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Stat Med, 2011. 30(10): p. 1105–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Deppen SA, et al. , Predicting Lung Cancer Prior to Surgical Resection in Patients with Lung Nodules. J Thorac Oncol, 2014. 9(10): p. 1477–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wilson PW, et al. , Prediction of coronary heart disease using risk factor categories. Circulation, 1998. 97(18): p. 1837–47. [DOI] [PubMed] [Google Scholar]
- 15.Meurer WJ and Tolles J, Logistic Regression Diagnostics: Understanding How Well a Model Predicts Outcomes. JAMA, 2017. 317(10): p. 1068–1069. [DOI] [PubMed] [Google Scholar]
- 16.Jin R, et al. , Using Society of Thoracic Surgeons risk models for risk-adjusting cardiac surgery results. Ann Thorac Surg, 2010. 89(3): p. 677–82. [DOI] [PubMed] [Google Scholar]
- 17.Steyerberg EW, Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2009, New York: Springer-Verlag. [Google Scholar]
- 18.Gould MK, et al. , A clinical model to estimate the pretest probability of lung cancer in patients with solitary pulmonary nodules. Chest, 2007. 131(2): p. 383–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Isbell JM, et al. , Existing general population models inaccurately predict lung cancer risk in patients referred for surgical evaluation. Ann Thorac Surg, 2011. 91(1): p. 227–33; discussion 233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Swensen SJ, et al. , The probability of malignancy in solitary pulmonary nodules. Application to small radiologically indeterminate nodules. Arch Intern Med, 1997. 157(8): p. 849–55. [PubMed] [Google Scholar]
- 21.Levy H, et al. , Health numeracy: the importance of domain in assessing numeracy. Med Decis Making, 2014. 34(1): p. 107–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Zipkin DA, et al. , Evidence-based risk communication: a systematic review. Ann Intern Med, 2014. 161(4): p. 270–80. [DOI] [PubMed] [Google Scholar]
- 23.Bansback N, Harrison M, and Marra C, Does Introducing Imprecision around Probabilities for Benefit and Harm Influence the Way People Value Treatments? Med Decis Making, 2016. 36(4): p. 490–502. [DOI] [PubMed] [Google Scholar]
- 24.Oudhoff JP and Timmermans DR, The effect of different graphical and numerical likelihood formats on perception of likelihood and choice. Med Decis Making, 2015. 35(4): p. 487–500. [DOI] [PubMed] [Google Scholar]
- 25.Rubrichi S, et al. , Graphical representation of life paths to better convey results of decision models to patients. Med Decis Making, 2015. 35(3): p. 398–402. [DOI] [PubMed] [Google Scholar]
- 26.Durieux P, et al. , A clinical decision support system for prevention of venous thromboembolism: effect on physician behavior. JAMA, 2000. 283(21): p. 2816–21. [DOI] [PubMed] [Google Scholar]
- 27.O’Brien SM and Gauvreau K, Statistical issues in the analysis and interpretation of outcomes for congenital cardiac surgery. Cardiol Young, 2008. 18 Suppl 2: p. 145–51. [DOI] [PubMed] [Google Scholar]
- 28.O’Hara LM, et al. , Indirect Versus Direct Standardization Methods for Reporting Healthcare-Associated Infections: An Analysis of Central Line-Associated Bloodstream Infections in Maryland. Infect Control Hosp Epidemiol, 2017. 38(8): p. 989–992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Feldstein DA, et al. , Design and implementation of electronic health record integrated clinical prediction rules (iCPR): a randomized trial in diverse primary care settings. Implement Sci, 2017. 12(1): p. 37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Fernandez FG, et al. , The Society of Thoracic Surgeons Lung Cancer Resection Risk Model: Higher Quality Data and Superior Outcomes. Ann Thorac Surg, 2016. 102(2): p. 370–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Bilimoria KY, et al. , Development and evaluation of the universal ACS NSQIP surgical risk calculator: a decision aid and informed consent tool for patients and surgeons. J Am Coll Surg, 2013. 217(5): p. 833–42 e1–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Deppen SA, et al. , Cost-Effectiveness of Initial Diagnostic Strategies for Pulmonary Nodules Presenting to Thoracic Surgeons. The Annals of Thoracic Surgery, 2014. 98(4): p. 1214–1222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Meguid RA, et al. , Surgical Risk Preoperative Assessment System (SURPAS): I. Parsimonious, Clinically Meaningful Groups of Postoperative Complications by Factor Analysis. Ann Surg, 2016. 263(6): p. 1042–8. [DOI] [PubMed] [Google Scholar]
- 34.Meguid RA, et al. , Surgical Risk Preoperative Assessment System (SURPAS): II. Parsimonious Risk Models for Postoperative Adverse Outcomes Addressing Need for Laboratory Variables and Surgeon Specialty-specific Models. Ann Surg, 2016. 264(1): p. 10–22. [DOI] [PubMed] [Google Scholar]
- 35.Meguid RA, et al. , Surgical Risk Preoperative Assessment System (SURPAS): III. Accurate Preoperative Prediction of 8 Adverse Outcomes Using 8 Predictor Variables. Ann Surg, 2016. 264(1): p. 23–31. [DOI] [PubMed] [Google Scholar]
- 36.Waldron CA, et al. , The effect of different cardiovascular risk presentation formats on intentions, understanding and emotional affect: a randomised controlled trial using a web-based risk formatter (protocol). BMC Med Inform Decis Mak, 2010. 10: p. 41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Bonner C, et al. , I don’t believe it, but i’d better do something about it: patient experiences of online heart age risk calculators. J Med Internet Res, 2014. 16(5): p. e120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Donelan K, et al. , Consumer comprehension of surgeon performance data for coronary bypass procedures. Ann Thorac Surg, 2011. 91(5): p. 1400–5; discussion 1405–6. [DOI] [PubMed] [Google Scholar]
- 39.Farjah F, et al. , External validation of a prediction model for pathologic N2 among patients with a negative mediastinum by positron emission tomography. J Thorac Dis, 2015. 7(4): p. 576–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Farjah F, et al. , Vascular endothelial growth factor C complements the ability of positron emission tomography to predict nodal disease in lung cancer. J Thorac Cardiovasc Surg, 2015. 150(4): p. 796–803 e1–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Brouwers MC, et al. , A mixed methods approach to understand variation in lung cancer practice and the role of guidelines. Implement Sci, 2014. 9(1): p. 36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Fanshawe TR, et al. , Interactive visualisation for interpreting diagnostic test accuracy study results. BMJ Evid Based Med, 2018. 23(1): p. 13–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Shahian DM, et al. , The Society of Thoracic Surgeons 2008 cardiac surgery risk models: part 1--coronary artery bypass grafting surgery. Ann Thorac Surg, 2009. 88(1 Suppl): p. S2–22. [DOI] [PubMed] [Google Scholar]
- 44.Tammemagi MC, et al. , Selection criteria for lung-cancer screening. N Engl J Med, 2013. 368(8): p. 728–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Cattelani L, et al. , FRAT-up, a Web-based fall-risk assessment tool for elderly people living in the community. J Med Internet Res, 2015. 17(2): p. e41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Deppen SA, et al. , Lung cancer screening and smoking cessation: a teachable moment? J Natl Cancer Inst, 2014. 106(6): p. dju122. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.