Abstract
Background
Surgical repair of hip fracture carries substantial short-term risks of mortality and complications. The risk-reward calculus for most patients with hip fractures favors surgical repair. However, some patients have low prefracture functioning, frailty, and/or very high risk of postoperative mortality, making the choice between surgical and nonsurgical management more difficult. The importance of high-quality informed consent and shared decision-making for frail patients with hip fracture has recently been demonstrated. A tool to accurately estimate patient-specific risks of surgery could improve these processes.
Questions/purposes
With this study, we sought (1) to develop, validate, and estimate the overall accuracy (C-index) of risk prediction models for 30-day mortality and complications after hip fracture surgery; (2) to evaluate the accuracy (sensitivity, specificity, and false discovery rates) of risk prediction thresholds for identifying very high-risk patients; and (3) to implement the models in an accessible web calculator.
Methods
In this comparative study, preoperative demographics, comorbidities, and preoperatively known operative variables were extracted for all 82,168 patients aged 18 years and older undergoing surgery for hip fracture in the American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) between 2011 and 2017. Eighty-two percent (66,994 of 82,168) of patients were at least 70 years old, 21% (17,007 of 82,168) were at least 90 years old, 70% (57,260 of 82,168) were female, and 79% (65,301 of 82,168) were White. A total of 5% (4260 of 82,168) of patients died within 30 days of surgery, and 8% (6786 of 82,168) experienced a major complication. The ACS-NSQIP database was chosen for its clinically abstracted and reliable data from more than 600 hospitals on important surgical outcomes, as well as rich characterization of preoperative demographic and clinical predictors for demographically diverse patients. Using all the preoperative variables in the ACS-NSQIP dataset, least absolute shrinkage and selection operator (LASSO) logistic regression, a type of machine learning that selects variables to optimize accuracy and parsimony, was used to develop and validate models to predict two primary outcomes: 30-day postoperative mortality and any 30-day major complications. Major complications were defined by the occurrence of ACS-NSQIP complications including: on a ventilator longer than 48 hours, intraoperative or postoperative unplanned intubation, septic shock, deep incisional surgical site infection (SSI), organ/space SSI, wound disruption, sepsis, intraoperative or postoperative myocardial infarction, intraoperative or postoperative cardiac arrest requiring cardiopulmonary resuscitation, acute renal failure needing dialysis, pulmonary embolism, stroke/cerebral vascular accident, and return to the operating room. Secondary outcomes were six clusters of complications recently developed and increasingly used for the development of surgical risk models, namely: (1) pulmonary complications, (2) infectious complications, (3) cardiac events, (4) renal complications, (5) venous thromboembolic events, and (6) neurological events. Tenfold cross-validation was used to assess overall model accuracy with C-indexes, a measure of how well models discriminate patients who experience an outcome from those who do not. Using the models, the predicted risk of outcomes for each patient were used to estimate the accuracy (sensitivity, specificity, and false discovery rates) of a wide range of predicted risk thresholds. We then implemented the prediction models into a web-accessible risk calculator.
Results
The 30-day mortality and major complication models had good to fair discrimination (C-indexes of 0.76 and 0.64, respectively) and good calibration throughout the range of predicted risk. Thresholds of predicted risk to identify patients at very high risk of 30-day mortality had high specificity but also high false discovery rates. For example, a 30-day mortality predicted risk threshold of 15% resulted in 97% specificity, meaning 97% of patients who lived longer than 30 days were below that risk threshold. However, this threshold had a false discovery rate of 78%, meaning 78% of patients above that threshold survived longer than 30 days and might have benefitted from surgery. The tool is available here: https://s-spire-clintools.shinyapps.io/hip_deploy/.
Conclusion
The models of mortality and complications we developed may be accurate enough for some uses, especially personalizing informed consent and shared decision-making with patient-specific risk estimates. However, the high false discovery rate suggests the models should not be used to restrict access to surgery for high-risk patients. Deciding which measures of accuracy to prioritize and what is “accurate enough” depends on the clinical question and use of the predictions. Discrimination and calibration are commonly used measures of overall model accuracy but may be poorly suited to certain clinical questions and applications. Clinically, overall accuracy may not be as important as knowing how accurate and useful specific values of predicted risk are for specific purposes.
Level of Evidence Level III, therapeutic study.
Introduction
Annually, more than 1.5 million hip fractures occur worldwide [8]. The 30-day postoperative mortality for surgical hip fracture repair is high, ranging on average from 5.0% to 13.3% [9, 10]. But those percentages are averages; the actual risk for an individual patient may be lower or higher than the average, depending on factors such as frailty, comorbidities, and prefracture functioning [6, 11]. The risk-reward calculus for most patients with hip fractures favors surgical repair [6, 11]. However, some patients have low prefracture functioning and/or very high risk of postoperative mortality, making the choice between surgical and nonsurgical management more difficult. A recent study of frail patients with hip fractures found that using a shared decision-making process, patients who chose nonoperative management had noninferior quality of life and treatment satisfaction compared with patients choosing surgical treatment [20]. The ability to preoperatively estimate patient-specific risk of mortality and major complications after hip fracture repair should inform these shared decision-making and personalized informed consent processes.
Although studies and meta-analyses have identified factors associated with mortality and major complications after surgical repair of hip fractures [1, 3, 15, 29], these studies were not designed to estimate the risks of experiencing adverse events for individual patients. Existing models and tools for predicting postoperative mortality and complications have several limitations; some come only from a single institution [4, 27], others have not been internally or externally validated [27], have low overall accuracy [4], have low accuracy for subgroups of patients [9], are limited to in-hospital mortality [28], or include laboratory values that are frequently missing or need additional testing [11, 25, 28]. Some of the most promising existing tools for predicting 30-mortality and complications are the models reported by Pugely et al. [25]; in that study, the authors developed and internally validated reasonably accurate and well-calibrated models of 30-day mortality (C-index 0.70) and major complications (C-index 0.69). They also provided an interactive spreadsheet to facilitate use of the models (http://links.lww.com/BOT/A92). These models were derived from data on 4331 patients who underwent hip fracture repair who were identified in the American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) database between 2005 and 2010. Although the Pugely et al. [25] models arguably represent the current best-in-class prediction tools in the context of hip fracture repair, they include inputs that may not be known preoperatively (such as operative duration or involvement of a resident), lab values, patient demographics that may cause unintended exacerbation of health inequities [25, 32], and they derive data procedures that occurred more than a decade ago, which may not reflect current surgical and clinical management.
In addition to improving the quality of personalized informed consent for all patients, perhaps one of the primary values of risk prediction in the context of hip fracture is to identify accurately those patients who are unlikely to survive 30 days after surgery [6, 11]. Patients who are very likely to die within 30 days may benefit from personalized information about the risk of surgery within a shared decision-making process and may wish to consider nonoperative management or palliative care rather than surgery [6, 20]. However, even a high level of short-term mortality risk might be tolerable in patients who have a chance to reap longer-term benefits. Therefore, the key challenge of prediction lies in identifying patients with very low probability of survival while also not restricting access to surgery for those who choose to take those risks. Although risk models cannot make these decisions, the evaluating accuracy metrics (such as, specificity, false discovery rate) of various thresholds of predicted risk can be described to and used within a shared decision-making process. To our knowledge, information on these tradeoffs is unavailable in any existing hip fracture prediction tool.
Therefore, the purposes of the current study were (1) to develop, validate, and estimate the overall accuracy of risk prediction models for 30-day mortality and complications after hip fracture surgery; (2) to evaluate the accuracy of risk prediction thresholds for identifying very high-risk patients; and (3) to implement the models in an accessible web calculator.
Patients and Methods
Study Design and Setting
In this comparative study using a large existing database, we extracted data on preoperative demographics, comorbidities, preoperatively known operative variables, and outcomes for all patients aged 18 years and older who underwent surgery for hip fracture in the ACS-NSQIP between 2011 and 2017. We chose the ACS-NSQIP database for its clinically abstracted and reliable data from more than 600 hospitals on important surgical outcomes, as well as its uniquely rich characterization of preoperative demographic and clinical predictors for demographically diverse patients [2, 12, 16, 17, 26].
Study Population
We used the 2011 to 2017 ACS-NSQIP Participant Use Files (PUF) to identify patients aged 18 years and older who underwent either: (1) surgical hip fracture fixation (Current Procedural Terminology [CPT]: 27235, 27236, 27244, 27245) or (2) THA (CPT 27130) or hemiarthroplasty (CPT 27125) with a diagnosis of hip fracture (ICD-9-Clinical Modification [CM] and ICD-10-CM: 820.x and S72.0xxx, S72.1xxx, respectively). Definitions of model inputs are described in the ACS-NSQIP PUF User Guides (https://www.facs.org/media/p1enl5ym/nsqip_puf_userguide_2017.pdf)
Participants and Baseline Data
A total of 82,168 patients met the study inclusion criteria (Table 1). The study group was 70% (57,260 of 82,168) female, 79% (65,301 of 82,168) White, and 82% (66,994 of 82,168) at least 70 years old (Table 1). The most common surgical procedures were intramedullary implant (43% [35,112 of 82,168]), followed by open reduction (32% [26,678 of 82,168]), plate/screw (13% [10,994 of 82,168]), hemiarthroplasty (8% [6816 of 82,168]), THA with a fracture diagnosis (3% [2282 of 82,168]), and percutaneous fixation (0.4% [286 of 82,168]). The 30-day postoperative mortality incidence was 5% (4260 of 82,168), and the major complication incidence was 8% (6786 of 82,168) (Table 2). Incidence for the six complication clusters ranged from less than 1% (591 of 82,168) for neurological complications to 6% (5288 of 82,168) for infectious complications (Table 2).
Table 1.
Characteristic | Total (n = 82,168) |
Age group in years | |
< 70 | 18 (15,174) |
70-74 | 9 (7471) |
75-79 | 12 (10,174) |
80-84 | 18 (14,541) |
85-89 | 22 (17,801) |
90+ | 21 (17,007) |
Female | 70 (57,260) |
Race/ethnicity | |
White | 79 (65,301) |
Black | 3 (2835) |
Other | 17 (14,032) |
Non-Hispanic | 95 (78,462) |
BMI category in kg/m2 | |
Underweight (< 18.5) | 7 (5845) |
Normal weight (18.5-24.9) | 44 (36,358) |
Overweight (25.0-29.9) | 30 (25,037) |
Obese class I (30.0-34.9) | 12 (9882) |
Obese class II (35.0-39.9) | 4 (3227) |
Obese class III (≥ 40.0) | 2 (1819) |
>10% loss body weight in last 6 months | 2 (1296) |
Diabetes mellitus with oral agents or insulin | |
Insulin | 8 (6447) |
No | 82 (67,269) |
Noninsulin | 10 (8452) |
Current smoker within 1 year | 12 (10,232) |
Functional health status before surgery | |
Independent | 79 (65,157) |
Partially dependent | 18 (14,472) |
Totally dependent | 3 (2539) |
Dyspnea | |
At rest | 1 (965) |
Moderate exertion | 6 (5099) |
No | 93 (76,104) |
History of severe COPD | 11 (9376) |
CHF in 30 days before surgery | 4 (2949) |
Hypertension treated with medication | 67 (54,930) |
Systemic sepsis | |
None | 89 (72,879) |
SIRS | 10 (8572) |
Sepsis | 0.8 (662) |
Septic shock | 0.06 (55) |
Steroid use for chronic condition | 6 (4609) |
Bleeding disorders | 16 (13,393) |
Preoperative dialysis | 2 (1628) |
Disseminated cancer | 3 (2303) |
Anesthetic type | |
General | 74 (60,997) |
Other | 26 (21,171) |
CPT | |
27125 Hemiarthroplasty | 8 (6816) |
27130 THA | 3 (2282) |
27235 Percutaneous fixation | 0.4 (286) |
27236 Open treatment (fixation or prosthetic) | 32 (26,678) |
27244 Plate/screw fixation (sliding compression screw) | 13 (10,994) |
27245 Intramedullary implant (cephalomedullary nail) | 43 (35,112) |
Data presented as % (n); COPD = chronic obstructive pulmonary disease; CHF = congestive heart failure; SIRS = systemic inflammatory response syndrome.
Table 2.
Complication outcomes | Specific complication | Total (n = 82,168) |
30-day mortality | 5 (4260) | |
Any major morbidity | 8 (6786) | |
Six complication groups | ||
Pulmonary complications | 5 (4264) | |
On ventilator > 48 hoursa | < 1 (570) | |
Intraoperative or postoperative unplanned intubationa | 1 (1093) | |
Pneumonia | 4 (3254) | |
Septic shocka | < 1 (533) | |
Infectious complications | 6 (5288) | |
Superficial SSI | < 1 (554) | |
Deep incisional SSIa | < 1 (247) | |
Organ/space SSIa | < 1 (158) | |
Wound disruptiona | > 1 (64) | |
Urinary tract infection | 5 (3795) | |
Sepsisa | 1 (1014) | |
Cardiac complications | 2 (1868) | |
Intraoperative or postoperative myocardial infarctiona | 2 (1323) | |
Intraoperative or postoperative cardiac arrest requiring cardiopulmonary resuscitationa | < 1 (622) | |
Renal complication | < 1 (611) | |
Acute renal failure requiring dialysisa | < 1 (282) | |
Progressive renal insufficiency | < 1 (335) | |
Venous thromboembolic complication | 2 (1472) | |
Vein thrombosis requiring therapy | 1 (952) | |
Pulmonary embolisma | < 1 (619) | |
Neurological complication | < 1 (591) | |
Stroke/cerebral vascular accidenta | < 1 (591) | |
Other complication | ||
Return to operating rooma | 3 (2181) |
Data presented as % (n).
Signifies inclusion in calculation of any major complication.
Outcomes and Candidate Predictors
The two primary outcomes were 30-day postoperative mortality and any 30-day major complication. Major complications, patterned after Pugely et al. [25], were defined by the occurrence of any of the following: on a ventilator more than 48 hours, intraoperative or postoperative unplanned intubation, septic shock, deep incisional surgical site infection (SSI), organ/space SSI, wound disruption, sepsis, intraoperative or postoperative myocardial infarction, intraoperative or postoperative cardiac arrest requiring cardiopulmonary resuscitation, acute renal failure needing dialysis, pulmonary embolism, stroke/cerebral vascular accident (CVA), and return to the operating room. We did not include ACS-NSQIP complications involving less serious events, including superficial wound infection, pneumonia, deep vein thrombosis, renal insufficiency, and urinary tract infection (UTI) (Table 2) [25]. We also excluded blood transfusions from our definition of major complications because they are not unexpected in this context. Although the events included as major complications are not equal in terms of clinical significance (such as, postoperative myocardial infarction, sepsis, or acute renal failure requiring dialysis) they are all clinically serious. Secondary outcomes were the six clusters of complications that were recently developed and have been used increasingly in the development of surgical risk models [23], namely: (1) pulmonary complications (on a ventilator longer than 48 hours, intraoperative or postoperative unplanned intubation, pneumonia, and septic shock); (2) infectious complications (superficial SSI, deep incisional SSI, organ/space SSI, wound disruption, UTI, and sepsis); (3) cardiac complications (intraoperative or postoperative myocardial infarction, and intraoperative or postoperative cardiac arrest requiring cardiopulmonary resuscitation); (4) renal complications (acute renal failure requiring dialysis and progressive renal insufficiency); (5) venous thromboembolic events (vein thrombosis requiring therapy and pulmonary embolism); and (6) neurological events (stroke/CVA). Note that these complication clusters are homogeneous in terms of body systems but heterogeneous in terms of severity. We removed transfusions from the cardiac complication cluster because it is different from the other events and not unexpected in this context.
We chose demographic and clinically important variables a priori as candidate inputs in four domains including (1) preoperative demographics: age, sex; (2) preoperative health and healthcare utilization: BMI, recent weight loss, smoking history, steroid use for chronic condition, dialysis use, previous operation within 30 days; (3) preoperative comorbidities: diabetes mellitus, dyspnea, chronic obstructive pulmonary disease, coronary artery disease, chronic heart failure, hypertension requiring medication, CVA/stroke with neurological deficit, preoperative sepsis, bleeding disorder, active cancer; and (4) operative variables known preoperatively: anesthetic type (general versus other), procedure type (hemiarthroplasty [CPT 27125], THA [CPT 27130], percutaneous fixation [CPT 27235], open reduction [CPT 27236], plate/screw [CPT 27244], intramedullary [CPT 27245]. None of the included variables contained missing data. Although included in previous models [25], operative duration and resident involvement were not considered because they may be unknown in the preoperative period when the predictions would be useful. Furthermore, duration of surgery may result from adverse outcomes rather than being a precursor. Lab values were also not considered due to concerns about nonrandom missingness [22]. Race and ethnicity were not considered as predictors in the models because of concerns about unintended bias and exacerbation of health inequities [18, 19, 25, 32].
Ethical Approval
The Stanford University Institutional Review Board approved this study (protocol ID 50172).
Statistical Analysis
We used a machine-learning approach, the least absolute shrinkage and selection operator (LASSO) logistic regression, which is designed to consider and select from a large number of candidate predictors to produce accurate models that require as few inputs as possible. LASSO regression is known to have many desirable properties for prediction problems with a large number of candidate covariates and performs variable selection by excluding unnecessary covariates from the model [31]. Minimizing the number of inputs is important in this context where time and effort are required to manually enter patient data onto the calculator. We applied a 10-fold cross-validation to estimate model coefficients and to select the tuning parameter λ for each fold, which sets the balance between model accuracy (defined by the C-index) and parsimony. Variables with nonzero coefficients in cross-validation at the specific λ were retained, and their coefficients were included in the prediction model. A 10-fold cross-validation involves estimating the parameters with a randomly selected 90% of the data (training set) and testing the accuracy of the model on the remaining 10% of the data (test set), then repeating the process with the other nine folds. Final coefficients and C-indexes are obtained by averaging across the results.
Regarding overall model accuracy, we assessed two important dimensions of predictive performance: discrimination and calibration [30]. Discrimination is the probability that a patient who experiences the outcome will have a higher predicted probability compared with a patient who does not experience the outcome. Discrimination is measured on the test (versus training) data with the area under the receiver operator curve (C-index) and associated 95% confidence interval (CI). Because the sample size is very large and the C-indexes are descriptive rather than hypothesis testing, we did not report p values. A C-index of 0.5 suggests the model’s discrimination is no better than random chance, and a C-index approaching 1.0 indicates the model can perfectly discriminate an occurrence versus nonoccurrence in any pair of individuals. We assessed calibration by plotting the observed versus expected rates in each tenth of the data and visually inspecting the plots.
Our second purpose was to calculate three measures of accuracy (sensitivity, specificity, and false discovery rate) for various thresholds of predicted risk of 30-day mortality and major complications. For example, patients above some threshold of predicted risk of 30-day mortality (that is, 30%) might be targeted for more intensive informed consent or shared decision-making to consider nonoperative management. Each candidate threshold of predicted risk will have its own balance of sensitivity (the proportion of people who experienced the outcome who are above the threshold), specificity (proportion of people who did not experience the outcome who are below the threshold), and false discovery rate (the proportion of people above the threshold who did not experience the outcome). A lower threshold of predicted risk would maximize sensitivity by capturing a greater proportion of patients who will die, but it would also produce more false positives. A higher threshold of predicted risk would maximize specificity by capturing a greater proportion of patients who will not die, but it will also produce more false negatives (missing patients who will die). Deciding which measures of accuracy to prioritize and what is “accurate enough” depends on the clinical question and the use of the predictions. We performed data management and descriptive analysis in SAS 9.4 (SAS), and we performed logistic LASSO regressions using the “glmnet” package in R v.3.6.0 (R Core Team) [13].
Implementation of a Web-accessible Calculator
We used the “Shiny” package in R v.3.6.0 (R Core Team 2021) [7] to develop and implement the prediction models into web-accessible calculators.
Results
Risk Prediction Models for 30-day Mortality and Major Complications After Hip Fracture Surgery
For 30-day mortality, 15 predictors were selected (Table 3), and the model resulted in good discrimination with a C-index of 0.76 (95% CI 0.75 to 0.76) (Table 4). In the 30-day major complication model, 18 predictors were selected with a C-index of 0.64 (95% CI 0.64 to 0.65). The Hosmer-Lemeshow plots demonstrated that observed and predicted 30-day mortality (Fig. 1) and complication rates (Fig. 2) were similar, signifying good model calibration throughout the range of predicted risk. Importantly, the calibration was good in the highest risk deciles, with the models very modestly underestimating observed events.
Table 3.
Variable | Mortality | Major complications |
Intercept | -3.53 | -2.71 |
Female sex | -0.50 | -0.26 |
Age group in years | ||
75-79 vs younger than 70 | ||
80-84 vs younger than 70 | 0.34 | 0.05 |
85-89 vs younger than 70 | 0.57 | 0.04 |
90+ vs younger than 70 | 1.09 | 0.06 |
Preoperative health and comorbidities | ||
BMI category in kg/m2 | ||
Underweight (< 18.5) vs normal weight (18.5-24.9) | 0.31 | |
Overweight (25.0-29.9) vs normal weight | -0.14 | |
Obese class I (30.0-34.9) vs normal weight | -0.17 | |
Obese class II (35.0-39.9) vs normal weight | -0.01 | 0.29 |
>10% loss body weight (within 6 months) | 0.66 | 0.28 |
Diabetes mellitus with oral agents or insulin | ||
Insulin vs none | 0.05 | 0.23 |
Oral medication vs none | ||
Cigarette smoker (within 1 year) | -0.10 | |
Functional health status before surgery | ||
Partially dependent vs independent | 0.67 | 0.16 |
Totally dependent vs independent | 1.03 | 0.15 |
Dyspnea (within 30 days) | ||
Moderate exertion vs none | 0.22 | 0.34 |
At rest vs none | 0.53 | 0.28 |
History of severe COPD | 0.40 | 0.25 |
CHF (within 30 days) | 0.63 | 0.41 |
Hypertension | 0.16 | |
Systemic sepsis (within 48 hours) | ||
SIRS vs none | 0.39 | 0.36 |
Sepsis vs none | 0.54 | 1.09 |
Septic shock vs none | 1.39 | 1.41 |
Steroids | 0.03 | 0.13 |
Bleeding disorder | 0.20 | 0.21 |
Hemodialysis | 0.56 | 0.48 |
Active cancer | 1.14 | 0.28 |
Operative variables | ||
General anesthetic type | 0.03 | |
CPT | ||
Hemiarthroplasty vs intramedullary | 0.10 | |
THA vs intramedullary | -0.05 | 0.05 |
Percutaneous fixation vs intramedullary | ||
Open reduction vs intramedullary | 0.07 | |
Plate/screw vs intramedullary |
Only variables selected and estimated by LASSO are listed in the table. Note that some of the categories within the selected variables were not selected but are included for completeness. For each model, all the coefficients, including an offset term (intercept), are used in a formula to produce patient-level predicted risk. The coefficients are presented here so the reader can see which variables were selected, as well as their magnitude and direction of their influence on the calculation of risk. Methods do not exist to produce confidence intervals or p values for LASSO logistic regression coefficients. The goal is to produce a patient-level risk estimate, not to interpret the causal relationships between the variables and outcomes. It is important not to infer which or how variables are causally related to the outcomes. The purpose is prediction, not explanation or statistical hypothesis testing; COPD = chronic obstructive pulmonary disease; CHF = congestive heart failure; SIRS = systemic inflammatory response syndrome; CPT = common procedural terminology.
Table 4.
Outcome | C-index (95% CI) |
Mortality | 0.76 (0.75-0.76) |
Any major complication | 0.64 (0.64-0.65) |
Six clusters of complications | |
Pulmonary complication | 0.69 (0.68-0.70) |
Infectious complication | 0.59 (0.58-0.60) |
Cardiac complication | 0.68 (0.67−0.69) |
Renal complication | 0.69 (0.67-0.72) |
Venous thromboembolic | 0.59 (0.58-0.61) |
Neurological complication | 0.61 (0.59-0.63) |
The C-index, also known as the area under the receiver operator curve, is a measure of model discrimination, which is one dimension of the overall accuracy of the model through its range of predicted risk. It can be thought of as the probability that a randomly selected person who experienced the outcome will have a higher predicted risk than a randomly selected person who did not experience the outcome. A C-index of 0.50 is considered poor. C-indexes nearing 0.70 are generally considered moderate to good. Measured of overall model discrimination over the entire range of predicted probability might not be as important as calibration, or specificity and false discovery rates of specific thresholds on predicted risk, depending on how the predictions are being used.
For the models predicting the complication clusters, which are homogeneous in terms of bodily systems but heterogeneous in terms of severity, discrimination was highest for pulmonary and renal complications (with C-indexes of 0.69 [95% CI 0.68 to 0.70] and 0.70 [95% CI 0.67 to 0.72], respectively) and lowest for venous thromboembolic and infectious complications (with C-indexes of 0.60 [95% CI 0.58 to 0.61] and 0.61 [95% CI 0.60 to 0.62], respectively). The LASSO coefficients for the models show that different variables and different numbers of variables are optimal for predicting the six complication clusters (Supplementary Table 1; http://links.lww.com/CORR/A841).
Accuracy of Prediction Threshold for Identifying Very High-risk Patients
We then calculated the sensitivity, specificity, and false discovery rates of different thresholds of predicted risk from the 30-day mortality model (Supplementary Table 2; http://links.lww.com/CORR/A842) and for the model predicting major complications (Supplementary Table 3; http://links.lww.com/CORR/A843). For the 30-day mortality model, predicted risk of the sample had a mean of 5%, a minimum of 1%, a 75th percentile of 6%, and a maximum of 83%. Using a 10% predicted risk threshold for 30-day mortality resulted in a specificity of 92% but a false discovery rate of 83%. That means that 92% of patients with predicted risk lower than 10% were true negatives, meaning they did not die within 30 days. However, 83% of patients above the threshold were false positives. A threshold of 15% produces a specificity of 97% and false discovery rate of 78%. Even for the 23 patients with predicted risk more than 50%, that threshold produced a false discovery rate of 35%.
For the 30-day major complication model, predicted risk of the sample had a mean of 8.3%, a minimum of 4.9%, a 75th percentile of 9.2%, and a maximum of 60.3%. Patients with a predicted risk of 10% were more likely to have a major complication with a specificity of 82.9% and a false discovery rate of 85.7%. A threshold of 20% produced a specificity of 98.9% and a false discovery rate of 74.6%.
Implementing the Models in a Web-accessible Interface for Clinical Decision-making
A web-based calculator for the prediction models was developed and can be accessed at https://s-spire-clintools.shinyapps.io/hip_deploy/.
Discussion
Although the risk-reward calculus for most patients with hip fractures favors surgical repair [6, 11], the choice between surgical and nonsurgical management is more difficult for patients who are very frail and/or have very high risk of postoperative mortality. A recent study illustrated the importance and value of engaging frail patients with hip fractures and their families in a shared decision-making process regarding surgical and nonsurgical treatment options [20]. The ability to preoperatively estimate patient-specific risk of mortality and major complications after hip fracture operations should inform and improve these processes. However, currently available prediction models have several limitations we sought to address in this study. We developed and validated prediction models for 30-day mortality and major complications that have reasonable overall accuracy (discrimination and calibration), described the specificity and false discovery rates for specific thresholds of risk, and implemented the models into a web-accessible risk calculator. These models of mortality and complications may be accurate enough for personalizing informed consent and shared decision‐making with patient-specific risk estimates. However, even very high-risk estimates produced by these models should not be used to routinely deny patients access to surgery, as many patients with very high risk of 30-day mortality survive longer than predicted. Patients with very high surgical risk and their families should be engaged in a shared decision-making process to discuss surgical and nonsurgical treatment options.
Limitations
This study has several limitations. Most surgical risk calculators share the first limitation: The sample did not include patients who did not receive surgery. Therefore, the results only apply to patients who are possible candidates for surgery and cannot be used to compare risks of surgical and nonsurgical management. Second, the design and methods of this study preclude any causal conclusions regarding model inputs. The most important implication of this limitation is that the models should not be used to choose which procedure to perform. The models give risk estimates for patients who are receiving a specific procedure that is being selected for other clinical reasons. The models cannot be used to compare the risks of different procedures for a particular patient. Third, due to the low prevalence of our primary outcome, we used all of the data available to us to develop and internally validate these models. A 10-fold cross-validation is a strong method to ensure the reported accuracy of the models is not overly optimistic for patients like those in the dataset (or internal validation) but does not establish the accuracy of the models for patients from different times or settings (or external validation). Caution is needed in using these models with other patient populations and settings unlike NSQIP hospitals until they can be validated in those settings. Fourth, prediction intervals for the model outputs would be valuable to give a sense of the uncertainty of the estimates. Unfortunately, methods to produce prediction intervals for LASSO regression models do not yet exist. Fifth, the NSQIP database does not include many variables that might inform treatment selection and outcomes, such as bone quality, fracture displacement, substance use disorders, social support, and perceived healing potential. Sixth, we also developed and validated prediction models for six complication clusters that have recently been the target of surgical prediction modeling more generally [21]; however, it is important to note that these clusters group rare serious complications with more common, nonserious complications, making straightforward interpretation more difficult.
Discussion of Key Findings
Engaging patients and their caregivers in a high-quality shared decision‐making process, including patient-specific estimates of surgical risk, can be influential in treatment selection and satisfaction [20]. The models developed and internally validated in this study can be used in conversations with patients and their families to provide patient-specific risks of surgical treatment. The 30-day hip fracture surgery mortality model developed in this study has better overall accuracy (C-index 0.76 versus 0.70) than the previous best-in-class model developed by Pugely et al. [25]. The 30-day major complication model was less accurate (C-index 0.64 versus 0.69) but addressed other limitations of the previous models. We intentionally did not include resident involvement and lab values as candidate inputs because of concerns that these inputs might not be known at the time the decision to operate was made. Also, many lab values have nonrandom missing data in NSQIP datasets, making their use in model development problematic [22]. Another popular option is the ASC surgical risk calculator (https://riskcalculator.facs.org/RiskCalculator/index.jsp), which allows the selection of the procedure code as an input but not indication (such as, hip fracture) for the procedure. Although it was not the purpose of this study to quantitatively compare risk estimates between models, we plugged data into our models and the ACS calculator for a few prototypical patients to get a sense of how the results compare. The risk estimates for mortality and major complications were roughly similar for CPT codes that are inherently about fracture repair (like CPT 27236). However, for the arthroplasty CPT codes, which are done for fractures as well as other indications, our mortality estimates were four‐ to five‐times higher than the ACS model estimates. Therefore, we believe that the models developed in this study are currently the best choice for producing patient-specific estimates of 30-day mortality and major complication risk, information that should be included in consent and shared decision‐making for patients and families considering surgical repair of hip fracture.
Unlike some other clinical contexts where it is important to discriminate throughout the range of risk, for hip fractures, perhaps the hardest decisions occur at the high end of the risk scale. The results of our analyses allow users to select different thresholds of predicted risk to optimize for sensitivity, specificity, and false discovery rate. We argue that maximizing specificity (keeping the true negative rate high) is more important than sensitivity. However, the false discovery rate seems most relevant for the question of who might be encouraged to consider forgoing surgery. By using a higher threshold of predicted risk for avoiding surgery, more patients who might benefit from surgery will receive it, but there is no threshold high enough to avoid restricting surgery from patients who might live long enough to benefit. To emphasize this point: We cannot recommend using any risk threshold from these models to restrict access to surgical repair of hip fracture. However, the thresholds could be used to trigger a more extensive consent and shared decision-making process and to offer information to surgeons, patients, and their families about the accuracy of risk estimates of 30-day mortality and complications for specific patients considering specific surgical procedures.
These results highlight the importance of being more precise about which questions we expect predictive models to inform. Deciding which measures of accuracy to prioritize and what is accurate enough depends on the clinical question and the use of the predictions. Discrimination and calibration are commonly used measures of model accuracy but may be poorly suited to particular clinical questions. The models of mortality and complications we developed may be accurate enough for some uses (such as for personalized informed consent and shared decision‐making) but perhaps not for others, including restricting access to surgery for high-risk patients.
The inclusion of race in diagnostic algorithms or risk predictions has come under increased scrutiny and criticism [18, 19, 24, 32]. Some have argued that the genetic basis of race and the potential for more accurate race-specific predictions justifies the inclusion of race/ethnicity as model inputs [5]. More recently, others have pointed out that race is a poor proxy for genetic differences, and its use in prediction models, no matter how well intentioned, may further restrict healthcare access for already underserved groups [24, 32]. Thus, the decision to include race and ethnicity as potential inputs to prediction models must be informed by a careful consideration of the meaning of race variables, possible causal/biological mechanisms, and balancing the hopes of improved accuracy and precision medicine with the risks of unintentionally hurting (over- or undertreating) some people and groups. Although we did not consider race and ethnicity as inputs to the models, we are conducting separate analyses to examine how their inclusion might impact the results.
Conclusion
The importance of high-quality shared decision-making for patients, especially frail patients, with hip fracture has recently been demonstrated [20]. Clinical prediction models, such as those developed in this study, should be used to provide patient-specific risk information to patients and their families within consent and shared decision-making conversations in the context of hip fracture. However, the path from accurate statistical models to real improvement in patient outcomes is complex [14]. For model outputs to be useful, they must be viewed and understood by as well as provide new information to those involved in making treatment decisions. Next steps should include testing whether specific methods of providing model predictions, such as personalized informed consent, clinical decision support tools, and shared decision‐making that includes patient-specific risk estimates, can improve decision-making quality and patient outcomes.
Footnotes
The institution of one or more of the authors (AHSH) has received, during the study period, funding from the Veterans Affairs (VA) Health Services Research & Development Service (IIR 13-051-3; RCS14-232;) and support from the Stanford–Surgical Policy Improvement Research and Education Center (S-SPIRE).
Each author certifies that there are no funding or commercial associations (consultancies, stock ownership, equity interest, patent/licensing arrangements, etc.) that might pose a conflict of interest in connection with the submitted article related to the author or any immediate family members.
All ICMJE Conflict of Interest Forms for authors and Clinical Orthopaedics and Related Research® editors and board members are on file with the publication and can be viewed on request.
Ethical approval for this study was obtained from the Stanford University Institutional Review Board (protocol ID 50172).
This work was performed at the VA Palo Alto Center for Innovation to Implementation (Ci2i) and the Stanford–Surgical Policy Improvement Research and Education Center (S-SPIRE).
The views expressed do not reflect those of the US Department of Veterans Affairs or other institutions.
Contributor Information
Amber W. Trickey, Email: atrickey@stanford.edu.
Hyrum S. Eddington, Email: hyrumedd@stanford.edu.
Carolyn D. Seib, Email: carolyn.seib@va.gov.
Robin N. Kamal, Email: rnkamal@stanford.edu.
Alfred C. Kuo, Email: alfred.kuo@va.gov.
Qian Ding, Email: qiandingut@gmail.com.
Nicholas J. Giori, Email: ngiori@stanford.edu.
References
- 1.Alvarez-Nebreda ML, Jimenez AB, Rodriguez P, Serra JA. Epidemiology of hip fracture in the elderly in Spain. Bone. 2008;42:278-285. [DOI] [PubMed] [Google Scholar]
- 2.American College of Surgeons-National Surgical Quality Improvement Program. User Guide for the 2017 ACS NSQIP Participant Use Data File (PUF). Available at: https://www.facs.org/media/p1enl5ym/nsqip_puf_userguide_2017.pdf. Accessed May 24, 2022.
- 3.Aranguren-Ruiz MI, Acha-Arrieta MV, Casas-Fernandez de Tejerina JM, Arteaga-Mazuelas M, Jarne-Betran V, Arnaez-Solis R. Risk factors for mortality after surgery of osteoporotic hip fracture in patients over 65 years of age. Rev Esp Cir Orthop Traumatol. 2017;61:185-192. [DOI] [PubMed] [Google Scholar]
- 4.Blay-Domínguez E, Lajara-Marco F, Bernáldez-Silvetti PF, et al. O-POSSUM score predicts morbidity and mortality in patients undergoing hip fracture surgery. Rev Esp Cir Orthop Traumatol. 2018;62:207-2105. [DOI] [PubMed] [Google Scholar]
- 5.Burchard EG, Ziv E, Coyle N, et al. The importance of race and ethnic background in biomedical research and clinical practice. N Engl J Med. 2003;348:1170-1175. [DOI] [PubMed] [Google Scholar]
- 6.Cannada L, Mears S, Quatman C. Clinical Faceoff: When should patients 65 years of age and older have surgery for hip fractures, and when is it a bad idea? Clin Orthop Relat Res. 2021;479:24-27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Chang W, Cheng J, Allaire J, Xie Y, McPherson J. Shiny: web application framework for R. 2020. Available at: https://cran.r-project.org/web/packages/shiny/index.html. Accessed June 3, 2022
- 8.Cheng SY, Levy AR, Lefaivre KA, Guy P, Kuramoto L, Sobolev B. Geographic trends in incidence of hip fractures: a comprehensive literature review. Osteoporos Int. 2011;22:2575-2586. [DOI] [PubMed] [Google Scholar]
- 9.de Jong L, Mal Klem T, Kuijper TM, Roukema GR. Validation of the Nottingham Hip Fracture Score (NHFS) to predict 30-day mortality in patients with an intracapsular hip fracture. Rev Esp Cir Orthop Traumatol. 2019;105:485-489. [DOI] [PubMed] [Google Scholar]
- 10.Dubljanin Raspopovic E, Markovic Denic L, Marinkovic J, et al. Early mortality after hip fracture: what matters? Psychogeriatrics. 2015;15:95-101. [DOI] [PubMed] [Google Scholar]
- 11.Etscheidt J, McHugh M, Wu J, Cowen ME, Goulet J, Hake M. Validation of a prospective mortality prediction score for hip fracture patients. Eur J Orthop Surg Traumatol. 2021;31:525‐532. [DOI] [PubMed] [Google Scholar]
- 12.Fink AS, Campbell DA, Jr, Mentzer RM, Jr, et al. The National Surgical Quality Improvement Program in non-veterans administration hospitals: initial demonstration of feasibility. Ann Surg. 2002;236:344‐353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1-22. [PMC free article] [PubMed] [Google Scholar]
- 14.Harris AH. Path from predictive analytics to improved patient outcomes: a framework to guide use, implementation, and evaluation of accurate surgical predictive models. Ann Surg. 2017;265:461-463. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Hu F, Jiang C, Shen J, Tang P, Wang Y. Preoperative predictors for mortality following hip fracture surgery: a systematic review and meta-analysis. Injury. 2012;43:676-685. [DOI] [PubMed] [Google Scholar]
- 16.Khuri SF, Daley J, Henderson W, et al. The Department of Veterans Affairs' NSQIP: the first national, validated, outcome-based, risk-adjusted, and peer-controlled program for the measurement and enhancement of the quality of surgical care. National VA Surgical Quality Improvement Program. Ann Surg. 1998;228:491-507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Khuri SF, Henderson WG, Daley J, et al. Successful implementation of the Department of Veterans Affairs' National Surgical Quality Improvement Program in the private sector: the Patient Safety in Surgery study. Ann Surg. 2008;248:329-336. [DOI] [PubMed] [Google Scholar]
- 18.Leopold SS. Editorial: Beware of studies claiming that social factors are independently associated with biological complications of surgery. Clin Orthop Relat Res. 2019;477:1967-1969. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Leopold SS, Beadling L, Calabro AM, et al. Editorial: The complexity of reporting race and ethnicity in orthopaedic research. Clin Orthop Relat Res. 2018;476:917-920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Loggers SAI, Willems HC, Van Balen R, et al. Evaluation of quality of life after nonoperative or operative management of proximal femoral fractures in frail institutionalized patients: the FRAIL-HIP Study. JAMA Surg. 2022;157:424-434. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Meguid RA, Bronsert MR, Juarez-Colunga E, Hammermeister KE, Henderson WG. Surgical Risk Preoperative Assessment System (SURPAS): I. Parsimonious, clinically meaningful groups of postoperative complications by factor analysis. Ann Surg. 2016;263:1042-1048. [DOI] [PubMed] [Google Scholar]
- 22.Meguid RA, Bronsert MR, Juarez-Colunga E, Hammermeister KE, Henderson WG. Surgical Risk Preoperative Assessment System (SURPAS): II. Parsimonious risk models for postoperative adverse outcomes addressing need for laboratory variables and surgeon specialty-specific models. Ann Surg. 2016;264:10-22. [DOI] [PubMed] [Google Scholar]
- 23.Meguid RA, Bronsert MR, Juarez-Colunga E, Hammermeister KE, Henderson WG. Surgical Risk Preoperative Assessment System (SURPAS): III. Accurate preoperative prediction of 8 adverse outcomes using 8 predictor variables. Ann Surg. 2016;264:23-31. [DOI] [PubMed] [Google Scholar]
- 24.Paulus JK, Kent DM. Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities. NPJ Digit Med. 2020;3:99. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Pugely AJ, Martin CT, Gao Y, Klocke NF, Callaghan JJ, Marsh JL. A risk calculator for short-term morbidity and mortality after hip fracture surgery. J Orthop Trauma. 2014;28:63-69. [DOI] [PubMed] [Google Scholar]
- 26.Raval MV, Pawlik TM. Practical guide to surgical data sets: National Surgical Quality Improvement Program (NSQIP) and Pediatric NSQIP. JAMA Surg. 2018;153:764-765. [DOI] [PubMed] [Google Scholar]
- 27.Sanz-Reig J, Marin JS, Martinez JF, Beltran DO, Lopez JFM, Rico JAQ. Prognostic factors and predictive model for in-hospital mortality following hip fractures in the elderly. Chin J Traumatol. 2018;21:163-169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Schuijt HJ, Bos J, Smeeing DPJ, Geraghty O, van der Velde D. Predictors of 30-day mortality in orthogeriatric fracture patients aged 85 years or above admitted from the emergency department. Eur J Trauma Emerg Surg. 2021;47:817-823. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Smith T, Pelpola K, Ball M, Ong A, Myint PK. Pre-operative indicators for mortality following hip fracture surgery: a systematic review and meta-analysis. Age Ageing. 2014;43:464-471. [DOI] [PubMed] [Google Scholar]
- 30.Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21:128-138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Tibshirani R. Regression shrinkage and selection via the LASSO. J R Stat Soc Series B Stat Methodol 1996;58:267-288. [Google Scholar]
- 32.Vyas DA, Eisenstein LG, Jones DS. Hidden in plain sight - reconsidering the use of race correction in clinical algorithms. N Engl J Med. 2020;383:874-882. [DOI] [PubMed] [Google Scholar]