Deep Learning to Predict Mortality After Cardiothoracic Surgery Using Preoperative Chest Radiographs

Vineet K Raghu; Philicia Moonsamy; Thoralf M Sundt; Chin Siang Ong; Sanjana Singh; Alexander Cheng; Min Hou; Linda Denning; Thomas G Gleason; Aaron D Aguirre; Michael T Lu

doi:10.1016/j.athoracsur.2022.04.056

. Author manuscript; available in PMC: 2024 Sep 4.

Published in final edited form as: Ann Thorac Surg. 2022 May 21;115(1):257–264. doi: 10.1016/j.athoracsur.2022.04.056

Deep Learning to Predict Mortality After Cardiothoracic Surgery Using Preoperative Chest Radiographs

Vineet K Raghu ¹, Philicia Moonsamy ¹, Thoralf M Sundt ¹, Chin Siang Ong ¹, Sanjana Singh ¹, Alexander Cheng ¹, Min Hou ¹, Linda Denning ¹, Thomas G Gleason ¹, Aaron D Aguirre ¹, Michael T Lu ¹

PMCID: PMC11373441 NIHMSID: NIHMS2019322 PMID: 35609650

Abstract

BACKGROUND

The Society of Thoracic Surgeons Predicted Risk of Mortality (STS-PROM) estimates mortality risk only for certain common procedures (eg, coronary artery bypass or valve surgery) and is cumbersome, requiring greater than 60 inputs. We hypothesized that deep learning can estimate postoperative mortality risk based on a preoperative chest radiograph for cardiac surgeries in which STS-PROM scores were available (STS index procedures) or unavailable (non–STS index procedures).

METHODS

We developed a deep learning model (CXR-CTSurgery) to predict postoperative mortality based on preoperative chest radiographs in 9283 patients at Massachusetts General Hospital (MGH) having cardiac surgery before April 8, 2014. CXR-CTSurgery was tested on 3615 different MGH patients and externally tested on 2840 patients from Brigham and Women’s Hospital (BWH) having surgery after April 8, 2014. Discrimination for mortality was compared with the STS-PROM using the C-statistic. Calibration was assessed using the observed-to-expected ratio (O/E ratio).

RESULTS

For STS index procedures, CXR-CTSurgery had a C-statistic similar to STS-PROM at MGH (CXR-CTSurgery: 0.83 vs STS-PROM: 0.88; P = .20) and BWH (0.74 vs 0.80; P = .14) testing cohorts. The CXR-CTSurgery C-statistic for non–STS index procedures was similar to STS index procedures in the MGH (0.87 vs 0.83) and BWH (0.73 vs 0.74) testing cohorts. For STS index procedures, CXR-CTSurgery had better calibration than the STS-PROM in the MGH (O/E ratio: 0.74 vs 0.52) and BWH (O/E ratio: 0.91 vs 0.73) testing cohorts.

CONCLUSIONS

CXR-CTSurgery predicts postoperative mortality based on a preoperative CXR with similar discrimination and better calibration than the STS-PROM. This may be useful when the STS-PROM cannot be calculated or for non–STS index procedures.

Nearly 300 000 adults undergo cardiothoracic surgery in the United States annually.¹ An informed decision whether to proceed with surgery requires an accurate risk-benefit assessment, most importantly the risk of death after surgery. To this end, The Society of Thoracic Surgeons (STS) developed and maintains a risk assessment tool (STS Adult Cardiac Surgery Risk Calculator), which calculates the Predicted Risk of Mortality (STS-PROM) and morbidity scores² for many common cardiac surgical procedures. The STS-PROM is a risk factor–based regression model³ that was developed using data from 90% to 95% of U.S. hospitals performing cardiac surgery, totaling over 6 million cumulative operations performed by 3017 surgeons.¹ STS-PROM scores are commonly used by surgeons and institutions to counsel patients on the risk of cardiac surgery and for quality assurance, to evaluate hospitals based on their risk-adjusted surgical outcomes.

However, the STS-PROM has significant limitations. First, it is currently only applicable to commonly performed procedures: isolated coronary artery bypass grafting (CABG), isolated aortic valve replacement, isolated mitral valve repair and replacement, and a combination of CABG and valve procedures (STS index procedures). The STS-PROM is not relevant for 20% to 35% of cardiac surgical procedures (non–STS index procedures) including aortic aneurysms, tricuspid valve operations, ventricular assist devices, arrhythmia operations, and transplantations) because regression models have not yet been created for them.¹ Second, the STS-PROM achieves high performance by using more than 60 input variables; a simpler calculator may be more convenient. In addition, the STS-PROM often overestimates risk.⁴ An alternative that can be applied with limited data and to non–STS index procedures could be impactful.

Our study explores a new way to predict postoperative risk, using a convolutional neural network (CNN) and preoperative chest radiograph (CXR) images. Preoperative CXR images are typically obtained before cardiac surgery in the United States and abroad. We previously showed that CNNs can extract information from the pixels of a CXR image to predict 12-year all-cause mortality,⁵ incident lung cancer,⁶ and biological age,⁷ indicating that the pixels of the CXR provide a window into the health of a patient. We hypothesized that a deep learning model (CXR-CTSurgery) can learn to predict operative mortality in patients undergoing both STS and non–STS index procedures, based on preoperative CXR images.

PATIENTS AND METHODS

PATIENT COHORTS.

The model was developed (Figure 1) in a dataset consisting of adult patients having cardiothoracic surgery at the Massachusetts General Hospital (MGH), as identified in our institutional STS adult cardiac surgery database. Data are collected in this database according to the STS database guidelines.⁸ Presence of risk factors and outcomes were assessed using the standardized definitions of the STS. Consecutive patients who had a preoperative posterior-anterior CXR within 30 days before surgery (n = 12 898) were included: 70.3% (n = 12 898 of 18 344) of MGH cardiac surgery patients (Figure 1). Individuals having surgery between July 2002 and April 2014 were used for model development (randomly partitioned into training and tuning datasets) (n = 9283), totaling approximately 70% of the available data for the study. When a patient had multiple operations during the study period, only the first procedure was included.

Datasets for (A) model development and (B) testing. (BWH, Brigham and Women’s Hospital; CXR, chest radiograph; MGH, Massachusetts General Hospital; PA, posterior-anterior; STS, The Society of Thoracic Surgeons.)

The final model was tested in 2 independent testing datasets: (1) the remaining 30% of MGH patients (n = 3615) not used for model development who had surgery between April 8, 2014, and September 30, 2018; and (2) patients undergoing surgery between April 8, 2014, and December 31, 2019, from a second hospital (Brigham and Women’s Hospital [BWH], n = 2840). We tested our final model in the most recent patients, to best reflect how the model might perform in a future implementation.⁹ Individuals who had an STS index procedure but did not have a STS-PROM calculated were excluded from testing datasets. No individual in the development dataset was included in the independent testing dataset. This retrospective study was approved by our institutional review board, with waiver of informed consent.

RISK FACTORS, STS-PROM, AND MORTALITY.

Risk factor inputs to the STS-PROM were collected by dedicated data managers as part of the institutional STS adult cardiac surgery databases. All risk factor inputs were documented according to the STS-PROM calculator, and all STS-PROM inputs were used to calculate the score. The STS-PROM was calculated using the current version at the time of surgery (in the testing datasets, version 2.73 until July 1, 2014; v2.81 until July 1, 2017; and v2.9 for patients having surgery after July 1, 2017). The primary outcome was operative mortality,² defined as 30-day deaths and in-hospital mortality including after transfer to other acute care facilities (can be ≥30 days). This was ascertained through the electronic medical record, national death registry, and follow-up communication with patients and next of kin.⁸

CNN DEVELOPMENT.

A CNN is a type of artificial intelligence frequently used for tasks in computer vision. CNNs typically require tens of thousands of samples in order to achieve high performance. To mitigate this, we developed CXR-CTSurgery using a technique called transfer learning. In transfer learning, a network is first trained on a task in which data are plentiful (here, predicting 12-year mortality from CXRs using our previous CXR-Risk CNN),⁵ and then the network is “fine-tuned” on a second dataset to perform well on the intended task (predicting operative mortality after cardiac surgery). In this way, the model can retain high-level image features common across the 2 tasks. Full details of model development are available in the Supplemental Material.

The final CXR-CTSurgery model is an Inception V4 CNN,¹⁰ which accepts a CXR image as input and outputs a probability between 0 and 1 of operative mortality. Because different procedures have different baseline mortality rates, CNN predictions and the surgery being performed (encoded as: isolated CABG, isolated aortic valve replacement, isolated mitral valve repair, isolated mitral valve replacement, CABG + aortic valve replacement, CABG + mitral valve replacement, CABG + mitral valve repair, and non–STS index cases) were included as predictors in a logistic regression model trained on the tuning dataset to predict operative mortality. Non–STS index cases were used as the reference group. This choice is only internal to model building. We do not report any odds ratios, and the reference group does not affect the probability output by the model. These calibrated estimates were used for all downstream statistical analyses. Results are reported for the testing datasets only.

SALIENCY MAPS.

Saliency maps^11,12 were used to localize the anatomical regions on the CXR image that contributed to the model output. These maps reflect the gradient of the output risk estimate with respect to each individual pixel of the input image. Heatmaps were superimposed on the original images for visualization.

STATISTICAL ANALYSIS.

Primary Analysis.

We used the area under the receiver-operating characteristic curve (AUC) to evaluate how well CXR-CTSurgery and STS-PROM discriminate operative mortality. Calibration was assessed using the observed-to-expected ratio (O/E ratio). Both the AUC and O/E ratio were computed using the independent and external test datasets only. STS-PROM results are only reported on patients undergoing STS index procedures. CXR-CTSurgery results are compared with the STS-PROM for STS index procedures only, and CXR-CTSurgery results on non–STS index patients are reported separately.

Secondary Analysis.

Assessment of discrimination and calibration was repeated in male and female subgroups, and using 30-day mortality. We also measured discrimination using the area under the precision-recall curve, as the primary outcome was rare. CXR-CTSurgery outputs a continuous probability of mortality between 0 and 1. To facilitate calculation of precision, recall, and F1 score, continuous risk probabilities were stratified into low-risk (<4%) vs intermediate-high–risk (≥4%) groups. The less than 4% risk threshold was chosen to define low risk, based on the transcatheter aortic valve replacement literature that defines low risk as STS-PROM less than 4%.¹³ Calibration was further assessed by calculating mortality rates within risk groups. All statistical analysis was done in R v4.0.4 (R Foundation for Statistical Computing).

RESULTS

DEMOGRAPHICS, RISK FACTORS, AND PROCEDURES.

Table 1 presents demographics and risk factors for the development, internal testing, and external testing datasets. Testing cohorts had fewer ever-smokers (MGH development: 69.8% vs MGH testing: 53.8%; P < .001) and fewer individuals on dialysis (MGH development: 2.6% vs 1.5%; P < .001) than the development cohort but had otherwise similar baseline risk factors. The MGH testing cohort had lower operative mortality than the development cohort (MGH testing: n = 63 [1.7%] of 3615 vs MGH development: n = 251 [2.7%] of 9283; P = .002); however, this difference was not observed in the BWH testing cohort (n = 67 of 2840 [2.4%]; P = .35). Both testing cohorts had lower mean STS-PROM estimates than the development dataset (MGH development mean: 3.2% ± 4.2% vs MGH testing mean: 2.0% ± 2.8%; P < .001; and BWH testing mean: 1.7% ± 2.3%; P < .001), suggesting that there were differences in the risk profiles of patients used for model development vs model testing (ie, dataset drift).

TABLE 1.

Clinical Characteristics of Development and Testing Cohorts

Variable	MGH Development (n = 9283)	MGH Testing (n = 3615)	BWH Testing (n = 2840)
Age, y	65.7 ± 13.8	64.5 ± 13.3	64.2 ± 12.3
Male	6310/9282 (68)	2584/3615 (71.5)	1935/2840 (68.1)
Race/ethnicity
Asian	184/8802 (2.1)	114/3489 (3.3)	52/2723 (1.9)
Black	158/8802 (1.8)	81/3489 (2.3)	92/2723 (3.4)
White	8460/8802 (96.1)	3179/3489 (91.1)	2579/2723 (94.7)
Hispanic	290/9076 (3.2)	160/3566 (4.5)	96/2675 (3.6)
Obesity	2946/9274 (31.8)	1189/3612 (32.9)	950/2840 (33.4)
Diabetes	2398/9281 (25.8)	940/3613 (26.0)	751/2840 (26.4)
Past MI	1547/5561 (27.8)	890/3592 (24.8)	524/2840 (18.5)
Hypertension	6903/9281 (74.4)	2638/3613 (73.0)	2104/2840 (74.1)
Dyslipidemia	4099/5562 (73.7)	2635/3593 (73.3)	2102/2839 (74.0)
Heart failure class 3 or 4	2611/4698 (55.6)	442/805 (54.9)	480/957 (50.2)
Chronic lung disease	1335/9271 (14.4)	406/3609 (11.2)	371/2839 (13.1)
Prior cardiac intervention	2730/9280 (29.4)	1104/3612 (30.6)	894/2840 (31.5)
On dialysis	153/5957 (2.6)	53/3594 (1.5)	34/2840 (1.2)
Chronic kidney disease
Moderate	3178/9069 (35.0)	887/3484 (25.5)	636/2722 (23.4)
Severe	388/9069 (4.3)	121/3484 (3.5)	82/2722 (3.0)
Ever-smokers	3088/4422 (69.8)	1830/3399 (53.8)	1415/2726 (51.9)
Creatinine, mg/dL	1.2 ± 0.6	1.1 ± 0.7	1.1 ± 0.7
STS-PROM	3.2 ± 4.2	2.0 ± 2.8	1.7 ± 2.3
Operative mortality	251/9283 (2.7)	63/3620 (1.7)	67/2840 (2.4)

Open in a new tab

Values are mean ± SD or n/n (%). BWH, Brigham and Women’s Hospital; MGH, Massachusetts General Hospital; MI, myocardial infarction; STS-PROM, The Society of Thoracic Surgeons Predicted Risk of Mortality.

Development and testing cohorts differed in the mix of surgical procedures performed (Supplemental Table 1). The development cohort had a higher percent of isolated CABG (MGH development: n = 4297 [46.3%] of 9283 vs MGH testing: n = 1162 [32.1%] of 3620) and lower percent of all other STS index procedures. Non–STS index procedures represented a greater percent of all procedures in the development set than in the MGH testing set (MGH development: n = 410 [44.2%] of 9283 vs MGH testing: n = 1319 [36.4%] of 3620; P < .001). Both testing cohorts had a similar distribution of procedures.

DISCRIMINATION AND CALIBRATION.

Discrimination for operative mortality was assessed using the AUC (Table 2). For STS index procedures, CXR-CTSurgery had a similar AUC as the STS-PROM in the MGH testing cohort (CXR-CTSurgery: 0.829 [95% CI: 0.72–0.94] vs STS-PROM: 0.884 [95% CI: 0.82–0.95]; P = .20). CXR-CTSurgery discrimination was similar in patients undergoing non–STS index procedures (AUC: 0.874 [95% CI: 0.83–0.92]). CXR-CTSurgery outperformed a baseline demographic model in STS index (AUC: 0.829 vs 0.606 [95% CI: 0.47–0.74]; P < .001) and non–STS index (AUC: 0.874 vs 0.725 [95% CI: 0.65–0.80]; P < .001) procedures.

TABLE 2.

Discrimination for Operative Mortality of a Baseline Demographic Model (Age, Sex, Race), CXR-CTSurgery, and the STS-PROM for Patients for Whom the STS-PROM Could and Could Not Be Calculated

	MGH Testing Cohort		BWH External Testing Cohort
Variable	Patients Without an STS-PROM (n = 1315)	Patients With an STS-PROM (n = 2300)	Patients Without an STS-PROM (n = 935)	Patients With an STS-PROM (n = 1905)
Operative mortality	39/1315 (3.0)	24/2300 (1.0)	43/935 (4.6)	24/1905 (1.3)
Baseline demographic model	0.712 (0.63–0.79)	0.589 (0.46–0.71)	0.546 (0.46–0.64)	0.612 (0.50–0.72)
CXR-CTSurgery	0.874 (0.83–0.92)	0.829 (0.72–0.94)	0.727 (0.64–0.81)	0.738 (0.64–0.83)
STS-PROM	NA	0.884 (0.82–0.95)	NA	0.803 (0.71–0.90)

Open in a new tab

Values are n (%) or area under the receiver-operating characteristic curve (95% CI). BWH, Brigham and Women’s Hospital; MGH, Massachsetts General Hospital; NA, not available in this population; STS-PROM, The Society of Thoracic Surgeons Predicted Risk of Mortality.

Both the STS-PROM and CXR-CTSurgery had lower AUC in the BWH testing cohort; however, the relative performance between CXR-CTSurgery and STS-PROM was similar (CXR-CTSurgery AUC: 0.738 [95% CI: 0.64–0.83] vs STS-PROM AUC: 0.803 [95% CI: 0.71–0.90]; P = .14). CXR-CTSurgery discrimination was again similar in patients undergoing non–STS index procedures (AUC: 0.727 [95% CI: 0.64–0.81]). Similar discrimination results were seen when stratifying by sex (Supplemental Tables 2 and 3), using 30-day mortality as the endpoint (Supplemental Tables 4 and 5), and using area under the precision-recall curve (Supplemental Table 6).

We calculated mortality rates by procedure for the development and testing sets and median model risk scores by procedure (Supplemental Table 7). Mortality rates by procedure were similar in the development and both testing cohorts.

Calibration of CXR-CTSurgery and the STS-PROM was assessed using the O/E ratio. For patients undergoing STS index procedures, CXR-CTSurgery had better calibration than the STS-PROM in the MGH (CXR-CTSurgery O/E ratio: 0.74 [95% CI: 0.5–1.0] vs STS-PROM O/E ratio: 0.52 [95% CI: 0.3–0.7]) and BWH (O/E ratio: 0.91 [95% CI: 0.5–1.3] vs 0.73 [95% CI: 0.5–1.0]) testing cohorts. For those undergoing non–STS index procedures, CXR-CTSurgery had good calibration in the MGH testing cohort (O/E ratio: 1.18 [95% CI: 0.8–1.5]) but slightly underestimated risk in the BWH cohort (O/E ratio: 1.84 [95% CI: 1.3–2.4]). Similar calibration results were seen using 30-day mortality (Supplemental Tables 8 and 9).

We then assessed the calibration of CXR-CTSurgery after stratifying by sex (Supplemental Table 10). We found that CXR-CTSurgery retained good calibration in MGH STS index patients for both the female (CXR-CTSurgery O/E ratio: 0.69 [95% CI: 0.1–1.2] vs STS-PROM O/E ratio: 0.40 [95% CI: 0.1–0.7]) and male (O/E ratio: 0.76 [95% CI: 0.4–1.1] vs 0.57 [95% CI: 0.3–0.8]) subgroups. CXR-CTSurgery slightly underestimated risk in female (1.55 [95% CI: 1.0–2.2]), but had good calibration in male (0.87 [95% CI: 0.5–1.3]) non–STS index patients. Similar results were found in the BWH testing cohort.

DISCRETE RISK GROUPS AND RECLASSIFICATION ANALYSES.

CXR-CTSurgery and STS-PROM risk probabilities were stratified into low (<4%) and intermediate-high (≥4%) risk groups. CXR-CTSurgery risk groups had a graded association with mortality (Supplemental Tables S11,S12) with individuals at estimated risk ≥4% having a higher mortality rate than those at <4% in the MGH (CXR-CTSurgery risk ≥4%: 12 [14.8%] of 81 vs <4%: 9 [0.4%] of 2219; P < .001) and BWH (CXR-CTSurgery ≥4%: 5 [6.8%] of 73 vs <4%: 19 [1.0%] of 1832; P < .001) testing cohorts. Non–STS index procedures showed a similar pattern in both the MGH (CXR-CTSurgery ≥4%: 24 [14.9%] of 161 vs <4%: 15 [1.3%] of 1154; P < .001), and BWH (CXR-CTSurgery ≥4%: 13 [12.9%] of 101 vs <4%: 30 [3.6%] of 834; P < .001) testing cohorts. Detailed calibration by ordinal risk groups (Supplemental Table 13) and binary test statistics (Supplemental Table 14) are in the Supplement Material.

SALIENCY MAPS AND ASSOCIATION ANALYSIS.

Saliency maps were used to localize anatomical regions contributing to the CXR-CTSurgery risk estimate (Figure 2). The heart, aortic silhouette, mediastinum, lung, and other anatomy plausibly related to health and resilience were commonly highlighted. Of note, predictions were not based on the orientation markers or technologists’ initials in the corner of the image, which have been identified as potential confounders for other CXR artificial intelligence studies.¹⁴

Saliency maps localize anatomical regions contributing to the CXR-CTSurgery score. (AVR, aortic valve replacement; CABG, coronary artery bypass grafting; MV, mitral valve; STS-PROM, The Society of Thoracic Surgeons Predicted Risk of Mortality.)

We also performed an association analysis to determine what radiographic features and risk factors may be picked up by the CXR-CTSurgery estimate. We found that the CXR-CTSurgery score is associated with the STS-PROM, heart failure, chronic obstructive pulmonary disease, older age, lower estimated glomerular filtration rate, and other factors (Supplemental Figure 1). Association analysis with radiographic features shows that CXR-CTSurgery picked up on pleural effusion, pulmonary edema, cardiomegaly, and atelectasis from the CXR—similar to the saliency map analysis (Supplemental Figure 2).

COMMENT

Preoperative CXR images are routinely obtained before cardiac surgery. We found that a CNN (CXR-CTSurgery) can extract information from the preoperative CXR image to estimate risk of operative mortality. Discrimination was similar to the current clinical standard, the STS-PROM, for patients having STS index procedures in the MGH (STS-PROM AUC: 0.88 vs CXR-CTSurgery AUC: 0.83; P = .20) and BWH (AUC: 0.80 vs 0.74; P = .14) testing cohorts. For the important category of patients having non–STS index procedures, CXR-CTSurgery maintained high discrimination for postoperative mortality (AUC: 0.874) in the MGH testing cohort.

We trained the CXR-CTSurgery model using data from 2002 to 2014, then tested model performance in the most recent available data (2014–2019). The intent was to use the most recent data as the best estimate of future performance. This was important due to substantial dataset drift over time, with significant differences in STS-PROM profiles, overall mortality, and the mix of procedures performed. Nevertheless, in both datasets, CXR-CTSurgery had better calibration than STS-PROM (MGH CXR-CTSurgery O/E ratio, 0.74 vs STS-PROM O/E ratio, 0.52; BWH O/E ratio, 0.91 vs 0.73).

The STS-PROM requires manual input of over 60 clinical risk factors, which requires substantial effort and potentially introduces data entry errors. In contrast, CXR-CTSurgery requires only a single CXR image and the procedure being performed. Further, while STS-PROM scores are applicable to only the most common cardiothoracic procedures, CXR-CTSurgery may be applicable to any procedure. This is important, as over a third of the surgeries at the institutions studied here do not have STS-PROM scores (36.4% at MGH and 32.9% at BWH), and these patients had an operative mortality rate 3-fold higher than patients having STS index procedures (3.0% vs 0.9% at MGH; 4.6% vs 1.3% at BWH).

To our knowledge, this is the first report of a CNN to predict postoperative mortality from routine preoperative CXRs. We hypothesize that the appearance of the aorta, heart, lungs, and other anatomy on the CXR image carry information about aging, frailty, and resilience to surgery. Past risk scores (eg, STS-PROM, EuroSCORE [European System for Cardiac Operative Risk Evaluation])¹⁵ are based on clinical risk factors (eg age, diabetes) instead of an image. Kilic and colleagues used the same risk factors in a machine learning algorithm called XGBoost¹⁶ to modestly improve discrimination and calibration over STS-PROM for postoperative mortality.¹⁷

Limitations of this study should be considered. This was a retrospective study of individuals having cardiothoracic surgery, and performance of the model may differ when tested prospectively. Persons evaluated for but ultimately not having surgery were excluded. The model was developed and validated using frontal posterior-anterior CXR images, which was 70% of patients having cardiac surgery at MGH. The posterior-anterior technique is usually reserved for patients who can stand upright; thus, our population may be lower risk than the general cardiac surgery population. CXR-CTSurgery was developed and tested in 2 Boston tertiary care clinics. Whether the model generalizes to other CXR techniques, emergency surgery, and other institutions is not known. Testing at other institutions will be an important step toward validation of CXR-CTSurgery. To this end, our model is publicly available as free software (https://github.com/vineet1992/CXRCT-Surgery). CXR-CTSurgery is an Inception V4 CNN and was developed using transfer learning; other deep learning architectures and training procedures may improve performance. The training cohort included patients regardless of whether all STS-PROM inputs were available. For testing, we excluded 35 patients having STS index procedures but were missing inputs to STS-PROM to enable a direct comparison.

Deep learning models may encode bias against demographic groups. We showed that our model performed equally well in both male and female subgroups. Our cohort is 96% non-Hispanic White, so we had insufficient patients to test the model in racial or ethnic subgroups. Future work will validate the model in these and other minority groups.

Deep learning models are a black box,¹¹ in that it is difficult to assess how the model made its prediction, which may reduce trust and utility. To address this, we used association analyses to determine that the CXR-CTSurgery score is primarily associated with chronic obstructive pulmonary disease, heart failure, pulmonary edema, cardiomegaly, pleural effusion, and atelectasis.

The CXR-CTSurgery model may reduce manual effort, as it only requires a single preoperative CXR image instead of manual entry of over 60 risk factors. We see CXR-CTSurgery as a limited-data method to complement the STS-PROM in preoperative decision making. However, as CXR-CTSurgery does not identify underlying pathologies that drive risk, the physician must integrate their knowledge before decision making.

Though CXR-CTSurgery is not yet implemented into clinical workflows, the model can be easily distributed and run in the background of the electronic medical record with minimal user input. Most modern electronic medical record systems have a Picture Archiving and Communication system that exists in parallel where CXR images are archived. Vendor and free open-source tools are available to integrate models like CXR-CTSurgery into these archiving systems, enabling automated analysis of preoperative CXR images.^6,18

As this was a retrospective analysis, CXR-CTSurgery was compared with the STS-PROM version available at the time of surgery (versions 2.73–2.9) from 2017 to 2019; future prospective studies will validate CXR-CTSurgery against the latest STS-PROM version. CXR-CTSurgery was trained to estimate 30-day mortality risk; however, results are presented for operative mortality including 30-day mortality and all in-hospital deaths, as this was the endpoint used to develop the STS-PROM. Low-resolution (224 × 224) radiographs were used; higher resolution may improve performance. Manually inputting risk factors into the STS-PROM calculator pushes institutions to carefully monitor risk factors and their impact on outcomes and perioperative management; using a model based solely on CXR may result in a loss of vigilance toward presence of specific risk factors. Last, our development and testing cohorts were smaller than the Adult Cardiac Surgery Database used to develop the STS-PROM. Training the model on CXR images from multiple institutions and using a model that in-corporates both the image and risk factors as inputs may further improve performance.

In summary, we developed a CNN to assess postoperative mortality risk after cardiac surgery from a preoperative CXR image with similar discrimination and better calibration than the STS-PROM. Model predictions may inform surgical decisions when the STS-PROM cannot be applied.

Supplementary Material

Supplemental Material

NIHMS2019322-supplement-Supplemental_Material.docx^{(507KB, docx)}

FUNDING SOURCES

Dr Lu is supported by American Heart Association grants 18UNPG34030172 and 810966. Dr Raghu is supported by National Institutes of Health grant T32HL076136. Dr Lu reported research funding as a co-investigator to Massachusetts General Hospital from Kowa Company Limited and Medimmune/AstraZeneca and receiving personal fees from PQBypass unrelated to this work. Dr Aguirre reported grants from the CRICO Risk Management Foundation during the conduct of the study. No other potential conflict of interest relevant to this article was reported. A graphics processing unit used for this research was donated to Dr Lu as an unrestricted gift through the Nvidia Corporation Academic Program.

Abbreviations and Acronyms

AUC: area under the receiver-operating characteristic curve
BWH: Brigham and Women’s Hospital
CABG: coronary artery bypass grafting
CNN: convolutional neural network
CXR: chest radiograph
MGH: Massachusetts General Hospital
O/E ratio: observed-to-expected ratio
STS-PROM: The Society of Thoracic Surgeons Predicted Risk of Mortality

Footnotes

DISCLOSURES

Dr Lu has common stock in Nvidia and AMD. Dr Raghu has common stock in Nvidia, Alphabet, and Apple. Dr Gleason serves on the medical advisory board for Abbott.

The Supplemental Material can be viewed in the online version of this article [https://doi.org/10.1016/j.athoracsur.2022.04.056] on http://www.annalsthoracicsurgery.org.

Presented at the 2020 Radiological Society of North America Annual Meeting, Virtual Meeting, Nov 29-Dec 5, 2020.

REFERENCES

1.D’Agostino RS, Jacobs JP, Badhwar V, et al. The Society of Thoracic Surgeons Adult Cardiac Surgery Database: 2018 Update on Outcomes and Quality. Ann Thorac Surg. 2018;105:15–23. [DOI] [PubMed] [Google Scholar]
2.Shahian DM, Jacobs JP, Badhwar V, et al. The Society of Thoracic Surgeons 2018 Adult Cardiac Surgery Risk Models: part 1-background, design considerations, and model development. Ann Thorac Surg. 2018;105:1411–1418. [DOI] [PubMed] [Google Scholar]
3.O’Brien SM, Feng L, He X, et al. The Society of Thoracic Surgeons 2018 Adult Cardiac Surgery Risk Models: part 2-statistical methods and results. Ann Thorac Surg. 2018;105:1419–1428. [DOI] [PubMed] [Google Scholar]
4.Ad N, Holmes SD, Patel J, Prithcard G, Shuman DJ, Halpin L. Comparison of EuroSCORE II, original EuroSCORE, and The Society of Thoracic Surgeons Risk Score in cardiac surgery patients. Ann Thorac Surg. 2016;102:573–579. [DOI] [PubMed] [Google Scholar]
5.Lu MT, Ivanov A, Mayrhofer T, Hosny A, Aerts HJWL, Hoffmann U. Deep learning to assess long-term mortality from chest radiographs. JAMA Netw Open. 2019;2:e197416. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Lu MT, Raghu VK, Mayrhofer T, Aerts HJWL, Hoffmann U. Deep learning using chest radiographs to identify high-risk smokers for lung cancer screening computed tomography: development and validation of a prediction model. Ann Intern Med. 2020;173:704–713. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Raghu VK, Weiss J, Hoffmann U, Aerts HJWL, Lu MT. Deep learning to estimate biological age from chest radiographs. J Am Coll Cardiol Img. 2021;14:2226–2236. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Shahian DM, Torchiana DF, Engelman DT, et al. Mandatory public reporting of cardiac surgery outcomes: the 2003–2014 Massachusetts experience. J Thorac Cardiovasc Surg. 2019;158:110–124. [DOI] [PubMed] [Google Scholar]
9.fast.ai. How (and why) to create a good validation set. 2022. Accessed March 10, 2022. https://www.fast.ai/2017/11/13/validation-sets/
10.Szegedy C, Vanhoucke V, Ioffe S, Shiens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2016: 2818–2826. [Google Scholar]
11.Yosinski J, Clune J, Nguyen A, Fuchs T, Lipson H. Understanding neural networks through deep visualization. Preprint. Published online June 22, 2015. arXiv 1506.06579. 10.48550/arxiv.1506.06579 [DOI] [Google Scholar]
12.Baehrens D, Schroeter T, Harmeling S, Kawanabe M, Hansen K, Müller KR. How to explain individual classification decisions. J Mach Learn Res. 2010;11:1803–1831. [Google Scholar]
13.Kumar A, Sato K, Narayanaswami J, et al. Current society of thoracic surgeons model reclassifies mortality risk in patients undergoing transcatheter aortic valve replacement. Circ Cardiovasc Interv. 2018;11:e006664. [DOI] [PubMed] [Google Scholar]
14.DeGrave AJ, Janziek JD, Lee SI. AI for radiographic COVID-19 detection selects shortcuts over signal. Preprint. Published online October 8, 2020. medRxiv 20193565. 10.1101/2020.09.13.20193565 [DOI] [Google Scholar]
15.Nashef SAM, Roques F, Sharples LD, et al. EuroSCORE II. Eur J Cardiothorac Surg. 2012;41:734–745. [DOI] [PubMed] [Google Scholar]
16.Chen T, Guestrin C. Xgboost: A scalable tree bosting system. In: Proceedings of the 22nd ACM SigKDD International Conference on Knowledge Discovery and Data Mining. ACM; 2016:785–794. [Google Scholar]
17.Kilic A, Goyal A, Miller JK, et al. Predictive utility of a machine learning algorithm in estimating mortality risk in cardiac surgery. Ann Thorac Surg. 2020;109:1811–1819. [DOI] [PubMed] [Google Scholar]
18.Sohn JH, Chillakuru YR, Lee S, et al. An open-source, vender agnostic hardware and software pipeline for integration of artificial intelligence in radiology workflow. J Digit Imaging. 2020;33:1041–1046. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

NIHMS2019322-supplement-Supplemental_Material.docx^{(507KB, docx)}

[R1] 1.D’Agostino RS, Jacobs JP, Badhwar V, et al. The Society of Thoracic Surgeons Adult Cardiac Surgery Database: 2018 Update on Outcomes and Quality. Ann Thorac Surg. 2018;105:15–23. [DOI] [PubMed] [Google Scholar]

[R2] 2.Shahian DM, Jacobs JP, Badhwar V, et al. The Society of Thoracic Surgeons 2018 Adult Cardiac Surgery Risk Models: part 1-background, design considerations, and model development. Ann Thorac Surg. 2018;105:1411–1418. [DOI] [PubMed] [Google Scholar]

[R3] 3.O’Brien SM, Feng L, He X, et al. The Society of Thoracic Surgeons 2018 Adult Cardiac Surgery Risk Models: part 2-statistical methods and results. Ann Thorac Surg. 2018;105:1419–1428. [DOI] [PubMed] [Google Scholar]

[R4] 4.Ad N, Holmes SD, Patel J, Prithcard G, Shuman DJ, Halpin L. Comparison of EuroSCORE II, original EuroSCORE, and The Society of Thoracic Surgeons Risk Score in cardiac surgery patients. Ann Thorac Surg. 2016;102:573–579. [DOI] [PubMed] [Google Scholar]

[R5] 5.Lu MT, Ivanov A, Mayrhofer T, Hosny A, Aerts HJWL, Hoffmann U. Deep learning to assess long-term mortality from chest radiographs. JAMA Netw Open. 2019;2:e197416. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Lu MT, Raghu VK, Mayrhofer T, Aerts HJWL, Hoffmann U. Deep learning using chest radiographs to identify high-risk smokers for lung cancer screening computed tomography: development and validation of a prediction model. Ann Intern Med. 2020;173:704–713. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Raghu VK, Weiss J, Hoffmann U, Aerts HJWL, Lu MT. Deep learning to estimate biological age from chest radiographs. J Am Coll Cardiol Img. 2021;14:2226–2236. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Shahian DM, Torchiana DF, Engelman DT, et al. Mandatory public reporting of cardiac surgery outcomes: the 2003–2014 Massachusetts experience. J Thorac Cardiovasc Surg. 2019;158:110–124. [DOI] [PubMed] [Google Scholar]

[R9] 9.fast.ai. How (and why) to create a good validation set. 2022. Accessed March 10, 2022. https://www.fast.ai/2017/11/13/validation-sets/

[R10] 10.Szegedy C, Vanhoucke V, Ioffe S, Shiens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2016: 2818–2826. [Google Scholar]

[R11] 11.Yosinski J, Clune J, Nguyen A, Fuchs T, Lipson H. Understanding neural networks through deep visualization. Preprint. Published online June 22, 2015. arXiv 1506.06579. 10.48550/arxiv.1506.06579 [DOI] [Google Scholar]

[R12] 12.Baehrens D, Schroeter T, Harmeling S, Kawanabe M, Hansen K, Müller KR. How to explain individual classification decisions. J Mach Learn Res. 2010;11:1803–1831. [Google Scholar]

[R13] 13.Kumar A, Sato K, Narayanaswami J, et al. Current society of thoracic surgeons model reclassifies mortality risk in patients undergoing transcatheter aortic valve replacement. Circ Cardiovasc Interv. 2018;11:e006664. [DOI] [PubMed] [Google Scholar]

[R14] 14.DeGrave AJ, Janziek JD, Lee SI. AI for radiographic COVID-19 detection selects shortcuts over signal. Preprint. Published online October 8, 2020. medRxiv 20193565. 10.1101/2020.09.13.20193565 [DOI] [Google Scholar]

[R15] 15.Nashef SAM, Roques F, Sharples LD, et al. EuroSCORE II. Eur J Cardiothorac Surg. 2012;41:734–745. [DOI] [PubMed] [Google Scholar]

[R16] 16.Chen T, Guestrin C. Xgboost: A scalable tree bosting system. In: Proceedings of the 22nd ACM SigKDD International Conference on Knowledge Discovery and Data Mining. ACM; 2016:785–794. [Google Scholar]

[R17] 17.Kilic A, Goyal A, Miller JK, et al. Predictive utility of a machine learning algorithm in estimating mortality risk in cardiac surgery. Ann Thorac Surg. 2020;109:1811–1819. [DOI] [PubMed] [Google Scholar]

[R18] 18.Sohn JH, Chillakuru YR, Lee S, et al. An open-source, vender agnostic hardware and software pipeline for integration of artificial intelligence in radiology workflow. J Digit Imaging. 2020;33:1041–1046. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Deep Learning to Predict Mortality After Cardiothoracic Surgery Using Preoperative Chest Radiographs

Vineet K Raghu, PhD

Philicia Moonsamy, MD, MPH

Thoralf M Sundt, MD

Chin Siang Ong, MBBS, PhD

Sanjana Singh

Alexander Cheng

Min Hou, MS

Linda Denning, BSN

Thomas G Gleason, MD

Aaron D Aguirre, MD, PhD

Michael T Lu, MD, MPH

Abstract

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

PATIENTS AND METHODS

PATIENT COHORTS.

FIGURE 1.

RISK FACTORS, STS-PROM, AND MORTALITY.

CNN DEVELOPMENT.

SALIENCY MAPS.

STATISTICAL ANALYSIS.

Primary Analysis.

Secondary Analysis.

RESULTS

DEMOGRAPHICS, RISK FACTORS, AND PROCEDURES.

TABLE 1.

DISCRIMINATION AND CALIBRATION.

TABLE 2.

DISCRETE RISK GROUPS AND RECLASSIFICATION ANALYSES.

SALIENCY MAPS AND ASSOCIATION ANALYSIS.

FIGURE 2.

COMMENT

Supplementary Material

FUNDING SOURCES

Abbreviations and Acronyms

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases