Abstract
Objective:
To develop an artificial intelligence–based pipeline to assess and triage patient-submitted postoperative wound images.
Background:
The rise of outpatient surgeries, remote monitoring, and patient-submitted wound images via online portals has contributed to a growing administrative workload on clinicians. Early identification of surgical site infection (SSI) is essential for reducing postoperative morbidity.
Methods:
Patients ≥18 years old who underwent surgery at 9 affiliated Mayo Clinic hospitals (2019-2022) and were captured by the National Surgical Quality Improvement Program (NSQIP) were included. Eligibility required submission of one image via the patient portal within 30 days after surgery. Images were independently screened in duplicate to determine the presence of an incision. SSI outcomes were obtained from NSQIP. The developed model consisted of 2 stages: incision detection and SSI detection in images with incisions. Four pretrained architectures were evaluated using 10-fold cross-validation, with upsampling and data augmentation to mitigate class imbalance. An end-to-end pipeline, image quality assessment and sensitivity analysis stratified by race were also performed.
Results:
Among 6060 patients, the median age was 54 years (interquartile range: 40–65), 61.4% (n=3805) were female, and 92.5% (n=5731) identified as White. SSIs were confirmed in 6.2% (n=386) images. Vision Transformer outperformed all others, achieving an incision detection accuracy of 0.94 (area under the curve=0.98) and an SSI detection accuracy of 0.73 (area under the curve=0.81). In addition, it demonstrated strong performance in assessing image quality. Sensitivity analysis revealed comparable performance across racial subgroups.
Conclusion:
This artificial intelligence pipeline demonstrates promising performance in automating wound image assessment and SSI detection, reducing clinical workload and improving postoperative care.
Key Words: artificial intelligence, postoperative monitoring, surgical site infection
Surgical site infections (SSIs) are infections that occur at or near the surgical incision.1 Despite advances in infection control practices, SSIs remain a significant challenge, accounting for up to 20% of all hospital-acquired infections.2,3 They contribute to prolonged hospitalization, increased morbidity and mortality, and an estimated $3.3 billion in annual health care costs in the United States.2,4,5
Early identification of SSIs can reduce morbidity and improve patient outcomes.6 However, the increasing prevalence of outpatient surgeries7,8 and remote patient monitoring,9,10 combined with patients’ growing reliance on online portals for communication,11 has led to a surge in postoperative wound images submitted electronically. Reviewing and triaging these images poses a significant administrative workload on physicians, nurses, and advanced practice providers.12–14 The early diagnosis of SSIs, which can occur up to 30 days after surgery, has been shown to improve outcomes.15,16 Therefore, delays in manual review and triage of images submitted by patients may postpone timely diagnosis and intervention, potentially worsening patient outcomes.15,16
Leveraging artificial intelligence (AI) for image analysis offers the promise of reducing the administrative load on clinicians and potentially improving patient outcomes. Although automated image analysis has shown promise in other areas of medical imaging,17 its application to postoperative wound assessment remains relatively unexplored, underscoring the novelty and potential impact of this approach on patient outcomes and health care efficiency.14,18 Therefore, our aim was to develop an AI-based approach to assess and triage postoperative wound images, with the goal of automating SSI detection and streamlining postoperative monitoring.
METHODS
Study Design
This study was conducted in adherence to the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines.19 Following institutional review board approval, adult patients (≥18 years old) who underwent any surgical procedure captured by the National Surgical Quality Improvement Program (NSQIP) between 2019 and 2022 at the 9 affiliated academic, tertiary, and regional hospitals across the Mayo Clinic system were retrospectively identified. This includes hospitals in Arizona, Wisconsin, Florida, and Minnesota. At the time of the study, NSQIP captured between 12% and 42% of all surgical procedures per its sampling protocol at these sites. Institutions participating in NSQIP identify postoperative patients through trained clinical reviewers who systematically assess medical records to collect data on preoperative factors, intraoperative details, and 30-day postoperative outcomes. Trauma, cardiac, and transplant procedures are not included in NSQIP data.
The cohort was then limited to patients who submitted at least one image of any type via the patient portal within 30 days after surgery. Patients who only submitted PDF documents or did not consent to the use of their data for retrospective research per Minnesota state law were excluded. By study design, there were no missing images; however, patients with missing characteristics were described in the tables. Patients were permitted to have multiple surgeries and submit multiple images. For patients who underwent reoperations within 30 days after the index operation, all procedures were considered part of a single event. However, subsequent operations beyond 30 days were included as separate events if they were selected for NSQIP sampling, and these procedures were treated as new patient entries. All images were treated independently.
Image Review
All identified images were independently and systematically reviewed in duplicate by two surgeons. Inter-rater reliability was assessed using the kappa statistic (κ), and review proceeded once a κ score >0.8 was attained.20 A structured extraction form using Excel was developed and piloted to evaluate both the presence of a surgical incision and the image quality when an incision was present. Data abstraction was performed in duplicates (H.M. and F.L.), with disagreements resolved through discussion or by consulting a third reviewer (C.T.).
Each image was categorized for the presence or absence of a surgical incision. For images with incisions, the overall quality was evaluated based on specific criteria: blurriness, covered incision, poor angle, suboptimal lighting (too dark or too bright), and suboptimal distance (too close or too far). A composite variable (overall image quality) was created to include the presence of any of these qualities.
Surgical Site Infection and Other Variables
For the purposes of this study, SSI outcomes were obtained from NSQIP and considered the gold standard as the data are independently and prospectively collected, thereby minimizing the risk of investigator-introduced bias in determining SSIs. NSQIP defines SSI according to the CDC criteria within a 30-day postoperative period.21 SSIs are categorized as superficial (involving skin or subcutaneous tissue), deep (involving deeper soft tissues), and organ space (involving internal organs or spaces).21 For this study, patients with superficial or deep SSIs were classified as having an SSI, while organ-space infections were excluded, as these involve infection of internal organs rather than the surgical incision.22 All other clinical variables were extracted from institutional NSQIP registries. Patient-level variables included age at time of surgery, sex, and race. Operative variables included the surgical specialty, whether the procedure was performed in the outpatient setting, and operative time. In addition, the interval from surgery to SSI diagnosis was abstracted for patients with confirmed infection.
Images Processing
To ensure consistency in model input, minimize distortion and optimize image quality while preserving spatial relationships, images of varying dimensions were first scaled to maintain their aspect ratio. They were then standardized to 224×224 pixels (with 3 color channels) using white padding to achieve uniform resolution without distortion.
Model Training and Validation
The developed model consists of 2 stages (Supplemental Digital Content Figure 1, http://links.lww.com/SLA/F531). The first stage detects surgical incisions from all submitted images, and the second stage identifies SSIs from images confirmed to contain an incision. All analyses were performed at the image level. Four pretrained architectures were evaluated: Vision Transformer,23 uses an attention-based method to identify important regions within an image; ResNet50,24 employs residual “skip” connections to facilitate deep network training; MobileNetV4,25 optimized for resource-constrained environments with a lightweight design; U-Net,26 uses an encoder-decoder structure for precise image segmentation. See Supplemental Digital Content Additional Methods, http://links.lww.com/SLA/F531, for additional details on the utilized architectures.
The dataset was randomly split into 90% for training and 10% for validation. Model development and validation were performed using 10-fold cross-validation, where the dataset was divided into 10 partitions (folds). Each model was trained using 9-folds and tested on the remaining fold. This process was repeated 10 times, rotating the training and testing folds in each iteration. The models were fine-tuned using the Adam optimizer (learning rate of 1e-5 for Vision Transformer and 1e-4 for other models) with Cosine Annealing for learning rate scheduling. Cross-entropy loss with label smoothing was used and early stopping (with a patience of 5 iterations), was implemented to prevent overfitting. A batch size of 32 was used, with model parameters updated based on the average error across batches. This process was reset and repeated for each fold. See Supplemental Digital Content, Additional Methods, http://links.lww.com/SLA/F531 and Supplemental Digital Content, Table 1, http://links.lww.com/SLA/F531 for additional details on hyperparameter search.
To address class imbalance during the training stage, upsampling was applied to the minority class to ensure a balanced outcome distribution and reduce bias toward the majority class. During the test/validation stage, downsampling of the majority class was applied to mitigate performance inflation bias and provide a more informative evaluation of the model's performance. In addition, data augmentation techniques (random flips and grayscaling) were utilized to enhance model generalizability.
Validation was performed without recalibration to assess the model’s predictive ability against actual outcomes. In addition, an end-to-end pipeline—combining incision and infection detection models—was evaluated on the natural data distribution to ensure no artificial biases were introduced. For each image, the model generated a continuous probability score between 0 and 1, with values closer to 1 indicating a higher likelihood of the predicted outcome. The mean and standard deviations (SD) of performance metrics were calculated across all iterations, and the area under the receiver operating characteristic (ROC) curve (AUC) was calculated to assess model discrimination.
Attention Maps
Attention maps were generated to visually represent the regions of an image that the model is focused on when making predictions.27 This was performed for Vision Transformer, which processes images by dividing them into small patches and calculating how each patch is related to every other patch (attention). Attention weights from the final layer of the Vision Transformer model were used to generate these maps, highlighting the most relevant areas contributing to the model’s prediction, providing insight into the model’s decision-making process.
Gradient-weighted Class Activation Mapping (Grad-CAM) Visualization
Grad-CAM was used to visualize which areas of an image most influenced the model’s prediction.28 Grad-CAM works by using the gradients flowing into the final convolutional layers of CNN-based models (eg, ResNet50, MobileNet, U-Net) to generate a coarse heatmap that highlights important regions contributing to the classification output. These visualizations enhance model interpretability by identifying the features that drive predictions. We used the trained ResNet50 model to generate the Grad-CAM visualizations.
Image Quality Prediction Analysis
To enhance the model’s applicability and real-world utility, an additional analysis was conducted using the same method of model development and validation to predict image quality. The best-performing pretrained architecture was leveraged to detect images with blurriness, covered incisions, poor angle, suboptimal lighting, suboptimal distance, and overall image quality. This analysis was intended to support future implementation to allow for the automated submission of higher-quality images by the patients. Therefore, the results of the models were reported independently and were not included in the incisions detection, infection detection, and end-to-end models.
Sensitivity Analysis for Bias
To evaluate potential bias based on race, a sensitivity analysis was performed using the best-performing pretrained architecture. Patients were stratified based on the patient-reported race variable abstracted from NSQIP into White and non-White subcategories. The same model development, fine-tuning, and validation procedures were applied to each subgroup to ensure the model is equitable and unbiased in its predictive ability.
Statistical Analysis
Performance metrics, including accuracy, precision, recall, and F1-score, were calculated for each model. Accuracy represents the proportion of correctly classified images (both true positives and true negatives) out of the total number of images, providing an overall measure of the model’s correctness. Precision, or positive predictive value, reflects the proportion of true positive results among all images the model identified as positive. Recall, also known as sensitivity, measures the proportion of actual positive cases that were correctly identified by the model, assessing its effectiveness in capturing relevant cases. F1-score is a metric that combines precision and recall by balancing the trade-off between these 2 measures.
Model discrimination was quantified using the AUC, where an AUC of 1 indicates perfect classification, and 0.5 suggests performance equivalent to random chance. During the 10-fold cross-validation, performance metrics (accuracy, precision, recall, F1-score, and AUC) were computed in each fold. The mean and SD for these scores were then calculated across all folds and reported as final performance metrics. A 2-tailed Student t test was used to compare continuous outcomes. P values <0.05 were considered statistically significant.
All model development was performed using PyTorch version 2.5.1+cu124, developed by Facebook’s AI Research Lab (FAIR) using Timm (PyTorch Image Models) library. Performance metrics were calculated using the Scikit-learn library available in Python version 3.11.7. Attention maps were generated using the Matplotlib library in Python. Grad-CAM visualizations were generated using the PyTorch Grad-CAM library in Python.
RESULTS
Cohort Description
After inclusion and exclusion criteria, 6060 patients were identified, with 6199 operations and 20,895 photos submitted within 30 days after surgery (Supplemental Digital Content Figure 2, http://links.lww.com/SLA/F531). The median age was 54 years (IQR: 40–65), 61.4% (n=3805) females and 92.5% (n=5731) identifying as White (Table 1). The most common surgical specialties were orthopedics (23.9%, n=1484), general surgery (19.8%, n=1230), and plastic surgery (15.1%, n=939), with 56.8% (n=3524) of procedures performed on an outpatient basis. The median operating room time was 130 minutes (IQR: 79–224) (Table 1).
TABLE 1.
Cohort Characteristics
| Variables | Total (N=6199) | 
|---|---|
| Age at surgery, median (IQR) | 54.0 (40.0–65.0) | 
| Sex, n (%) | |
| Male | 2390 (38.6) | 
| Female | 3805 (61.4) | 
| Nonbinary | 4 (0.1) | 
| Race/ethnicity, n (%) | |
| Unknown/not reported/multiple | 177 (2.9) | 
| White | 5731 (92.5) | 
| Non-White | 291 (4.7) | 
| Functional status, n (%) | |
| Unknown | 235 (3.8) | 
| Independent | 5891 (95.0) | 
| Partially dependent or dependent | 73 (1.2) | 
| Current smoker within 1 y, n (%) | |
| No | 5646 (91.1) | 
| Yes | 553 (8.9) | 
| Diabetes mellitus with oral agents or insulin, n (%) | |
| Insulin | 250 (4.0) | 
| Noninsulin/oral | 395 (6.4) | 
| None | 5554 (89.6) | 
| Surgical specialty, n (%) | |
| General surgery | 1230 (19.8) | 
| Gynecology | 400 (6.5) | 
| Neurosurgery | 663 (10.7) | 
| Orthopedics | 1484 (23.9) | 
| ENT | 445 (7.2) | 
| Plastics | 939 (15.1) | 
| Thoracic | 231 (3.7) | 
| Urology | 450 (7.3) | 
| Vascular | 156 (2.5) | 
| Obstetrics | 201 (3.2) | 
| Inpatient/outpatient, n (%) | |
| Inpatient | 2675 (43.2) | 
| Outpatient | 3524 (56.8) | 
| Time in OR (min), median (IQR) | 130.0 (79.0–224.0) | 
| Surgical site infection, n (%) | |
| No | 5776 (93.2) | 
| Yes | 423 (6.8) | 
Cohort characteristics are described at the surgery level. Included patients may have more than one surgical intervention. Values represent median (interquartile range, IQR) for continuous data and counts (percentage) for categorical data.
Following manual review of all submitted photos, 4289 patients with 13,825 photos were found to contain a surgical incision. Among these, 1365 photos from 372 patients were diagnosed with an SSI (Supplemental Digital Content Figure 2, http://links.lww.com/SLA/F531).
Incision Detection
A total of 20,895 photos were used to develop the incision detection model, of which 13,825 contained a surgical incision and 7069 did not (Table 1 and Supplemental Digital Content Table 2, http://links.lww.com/SLA/F531). Vision Transformer achieved the highest performance, with an accuracy of 0.94±0.00, precision of 0.93±0.01, recall of 0.94±0.01, F1-score of 0.94±0.00, and an AUC of 0.98±0.00 (Table 2 and Fig. 1A). In comparison, MobileNet, ResNet50, and U-Net showed slightly lower accuracies of 0.90±0.01, 0.89±0.01, and 0.88±0.01, respectively. While MobileNetV4 and ResNet50 demonstrated higher precision, Vision Transformer outperformed in recall and F1-score, indicating a more balanced performance across different performance metrics (Table 2 and Fig. 1A).
TABLE 2.
Model Performance for Incision and SSI Prediction
| Model | Accuracy | Precision | Recall | F1-score | AUC | 
|---|---|---|---|---|---|
| Incision detection | |||||
| Vision Transformer | 0.94±0.00 | 0.93±0.01 | 0.94±0.01 | 0.94±0.00 | 0.98±0.00 | 
| MobileNet | 0.90±0.01 | 0.94±0.01 | 0.84±0.02 | 0.89±0.01 | 0.95±0.01 | 
| ResNet50 | 0.89±0.01 | 0.94±0.01 | 0.84±0.01 | 0.89±0.01 | 0.95±0.01 | 
| U-Net | 0.88±0.01 | 0.93±0.02 | 0.83±0.02 | 0.88±0.01 | 0.84±0.01 | 
| Surgical site infection detection | |||||
| Vision Transformer | 0.73±0.02 | 0.89±0.04 | 0.53±0.05 | 0.67±0.03 | 0.81±0.05 | 
| MobileNet | 0.71±0.03 | 0.92±0.04 | 0.47±0.05 | 0.62±0.05 | 0.77±0.04 | 
| ResNet50 | 0.72±0.02 | 0.88±0.03 | 0.50±0.05 | 0.64±0.03 | 0.73±0.02 | 
| U-Net | 0.75±0.01 | 0.88±0.03 | 0.57±0.05 | 0.70±0.03 | 0.57±0.03 | 
| End-to-end model | |||||
| Vision Transformer | 0.87±0.05 | 0.43±0.10 | 0.54±0.05 | 0.46±0.06 | 0.82±0.02 | 
| MobileNet | 0.71±0.02 | 0.40±0.04 | 0.49±0.06 | 0.44±0.03 | 0.76±0.01 | 
| ResNet50 | 0.70±0.02 | 0.37±0.06 | 0.52±0.06 | 0.43±0.04 | 0.69±0.01 | 
| U-Net | 0.73±0.01 | 0.34±0.07 | 0.59±0.05 | 0.42±0.03 | 0.55±0.01 | 
Prediction of incision, SSI, and end-to-end model pipeline. All estimates are presented as mean (SD).
FIGURE 1.

Receiver operating characteristic curves. Receiver operating characteristics (ROC), including area under the curve (AUC) calculated for each model architecture used to illustrate the predictive ability. AUC ≥0.7 indicates acceptable discrimination, while AUC ≥0.8 signifies excellent discrimination. Vision Transformer (blue), ResNet50 (green), MobileNetV4 (yellow), and U-Net (red).
Infection Detection
A total of 13,825 incision photos were used to develop the SSI model, including 1365 photos with an SSI and 12,460 without. Vision Transformer achieved an accuracy of 0.73±0.02, with high precision (0.89±0.04) but moderate recall (0.53±0.05) and an F1-score of 0.67±0.03, yielding an AUC of 0.81±0.05. MobileNet recorded an accuracy of 0.71±0.03 and the highest precision at 0.92±0.04, although its recall was lower (0.47±0.05), resulting in an F1-score of 0.62±0.05 and an AUC of 0.77±0.04. ResNet50 performed similarly with an accuracy of 0.72±0.02, precision of 0.88±0.03, recall of 0.50±0.05, and an F1-score of 0.64±0.03, with an AUC of 0.73±0.02. U-Net demonstrated the highest accuracy (0.75±0.01) and best recall (0.57±0.05) and F1-score (0.70±0.03), with an AUC of 0.84±0.03. While the models show comparable performance, Vision Transformer exhibits a more balanced detection capability (Table 2 and Fig. 1B).
End-to-End Model
The end-to-end analysis integrates both incision and infection detection. Vision Transformer outperformed the other models, achieving an accuracy of 0.87±0.05 and an AUC of 0.82±0.02, despite modest precision and F1-score (Table 2 and Fig. 1C). In contrast, MobileNet, ResNet50, and U-Net demonstrated lower accuracies of 0.71±0.02, 0.70±0.02, and 0.73±0.01, respectively, with comparable performance across precision, recall, and F1-score metrics. Therefore, Vision Transformer provides a more robust classification framework for automated surgical wound analysis. As a result, all additional analyses were conducted using Vision Transformer.
Attention Maps and Grad-CAM
To better understand model decision-making, attention maps were generated using Vision Transformer (Supplemental Digital Content Figure 3A, http://links.lww.com/SLA/F531) and Grad-CAM images were generated for ResNet50 (Supplemental Digital Content Figure 3B, http://links.lww.com/SLA/F531). The highest activation was consistently localized over the incision, indicating that the models effectively focus on the most relevant features for accurate detection.
Image Quality Prediction Analysis
A total of 13,825 photos with incisions were used to assess image quality. Vision Transformer achieved an accuracy of 0.82±0.01, precision of 0.81±0.02, recall of 0.85±0.05, F1-score of 0.82±0.01, and an AUC of 0.90±0.01 (Supplemental Digital Content Table 3, http://links.lww.com/SLA/F531 and Fig. 2). When evaluating specific quality attributes, the model performed best in identifying covered incisions, with slightly lower performance in detecting blurriness, poor angle, lighting, and distance issues. This suggests that the model is highly effective at assessing overall image quality and can be utilized for providing image quality feedback.
FIGURE 2.

Prediction of image quality. Receiver operating characteristics (ROC), including area under the curve (AUC) calculated for Vision Transformer to illustrate the predictive ability of identifying overall image quality (blue), blurry images (orange), images with covered incisions (green), poor angle (red), images with suboptimal lighting (too bright or too dark) (purple), images with suboptimal distance (too far or too close) (brown).
Sensitivity Analysis for Bias
A sensitivity analysis stratified by race demonstrated comparable performance between White and non-White patients (Supplemental Digital Content Table 4, http://links.lww.com/SLA/F531 and Fig. 3). For incision detection, the model achieved similar accuracies (0.93±0.02 for White vs 0.94±0.02 for non-White) with nearly identical precision, recall, F1-scores, and AUC values. SSI detection metrics were also similar between the 2 groups, and the end-to-end model further confirmed this consistency, yielding identical performance across racial groups, suggesting minimal bias and strong generalizability.
FIGURE 3.

Sensitivity analysis for race. Receiver operating characteristics (ROC), including area under the curve (AUC) calculated for a sensitivity analysis to determine model performance across race using Vision Transformer. P values reflect 2-tailed Student t tests to compare White and non-White patient groups.
DISCUSSION
We developed an AI-based pipeline to assess and triage postoperative wound images, focusing on surgical incision and SSI detection. Among the evaluated architectures, Vision Transformer consistently outperformed alternative models for incision detection, achieving an accuracy of 0.94 and an AUC of 0.98. While its performance for SSI detection was more modest (accuracy of 0.73 and AUC of 0.81), the overall end-to-end model demonstrated promising results, supporting the feasibility of AI in automating incision and SSI detection. These findings are particularly relevant in an era of increased reliance on patient portals for communication, a growing volume of outpatient surgeries, and expanding remote patient monitoring programs, all of which contribute to the rising administrative burden faced by health care providers.12–14
Given its current performance, the model is best suited as a triage support tool intended to complement, rather than replace, clinical judgment. Although not the aim of this study, it is noteworthy that the model’s accuracy for SSI detection, based on image analysis alone, while not perfect, approaches the performance accuracy reported for clinicians.29–32 Importantly, SSIs often cannot be diagnosed based on a single image, even by expert surgeons. Clinical diagnosis typically incorporates multimodal data sources, including patient symptoms and physical examination. Thus, we anticipate that any model solely reliant on imaging will have intrinsic performance limitations, and that future enhancements incorporating multiple data sources, such as symptoms and patient risk factors, will be necessary to improve diagnostic accuracy and clinical utility.
Vision Transformer, MobileNet, ResNet50, and U-Net models were selected for incision and SSI detection to allow a comparison of architectures with diverse strengths. Vision Transformer outperformed the other models due to its ability to capture pixel dependencies and global contextual relationships in the images, which is crucial for distinguishing subtle patterns indicative of incisions and infections. Unlike cognitive convolutional neural network-based models such as MobileNet and ResNet50, which rely on local receptive fields and hierarchical feature extraction, Vision Transformer processes entire image patches simultaneously, leading to better feature representation in complex medical imagery. MobileNet, designed for efficiency, demonstrated strong performance in real-time applications but lacked the depth needed for intricate feature learning. ResNet50, while robust in deep feature extraction, showed limitations in handling spatially scattered or small-scale features. U-Net, primarily a segmentation model, struggled with classification tasks but provided useful localization insights. This comparative analysis demonstrated that Vision Transformer's performance makes it a promising candidate for clinical deployment, balancing accuracy and interpretability in surgical image analysis.
This study builds on existing literature by illustrating the possibility of utilizing AI for wound monitoring.18,33 Herein, we focus on a pragmatic approach using real-world unstructured images submitted by patients without predefined criteria for submission. Our work leverages state-of-the-art attention-based architectures like Vision Transformer, which effectively capture clinically relevant features, to detect surgical incisions, SSIs, and image quality. In addition, unlike prior work,18 we demonstrated that through iterative model development and refinement, we overcame racial biases and barriers, as demonstrated by comparable performance metrics across racial subgroups. With prospective external validation, this tool has the potential to be integrated into clinical workflows, to provide immediate feedback to patients and support care teams, such as acknowledgement of the type of image received, identification of low-quality image submissions to prompt the patient for additional photos and distinguish patients at risk of SSI who require expedited review by the provider. Moreover, the application of interpretability techniques will enhance the ability to understand precisely which image features influence model predictions and whether subclinical changes might precede clinically apparent infections.
The pragmatic approach of incorporating AI-based image analysis has advantages, but several limitations must be acknowledged. First, while incision detection performance was robust, the SSI detection component may require further refinement, possibly through a larger sample size and the integration of additional patient-reported or clinical variables. For example, the addition of patient-reported symptoms (eg, drainage or warmth) could enhance future model performance. Second, although this model was trained on a large, multi-institutional dataset encompassing a variety of surgical specialties, it was limited to NSQIP-sampled procedures, excluding trauma, cardiac, and transplant surgeries. Further validation in these patient populations and at other centers is needed. Third, neither external nor prospective validation was performed, both of which are essential to assess the model’s generalizability and performance in real-world clinical environments. Fourth, while reliance on NSQIP for SSI identification ensures a standardized framework for consistent identification of the outcome, prior studies have documented some limitations of NSQIP in capturing SSIs.34 Finally, although the model exhibited comparable performance across races, the non-White patient sample size was small and patient-reported race may not necessarily correlate with skin pigmentation. Further studies stratifying performance by skin tone are currently in process to evaluate the model’s performance across different skin pigmentation levels.
In conclusion, integrating AI into postoperative surgical incision monitoring is feasible and holds promise for optimizing patient outcomes and reducing the administrative burden on clinicians by streamlining clinical workflow. These findings can revolutionize clinical practice by integrating automated AI-based wound image analysis to streamline image triage, facilitate earlier SSI detection and improve response time. As health care systems increasingly rely on patient portals, outpatient surgeries, and remote monitoring programs, AI-driven solutions offer a pragmatic approach to reducing clinicians’ workload while enhancing patient care.
Supplementary Material
AKNOWLEDGEMENT
The authors acknowledge funding from the Dalio Foundation and the Simons Family Career Development Award in Surgical Innovation that supported this work.
DISCUSSANT
Dr. Paul C. Kuo (Tampa, FL)
I don’t have any conflicts to report. But, to paraphrase one of my trainees, I wish I did.
This is a really cool paper, and it rekindles my faith in impactful surgical innovation and research. And I’m super excited to discuss this.
The paper describes an AI pipeline that identifies surgical incisions and then determines the presence or absence of SSI. I’m not going to repeat the findings of the study, and what follows are my nerdy meanderings in no particular order.
One, the authors state that this technology would be primarily positioned as a workflow optimization tool rather than a diagnostic replacement. So, what are some implementation strategies or integration points within workflows? Like, how many days postop do you want? Staples or no staples?
And as you point out, the 73% accuracy for SSI detection, while promising, may not be sufficient for fully automated screening without human oversight.
Your study uses a traditional test and train approach but leaves out what is now a requirement of having a validation set in addition to test-and-train. And a lack of validation on an external dataset would seem to be a critical limitation.
Would the authors consider evaluation using a prospectively collected dataset or compare their model’s performance against human diagnosis in a blinded trial?
The next step is rhetorical, but it is extremely problematic for AI clinical studies. The authors mention the TRIPOD guidelines, but several key elements for reproducibility are missing, such as code availability, data sharing of either original or mirrored data, insufficient technical details to reproduce the models and limited information on hyperparameter selection.
Without these elements, reproducibility and feature applicability of AI clinical studies are difficult, and the scientific community cannot build on these findings.
The paper provides limited information about specific image features that contribute to SSI detection. A comprehensive feature importance analysis would not only improve model interpretability but also provide important clinically relevant insights into early visual indicators of infection.
While attention maps demonstrate that the model focuses on the incision area, other interpretability approaches such as quantitative feature importance analysis, ablation, which for basic scientists is an AI word for loss of function studies, counterfactual examples, Grad CAM or similar visualizations on convolutional neural net modeling.
With regard to hyperparameter tuning, the Supplementary methods, http://links.lww.com/SLA/F531 mention using grid search but really provide few specifics. And you used an Adam optimizer with cosine annealing but really don’t mention exploring alternative optimization strategies or loss functions.
These questions aside, I’m really, really interested in the last items because I wonder if SSIs could be detected before becoming apparent to the human eye, meaning subclinical visual patterns that precede obvious infection. The median time to SSI in this study -- diagnosis of SSI is 15 days, but could an AI system detect SSI days earlier, enabling intervention?
Smartphones can be software enabled to use an extended portion of the EM spectrum. So as a result, there’s a lot of information that’s potentially available that might even contribute to the diagnosis of SSI before a human can see it.
And parenthetically, if the AI detects it before you see it, what would you do about it? These days, what do we do? We open the incision, pop some staples. But if you knew in five days that your incision was going to pus out, well, what do you do?
Again, a really pioneering, innovative piece of work. Thank you very much.
Response From Hala Muaddi
Thank you very much for the thoughtful comments. I’m going to answer your last questions first before I jump back to the other questions.
Essentially, we hope that this is the future of AI for surgical site infections. The possibility that AI can identify subclinical features of an incision and report this earlier than the human eye can detect SSI is really the future we hope to see AI fulfill. However, we need prospective validation with statistical and methodological expertise, to make sure that this will be conducted in a methodologically sound matter.
Regarding pairing the techniques with EM, this is already utilized for chronic wounds for visualizing infrared and heat radiations from chronic wounds, and it has demonstrated significant success. We have not utilized those for surgical site infections, and there is strong potential for this to have the same impact, if not more.
Regarding what to do if AI detects SSI before a clinician can diagnose it, we don’t know. We usually diagnose SSIs, and then we act on it. We have never been in a situation where we were able to foresee that a patient will develop an SSI and how they would respond to preventative or earlier interventions like antibiotics or others that may be helpful. But truthfully, hopefully, we want to mitigate the complications of SSI in the future, through early identification and intervention.
With regards to the initial questions about accuracy and screening without human oversight, AI in the medical field is not at the point yet, especially for this project, to replace healthcare providers and their ability to diagnose SSI by synthesizing the clinical picture plus the information we have objectively and subjectively. In this project, at this stage, what we propose is for AI to be a supplementary tool that we can utilize to streamline our care and flag a concerning image that would expedite the review process by the healthcare provider so that the patient can receive an intervention earlier than waiting for a couple of days over a weekend or overnight until they receive an answer.
In terms of the lack of external validation and prospective validation, this is a very important point and limitation of this work. We hope to engage external collaborators who would be interested in this so that we can together, hopefully, define how this will shape and how this is going to progress in the future.
Yes, the code can be made available with some restrictions at this time, as we are in the process of commercializing this work to expedite the dissemination of this technology.
With regards to sharing the data, it’s difficult. All of these images have some form of patient identifying features that would make it a little bit more complex from a patient’s privacy aspect to be shared. But, again, I don’t think that this should be the hindrance of how and why we can collaborate with each other. There are methods to make these more anonymous, and hopefully we can share and collaborate in the future.
Dr. Charles Yeo (Philadelphia, PA)
I really enjoyed your talk. I must say I didn’t fully understand it, nor did I fully understand many of the questions that Dr. Kuo just asked.
But if I was training an AI to do this, can you help me understand, why wouldn’t it make sense to have some clinical input such as the patient has a temperature, there’s fluid dripping out of the wound, the wound is hot to the touch, or the wound is more painful today than it was three days ago?
I don’t think I heard any of those things in how you trained your AI. Does that make any sense? Or do clinical findings not matter—are they just worthless? I want you to educate me. Thank you.
Response From Hala Muaddi
You’re very welcome. And thank you very much for these questions. I am going to address your comments in two separate thought processes.
As a clinical epidemiologist, it was quite a jump to understand the AI world and AI technology. And interestingly, there’s quite a bit of overlap in clinical epidemiology terminology and AI. We just use different words to describe the same thing.
Second, from the clinical aspect of the decision-making, you are absolutely correct. And we actually have some data on patient relevant factors. But we did not incorporate those in the model here because we wanted this to be objective based on the incision itself and the pictures itself. Whether incorporating those will improve the model performance or not, I suspect that it will, but this is something in which we will engage in the future.
Regarding the information about temperature, incision pain and purulent discharge, as much as those are important factors, there is a plethora of literature that this information from patients, or healthcare providers, to diagnose a surgical-site infection has very limited sensitivity and specificity. That said, I do not think that this information should be excluded from this project, and I do think that having some feedback from patients in an automated way—such as whether it is painful, red, purulent—may hopefully improve the performance of this model.
And, again, the purpose of this model is not to only diagnose SSI, it’s primarily to help expedite the process for postoperative care for patients to receive a response in a more prompt fashion as needed, when needed.
Dr. Justin Dimick (Ann Arbor, MI):
I have almost the same question as Charlie Yeo, which scares me a little bit that we have convergent thinking on this. I also want to be educated as someone who’s not an expert.
When you sit down to design a study like this, to what extent are you asking the question you asked, which is, can computer vision or some computer vision-type model accurately identify a wound from a photo? Because I, as some might argue, am of human intelligence, and I often can’t do that, right?
So, the real question is, based on this photo that’s not very good, taken by a patient in the EMR, I need to examine this wound, right? I think this gets to a lot of the things that Dr. Yeo was saying.
That’s the question I have. It’s not necessarily a computer-vision question. It’s, like, what’s the threshold for seeing this in-person and probing it and feeling it and doing all those types of physical examination things that we just heard about?
So, to what extent do you think about the clinical context and the clinical decision-making that we make decisions?
And the second thing is you kind of talked about that area under the ROC curve, which is like an accuracy question. And from a design perspective, is it accuracy we care about or is it like ruling out an infection? And do you design the question around making sure we don’t miss an infection? Is this about making sure we get somebody in without missing?
Response From Hala Muaddi
Absolutely. Thank you very much for these important questions. The AI part, where we are now and where we will be in a couple of years, I do not think that this will replace the human factor of clinical decision making. All that I suspect we are going to achieve is to help triage those pictures that we’re getting in numerous numbers by patients, and then we ask for clarification photos: Can you take this in a better light? Can you take this further away? Can you take this closer so that we can see the incision?
There are ways that we can integrate this model into our practice to help us expedite the care for the patients, to simply provide an answer when they are hundreds of miles away rather than wait to see them in a week when they’re coming for a clinical visit.
With respect to ruling in surgical-site infections, this is what is reflected by the metrics. Discrimination, AUC, is not sufficient for telling us whether there is an infection. It’s just telling us that the model can distinguish which is an infection and which is not. Adding the other metrics—recall, precision and the F1 score—helps us understand more what we are ruling out and what we are ruling in.
Recall, for example, is equivalent to sensitivity. And it’s telling us that we’re missing some surgical-site infections that are surgical-site infections. And precision is equivalent to positive predictive values. So, a low precision score tells us that we’re including some as an infection when they are not truly a surgical-site infection.
Based on the numbers I presented, our model still needs further refinement, and I hope that with the prospective analysis and with the external validation we will continue to further fine-tune the model so that it can train and learn what features we’re looking for.
We tried to alleviate the concept of the “black box” by showing some of the attention maps, and I think to a certain limit, perhaps, it gives us a reassurance. However, as Dr. Kuo said, there are still other methodologies that we can use to help us understand what are the features that are being recognized by the AI system to specifically label an image as a surgical-site infection versus not. And I hope that we will be able to include this in further iterations and in the manuscript in the future.
Dr. Benedict Nwomeh (Columbus, OH)
This is fascinating. I have two quick questions. One is on the patient-related factor of skin characteristics.
So, did you stratify your data just by the patient-reported race or by skin tone, which I think is a more relevant metric here?
Response From Hala Muaddi
Absolutely.
Dr. Benedict Nwomeh (Columbus, OH)
I have one more question. The second question is about wound size.
I’m a pediatric surgeon, and I would make tiny incisions. And I was wondering whether this model performs differently based on wound size. Is there a minimum pixel density that the system can detect? Thank you.
Response From Hala Muaddi
Absolutely. Thank you for these questions. You are right. The data that I showed does not include the stratification based on skin color. I fully agree and acknowledge that race does not equal skin color at all. So now we are in the process of utilizing the Fitzpatrick scale to stratify the data so that we can perform this more relevantly, I would say, to different skin tones and skin colors.
As to the size, about 50% of the procedures that we included here were performed in an outpatient setting. And some were laparoscopic procedures and thoracoscopic procedures. We did not stratify based on this for our analysis. But I don’t suspect that it will change based on the size of the incision as the AI models look at pixels rather than big size, small size.
Dr. Melina Kibbe (Charlottesville, VA)
I want to, again, congratulate you on this work. I’m going to say something a little provocative in response to the comments of Dr. Dimick and Dr. Yeo.
The lens with which Dr. Yeo and Dr. Dimick just gave you those comments is the lens for which all of us in this room know, and it is our current paradigm for diagnosing an infection.
I will pose that AI is going to cause a paradigm shift in how medicine is practiced. And I think it’s going to happen in our lifetime. AI is going to challenge your premise, Dr. Yeo. In the future, patients are just going to send an image in, and “Dr. Google” will call a script into their pharmacy and take care of this. Physical exams may no longer be needed for much of what we do as AI may prove superior and more accurate than humans.
So, I really applaud your work. And I just felt compelled to make that provocative comment. Thank you.
Response From Hala Muaddi
Thank you. I really appreciate that very much.
Dr. Dana Telem (Ann Arbor, MI)
I love this study, and I love where this is going. And I think I want to dovetail on what Dr. Kibbe said because what I heard from Dr. Yeo and Dr. Dimick is believability, and I heard a question about implementation.
And how are you going to go about, have you thought about, the qualitative work and the work that you would need to do to shop this around to understand how to get the users in this room to believe this, to use this, to have this paradigm shift, and what it will take to get you from here to there?
Response From Hala Muaddi
Thank you for the question and the comment. I think the most important part of this is having a prospective analysis where we are going to monitor the patients and follow them with multiple images or pictures taken per day with the assistance of the patient themselves, so not us taking the pictures, but the patient themselves taking the picture over multiple days, until we do or do not flag a surgical-site infection.
And this must happen in a prospective way because, first, it’s hard to fully determine what features that AI is picking up on that might indicate that there will be a surgical-site infection in a couple of days or so. We don’t know whether AI can or cannot do that. We can only hypothesize and theorize that this will be the future. And I hope that answered your question.
Dr. Paul C. Kuo (Tampa, FL)
I want to just throw something out to follow up on what Dr. Kibbe said.
Thinking about it, the amount of data, the amount of storage capacity that’s available, the processing power of machines these days far exceeds that of the human brain.
Response From Hala Muaddi
Correct.
Dr. Paul C. Kuo (Tampa, FL)
Add on to that now, the development of algorithms that are, quote/unquote, learnable, or they learn, that’s almost like the last step. Add on to the fact, then, with robots that can physically manipulate the physical environment. Like, for example, when was the last time a human made an integrated circuit? So, I think that’s the future. Something to think about.
Response From Hala Muaddi
Thank you.
Footnotes
H.S. and C.T. contributed equally to the work.
Funded by the Dalio Philanthropies Artificial Intelligence/Machine Learning Enablement Award from the Center for Digital Health at Mayo Clinic and the Simons Family Career Development Award in Surgical Innovation from the Mayo Clinic.
The data in this study are not publicly available due to patient privacy and the institutional data use agreement. Access to the identified data may be considered upon reasonable request with the appropriate ethical institutional approvals. The model architecture and training code are not publicly available but could be shared with individual institutions at their request.
C.T. has no disclosures related to this work but does declare unrelated consulting/advisor relationships with apoQlar Medical and Intera Oncology. The remaining authors report no conflicts of interest.
Supplemental Digital Content is available for this article. Direct URL citations are provided in the HTML and PDF versions of this article on the journal's website, www.annalsofsurgery.com.
Contributor Information
Hala Muaddi, Email: hala.muaddi@gmail.com.
Ashok Choudhary, Email: choudhary.ashok@mayo.edu.
Frank Lee, Email: lee.frank@mayo.edu.
Stephanie S. Anderson, Email: anderson.stephanie@mayo.edu.
Elizabeth Habermann, Email: habermann.elizabeth@mayo.edu.
David Etzioni, Email: etzioni.david@gmail.com.
Sarah McLaughlin, Email: mclaughlin.sarah@mayo.edu.
Michael Kendrick, Email: kendrick.michael@mayo.edu.
Hojjat Salehinejad, Email: salehinejad.hojjat@mayo.edu.
Cornelius Thiels, Email: thiels.cornelius@mayo.edu.
REFERENCES
- 1. Dencker EE, Bonde A, Troelsen A, et al. Postoperative complications: an observational study of trends in the United States from 2012 to 2018. BMC Surg. 2021;21:393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Ban KA, Minei JP, Laronga C, et al. Executive Summary of the American College of Surgeons/Surgical Infection Society Surgical Site Infection Guidelines-2016 Update. Surg Infect. 2017;18:379–382. [DOI] [PubMed] [Google Scholar]
- 3. Awad SS. Adherence to surgical care improvement project measures and post-operative surgical site infections. Surg Infect. 2012;13:234–237. [DOI] [PubMed] [Google Scholar]
- 4. Zimlichman E, Henderson D, Tamir O, et al. Health care-associated infections: a meta-analysis of costs and financial impact on the US health care system. JAMA Intern Med. 2013;173:2039–2046. [DOI] [PubMed] [Google Scholar]
- 5. Merkow RP, Ju MH, Chung JW, et al. Underlying reasons associated with hospital readmission following surgery in the United States. JAMA. 2015;313:483–495. [DOI] [PubMed] [Google Scholar]
- 6. Berrios-Torres SI, Umscheid CA, Bratzler DW, et al. Centers for Disease Control and Prevention Guideline for the Prevention of Surgical Site Infection, 2017. JAMA Surg. 2017;152:784–791. [DOI] [PubMed] [Google Scholar]
- 7. Omling E, Jarnheimer A, Rose J, et al. Population-based incidence rate of inpatient and outpatient surgical procedures in a high-income country. Br J Surg. 2018;105:86–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Bicket MC, Chua KP, Lagisetty P, et al. Prevalence of Surgery Among Individuals in the United States. Ann Surg Open 2024, 5:e421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Spaulding A, Loomis E, Brennan E, et al. Postsurgical Remote Patient Monitoring Outcomes and Perceptions: A Mixed-Methods Assessment. Mayo Clin Proc Innov Qual Outcomes. 2022;6:574–583. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Dawes AJ, Lin AY, Varghese C, et al. Mobile health technology for remote home monitoring after surgery: a meta-analysis. Br J Surg. 2021;108:1304–1314. [DOI] [PubMed] [Google Scholar]
- 11. Nath B, Williams B, Jeffery MM, et al. Trends in Electronic Health Record Inbox Messaging During the COVID-19 Pandemic in an Ambulatory Practice Network in New England. JAMA Netw Open. 2021;4:e2131490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Shenson JA, Cronin RM, Davis SE, et al. Rapid growth in surgeons’ use of secure messaging in a patient portal. Surg Endosc. 2016;30:1432–1440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Gregory ME, Russo E, Singh H. Electronic Health Record Alert-Related Workload as a Predictor of Burnout in Primary Care Providers. Appl Clin Inform. 2017;8:686–697. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Rochon M, Jawarchan A, Fagan F, et al. Image-based digital post-discharge surveillance in England: measuring patient enrolment, engagement, clinician response times, surgical site infection, and carbon footprint. J Hosp Infect. 2023;133:15–22. [DOI] [PubMed] [Google Scholar]
- 15. Martin D, Hubner M, Moulin E, et al. Timing, diagnosis, and treatment of surgical site infections after colonic surgery: prospective surveillance of 1263 patients. J Hosp Infect. 2018;100:393–399. [DOI] [PubMed] [Google Scholar]
- 16. Gibson A, Tevis S, Kennedy G. Readmission after delayed diagnosis of surgical site infection: a focus on prevention using the American College of Surgeons National Surgical Quality Improvement Program. Am J Surg. 2014;207:832–839. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Elhage SA, Deerenberg EB, Ayuso SA, et al. Development and Validation of Image-Based Deep Learning Models to Predict Surgical Complexity and Complications in Abdominal Wall Reconstruction. JAMA Surg. 2021;156:933–940. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Rochon M, Tanner J, Jurkiewicz J, et al. Wound imaging software and digital platform to assist review of surgical wounds using patient smartphones: The development and evaluation of artificial intelligence (WISDOM AI study). PLoS One. 2024;19:e0315384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Collins GS, Reitsma JB, Altman DG, et al. Transparent reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement. J Clin Epidemiol. 2015;68:134–143. [DOI] [PubMed] [Google Scholar]
- 20. Polanin JR, Pigott TD, Espelage DL, et al. Best practice guidelines for abstract screening large-evidence systematic reviews and meta-analyses. Research Synthesis Methods. 2019;10:330–342. [Google Scholar]
- 21. Borchardt RA, Tzizik D. Update on surgical site infections: The new CDC guidelines. JAAPA. 2018;31:52–54. [DOI] [PubMed] [Google Scholar]
- 22. Christensen AMM, Dowler K, Doron S. Surgical site infection metrics: Dissecting the differences between the National Health and Safety Network and the National Surgical Quality Improvement Program. Antimicrob Steward Healthc Epidemiol. 2021;1:e16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv. 2020;abs/2010:11929. [Google Scholar]
- 24. He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016;2015:770–778. [Google Scholar]
- 25. Qin D, Leichner C, Delakis M, et al. MobileNetV4 - Universal Models for the Mobile Ecosystem. European Conference on Computer Vision,. 2024;XL:78–96. [Google Scholar]
- 26. Ronneberger O, Fischer P, Brox T. U-Net. Convolutional Networks for Biomedical Image Segmentation. Cham: Springer International Publishing; 2015:pp. 234–241. [Google Scholar]
- 27. Hunter JD. Matplotlib: A 2D Graphics Environment. Comput Sci Eng 2007;9:90–95. [Google Scholar]
- 28. Selvaraju RR, Cogswell M, Das A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. Int J Comput Vis. 2020;128:336–359. [Google Scholar]
- 29. Lepelletier D, Ravaud P, Baron G, et al. Agreement among health care professionals in diagnosing case Vignette-based surgical site infections. PLoS One. 2012;7:e35131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Taylor G, McKenzie M, Kirkland T, et al. Effect of surgeon’s diagnosis on surgical wound infection rates. Am J Infect Control. 1990;18:295–299. [DOI] [PubMed] [Google Scholar]
- 31. Rosenthal R, Weber WP, Marti WR, et al. Surveillance of surgical site infections by surgeons: biased underreporting or useful epidemiological data? J Hosp Infect. 2010;75:178–182. [DOI] [PubMed] [Google Scholar]
- 32. Wilson J, Ramboer I, Suetens C, et al. Hospitals in Europe Link for Infection Control through Surveillance (HELICS). Inter-country comparison of rates of surgical site infection--opportunities and limitations. J Hosp Infect. 2007;65(suppl 2):165–170. [DOI] [PubMed] [Google Scholar]
- 33. McLean KA, Sgro A, Brown LR, et al. Multimodal machine learning to predict surgical site infection with healthcare workload impact assessment. NPJ Digit Med. 2025;8:121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Ali-Mucheru MN, Seville MT, Miller V, et al. Postoperative surgical site infections: understanding the discordance between surveillance systems. Ann Surg. 2020;271:94–99. [DOI] [PubMed] [Google Scholar]
