The generative artificial intelligence (AI) tool ChatGPT has promising applications. Following a study demonstrating that ChatGPT could pass the USMLE Step Exams, clinicians are eager to explore its application in practice.1,2 This study investigated the utility of ChatGPT (Version 4) as an ad-junctive tool for breast surgery decision-making by posing a series of clinical questions and comparing its responses to three key surgical guidelines.
Accuracy statements
Three surgical breast reconstruction guidelines created by the American Society of Plastic Surgeons (ASPS) were identified: Reduction Mammaplasty3, Breast Reconstruction with Expanders & Implants4, and Autologous Breast Reconstruction with DIEP or Pedicled TRAM Abdominal Flaps.5 The recommendations were converted into questions which were then posed to ChatGPT, after which four reviewers graded its responses as “Fully Concordant,” “Partially Concordant,” or “Nonconcordant”. 32 questions were developed across all three guidelines. Cumulatively, ChatGPT’s responses were 31.3% fully concordant, 40.6% partially concordant, and 28.1% nonconcordant (Figure 1).
Figure 1.

“Cumulative Accuracy of ChatGPT Responses to Breast Reconstruction Surgical Guideline Prompts” – A bar graph displaying the relative concordance of ChatGPT when answering clinical questions relating to each surgical guideline, as well as the cumulative concordance across all three guidelines.
Reduction mammaplasty
Recommendations were provided for 10 topics related to reduction mammaplasty. ASPS had graded each recommendation based on its overall strength: strong, moderate, weak, and “option” (which indicates that the available evidence is inconsistent). When ChatGPT was asked a question corresponding to each of the 10, it generated five (50%) fully concordant, two (20%) partially concordant, and three (30%) nonconcordant responses. ChatGPT’s answers were fully concordant with all strong, weak, and option recommendations. For moderate recommendations, ChatGPT was partially concordant with two (33%) and nonconcordant with three (50%) (Table 1).
Table 1.
Strength-specific accuracy of ChatGPT responses to breast reconstruction surgical guideline prompts.
| Recommendation Strength | Reduction Mammaplasty N (Column %) | Expanders & Implants N (Column %) | Autologous Breast Reconstruction N (Column %) | Total N (%) |
|---|---|---|---|---|
| Strong | ||||
| Fully Concordant | 2 (100%) | 1 (33%) | - | 3 (60%) |
| Partially Concordant | 0 (0%) | 1 (33%) | - | 1 (20%) |
| Not Concordant | 0 (0%) | 1 (33%) | - | 1 (20%) |
| Standard/Moderate | ||||
| Fully Concordant | 1 (17%) | 1 (25%) | - | 2 (20%) |
| Partially Concordant | 2 (33%) | 1 (25%) | - | 3 (30%) |
| Not Concordant | 3 (50%) | 2 (50%) | - | 5 (50%) |
| Weak | ||||
| Fully Concordant | 1 (100%) | NA | NA | - |
| Partially Concordant | 0 (0%) | NA | NA | - |
| Not Concordant | 0 (0%) | NA | NA | - |
| Option | ||||
| Fully Concordant | 1 (100%) | 2 (22%) | 0 (0%) | 3 (20%) |
| Partially Concordant | 0 (0%) | 5 (56%) | 4 (75%) | 9 (60%) |
| Not Concordant | 0 (0%) | 2 (22%) | 1 (25%) | 3 (20%) |
“-” indicates no responses were recorded in that category, or total calculation was not performed. “NA” indicates that recommendation strength was not available for those guidelines.
Breast reconstruction with expanders & implants
Recommendations were provided for 17 topics related to implant-based breast reconstruction. ASPS simplified the grading to strong, standard, or option. When ChatGPT was asked a question corresponding to each of the 17 topics, five (29.4%) responses were fully concordant, seven (41.2%) were partially concordant, and five (29.4%) were nonconcordant. ChatGPT’s accuracy with strong recommendations was distributed between fully concordant (33%), partially concordant (33%), and nonconcordant (33%). Less accurate responses were observed for standard recommendations, with one (25%) fully concordant, one (25%) partially concordant, and two (50%) nonconcordant. ChatGPT aligned best with option recommendations, with five (56%) being partially concordant, and four split evenly between fully concordant (22%) and nonconcordant (22%).
Autologous breast reconstruction with DIEP or pedicled TRAM abdominal flaps
Recommendations were provided for 5 topics related to abdominal-based autologous breast reconstruction. All 5 topics had received the “option” grade, meaning autologous breast reconstruction recommendations are the least definitive. Four (75%) of ChatGPT’s responses were partially concordant and one (25%) was nonconcordant. This diminished accuracy illustrates the variability within the field currently.
The accuracy of ChatGPT as a medical decision-making tool remains low, depending on the topic and strength of the evidence. AI continues to advance, with ChatGPT-4 surpassing its predecessor, ChatGPT-3.5, by offering internet connectivity and source citations. Nevertheless, caution should be exercised when incorporating generative AI tools into medical practice. Further development is required to improve the concordance between AI and practice guidelines before it can be considered for implementation in surgical practice.
Footnotes
Declaration of Competing Interest
None.
References
- 1.Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2(2). 10.1371/journal.pdig.0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Jeha GM, Qiblawi S, Jairath N, et al. ChatGPT and generative artificial intelligence in mohs surgery: a new frontier of innovation. Published online June 3, J Invest Dermatol 2023;S0022–202X (23):02142–5. 10.1016/j.jid.2023.05.018. [DOI] [PubMed] [Google Scholar]
- 3.Perdikis G, Dillingham C, Boukovalas S, et al. American Society of Plastic Surgeons evidence-based clinical practice guideline revision: reduction mammaplasty. Plast Reconstr Surg 2022;149(3):392e–409e. 10.1097/PRS.0000000000008860. [DOI] [PubMed] [Google Scholar]
- 4.Alderman A, Gutowski K, Ahuja A, Gray D. ASPS clinical practice guideline summary on breast reconstruction with expanders and implants. Plast Reconstr Surg 2014;134(4):648e–55e. 10.1097/PRS.0000000000000541. [DOI] [PubMed] [Google Scholar]
- 5.Lee BT, Agarwal JP, Ascherman JA, et al. Evidence-based clinical practice guideline: autologous breast reconstruction with DIEP or pedicled TRAM abdominal flaps. Plast Reconstr Surg 2017;140(5):651e–64e. 10.1097/PRS.0000000000003768. [DOI] [PubMed] [Google Scholar]
