Generative artificial intelligence fails to provide sufficiently accurate recommendations when compared to established breast reconstruction surgery guidelines

Michael P Saturno; Mateo Restrepo Mejia; Anya Wang; Daniel Kwon; Olachi Oleru; Nargiz Seyidova; Peter W Henderson

doi:10.1016/j.bjps.2023.09.030

. Author manuscript; available in PMC: 2024 Mar 26.

Published in final edited form as: J Plast Reconstr Aesthet Surg. 2023 Sep 15;86:248–250. doi: 10.1016/j.bjps.2023.09.030

Generative artificial intelligence fails to provide sufficiently accurate recommendations when compared to established breast reconstruction surgery guidelines

Michael P Saturno ¹, Mateo Restrepo Mejia ¹, Anya Wang ¹, Daniel Kwon ¹, Olachi Oleru ¹, Nargiz Seyidova ¹, Peter W Henderson ^1,^*

PMCID: PMC10965244 NIHMSID: NIHMS1978336 PMID: 37793197

The generative artificial intelligence (AI) tool ChatGPT has promising applications. Following a study demonstrating that ChatGPT could pass the USMLE Step Exams, clinicians are eager to explore its application in practice.^1,2 This study investigated the utility of ChatGPT (Version 4) as an ad-junctive tool for breast surgery decision-making by posing a series of clinical questions and comparing its responses to three key surgical guidelines.

Accuracy statements

Three surgical breast reconstruction guidelines created by the American Society of Plastic Surgeons (ASPS) were identified: Reduction Mammaplasty³, Breast Reconstruction with Expanders & Implants⁴, and Autologous Breast Reconstruction with DIEP or Pedicled TRAM Abdominal Flaps.⁵ The recommendations were converted into questions which were then posed to ChatGPT, after which four reviewers graded its responses as “Fully Concordant,” “Partially Concordant,” or “Nonconcordant”. 32 questions were developed across all three guidelines. Cumulatively, ChatGPT’s responses were 31.3% fully concordant, 40.6% partially concordant, and 28.1% nonconcordant (Figure 1).

“Cumulative Accuracy of ChatGPT Responses to Breast Reconstruction Surgical Guideline Prompts” – A bar graph displaying the relative concordance of ChatGPT when answering clinical questions relating to each surgical guideline, as well as the cumulative concordance across all three guidelines.

Reduction mammaplasty

Recommendations were provided for 10 topics related to reduction mammaplasty. ASPS had graded each recommendation based on its overall strength: strong, moderate, weak, and “option” (which indicates that the available evidence is inconsistent). When ChatGPT was asked a question corresponding to each of the 10, it generated five (50%) fully concordant, two (20%) partially concordant, and three (30%) nonconcordant responses. ChatGPT’s answers were fully concordant with all strong, weak, and option recommendations. For moderate recommendations, ChatGPT was partially concordant with two (33%) and nonconcordant with three (50%) (Table 1).

Table 1.

Strength-specific accuracy of ChatGPT responses to breast reconstruction surgical guideline prompts.

Recommendation Strength	Reduction Mammaplasty N (Column %)	Expanders & Implants N (Column %)	Autologous Breast Reconstruction N (Column %)	Total N (%)
Strong
Fully Concordant	2 (100%)	1 (33%)	-	3 (60%)
Partially Concordant	0 (0%)	1 (33%)	-	1 (20%)
Not Concordant	0 (0%)	1 (33%)	-	1 (20%)
Standard/Moderate
Fully Concordant	1 (17%)	1 (25%)	-	2 (20%)
Partially Concordant	2 (33%)	1 (25%)	-	3 (30%)
Not Concordant	3 (50%)	2 (50%)	-	5 (50%)
Weak
Fully Concordant	1 (100%)	NA	NA	-
Partially Concordant	0 (0%)	NA	NA	-
Not Concordant	0 (0%)	NA	NA	-
Option
Fully Concordant	1 (100%)	2 (22%)	0 (0%)	3 (20%)
Partially Concordant	0 (0%)	5 (56%)	4 (75%)	9 (60%)
Not Concordant	0 (0%)	2 (22%)	1 (25%)	3 (20%)

Open in a new tab

“-” indicates no responses were recorded in that category, or total calculation was not performed. “NA” indicates that recommendation strength was not available for those guidelines.

Breast reconstruction with expanders & implants

Recommendations were provided for 17 topics related to implant-based breast reconstruction. ASPS simplified the grading to strong, standard, or option. When ChatGPT was asked a question corresponding to each of the 17 topics, five (29.4%) responses were fully concordant, seven (41.2%) were partially concordant, and five (29.4%) were nonconcordant. ChatGPT’s accuracy with strong recommendations was distributed between fully concordant (33%), partially concordant (33%), and nonconcordant (33%). Less accurate responses were observed for standard recommendations, with one (25%) fully concordant, one (25%) partially concordant, and two (50%) nonconcordant. ChatGPT aligned best with option recommendations, with five (56%) being partially concordant, and four split evenly between fully concordant (22%) and nonconcordant (22%).

Autologous breast reconstruction with DIEP or pedicled TRAM abdominal flaps

Recommendations were provided for 5 topics related to abdominal-based autologous breast reconstruction. All 5 topics had received the “option” grade, meaning autologous breast reconstruction recommendations are the least definitive. Four (75%) of ChatGPT’s responses were partially concordant and one (25%) was nonconcordant. This diminished accuracy illustrates the variability within the field currently.

The accuracy of ChatGPT as a medical decision-making tool remains low, depending on the topic and strength of the evidence. AI continues to advance, with ChatGPT-4 surpassing its predecessor, ChatGPT-3.5, by offering internet connectivity and source citations. Nevertheless, caution should be exercised when incorporating generative AI tools into medical practice. Further development is required to improve the concordance between AI and practice guidelines before it can be considered for implementation in surgical practice.

Footnotes

Declaration of Competing Interest

None.

References

1.Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2(2). 10.1371/journal.pdig.0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Jeha GM, Qiblawi S, Jairath N, et al. ChatGPT and generative artificial intelligence in mohs surgery: a new frontier of innovation. Published online June 3, J Invest Dermatol 2023;S0022–202X (23):02142–5. 10.1016/j.jid.2023.05.018. [DOI] [PubMed] [Google Scholar]
3.Perdikis G, Dillingham C, Boukovalas S, et al. American Society of Plastic Surgeons evidence-based clinical practice guideline revision: reduction mammaplasty. Plast Reconstr Surg 2022;149(3):392e–409e. 10.1097/PRS.0000000000008860. [DOI] [PubMed] [Google Scholar]
4.Alderman A, Gutowski K, Ahuja A, Gray D. ASPS clinical practice guideline summary on breast reconstruction with expanders and implants. Plast Reconstr Surg 2014;134(4):648e–55e. 10.1097/PRS.0000000000000541. [DOI] [PubMed] [Google Scholar]
5.Lee BT, Agarwal JP, Ascherman JA, et al. Evidence-based clinical practice guideline: autologous breast reconstruction with DIEP or pedicled TRAM abdominal flaps. Plast Reconstr Surg 2017;140(5):651e–64e. 10.1097/PRS.0000000000003768. [DOI] [PubMed] [Google Scholar]

[R1] 1.Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2(2). 10.1371/journal.pdig.0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Jeha GM, Qiblawi S, Jairath N, et al. ChatGPT and generative artificial intelligence in mohs surgery: a new frontier of innovation. Published online June 3, J Invest Dermatol 2023;S0022–202X (23):02142–5. 10.1016/j.jid.2023.05.018. [DOI] [PubMed] [Google Scholar]

[R3] 3.Perdikis G, Dillingham C, Boukovalas S, et al. American Society of Plastic Surgeons evidence-based clinical practice guideline revision: reduction mammaplasty. Plast Reconstr Surg 2022;149(3):392e–409e. 10.1097/PRS.0000000000008860. [DOI] [PubMed] [Google Scholar]

[R4] 4.Alderman A, Gutowski K, Ahuja A, Gray D. ASPS clinical practice guideline summary on breast reconstruction with expanders and implants. Plast Reconstr Surg 2014;134(4):648e–55e. 10.1097/PRS.0000000000000541. [DOI] [PubMed] [Google Scholar]

[R5] 5.Lee BT, Agarwal JP, Ascherman JA, et al. Evidence-based clinical practice guideline: autologous breast reconstruction with DIEP or pedicled TRAM abdominal flaps. Plast Reconstr Surg 2017;140(5):651e–64e. 10.1097/PRS.0000000000003768. [DOI] [PubMed] [Google Scholar]

PERMALINK

Generative artificial intelligence fails to provide sufficiently accurate recommendations when compared to established breast reconstruction surgery guidelines

Michael P Saturno

Mateo Restrepo Mejia

Anya Wang

Daniel Kwon

Olachi Oleru

Nargiz Seyidova

Peter W Henderson

Accuracy statements

Figure 1.

Reduction mammaplasty

Table 1.

Breast reconstruction with expanders & implants

Autologous breast reconstruction with DIEP or pedicled TRAM abdominal flaps

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Generative artificial intelligence fails to provide sufficiently accurate recommendations when compared to established breast reconstruction surgery guidelines

Michael P Saturno

Mateo Restrepo Mejia

Anya Wang

Daniel Kwon

Olachi Oleru

Nargiz Seyidova

Peter W Henderson

Accuracy statements

Figure 1.

Reduction mammaplasty

Table 1.

Breast reconstruction with expanders & implants

Autologous breast reconstruction with DIEP or pedicled TRAM abdominal flaps

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases