Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Apr 1.
Published in final edited form as: Surgery. 2024 Jan 20;175(4):936–942. doi: 10.1016/j.surg.2023.12.014

Evaluating Capabilities of Large Language Models (LLMs): Performance of GPT4 on Surgical Knowledge Assessments

Brendin R Beaulieu-Jones 1,2, Margaret T Berrigan 1, Sahaj Shah 3, Jayson S Marwaha 4, Shuo-Lun Lai 4, Gabriel A Brat 1,2
PMCID: PMC10947829  NIHMSID: NIHMS1961919  PMID: 38246839

Abstract

Background:

Artificial intelligence (AI) has the potential to dramatically alter healthcare by enhancing how we diagnose and treat disease. One promising AI model is ChatGPT, a general-purpose large language model (LLM) trained by OpenAI. ChatGPT has shown human-level performance on several professional and academic benchmarks. We sought to evaluate its performance on surgical knowledge questions and assess stability of this performance on repeat query.

Methods:

We evaluated the performance of ChatGPT-4 on questions from the Surgical Council on Resident Education (SCORE) question bank and a second commonly used surgical knowledge assessment, referred to as Data-B. Questions were entered in two formats: open-ended (OE) and multiple choice (MC). ChatGPT outputs were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat query.

Results:

A total of 167 SCORE and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71.3% and 67.9% of MC, and 47.9% and 66.1% of OE questions for SCORE and Data-B, respectively. For both OE and MC questions, approximately two-thirds of ChatGPT responses contained non-obvious insights. Common reasons for incorrect responses included: inaccurate information in a complex question (n=16, 36.4%); inaccurate information in fact-based question (n=11, 25.0%); and accurate information with circumstantial discrepancy (n=6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of questions answered incorrectly on the first query; the response accuracy changed for 6/16 (37.5%) questions.

Conclusion:

Consistent with findings in other academic and professional domains, we demonstrate near or above human-level performance of ChatGPT on surgical knowledge questions from two widely used question banks. ChatGPT performed better on MC than OE questions, prompting question regarding its potential for clinical application. Unique to this study, we demonstrate inconsistency in ChatGPT responses on repeat query. This finding warrants future consideration including efforts at training LLMs to provide the safe and consistent responses required for clinical application. Despite near or above human-level performance on question banks and given these observations, it is unclear whether LLMs such as ChatGPT are able to safely assist clinicians in providing care.

Keywords: ChatGPT, artificial intelligence, language models, surgical education, surgery

Article Summary

We evaluate ChatGPT’s performance on surgical knowledge questions and assess the stability of its responses on repeat query. To our knowledge, this is the first study testing ChatGPT’s performance on surgical knowledge over multiple instances. Although overall accuracy is near or above human-level performance, we demonstrate inconsistency in ChatGPT’s responses on repeat query.

Background

Artificial intelligence (AI) models have the potential to dramatically affect the healthcare landscape by altering how we diagnosis and treat disease. If employed effectively, these models could lead to increased efficiency, improved accuracy and widely implemented personalized patient care across the care continuum. Successful healthcare-related applications have been widely reported.110 Within surgery, machine learning approaches that include natural language processing, computer vision, and reinforcement learning have each shown potential to advance care.1,1114 Still, despite the promise of AI to revolutionize healthcare, its use within the field is markedly limited compared to other industries. Concerns regarding the potential for catastrophic error and at least partial sacrifice of human empathy with the integration of AI into clinical practice have led to cautious adoption.7,11,1518

ChatGPT is a publicly-available, large language model (LLM) trained by OpenAI that has potential for future application in many fields including healthcare.19 Released in November 2022, ChatGPT received unprecedented attention20 for its notable performance across a range of medical and non-medical domains.21 ChatGPT has achieved human-level performance on several professional and academic benchmarks, including a simulated bar exam, the graduate record examination (GRE), numerous Advanced Placement (AP) examinations, and the Advanced Sommelier knowledge assessment.19 An earlier version of ChatGPT was shown to perform at or near the passing threshold of 60% accuracy on the United States Medical Licensing Exam (USMLE).22,23 In addition, ChatGPT has demonstrated robust performance on knowledge assessments in family medicine,24 neurosurgery,25 hepatology,26 and a combination of all major medical specialties.27 Moreover, ChatGPT has shown promise as a clinical decision support tool in radiology,28 pathology,29 and orthodontics.30 ChatGPT has also performed valuable clinical tasks,3134 such as writing patient clinic letters, composing inpatient discharge summaries, suggesting cancer screening, and conveying patient education.35 Lastly, several studies have highlighted the potential impact of ChatGPT on medical education and research, with roles ranging from supporting nursing education to advancing data analysis and streamlining the writing of scientific publications.32,3638 The emergence of ChatGPT has reignited interest in exploring AI applications in healthcare; however, it has also provoked numerous concerns regarding bias, reliability, privacy, and governance.21,26,32,3641

In the current study, we evaluate ChatGPT-4’s performance on two commonly used surgical knowledge assessments: The Surgical Council on Resident Education (SCORE) curriculum and a second case-based question bank for general surgery residents and practicing surgeons which is referred to as Data-B and not otherwise identified due to copyright restrictions. SCORE is an educational resource and self-assessment used by many surgical residents in the United States throughout their training.4245 Data-B is principally designed for graduating surgical residents and fully-trained surgeons preparing for the American Board of Surgery Qualifying Exam. These assessments were selected as their content represents the knowledge expected of surgical residents and board-certified surgeons, respectively. As such, it was thought that Data-B, while based on the same content area as SCORE, should include more higher-order management or multi-step reasoning questions, and that we may observe differential ChatGPT performance. The performance of ChatGPT on each of these assessments may provide important insights regarding ChatGPT-4’s current capabilities. In addition to assessing performance, this study investigates reasons for ChatGPT errors and assesses its performance on repeat queries. This latter objective represents a significant contribution to our current understanding of LLMs – and a critical topic when considering development of safe and effective AI applications in healthcare.

Methods

Large Language Model

ChatGPT (Open AI, San Francisco, CA) is a publicly-available, subscription-based AI chatbot that first launched in November 2022. It was initially derived from GPT-3 (Generative Pretrained Transformer) LLMs, which are pre-trained transformer models designed primarily to generate text via next word prediction. To improve performance for ChatGPT, initial GPT-3 models were further trained using a combination of supervised and reinforcement learning techniques.46 More specifically, ChatGPT was trained in part using Reinforcement Learning from Human Feedback (RLHF), a method in which a reward model is trained from human feedback. A reward model was created using a dataset of comparison data comprised of two or more model responses ranked according to quality by a human AI trainer. This data could then be used to fine-tune the model using Proximal Policy Optimization.47 Importantly, ChatGPT was trained on generalized, rather than domain-specific data. It is possible to further train such a model on domain-specific details, although part of its appeal is the ability to perform highly across various fields.

ChatGPT-PLUS is the latest development from Open-AI and employs GPT-4, which is the fourth iteration of the GPT family of LLMs.19 Details regarding the architecture and development of GPT-4 are not publicly available. It is generally accepted that GPT-4 was trained in a similar fashion as GPT-3 via RLHF and other supervised reinforcement techniques. While specific technical details are unknown, Open AI states in their technical report that one of their main goals in developing GPT-4 was to improve the LLM’s ability to understand and generate natural text, particularly in complex and nuanced scenarios. The report highlights improved performance of GPT-4, relative to GPT-3.5. For example, GPT-4 passed a simulated bar exam with a score in the 90th percentile, whereas GPT-3.5 achieved a score in the 10th percentile of exam takers. GPT-4 was officially released on March 13, 2023 and is currently available via the ChatGPT Plus paid subscription.

Input Sources

Input questions were derived from two commonly used surgical educational resources:

  1. The Surgical Council on Resident Education (SCORE): SCORE is a nonprofit organization established in 2004 by the principal organizations in surgical education in the United States, including the American Board of Surgery (ABS) and the Association for Surgical Education (ASE). SCORE maintains a curriculum for surgical trainees, which includes didactic educational content and more than 2400 multiple-choice (MC) questions for self-assessment. Access to the SCORE question bank was obtained through the research staff’s institutional access. SCORE was not part of the research team and did not participate in the study design, data collection, analysis, or interpretation of results. A total of 175 self-assessment questions were obtained from the SCORE question bank. Using existing functionality within the SCORE platform, study questions were randomly selected from all topics, except systems-based practice; surgical professionalism and interpersonal communication education; ethical issues in clinical surgery; biostatistics and evaluation of evidence; and quality improvement. While these topics are important in well-rounded clinical training and practice, they were excluded from this study given 1) the decision to focus specifically on scientific knowledge required to diagnose and manage surgical conditions in this exploratory study and 2) concerns regarding interpretation of the “correct” answer for questions where, with adequate explanation and justification, more than one answer choice may be considered correct. Future studies should include these question types as they do make up a significant component of surgical in-training and licensing exams and represent important skills required of practicing surgeons. Fellowship-level questions were not excluded; questions containing images were excluded from analysis. After exclusion, a total of 167 SCORE questions were included in the study analysis. Distribution of these questions by topic is shown in Supplemental Table 1.

  2. Data-B: Data-B is an educational resource for practicing surgeons and senior surgical trainees, which includes case-based, MC questions across a range of general surgical domains, including endocrine, vascular, abdomen, alimentary tract, breast, head and neck, oncology, perioperative care, surgical critical care, and skin/soft issue. A total of 120 questions were randomly selected for inclusion in the study. Questions containing images were excluded from analysis. After exclusion, 119 questions were included.

Encoding

For input into ChatGPT, all selected questions were formatted two ways:

  1. Open-ended (OE) prompting: Constructed by removing all answer choices and translating the existing question into open-ended form. Examples of OE questions include: “What is the best initial treatment for this patient?”; “For a patient with this diagnosis and risk factor, what is the most appropriate operative approach?”; or “What is the most appropriate initial diagnostic test to determine the cause of this patient’s symptoms?”

  2. Multiple choice (MC) single answer without forced justification: Created by replicating the original SCORE or Data-B question verbatim. Examples of MC questions include: “After appropriate cardiac workup, which of the following surgeries should be performed?”; “Which of the following laboratory values would most strongly suggest the presence of ischemic bowel in this patient?”; or “Which of the following options is the best next step in treatment?”

The structure of open-ended prompts was deliberately varied to avoid systemic errors. For each entry, a new chat session was started in ChatGPT to avoid potential bias. All selected questions were inputted twice, once in OE and MC form, respectively. Presenting questions to ChatGPT in two formats provided insight regarding the capacity of a predictive LLM to generate accurate, domain-specific responses without the prompting available with a MC format.

Assessment

Outputs from ChatGPT-4 were assessed for accuracy, internal concordance and insight by two surgical residents as described by Kung et al. in their related work assessing the performance of ChatGPT on the USMLE exam.23 Response accuracy (i.e., correctness) was assessed based on the provided solution and explanation by SCORE and Data-B, respectively. Internal response concordance refers to the internal validity and consistency of ChatGPT’s output—specifically whether the explanation affirms the answer and negates remaining choices without contradiction. As defined in Kung, et al, insights are instances of the model generated response that are nondefinitional, nonobvious, unique and valid. Insights demonstrate instances of higher level performance. Unlike Kung, et al, we did not calculate density of insight (defined as the number of insights/[number of answer choices +1]), but rather took a binary approach assessing the absolute presence or absence of an insight in the model’s response..23

Each reviewer adjudicated 100 SCORE questions and 70 Data-B questions, with 30% and 28% overlap, respectively. For overlapping questions, residents were blinded to each other’s assessment. Interrater agreement was evaluated by computing the Cohen kappa (κ) statistic for each question type (Supplemental Table 2).

For the 167 SCORE questions included in the study, the median performance by all human SCORE users was 65%, as reported on the SCORE question bank dashboard. Reference data for Data-B is not available, preventing an exact comparison between ChatGPT and surgeon users. In addition, we reviewed all inaccurate ChatGPT responses to MC SCORE questions to determine and classify the reason for the incorrect output. The classification system for inaccurate responses was created by study personnel and designations were made by consensus. Reasons for inaccurate responses included: inaccurate information in complex question, inaccurate information in fact-based question; accurate information, circumstantial discrepancy; inability to differentiate relative importance of information; imprecise application of detailed information; and imprecise application of general information. A description of each error type, as well as representative examples are shown in Table 1.

Table 1:

Classification of Error Type: Description and Examples

Error Type Description and Theoretical Example(s) of Error
Imprecise application of detailed information Description: Answer selection was based on detailed clinical information, which was applied imprecisely or inaccurately to the clinical context
Example:
  • Recommend medical management rather than surgery as first-line treatment for specific diagnosis, which is accurate, unless symptoms are medically-refractory, which is the case in the question

Imprecise application of general knowledge Description: Answer selection was based on general knowledge, which was either incompletely accurate or out of scope given context of the question
Example:
  • Recommend against a secondary procedure in a child to avoid additional anesthesia and potential procedural complications

Inability to differentiate relative importance of information Description: Answer selection was based on accurate information, but did not delineate between more accurate options
Example:
  • Select a laboratory finding which is present in most patients with a specific condition, when a more characteristic finding was intended

Accurate information; circumstantial discrepancy Description: Response is based on accurate information, which is incorrect based on question interpretation or other circumstantial factors that unlikely reflect competency of GPT
Example:
  • Select the cost-effective, first-line imaging, rather than the gold standard mechanism for diagnosis

Inaccurate information in fact-based question Description: Response is based on inaccurate information in the context of a single-part, fact-based question
Example:
  • Incorrectly identify the second most common site of pathology

Inaccurate information in complex question Description: Response is based on inaccurate information in the context of a complex clinical scenario or multi-part question
Example:
  • Inaccurate selection of most appropriate next step in patient with constellation of symptoms and description of imaging

To further assess the performance and reproducibility of GPT-4, all responses to SCORE questions (MC format) that were initially deemed inaccurate were re-queried. Second, ChatGPT responses were compared to the initial output to determine if the response changed and, if it changed, whether the response was now accurate or if it remained inaccurate.

Institutional Review Board approval was not required for this research study.

Results

Accuracy of ChatGPT Responses

A total of 167 SCORE and 112 Data-B questions were presented to ChatGPT. The accuracy of ChatGPT responses for OE and MC SCORE and Data-B questions is presented in Figure 1. ChatGPT correctly answered 71.3% and 67.9% of MC SCORE and Data-B questions, respectively. The proportion of accurate responses for OE questions was lower than for MC, particularly for SCORE questions, which is largely due to an increase in responses that were deemed indeterminate by study adjudicators in the setting of the open-ended format.

Figure 1: Accuracy of ChatGPT Output for Open-Ended and Multiple-Choice Questions.

Figure 1:

Surgical knowledge questions from SCORE and Data-B were presented to ChatGPT via two formats: open-ended (OE; left side) and multiple-choice (MC; right side). ChatGPT’s outputs were assessed for accuracy by surgeon evaluators. A total of 167 SCORE and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71.3% (119/167) and 67.9% (76/112) of multiple choice SCORE and Data-B questions, respectively.

Internal Response Concordance of ChatGPT Responses

Internal concordance was adjudicated by review of the entire ChatGPT response (Table 2). Overall internal response concordance was very high: 85.6% and 100% for OE SCORE and Data-B questions, respectively, and 88.6% and 97.3% for MC SCORE and Data-B questions. Among OE SCORE questions, internal concordance was also assessed by accuracy subgroup. Concordance was 98.75% (79/80) for accurate responses. Internally discordant responses were more frequently observed for inaccurate or indeterminate responses (26.4%, 23/87).

Table 2:

Accuracy, Internal Concordance and Nonobvious Insights of ChatGPT Responses

Accuracy, N (%) Internal Concordance Insights
Open-Ended Format
  SCORE 80 (47.9%) 143 (85.6%) 111 (66.5%)
  Data-B 74 (66.1%) 112 (100%) 87 (77.7%)
  Combined 154 (55.2%) 255 (91.4%) 198 (71.0%)
Multiple Choice - Single Answer
  SCORE 119 (71.3%) 148 (88.6%) 106 (63.5%)
  Data-B 76 (67.9%) 109 (97.3%) 69 (61.6%)
  Combined 195 (69.9%) 257 (92.1%) 175 (62.7%)

SCORE: Surgical Council on Resident Education;

Data-B: refers to a second commonly used surgical knowledge assessment and question bank

Insights within ChatGPT Responses

For both OE and MC questions, approximately two-thirds of ChatGPT responses contained nonobvious insights (Table 2). Insights were more frequently observed for OE questions (SCORE: 66.5% versus 63.5%; Data-B: 77.7% versus 62.7%).

Classification of Inaccurate ChatGPT Responses to MC SCORE Questions

Reasons for inaccurate responses are shown (Table 3). The most common reasons were: inaccurate information in a complex question (36.4%); inaccurate information in fact-based question (25.0%); and accurate information, circumstantial discrepancy (13.6%).

Table 3:

Classification of Inaccurate ChatGPT Responses for SCORE Questions (N=44)

Classification N (%)
Imprecise application of detailed information 3 (6.8)
Imprecise application of general knowledge 4 (9.1)
Inability to differentiate relative importance of information 4 (9.1)
Accurate information; circumstantial discrepancy 6 (13.6)
Inaccurate information in fact-based question 11 (25.0)
Inaccurate information in complex question 16 (36.4)
 
Total 44 (100%)

Outcome of Repeat Query for Initially Inaccurate ChatGPT Responses

For all inaccurate ChatGPT responses to MC SCORE questions, the exact MC SCORE question was re-presented to ChatGPT on a separate encounter, using a new chat. The accuracy of the response was assessed as previously described. In total, the answer selected by ChatGPT varied between iterations for 16 questions (36.4% of inaccurate questions). The response remained inaccurate in 10/16 (6.3%) questions and was accurate on the second encounter for 6/16 (3.8%) questions. No change in the selected MC answer was observed in nearly two-thirds of cases (n=28, 63.6%).

Interrater Reliability

Interrater agreement was assessed between independent reviewers (Supplemental Table 2). For SCORE, the Cohen kappa (κ) statistic was 0.720 for OE questions and 1.00 for MC questions. Similarly, for Data-B, the Cohen kappa (κ) statistic was 0.0.681 for OE questions and 1.00 for MC questions. Based on accepted guidelines, this represents substantial agreement (defined as 0.61–0.80) for OE questions and perfect agreement for MC questions.

Representative GPT input and output are presented in Supplemental Document 1.

Discussion

We assessed the performance of ChatGPT-4 on two surgical knowledge self-assessments. Consistent with prior findings in other domains, ChatGPT exhibited robust accuracy and internal concordance, near or above human-level performance. This study highlights the ability of ChatGPT to perform within a highly specific and sophisticated field without domain-specific training or fine-tuning. More importantly, the findings underscore some of the current limitations of AI including variable performance on the same task and unpredictable gaps in the model’s capabilities. In addition, the similar performance of ChatGPT on SCORE and Data-B, which are considered to be tiered assessments with Data-B being more advanced, suggests a distinctiveness between human understanding and/or learning and the development of LLMs. Nonetheless, the near or above human-level performance of a LLM within the surgical domain – and the potential to enhance its performance with domain-specific training (i.e., high-yield surgical literature) – highlights its prospective value as a clinical reasoning and decision-making support tool. While human context and high-level conceptual models are needed for certain decisions and tasks within surgery, understanding the performance of LLMs will direct their future development such that AI tools are safely and complementarily integrated within healthcare, offloading avoidable cognitive demand.

Foremost, at a cohort level, ChatGPT demonstrated near or above human-level performance, with an accuracy of 71% and 68% on MC SCORE and Data-B questions, respectively. This is consistent with ChatGPT performance in other general and specific knowledge domains, including law, verbal reasoning, mathematics, and medicine.19,23 The current findings are consistent with a study by Hopkins et al., in which ChatGPT achieved near human-level performance on a subset of questions from the Congress of Neurological Surgeons (CNS) Self-Assessment Neurosurgery (SANS).25

In this study, ChatGPT performed better on MC questions than OE questions. While this is in part due to an increased proportion of responses to OE questions that were deemed indeterminate by study adjudicators, we must acknowledge the lower ChatGPT response accuracy on OE questions compared to MC questions. This has potential implications for the future application of LLMs, by both medical professionals and nonmedical users, to answer clinical questions. User queries are likely to be framed as OE, rather than MC, questions and so we must understand the current model’s limitations and optimize the ability of future versions to respond to OE questions with improved accuracy and reliability.

The current study utilizes two knowledge assessments, which are generally accepted to be tiered in difficulty. SCORE is principally designed for residents throughout training and Data-B is targeted to senior residents and board-certified attending general surgeons. This design provides additional insight into the performance of ChatGPT relative to humans; we anticipated that ChatGPT would perform superiorly on SCORE relative to Data-B. However, the near equivocal relative performance of ChatGPT on SCORE and Data-B suggests that its ability to provide correct answers to a question prompt fundamentally differs from that of surgical trainees and surgeons. Learners progressively attain layers of context and comprehension to expand and apply their knowledge base – they first learn simple facts and then form networks of understanding linking these individual pieces of information and allowing for nuanced reasoning. Given the nature of its corpus of information and reinforcement-based training, predictive LLMs such as ChatGPT do not progress in a similar pattern. It is an informal observation that SCORE questions often require more precise delineation of similar answer choices (e.g., distal pancreatectomy with splenectomy versus distal pancreatectomy alone), and Data-B generally requires a broader, more comprehensive understanding to answer each question. The near equivocal performance of ChatGPT on these two question sets suggests that a probabilistic algorithm like ChatGPT can function at a high level on both tasks, but it also highlights that the mental and conceptual models providers use to develop and leverage their expertise should not be attributed to these models. Mental models have acknowledged limitations, but they allow physicians to think broadly during clinical encounters where information may be limited or not obviously linked. Importantly, such differences may lead to errors by the LLM that experienced providers would easily avoid given the way we learn. Thus, it is still too early to assume LLMs can safely directly assist clinicians in providing care. Future research into how LLMs perform, with specific attention to end-points beyond accuracy, is needed to direct further development and application of LLMs and related AI in healthcare and specifically, in surgery.

Two additional findings warrant consideration. First, our analysis highlights the kind of errors that ChatGPT makes on surgical knowledge questions. In 11 of 44 inaccurate responses (25%), the incorrect response was related to a straightforward, fact-based query (e.g., What is the second most common location of blunt thoracic vascular injury after the aorta?). Second, we observed inconsistencies in ChatGPT responses. When erroneous responses were re-presented to the LLM interface, output varied in one-third of queries; responses were different and incorrect (e.g. select another multiple choice response) in two-thirds. Current performance on fact-based queries and changes in response on repeat query highlight a significant limitation of current predictive LLMs. Future performance metrics should include a measure of consistency as well as initial capability. Fine-tuning to the target domain may improve model performance and consistency. These findings also underscore the importance of implementing AI tools such that they complement, rather than replace, human input in healthcare in order to avoid costly errors and optimize care.

To our knowledge, this is the first study evaluating the performance of ChatGPT on knowledge assessments over multiple instances. The ability of a general-purpose model like ChatGPT to correctly answer knowledge assessment questions at rates similar to, or above, the average human performance highlights the incredible potential of such models; however, we demonstrated inconsistency in ChatGPT responses on repeat query, which is a major limitation of its capacity to be used in this domain. Given inconsistency on repeat testing, additional value may be derived from domain-specific fine tuning and assessment of model performance in clinical encounters in addition to standardized knowledge exam questions. LLMs including ChatGPT lack a conceptual model which forms the framework used by human clinicians to diagnose and treat clinical conditions. This may be a major limitation of ChatGPT’s performance in a clinical setting as was highlighted in a recent blog post by an emergency medicine physician who tested ChatGPT’s diagnostic capacity in a subset of clinical encounters.48 Without these mental or conceptual models, correct responses to deterministic questions, like those found in the SCORE and Data-B question banks, do not necessarily imply that the existing model would be able to optimize clinical understanding and decision making.

The current study has notable limitations. First, a relatively small bank of questions was used, which may not accurately reflect the broader domain of surgical knowledge, although the list of topics included is diverse. Second, the assessment of accuracy and internal concordance for the open-ended responses may be biased, although we did find significant inter-rater reliability. Third, and most importantly, our assessment of the consistency of ChatGPT responses on repeat query was limited to questions in which ChatGPT was initially incorrect. Further study should assess whether the observed inconsistencies extend to all questions. Furthermore, it is possible that some of the questions and/or answers are available in some form online, potentially allowing the model to draw on existing responses. While the content is easily accessible online, the specific questions are less likely to be available as neither SCORE nor Data-B are open-source assessments. Finally, a metric of human performance on Data-B is not readily available, though median performance is likely equivalent to SCORE, given the reported data on both the American Board of Surgery In-Training and Qualifying Examinations.

Conclusion

Consistent with findings in other academic and professional fields,19, 2227 the current study demonstrates the ability of ChatGPT to perform well within the surgical domain. However, despite the demonstrated performance of ChatGPT on surgical training and licensing examinations, we observed inconsistency in ChatGPT responses upon repeat inquiry. This study represents the first analysis of repeat inquiry which warrants future consideration, particularly with regards to implementation of AI tools in high-stakes fields including healthcare. While able to perform as well as the average human user on standardized surgical knowledge question sets, it is unclear whether LLMs like ChatGPT are able to safely and reliably assist real-world clinical decision making.

Supplementary Material

1
2

Table 4:

Outcome of Repeat Question for 44 Initially Inaccurate Responses to SCORE

Outcome N (%)
No change in answer/response 28 (63.6%)
Change in answer/response
 Inaccurate to inaccurate 10 (22.7%)
 Inaccurate to accurate 6 (13.6%)
Total 44 (100%)

Funding/Support:

Dr. Beaulieu-Jones is supported by National Library of Medicine/NIH grant [T15LM007092]

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Conflict of Interest/Disclosure: The authors have no relevant financial disclosures.

References

  • 1.Khalsa RK, Khashkhusha A, Zaidi S, Harky A, Bashir M. Artificial intelligence and cardiac surgery during COVID-19 era. J Card Surg 2021;36(5):1729–1733. doi: 10.1111/JOCS.15417 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Mehta N, Pandit A, Shukla S. Transforming healthcare with big data analytics and artificial intelligence: A systematic mapping study. J Biomed Inform 2019;100. doi: 10.1016/J.JBI.2019.103311 [DOI] [PubMed] [Google Scholar]
  • 3.Payrovnaziri SN, Chen Z, Rengifo-Moreno P, et al. Explainable artificial intelligence models using real-world electronic health record data: a systematic scoping review. J Am Med Inform Assoc JAMIA 2020;27(7):1173–1185. doi: 10.1093/JAMIA/OCAA053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Xiao C, Choi E, Sun J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. J Am Med Inform Assoc JAMIA 2018;25(10):1419–1428. doi: 10.1093/JAMIA/OCY068 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Luh JY, Thompson RF, Lin S. Clinical Documentation and Patient Care Using Artificial Intelligence in Radiation Oncology. J Am Coll Radiol JACR 2019;16(9 Pt B):1343–1346. doi: 10.1016/J.JACR.2019.05.044 [DOI] [PubMed] [Google Scholar]
  • 6.Johnson SP, Wormer BA, Silvestrini R, Perdikis G, Drolet BC. Reducing Opioid Prescribing After Ambulatory Plastic Surgery With an Opioid-Restrictive Pain Protocol. Ann Plast Surg 2020;84(6S Suppl 5):S431–S436. doi: 10.1097/SAP.0000000000002272 [DOI] [PubMed] [Google Scholar]
  • 7.Makhni EC, Makhni S, Ramkumar PN. Artificial Intelligence for the Orthopaedic Surgeon: An Overview of Potential Benefits, Limitations, and Clinical Applications. J Am Acad Orthop Surg 2021;29(6):235–243. doi: 10.5435/JAAOS-D-20-00846 [DOI] [PubMed] [Google Scholar]
  • 8.Hammouda N, Neyra JA. Can Artificial Intelligence Assist in Delivering Continuous Renal Replacement Therapy? Adv Chronic Kidney Dis 2022;29(5):439–449. doi: 10.1053/J.ACKD.2022.08.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.McBee MP, Awan OA, Colucci AT, et al. Deep Learning in Radiology. Acad Radiol 2018;25(11):1472–1480. doi: 10.1016/J.ACRA.2018.02.018 [DOI] [PubMed] [Google Scholar]
  • 10.Rashidi HH, Tran NK, Betts EV, Howell LP, Green R. Artificial Intelligence and Machine Learning in Pathology: The Present Landscape of Supervised Methods. Acad Pathol 2019;6. doi: 10.1177/2374289519873088 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Hashimoto DA, Rosman G, Rus D, Meireles OR. Artificial Intelligence in Surgery: Promises and Perils. Ann Surg 2018;268(1):70–76. doi: 10.1097/SLA.0000000000002693 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Mumtaz H, Saqib M, Ansar F, et al. The future of Cardiothoracic surgery in Artificial intelligence. Ann Med Surg 2012 2022;80. doi: 10.1016/J.AMSU.2022.104251 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Raffort J, Adam C, Carrier M, Lareyre F. Fundamentals in Artificial Intelligence for Vascular Surgeons. Ann Vasc Surg 2020;65:254–260. doi: 10.1016/J.AVSG.2019.11.037 [DOI] [PubMed] [Google Scholar]
  • 14.Stumpo V, Staartjes VE, Regli L, Serra C. Machine Learning in Pituitary Surgery. Acta Neurochir Suppl 2022;134:291–301. doi: 10.1007/978-3-030-85292-4_33 [DOI] [PubMed] [Google Scholar]
  • 15.Petch J, Di S, Nelson W. Opening the Black Box: The Promise and Limitations of Explainable Machine Learning in Cardiology. Can J Cardiol 2022;38(2):204–213. doi: 10.1016/J.CJCA.2021.09.004 [DOI] [PubMed] [Google Scholar]
  • 16.Jarrett D, Stride E, Vallis K, Gooding MJ. Applications and limitations of machine learning in radiation oncology. Br J Radiol 2019;92(1100). doi: 10.1259/BJR.20190001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Cheng JY, Abel JT, Balis UGJ, McClintock DS, Pantanowitz L. Challenges in the Development, Deployment, and Regulation of Artificial Intelligence in Anatomic Pathology. Am J Pathol 2021;191(10):1684–1692. doi: 10.1016/J.AJPATH.2020.10.018 [DOI] [PubMed] [Google Scholar]
  • 18.Sarno L, Neola D, Carbone L, et al. Use of artificial intelligence in obstetrics: not quite ready for prime time. Am J Obstet Gynecol MFM 2023;5(2). doi: 10.1016/J.AJOGMF.2022.100792 [DOI] [PubMed] [Google Scholar]
  • 19.OpenAI. GPT-4 Technical Report Published online March 15, 2023.
  • 20.Zhang C, Zhang C, Li C, Qiao Y. One Small Step for Generative AI, One Giant Leap for AGI: A Complete Survey on ChatGPT in AIGC Era. ArXiv 1013140RG222478970883 Published online April 4, 2023. [Google Scholar]
  • 21.Quick uptake of ChatGPT, and more - this week’s best science graphics. Nature Published online February 28, 2023. doi: 10.1038/D41586-023-00603-2 [DOI] [PubMed] [Google Scholar]
  • 22.Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ 2023;9. doi: 10.2196/45312 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2(2):e0000198. doi: 10.1371/JOURNAL.PDIG.0000198 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Morreel S, Mathysen D, Verhoeven V. Aye, AI! ChatGPT passes multiple-choice family medicine exam. Med Teach Published online 2023. doi: 10.1080/0142159X.2023.2187684 [DOI] [PubMed] [Google Scholar]
  • 25.Hopkins BS, Nguyen VN, Dallas J, et al. ChatGPT versus the neurosurgical written boards: a comparative analysis of artificial intelligence/machine learning performance on neurosurgical board-style questions. J Neurosurg Published online March 1, 2023:1–8. doi: 10.3171/2023.2.JNS23419 [DOI] [PubMed] [Google Scholar]
  • 26.Yeo YH, Samaan JS, Ng WH, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol Published online March 22, 2023. doi: 10.3350/CMH.2023.0089 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Johnson D, Goodman R, Patrinely J, et al. Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model. Res Sq Published online 2023. doi: 10.21203/RS.3.RS-2566942/V1 [DOI] [Google Scholar]
  • 28.Ismail A, Ghorashi NS, Javan R. New Horizons: The Potential Role of OpenAI’s ChatGPT in Clinical Radiology. J Am Coll Radiol JACR Published online March 2023. doi: 10.1016/J.JACR.2023.02.025 [DOI] [PubMed] [Google Scholar]
  • 29.Sinha RK, Deb Roy A, Kumar N, Mondal H. Applicability of ChatGPT in Assisting to Solve Higher Order Problems in Pathology. Cureus 2023;15(2). doi: 10.7759/CUREUS.35237 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Strunga M, Urban R, Surovková J, Thurzo A. Artificial Intelligence Systems Assisting in the Assessment of the Course and Retention of Orthodontic Treatment. Healthc Basel Switz 2023;11(5). doi: 10.3390/HEALTHCARE11050683 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Ali SR, Dobbs TD, Hutchings HA, Whitaker IS. Using ChatGPT to write patient clinic letters. Lancet Digit Health 2023;5(4). doi: 10.1016/S2589-7500(23)00048-1 [DOI] [PubMed] [Google Scholar]
  • 32.Sallam M ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthc Basel Switz 2023;11(6). doi: 10.3390/HEALTHCARE11060887 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Rao A, Pang M, Kim J, et al. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow. MedRxiv Prepr Serv Health Sci Published online February 26, 2023. doi: 10.1101/2023.02.21.23285886 [DOI] [Google Scholar]
  • 34.Haver HL, Ambinder EB, Bahl M, Oluyemi ET, Jeudy J, Yi PH. Appropriateness of Breast Cancer Prevention and Screening Recommendations Provided by ChatGPT. Radiology Published online April 4, 2023:230424. doi: 10.1148/RADIOL.230424 [DOI] [PubMed] [Google Scholar]
  • 35.Hopkins AM, Logan JM, Kichenadasse G, Sorich MJ. Artificial intelligence chatbots will revolutionize how cancer patients access information: ChatGPT represents a paradigm-shift. JNCI Cancer Spectr 2023;7(2). doi: 10.1093/JNCICS/PKAD010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Alkaissi H, McFarlane SI. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus 2023;15(2). doi: 10.7759/CUREUS.35179 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios. J Med Syst 2023;47(1). doi: 10.1007/S10916-023-01925-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Thomas SP. Grappling with the Implications of ChatGPT for Researchers, Clinicians, and Educators. Issues Ment Health Nurs 2023;44(3):141–142. doi: 10.1080/01612840.2023.2180982 [DOI] [PubMed] [Google Scholar]
  • 39.Vaishya R, Misra A, Vaish A. ChatGPT: Is this version good for healthcare and research? Diabetes Metab Syndr 2023;17(4):102744. doi: 10.1016/J.DSX.2023.102744 [DOI] [PubMed] [Google Scholar]
  • 40.Dahmen J, Kayaalp ME, Ollivier M, et al. Artificial intelligence bot ChatGPT in medical research: the potential game changer as a double-edged sword. Knee Surg Sports Traumatol Arthrosc Off J ESSKA 2023;31(4). doi: 10.1007/S00167-023-07355-6 [DOI] [PubMed] [Google Scholar]
  • 41.Will ChatGPT transform healthcare? Nat Med 2023;29(3). doi: 10.1038/S41591-023-02289-5 [DOI] [PubMed] [Google Scholar]
  • 42.American Board of Surgery. SCORE - Surgical Council on Resident Education. https://www.absurgery.org/default.jsp?scre_booklet.
  • 43.Bell RH. Surgical council on resident education: a new organization devoted to graduate surgical education. J Am Coll Surg 2007;204(3):341–346. doi: 10.1016/J.JAMCOLLSURG.2007.01.002 [DOI] [PubMed] [Google Scholar]
  • 44.Klingensmith ME, Malangoni MA. SCORE provides residents with web-based curriculum for developing key competencies. Bull Am Coll Surg 2013;98(10):10–15. [PubMed] [Google Scholar]
  • 45.Moalem J, Edhayan E, Darosa DA, et al. Incorporating the SCORE curriculum and web site into your residency. J Surg Educ 2011;68(4):294–297. doi: 10.1016/J.JSURG.2011.02.010 [DOI] [PubMed] [Google Scholar]
  • 46.Bavarian M, Jun H, Tezak N, et al. Efficient Training of Language Models to Fill in the Middle Published online July 28, 2022.
  • 47.Gao L, Schulman J, Hilton J. Scaling Laws for Reward Model Overoptimization Published online October 19, 2022.
  • 48.Tamayo-Sarver J. I’m an ER doctor: Here’s what I found when I asked ChatGPT to diagnose my patients. Infect Health.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1
2

RESOURCES