Abstract
Background
Multimodal large language models (LLMs) have the capability to process and integrate both text and image data, offering promising applications in the medical field. This study aimed to evaluate the performance of representative multimodal LLMs in the 2023 Japanese Surgical Specialist Examination, with a focus on image-based questions across various surgical subspecialties.
Methods
A total of 98 examination questions, including 43 image-based questions, from the 2023 Japanese Surgical Specialist Examination were administered to three multimodal LLMs: GPT-4 Omni, Claude 3.5 Sonnet, and Gemini Pro 1.5. Each model’s performance was assessed under two conditions: with and without images. Statistical analysis was conducted using McNemar’s test to evaluate the significance of accuracy differences between the two conditions.
Results
Among the three LLMs, Claude 3.5 Sonnet achieved the highest overall accuracy at 84.69%, exceeding the passing threshold of 80%, which is consistent with the standard set by the Japan Surgical Society for board certification. GPT-4 Omni closely approached the threshold with an accuracy of 79.59%, while Gemini Pro 1.5 scored 61.22%. Claude 3.5 Sonnet demonstrated the highest accuracy in four of six subspecialties for image-based questions and was the only model to show a statistically significant improvement with image inclusion (76.74% with images vs. 62.79% without images, p = 0.041). By contrast, GPT-4 Omni and Gemini Pro 1.5 did not exhibit significant performance changes with image inclusion.
Conclusion
Claude 3.5 Sonnet outperformed the other models in most surgical subspecialties for image-based questions and was the only model to benefit significantly from image inclusion. These findings suggest that multimodal LLMs, particularly Claude 3.5 Sonnet, hold promise as diagnostic and educational support tools in surgical domains, and that variation in visual reasoning capabilities may account for model-level differences in image-based performance.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12909-025-07938-6.
Keywords: Artificial intelligence, Surgical specialist examination, Large language models, Multimodal models, Medical education
Introduction
In recent years, the potential applications of large-scale language models (LLMs) in medicine have rapidly expanded, particularly in clinical support and medical education [1, 2]. The emergence of multimodal LLMs, which integrate and process multiple data formats, including text and images, represents a significant advancement [3]. Despite certain limitations, this new type of artificial intelligence (AI) shows promises as a tool for supporting diagnostic imaging in clinical practice and enhancing education for medical students and residents [4]. In medical fields that rely heavily on imaging, such as surgery and radiology, the use of multimodal LLMs is gaining attention for its potential to improve diagnostic accuracy and educational efficiency [5, 6].
While several attempts have been made to utilize AI to answer questions in surgical specialty examinations, most have been limited to text-based questions [7–9]. A recent study by Japanese radiology specialists showed that the performance of multimodal LLMs did not improve even when images were incorporated [10]. However, with the advent of multimodal AI, there is an urgent need for further research to evaluate AI’s ability to address problems that include images. A comprehensive investigation into how the presence or absence of images affects the correct response rate of AI models is essential, and more empirical data are needed to assess how performance changes under these conditions.
This study tested the performance of representative multimodal LLMs—Claude 3.5 Sonnet, GPT-4 Omni, and Gemini Pro 1.5—using questions from the 2023 Japanese Surgical Specialty Examination. These three models were selected because, at the time of the study, they represented the latest available versions of state-of-the-art multimodal LLMs (updated in the second half of 2024) that could process both Japanese text and medical images via the same API platform, and were widely recognized for their advanced text generation and image interpretation capabilities. Specifically, we compared the percentage of correct responses to questions with and without images to determine how the presence or absence of images contributes to AI’s diagnostic accuracy. This study evaluates the applicability of AI in a highly image-dependent surgical specialty examination and demonstrates the usefulness of LLMs as educational and diagnostic support tools. No protected health information (PHI) was used or included in this study.
Materials and methods
The present study analyzed exam items sourced from publicly accessible archives of the Japan Surgical Society. Since no human or animal data were involved, ethical review was deemed unnecessary. All data was anonymized and processed in compliance with applicable data protection standards.
Study design and data collection
This study assessed the performance of multimodal LLMs using questions from the 2023 Japanese Surgical Specialist Examination, a multiple-choice examination designed to evaluate the clinical and diagnostic knowledge of surgical specialists. In total, 98 questions were included, selected from all 100 questions in the 2023 Japanese Surgical Specialist Examination. Two questions were excluded because they were deemed inappropriate due to extremely low answer accuracy and were not used for scoring by the official board. Of the remaining items, 43 required interpretation of image-based information, as detailed in Table 1. These questions were distributed across six major surgical subspecialties: gastroenterological surgery (44 questions), cardiovascular surgery (15), respiratory surgery (10), pediatric surgery (10), breast and endocrine surgery (10), and emergency and anesthesiology (9). Thus, the selected dataset adequately represents all core competency domains defined by the examination. The examination data used in this study did not involve identifiable patient information and did not require ethical review under applicable guidelines. As stated on the official website of the Japan Surgical Society, candidates are generally required to score at least 80 out of 100 to pass the Surgical Board Examination. Therefore, we used this threshold to evaluate the model’s performance against real-world certification standards.
Table 1.
Distribution of questions by subspecialty in the board certification test of Japan surgical society (2023)
| Surgical subspecialities |
Gastroenterological Surgery |
Cardiovascular Surgery |
Respiratory Surgery |
Pediatric Surgery |
Breast & Endocrine Surgery |
Emergency & Anesthesiology |
|---|---|---|---|---|---|---|
|
Number of questions |
44/98 (44.9%) | 15/98 (15.3%) | 10/98 (10.2%) | 10/98 (10.2%) | 10/98 (10.2%) | 9/98 (9.2%) |
Values are presented as number of questions in the subspecialty/total number of questions (percentage)
Model selection
Three multimodal LLMs—Claude 3.5 Sonnet, GPT-4 Omni, and Gemini Pro 1.5—were selected for evaluation based on their ability to process both textual and visual information. These models represent advanced multimodal technology released or updated in 2024. Claude 3.5 Sonnet and GPT-4 Omni are recognized for their advanced text generation and image interpretation capabilities, while Gemini Pro 1.5, developed by Google DeepMind, is optimized for integrating text and images in professional and diagnostic applications. The selection of these models was intended to provide a comparative analysis of their diagnostic capabilities across surgical subspecialties.
Data processing
The overall data conversion and prompt structure were adapted with reference to a prior methodology described in our previously published study [11]. Each question from the 2023 Japanese Surgical Specialist Examination was categorized by subspecialty and converted into a format suitable for model input. The questions were organized into two types: text-only questions and image-based questions. Text content was extracted and saved as plain text files, while associated images were saved in high-resolution PNG format to preserve visual clarity for diagnostic interpretation. To maintain consistency across inputs and ensure standardized model, the following prompts were used:
Text-only input prompt: “You are acting as a surgeon and must answer the following multiple-choice question. The question is provided in Japanese; please translate it into English before selecting your answer.”
Text with image input prompt: “You are acting as a surgeon and must answer the following multiple-choice question. The question is in Japanese, so translate it into English before responding. The related image files are provided separately.”
The questions were processed through the OpenRouter platform (https://openrouter.ai), which facilitated automated submission to the LLMs via an application programming interface (API). Using Python (version 3.11.7) and the OpenRouter API, each question file (text and/or image) was sequentially transferred to the LLMs for interpretation. The computational environment included the OpenAI Python client library (version 1.63.2), pandas (version 2.2.2), and statsmodels (version 0.14.2).
Three models were selected for evaluation: GPT-4 Omni (openai/gpt-4o-2024-08-06, released May 13, 2024, model card: https://platform.openai.com/docs/guides/gpt), Claude 3.5 Sonnet (anthropic/claude-3.5-sonnet, released June 21, 2024, updated October 2024, model card: https://docs.anthropic.com/claude/docs/claude-3-model-card), and Gemini Pro 1.5 (google/gemini-pro-1.5-exp, released February 8, 2024, model card: https://ai.google.dev/gemini-models), respectively, as referenced earlier in this study.
The complete inference parameters used for each model, including temperature, top-p, max tokens, and stop sequences, are summarized in Supplemental Table 1. While a temperature setting of 0 is typically recommended for deterministic tasks [4, 12]recent studies have reported that setting temperature to 0 can lead to performance degradation in modern models [13]. Therefore, we maintained the default temperature for each model as specified by OpenRouter. Similarly, since top-p is rarely modified from its default value in practice, we also used the default top-p settings.
Specific API endpoints were chosen for each model to align with its multimodal capabilities. In most cases, the responses included justifications for the chosen answer or English translations of the question, deviating from the expected single-choice format (“a,” “b,” “c,” or “d”). To standardize the outputs, one of the authors, who also developed the programming, manually corrected these responses to ensure all answers adhered to the required format.
Evaluation and scoring
Each LLM’s responses were assessed using a exact-match criterion. Only those answers that completely corresponded to the correct choice were deemed correct; partially correct or incomplete answers were not awarded any credit. After receiving the outputs through the API, a human reviewer checked each response for consistency and formatting accuracy to ensure alignment with evaluation standards. For scoring, the responses were processed through a structured review pipeline to ensure accuracy and standardization across the various answer formats. This approach provided a robust mechanism to address discrepancies and ensured a clear, uniform assessment of model performance.
Statistical analysis
To assess whether the presence of images significantly impacted model accuracy, McNemar’s exact test was used to compare each model’s performance with and without images. Statistical significance was defined as p < 0.05. All statistical analyses were performed using Python (version 3.11.7) with the statsmodels library (version 0.14.2). Because each LLM responded to each question only once, the reported accuracy rates are based on single-pass outputs, and confidence intervals were not calculated.
Results
Table 2 presents the accuracy rates of each LLM for both image-included and image-excluded conditions across all subspecialties. As depicted in Fig. 1, among the three-representative multimodal LLMs—Claude 3.5 Sonnet, GPT-4 Omni, and Gemini Pro 1.5—only Claude 3.5 Sonnet exceeded the approximate passing threshold of 80%, achieving a total accuracy of 84.69%. GPT-4 Omni closely approached the threshold with an accuracy of 79.59%, while Gemini Pro 1.5 fell significantly below the threshold at 61.22%. Notably, these models also maintained high accuracy rates without image data. Claude 3.5 Sonnet achieved 78.57%, GPT-4 Omni attained 77.55%, and Gemini Pro 1.5 remained at 61.22% accuracy, although none exceeded the passing threshold under these conditions.
Table 2.
Overall accuracy rates of multimodal large Language models in the board certification test of Japan surgical society (2023)
| Model | Average | With Images | Without images | |||||
|---|---|---|---|---|---|---|---|---|
| (Na) | (%) | (Na) | (%) | (Na) | (%) | |||
| Claude 3.5 Sonnet | 80/98 | (81.63) | 83/98 | (84.69) | 77/98 | (78.57) | ||
| GPT-4 omni | 77/98 | (78.57) | 78/98 | (79.59) | 76/98 | (77.55) | ||
| Gemini Pro 1.5 | 60/98 | (61.22) | 60/98 | (61.22) | 60/98 | (61.22) | ||
a Number of correct answers/questions
b McNemar’s exact tests were performed
Fig. 1.
Comparison of the performance of multimodal large language models in the board certification test of the Japan Surgical Society (2023). This bar chart depicts the accuracy rates of three multimodal large language models in responding to questions from the Japan Surgical Society board certification examination, under conditions with and without image inclusion
Among the six surgical subspecialties, Claude 3.5 Sonnet achieved the highest accuracy with images in four categories: gastroenterology (81.82%), pediatrics (100.00%), breast and endocrinology (100.00%), and emergency and anesthesiology (88.89%). GPT-4 Omni ranked first in three categories with images: cardiovascular surgery (86.67%), respiratory surgery (80.00%), and pediatrics (100.00%), where it tied with Claude 3.5 Sonnet. Gemini Pro 1.5 did not achieve the top position in any category when images were included. These results, summarized in Fig. 2, highlight Claude 3.5 Sonnet’s robust performance across multiple specialties with image-based questions, while GPT-4 Omni demonstrated competitive results, achieving the highest accuracy in three categories, including one tie with Claude 3.5 Sonnet (Fig. 3). For questions answered with images in gastroenterological surgery, Claude 3.5 Sonnet achieved an accuracy of 81.82%, compared to 68.18% for GPT-4 Omni and 50.00% for Gemini Pro 1.5. In cardiovascular surgery, the accuracies were 80.00% for Claude 3.5 Sonnet, 86.67% for GPT-4 Omni, and 66.67% for Gemini Pro 1.5. For respiratory surgery, the accuracies were 70.00% (Claude), 80.00% (GPT-4 Omni), and 50.00% (Gemini). In pediatric surgery, both Claude 3.5 Sonnet and GPT-4 Omni scored 100.00%, while Gemini Pro 1.5 scored 80.00%. For breast and endocrine surgery, the scores were 100.00% (Claude), 90.00% (GPT-4 Omni), and 90.00% (Gemini). Finally, in emergency and anesthesiology, the accuracies were 88.89% for Claude, 88.89% for GPT-4 Omni, and 66.67% for Gemini Pro 1.5.
Fig. 2.

Accuracy rates of multimodal large language models across different subspecialties in the board certification test of the Japan Surgical Society (2023). A Average accuracy rates. B Accuracy rates with images. C Accuracy rates without images across all subspecialties
Fig. 3.
Comparison of Accuracy Rates of Multimodal Large Language Models in Answering Image-Based Questions With and Without Images. This bar graph shows the accuracy rates of three multimodal large language models in solving image-based questions from the Japan Surgical Society board certification examination, both with and without image input
In the absence of images, Claude 3.5 Sonnet achieved the highest accuracy in two categories: gastroenterology (77.27%) and pediatrics (90.00%). GPT-4 Omni ranked first in three categories without images: cardiovascular surgery (86.67%), respiratory surgery (100.00%), and breast and endocrine surgery (90.00%). In pediatrics (90.00%), GPT-4 Omni tied with Claude 3.5 Sonnet for the highest score. Gemini Pro 1.5 did not secure the top position in any category without image support. This pattern, illustrated in Fig. 2, highlights GPT-4 Omni’s consistent performance in non-image-based questions across multiple categories, while Claude 3.5 Sonnet demonstrated strength in specific areas without image support.
Figure 3 illustrates a visual comparison of the accuracy rates of all LLMs when answering image-based questions, both with and without the inclusion of images. Table 3 summarizes these rates, showing that among the evaluated models, only Claude 3.5 Sonnet showed a statistically significant improvement in accuracy when images were provided (76.74% with images vs. 62.79% without images, p = 0.041). By contrast, GPT-4 Omni (69.77% with images vs. 65.12% without images, p = 0.773) and Gemini Pro 1.5 (44.19% with images vs. 44.19% without images, p = 0.724) did not exhibit significant enhancements with image inclusion. These results indicate that the addition of image information positively impacted the performance of Claude 3.5 Sonnet exclusively among the evaluated models.
Table 3.
Accuracy rate on image-based questions between with and without images
| Model | With Images | Without images | P-valueb | |||
|---|---|---|---|---|---|---|
| (Na) | (%) | (Na) | (%) | |||
| Claude 3.5 Sonnet | 33/43 | (76.74) | 27/43 | (62.79) | 0.041 | |
| GPT-4 omni | 30/43 | (69.77) | 28/43 | (65.12) | 0.773 | |
| Gemini Pro 1.5 | 19/43 | (44.19) | 19/43 | (44.19) | 0.724 | |
a Number of correct answers/questions
b McNemar’s exact tests were performed
Overall, the results indicate that while all three multimodal LLMs demonstrated the capability to perform well on the 2023 Japanese Surgical Specialist Examination, only Claude 3.5 Sonnet surpassed the passing score of 80 points, achieving the highest overall accuracy. It outperformed GPT-4 Omni, which closely approached the passing score, and Gemini Pro 1.5, and uniquely demonstrated statistically significant improvements with the inclusion of image data.
Discussion
This study evaluated the performance of three multimodal LLMs—Claude 3.5 Sonnet, GPT-4 Omni, and Gemini Pro 1.5—on the 2023 Japanese Surgical Specialist Examination. We assessed the impact of image inclusion on diagnostic accuracy and found that all three models achieved passing scores overall, though their accuracy varied across subspecialties. Among the evaluated models, Claude 3.5 Sonnet showed a statistically significant improvement in accuracy with the inclusion of image data, underscoring its potential as a diagnostic support tool in image-dependent specialties. Although no questions were excluded for being overly simple or lacking diagnostic value, we acknowledge that such filtering, if applied, could have potentially inflated model accuracy by eliminating low-difficulty items that may challenge the models’ broader generalizability.
Multimodal LLMs can process and generate responses from diverse data inputs, including text, images, and even audio or video [14, 15]. Recent advancements have enhanced their ability to perform complex tasks such as image captioning and diagnostic interpretation [5]. These capabilities make them promising tools for fields such as radiology and surgical education, where data integration is critical. By synthesizing textual and visual data, LLMs can support complex clinical decision-making processes and hold potential for applications requiring accurate image-based assessments. However, studies evaluating the medical image interpretation capabilities of multimodal LLMs have often been critical, emphasizing the need for further improvement in this area [10, 16].
Previous studies have explored the use of LLMs in medical education, primarily focusing on text-based assessments. For instance, Chan et al. [7] demonstrated the feasibility of using LLMs in the Membership of the Royal College of Surgeons exam, while Oh et al. [8] evaluated GPT-4 in surgical training contexts. However, these studies lacked image-based evaluation. Our study adds to this growing body of Literature by showing that a multimodal LLM, Claude 3.5 Sonnet, achieved high performance even in image-dependent assessments. This highlights the evolving role of multimodal AI in enhancing surgical education, not only through text comprehension but also by interpreting complex visual data. These findings suggest that current LLMs face limitations in solving image-based problems. Nevertheless, with rapid advancements in the field, recent multimodal LLMs may demonstrate improvements in this area.
Our study found that Claude 3.5 Sonnet demonstrated a statistically significant improvement in accuracy when medical images were included. This result suggests that multimodal LLMs are reaching a level where they can effectively evaluate medical images. Two possible factors may explain why our findings differ significantly from previous results. First, GPT-4 Vision, which has been widely studied, was released on September 25, 2023, making it older than Claude 3.5 Sonnet, which was launched on June 21, 2024, and updated to its latest version in October 2024. Given the rapid advancements in LLM technology, this 1-year difference could have resulted in markedly different outcomes. Second, prior studies evaluating image-based questions primarily focused on radiology, which may involve a higher level of difficulty than the general surgery-related questions analyzed in our study. Further research is necessary to explore the performance of multimodal LLMs, including Claude 3.5 Sonnet, in high-difficulty domains such as radiology specialist-level tasks.
We further speculate that the performance gap between Claude 3.5 Sonnet and Gemini Pro 1.5 may reflect differences in visual pretraining strategies. While Claude 3.5 Sonnet has Likely benefited from recent improvements in image-text alignment and exposure to medical image data, Gemini Pro 1.5 may lack sufficient optimization for image-intensive diagnostic tasks, particularly in Japanese-language contexts. Subspecialties such as pediatric and breast surgery, where Claude 3.5 Sonnet performed strongly, involve salient visual features—e.g., structural symmetry and lesion morphology—that may be better leveraged by Claude’s advanced vision-language integration.
Notably, model performance in this study varied across disease fields, likely reflecting differences in diagnostic complexity and the specific knowledge required in each specialty. Subspecialties with complex visual data requirements—such as emergency medicine and gastroenterology—appeared to benefit more from models with robust image interpretation capabilities. By contrast, underperformance was observed in fields where nuanced medical knowledge is critical but not sufficiently represented in training data. Optimizing model training to align more closely with the specialized demands of each medical field could enhance LLM performance and consistency in clinical applications. Although our analysis was conducted by surgical subspecialty rather than image modality, performance differences may reflect varying difficulty across modalities such as X-ray, CT, and endoscopy. Subspecialties like pediatric surgery and emergency surgery, where performance varied substantially across models, often rely on nuanced visual interpretation, which could pose greater challenges for less-optimized vision-language models. Because some subspecialties, such as pediatric surgery and breast/endocrine surgery, included only 10 questions each (with even fewer image-based questions), the reported accuracy rates in these domains should be interpreted with caution due to potential statistical instability.
Future advancements in LLM technology could broaden their role in healthcare. Improved multimodal capabilities may enable more robust applications in image-intensive specialties, enhancing diagnostic support [17]. Additionally, LLMs could streamline healthcare workflows by triaging patient concerns, generating automated responses, and assisting with routine administrative tasks, such as note-taking, data entry, scheduling, and communication management. These capabilities could reduce the burden on healthcare providers and improve the efficiency of limited healthcare resources. Furthermore, multimodal LLMs could function as virtual examiners in mock examinations, automatically evaluating responses, identifying common errors, and offering exemplar rationales. They could also be incorporated into resident self-assessment platforms to provide instant feedback and educational explanations, thereby supporting adaptive and individualized learning.
This study has several Limitations. First, it was based on 98 multiple-choice questions from a single-year national board examination, representing all valid items scored in the 2023 Japanese Surgical Specialist Examination after excluding two flawed questions. This limited set may not fully capture the breadth of surgical knowledge. Second, the distribution of questions was skewed toward gastroenterological surgery, potentially favoring models strong in this field. Third, LLMs are frequently updated, and performance may vary with each version, affecting reproducibility. Fourth, the multiple-choice format differs from real-world decision-making, which requires nuanced reasoning and patient-specific considerations. Fifth, the focus on the Japanese examination limits generalizability to other countries, although prior studies have examined similar settings internationally [7–9]. Finally, manual formatting of model outputs, while limited to answer structure, may have introduced minor bias. Future studies should incorporate larger, more balanced, and context-rich datasets to address these limitations.
In conclusion, our findings suggest that multimodal LLMs, particularly Claude 3.5 Sonnet, hold significant potential for diagnostic and educational applications in surgical specialties. As these models evolve, they may further augment clinical support by addressing both image-based and text-based tasks while also contributing to healthcare administration. With continued refinements in LLM technology and the implementation of ethical safeguards, multimodal AI could play a transformative role in improving healthcare quality and accessibility. While this study focused on aggregate model performance, a detailed failure analysis at the question level is currently underway and will be reported separately to explore specific misinterpretation patterns (e.g., visual vs. conceptual errors).
Supplementary Information
Acknowledgements
We thank Angela Morben, DVM, ELS, from Edanz (https://jp.edanz.com/ac) for editing a draft of this manuscript. During the preparation of this work, the authors used a large language model (ChatGPT V4.0) to revise the manuscript text for coherence and clarity. After using this service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
Abbreviations
- AI
Artificial Intelligence
- API
Application Programming Interface
- GPT
Generative Pre-trained Transform
- LLM
Large Language Model
Authors’ contributions
Y.M.: Conceptualization; Methodology; Visualization; Writing – Original Draft Preparation. T.N.: Conceptualization; Formal analysis; Methodology; Visualization; Writing – Original Draft. H.N.: Data Curation; Writing – Review & Editing. T.H.: Supervision; Writing – Review & Editing. M.I.: Supervision; Writing – Review & Editing.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Data availability
No datasets were generated or analysed during the current study.
Declarations
Ethics approval and consent to participate
Not available.
Consent for publication
The present study analyzed exam items sourced from publicly accessible archives of the Japan Surgical Society. Since no human or animal data were involved, ethical review was deemed unnecessary. All data was anonymized and processed in compliance with applicable data protection standards.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Lucas HC, Upperman JS, Robinson JR. A systematic review of large language models and their implications in medical education. Med Educ. 2024. 10.1111/medu.15402. [DOI] [PubMed] [Google Scholar]
- 2.Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large language models in medicine: the potentials and pitfalls : a narrative review. Ann Intern Med. 2024;177(2):210–20. 10.7326/M23-2772. [DOI] [PubMed] [Google Scholar]
- 3.Meng X, Yan X, Zhang K, et al. The application of large language models in medicine: a scoping review. iScience. 2024;27(5):109713. 10.1016/j.isci.2024.109713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Bhayana R. Chatbots and large language models in radiology: a practical primer for clinical and research applications. Radiology. 2024;310(1):e232756. 10.1148/radiol.232756. [DOI] [PubMed] [Google Scholar]
- 5.Nakaura T, Ito R, Ueda D, et al. The impact of large language models on radiology: a guide for radiologists on the latest innovations in AI. Jpn J Radiol. 2024;42(7):685–96. 10.1007/s11604-024-01552-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Rengers TA, Thiels CA, Salehinejad H. Academic surgery in the era of large language models: a review. JAMA Surg. 2024;159(4):445–50. 10.1001/jamasurg.2023.6496. [DOI] [PubMed] [Google Scholar]
- 7.Chan J, Dong T, Angelini GD. The performance of large language models in intercollegiate membership of the Royal college of surgeons examination. Ann R Coll Surg Engl. 2024;106(8):700–4. 10.1308/rcsann.2024.0023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Oh N, Choi GS, Lee WY. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Ann Surg Treat Res. 2023;104(5):269–73. 10.4174/astr.2023.104.5.269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Al Qurashi AA, Albalawi IAS, Halawani IR, et al. Can a machine ace the test?? Assessing GPT-4.0’s precision in plastic surgery board examinations. Plast Reconstr Surg. 2023;11(12):e5448. 10.1097/GOX.0000000000005448. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hirano Y, Hanaoka S, Nakao T, et al. Gpt-4 turbo with vision fails to outperform text-only GPT-4 turbo in the Japan diagnostic radiology board examination. Jpn J Radiol. 2024;42(8):918–26. 10.1007/s11604-024-01561-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Nakaura T, Yoshida N, Kobayashi N, et al. Performance of multimodal large Language models in Japanese diagnostic radiology board examinations (2021–2023). Acad Radiol. 2024;S1076–6332(24):00819–5. 10.1016/j.acra.2024.10.035. [DOI] [PubMed] [Google Scholar]
- 12.Li D, Gupta K, Bhaduri M, Sathiadoss P, Bhatnagar S, Chong J. Comparing. GPT-3.5 and GPT-4 accuracy and drift in radiology diagnosis please cases. Radiology. 2024;310(1):e232411. 10.1148/radiol.232411. [DOI] [PubMed] [Google Scholar]
- 13.Fatemi M, Rafiee B, Tang M, Talamadupula K. Concise Reasoning via Reinforcement Learning. arXiv:250405185. Published online May 15, 2025:10.48550/arXiv.2504.05185. https://arxiv.org/abs/2504.05185
- 14.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930–40. 10.1038/s41591-023-02448-8. [DOI] [PubMed] [Google Scholar]
- 15.Meskó B. The impact of multimodal large language models on health care’s future. J Med Internet Res. 2023;25:e52865. 10.2196/52865. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hayden N, Gilbert S, Poisson LM, Griffith B, Klochko C. Performance of GPT-4 with vision on text- and image-based ACR diagnostic radiology in-training examination questions. Radiology. 2024;312(3):e240153. 10.1148/radiol.240153. [DOI] [PubMed] [Google Scholar]
- 17.Topol EJ. As artificial intelligence goes multimodal, medical applications multiply. Science. 2023;381(6663):adk6139. 10.1126/science.adk6139. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
No datasets were generated or analysed during the current study.


