Abstract
Background
Recent studies suggest that large language models (LLMs) such as ChatGPT are useful tools for medical students or residents when preparing for examinations. These studies, especially those conducted with multiple-choice questions, emphasize that the level of knowledge and response consistency of the LLMs are generally acceptable; however, further optimization is needed in areas such as case discussion, interpretation, and language proficiency. Therefore, this study aimed to evaluate the performance of six distinct LLMs for Turkish and English neurosurgery multiple-choice questions and assess their accuracy and consistency in a specialized medical context.
Methods
A total of 599 multiple-choice questions drawn from Turkish Board examinations and an English neurosurgery question bank were presented to six LLMs (ChatGPT-o1pro, ChatGPT-4, AtlasGPT, Gemini, Copilot, and ChatGPT-3.5). Correctness rates were compared using the proportion z-test, and inter-model consistency was examined using Cohen’s kappa.
Results
ChatGPT-o1pro, ChatGPT-4, and AtlasGPT demonstrated relatively high accuracy for Single Best Answer–Recall of Knowledge (SBA-R), Single Best Answer–Interpretative Application of Knowledge (SBA-I), and True/False question types; however, performance notably decreased for questions with images, with some models leaving many unanswered items.
Conclusion
Our findings suggest that GPT-4-based models and AtlasGPT can handle specialized neurosurgery questions at a near-expert level for SBA-R, SBA-I, and True/False formats. Nevertheless, all models exhibit notable limitations in questions with images, indicating that these tools remain supplementary rather than definitive solutions for neurosurgical training and decision-making.
Supplementary Information
The online version contains supplementary material available at 10.1007/s00701-025-06628-y.
Keywords: Artificial intelligence, Examination question, Neurosurgery, Performance evaluation
Introduction
In recent years, the development of artificial intelligence (AI) and large language models (LLMs) has played an important role in the field of neurosurgery, particularly in terms of rapid access to information and support for making clinical decisions [11]. These models have the potential to reduce the surgeons'workload by speeding up the literature search, analysis of patient records, and surgical planning [16, 28]. In specialized fields such as neurosurgery, LLMs can contribute to the evaluation of anatomical and surgical information and support personalized learning processes [2]. Moreover, when case-based analyses are required, they can compile approach preferences, facilitate the process, and strengthen the coordination of the clinical processes [35]. These features render them potential tools for educational and professional development.
As per recent literature, LLMs, such as ChatGPT, are valuable tools for medical students or residents preparing for examinations. These studies have correlated levels of knowledge and the level of consistency of responses of the models satisfactory in general, particularly with multiple-choice. However, the models would have to improve their discussion of cases, interpretation, and control of language [8, 22]. In addition, studies have shown that model performance can vary significantly across disciplines and languages [17, 21]. In this context, there is a need for studies evaluating the performance of LLMs, especially in complex and specific fields such as neurosurgery, and for providing multilingual comparative analyses. Understanding the performance differences in different question types (Single Best Answer–Recall of knowledge [SBA-R], Single Best Answer–Interpretative application of knowledge [SBA-I], True/False and Questions with Images) and languages is critical for the reliable use of these technologies in medical education and clinical practice. Hence, understanding the potential role of LLMs in expert training and evaluation in the field of neurosurgery could shed light on the development of educational programs and the integration of these technologies into supporting clinical decisions.
This study aimed to evaluate the performance of several current LLMs in the context of neurosurgery. We included both general-purpose models and a specialized model (Atlas GPT), which focuses on neurosurgical content to better understand their potential in a field that demands high-level expertise [12]. Given that LLMs are constantly evolving and being updated, assessing their current performance and language-related limitations is crucial for both educational and clinical needs. Through this comparison, we hope to highlight the contribution of these models to neurosurgical training and assessment, ultimately providing future directions for the use of LLMs in medical education.
Methods
Study design
This study was designed as a descriptive research project to evaluate the performance of the LLMs in answering multiple-choice neurosurgical questions. The assessment was conducted using questions in two languages to provide a comprehensive evaluation of LLMs'capabilities in specialized medical knowledge.
Selection of models and questions
Six different LLMs were included in the study: five general-purpose models (ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, and ChatGPT-o1pro) and one domain-specific model (Atlas GPT). The Atlas GPT was selected because of its focus on neurosurgical content and peer-reviewed status. The study utilized 599 multiple-choice questions after excluding one invalidated question. The English question set (n = 300) was sourced from the first 300 questions of"Neurosurgery Practice Questions and Answers"published by Thieme [30]. The Turkish questions (n = 299) were obtained from the Turkish Neurosurgical Society Board Certification Examinations conducted in 2018, 2019, and 2022.
Question classification and processing
Questions were systematically categorized into four types: SBA-R (n = 392) testing basic concepts and knowledge, true–false format questions (n = 87) consisting of multiple statements where models were needed to identify which statements were correct or incorrect, SBA-I (n = 23) requiring clinical analysis, and questions with images (n = 97) containing radiological or anatomical illustrations. Categorization was performed independently by two neurosurgeons, and a third neurosurgeon was consulted for consensus in cases of disagreement. Representative examples of each question type are provided in the supplementary data.(Supplement Data).
Testing protocol and data collection
The evaluation was conducted between October 2024 and January 2025. Questions were presented to the models in their original language, without translation, to maintain consistency in the testing environment. This decision was methodologically deliberate because the translation of medical examination questions could potentially introduce artificial biases through the alteration of specialized terminology, contextual nuances, and question complexity. For questions with images, visual content was uploaded directly to models with image-processing capabilities. No time limit was set for responses. The responses were marked as incorrect when the models provided either no answers or irrelevant responses. Each question was presented individually, without additional prompts.
All responses were recorded in a standardized database, and invalid questions from the 2022 examination were excluded from the analysis to maintain data integrity. For the Turkish questions, the answer key published by the Turkish Neurosurgical Society was used as the reference standard. For English questions, the answer key from"Neurosurgery Practice Questions and Answers"published by Thieme served as the reference. Responses were considered correct if they specified the correct option or provided an accurate answer without explicitly stating the option letter.
Statistical analysis
All descriptive statistics are reported as frequencies (n) and percentages (%). The question canceled in 2022 was excluded from the analysis, as it was answered differently by all models. A proportion z-test was used to test the differences between the correct answer rates of the models. In addition, Cohen's kappa analysis was used to examine whether the models responded consistently to each other; processing failures were excluded from the kappa analysis, which was conducted only on successfully generated responses. Kappa values were interpreted as 0 or below (no agreement), 0.01–0.20 (very low/slight), 0.21–0.40 (low/fair), 0.41–0.60 (moderate/moderate), 0.61–0.80 (high/substantial), and 0.81–1.00 (almost perfect/almost perfect). The statistical significance level was set at p < 0.05, and all analyses were performed using SPSS version 21.0. [19]
Results
Overall performance for whole questions
One question from the 2022 Turkish Board Exam was previously invalidated by examination authorities and was excluded from our study. A total of 599 questions remained after exclusion of one canceled item. Distribution of accuracy rates for responses provided by the LLMs across all questions is shown in Fig. 1. The proportion z-test revealed no significant difference between ChatGPT-o1pro and ChatGPT-4 (p = 0.06) or between ChatGPT-o1pro and AtlasGPT (p = 0.08). ChatGPT-o1pro demonstrated a significantly higher correct response rate than Gemini, Copilot, and ChatGPT-3.5 (p < 0.001) did. ChatGPT-4 and AtlasGPT also surpassed these three models (p < 0.001).
Fig. 1.
Distribution of accuracy rates for responses provided by the LLMs across all questions. LLMs, large language models
Performance by question type
SBA-R questions
In total, 392 SBA-R questions were asked. Distribution of accuracy rates for responses provided by the LLMs for SBA-R questions is shown in Table 1. The accuracy rates of ChatGPT-o1pro, ChatGPT-4, and AtlasGPT were not significantly different but were significantly higher than those of Gemini, Copilot, and ChatGPT-3.5 (p < 0.001).
Table 1.
Distribution of accuracy rates for responses provided by the LLMs across different question types
| Correct | Incorrect | Response Failure | |||||
|---|---|---|---|---|---|---|---|
| n | % | n | % | n | % | ||
| SBA-R | ChatGPTo1pro | 320 | 81.6 | 71 | 18.1 | 1 | 0.3 |
| ChatGPT 4 | 306 | 78.1 | 84 | 21.4 | 2 | 0.5 | |
| AtlasGPT | 304 | 77.6 | 85 | 21.7 | 3 | 0.8 | |
| Gemini | 194 | 49.5 | 194 | 49.5 | 4 | 1.0 | |
| Copilot | 198 | 50.5 | 192 | 49.0 | 2 | 0.5 | |
| ChatGPT 3.5 | 200 | 51.0 | 191 | 48.7 | 1 | 0.3 | |
| True/False | ChatGPTo1pro | 71 | 81.6 | 16 | 18.4 | 0 | 0 |
| ChatGPT 4 | 66 | 75.9 | 21 | 24.1 | 0 | 0 | |
| AtlasGPT | 70 | 80.5 | 15 | 17.2 | 2 | 2.3 | |
| Gemini | 38 | 43.7 | 45 | 51.7 | 4 | 4.6 | |
| Copilot | 30 | 34.5 | 57 | 65.5 | 0 | 0 | |
| ChatGPT35 | 40 | 46.0 | 47 | 54.0 | 0 | 0 | |
| SBA-I | ChatGPTo1pro | 16 | 69.6 | 7 | 30.4 | 0 | 0 |
| ChatGPT 4 | 15 | 65.2 | 7 | 30.4 | 1 | 4.3 | |
| AtlasGPT | 15 | 65.2 | 8 | 34.8 | 0 | 0 | |
| Gemini | 8 | 34.8 | 15 | 65.2 | 0 | 0 | |
| Copilot | 8 | 34.8 | 15 | 65.2 | 0 | 0 | |
| ChatGPT 3.5 | 7 | 30.4 | 16 | 69.6 | 0 | 0 | |
| Questions with Images | ChatGPTo1pro | 51 | 52.6 | 46 | 47.4 | 0 | 0 |
| ChatGPT 4 | 44 | 45.4 | 50 | 51.5 | 3 | 3.1 | |
| Gemini | 11 | 11.3 | 18 | 18.6 | 68 | 70.1 | |
| Copilot | 12 | 12.4 | 55 | 56.7 | 30 | 30.9 | |
LLMs, large language models: SBA-R, Single Best Answer with Recall of Knowledge; SBA-I, Single Best Answer with Interpretative Application of Knowledge
True–false questions
A total of 87 true–false items were analyzed. No significant differences were observed among the accuracy reported by ChatGPT-o1pro, AtlasGPT, and ChatGPT-4; however, these three models outperformed Gemini, Copilot, and ChatGPT-3.5 (p < 0.001) (Table 1).
SBA-I questions
There were 23 clinically interpretative questions. Statistical analysis showed higher correct response rates for ChatGPT-o1pro, ChatGPT-4, and AtlasGPT than for Gemini, Copilot, and ChatGPT-3.5 (p < 0.05) (Table 1).
Questions with ımages
Ninety-seven questions with images were administered. The difference in accuracy rates between ChatGPT-o1pro and ChatGPT-4 was not statistically significant (p = 0.157). Both ChatGPT-o1pro and ChatGPT-4 accuracies exceeded those of Gemini and Copilot (p < 0.001) (Table 1).
Questions with ımages vs. text-based questions
After excluding questions with images, 502 items remained. No statistically significant differences were observed among ChatGPT-o1pro, ChatGPT-4, and AtlasGPT (p > 0.05). These three models had significantly higher correct rates than the Gemini, Copilot, and ChatGPT-3.5 (p < 0.001). For questions with images, ChatGPT-o1pro had 52.6% and ChatGPT-4 had 45.4% accurasies. These rates differed significantly from those for Gemini (12.4%) and Copilot (11.3%) (p < 0.001).
Consistency (kappa) analysis
Kappa analysis revealed distinct patterns in response consistency across different question types and languages. For overall performance, the top three models demonstrated substantial agreement: ChatGPT-o1pro (κ = 0.707), AtlasGPT (κ = 0.728), and ChatGPT-4 (κ = 0.658), while other models showed only fair agreement with values ranging from 0.349 to 0.394 (Fig. 2).
Fig. 2.
A heatmap graphic of consistency analysis (kappa and p-values) and response rates of LLMs across all questions and subcategories, including SBA-R, true–false, SBA-I, and Questions with Images. LLMs, large language models; SBA-R, Single Best Answer with Recall of Knowledge; SBA-I, Single Best Answer with Interpretative Application of Knowledge
The analysis by question type revealed a progressive decline in consistency as the questions became more complex. In SBA-R questions, the leading models maintained substantial agreement (κ > 0.72), while other models showed fair agreement (κ = 0.37–0.39). A similar pattern was observed in true–false format questions, where top models demonstrated substantial consistency (κ > 0.70), though lower-performing models showed notably reduced agreement (κ < 0.33).
Language-specific analysis revealed interesting patterns, particularly in non-image questions where ChatGPT-o1pro showed higher consistency in Turkish (κ = 0.836) compared with English (κ = 0.689). Atlas maintained more balanced performance across languages (Turkish κ = 0.722, English κ = 0.733).
SBA-I questions showed a marked decrease in consistency even among top performers, with kappa values decreasing to moderate levels (κ = 0.55–0.59), while other models displayed only slight agreement (κ < 0.15). The most significant decline was observed in questions with images, where even the best-performing model (ChatGPT-o1pro) achieved only fair agreement (κ = 0.40), and some models showed no meaningful agreement.
Response generation failures
For questions with images, Copilot failed to generate responses for 68 of the 97 items owing to technical limitations, whereas Gemini encountered processing errors in 30 cases. ChatGPT-4 failed to respond to three items, whereas ChatGPT-o1pro successfully processed all questions. For non-image-based items, AtlasGPT encountered processing errors in five of the 502 questions, whereas the other models had minimal response failures (0–3 questions).
Discussion
This study aimed to examine the performance of different LLMs on neurosurgery-specific multiple-choice questions by incorporating authentic examination questions from Turkish and English sources. Our findings, in light of the overall accuracy rates and performance analysis by question type, showed that some of the models we tested were able to answer the questions at the expert level to a great extent, whereas others were limited, especially in questions with high visual or clinical interpretation dimensions. In particular, ChatGPT-o1pro and ChatGPT-4 stood out for their high accuracy rates, and AtlasGPT offered similarly satisfactory results. However, the relatively lower accuracy rates of Gemini, Copilot and ChatGPT-3.5 indicate that there is still room for improvement, either owing to differences in the training database of LLMs or in highly specialized areas such as neurosurgery.
Testing LLMs in the field of neurosurgery is important for assessing the knowledge and decision-making competencies of these models in specialized disciplines [24]. In particular, the use of proficiency questions offers a valuable approach to measure the capacity of models to process in-depth medical knowledge and understand how they can be positioned as potential tools in medical education [4]. Studies evaluating the qualifying examinations related to neurosurgery are limited. Sahin et. al. obtained a success rate of 78.77% in 2023, in which they asked the last six Turkish neurosurgery proficiency questions to GPT-4 by excluding questions with image, while residents answered 62% of the questions correctly. In a meta-analysis evaluating ChatGPT with neurosurgical competency questions, such as proficiency exams and board exams, the success rate of ChatGPT was 50.4%−78.77% [6]. However, this study was heterogeneous, and the different ChatGPT models used were not considered. Recently, Ali et al. conducted a comprehensive evaluation using 500 mock neurosurgical board examination questions, in which ChatGPT-4 achieved 83.4% accuracy, significantly outperforming both ChatGPT-3.5 (73.4%) and human test-takers (73.7%), further supporting the potential of newer LLM versions in specialized medical knowledge assessment [3]. In our study, ChatGPT-o1pro's slightly higher performance compared to AtlasGPT and ChatGPT-4 is likely related to its recent version and advanced'chain-of-thought'reasoning approach; however, usage limitations and version updates should also be considered [20, 23].
To better understand strengths and limitations of chatbots and harness their whole potential, it is essential to categorize the questions posed to them; the performance of these systems in addressing theoretical inquiries is shaped by various factors, notably the underlying architecture and the quality and specificity of the training data. In our analysis of 392 SBA-R questions, ChatGPT-o1pro attained the highest correct response rate of 81.6%, followed closely by ChatGPT-4 (78.1%) and AtlasGPT (77.6%). In contrast, Gemini, Copilot, and ChatGPT-3.5, each retained less than 52% correct answers, revealing a marked gap in accuracy compared with the top-performing models. True–false questions also followed a similar trend among chatbots. Studies emphasize that LLMs using the GPT-4 model, such as ChatGPT-4o and ChatGPT-o1pro, are more successful in general knowledge and licensing examinations in different medical disciplines [10, 13, 24]. In this context, AtlasGPT has great potential, as it performs comparably well with powerful engines such as ChatGPT, is a domain-specific model, and provides answers based only on peer-reviewed sources. These features make AtlasGPT a promising option, particularly in professional and academic fields that require reliability and accuracy.
Evaluating the performance of chatbots on SBA-I questions is crucial for demonstrating the extent to which LLMs are proficient in constructing medical logic, analyzing symptoms, and integrating cases into scenarios that resemble real-life situations [29, 32]. In this study, ChatGPT-o1pro, ChatGPT-4, and AtlasGPT demonstrated superior performances in SBA-I questions. ChatGPT-o1pro achieved an accuracy rate of 69.6% in this domain, whereas ChatGPT-4 and AtlasGPT were correct 65.2% of the time. In contrast, Gemini, Copilot, and ChatGPT-3.5 performed less well in analyzing clinical cases and remained in the range of 30–35% for these questions. Whereas ChatGPT-o1pro’s superiority in this context may stem from its extensive language model architecture and diverse training data. AtlasGPT’s advantage over other models in interpretive questions, in addition to its strong performance in knowledge-focused formats such as SBA-I and true–false items, indicates that its success is not merely attributable to peer-reviewed data. Rather, it also underscores the model’s robust ability to comprehend and integrate information—an essential feature for accurate and nuanced clinical reasoning. However, studies have shown that the success rate of chatbots decreases when question types become more difficult and less precise questions are asked in situations that require more reasoning ability and the ability to process different information in a clinical sense [26, 31]. In our study, in line with the literature, the success of all models was lower for SBA-I questions.
Image interpretation in neurosurgery represents one of the most complex aspects of clinical practice, requiring sophisticated understanding of anatomical structures and pathological findings across various imaging modalities including magnetic resonance imaging, computed tomography, and surgical photographs [36, 37]. While specialized deep learning models have demonstrated remarkable success in analyzing specific pathologies with high accuracy rates often approaching expert-level performance, their implementation faces significant challenges including the need for extensive training data, substantial infrastructure, and typically narrow, specific applications [7, 18]. Given these implementation challenges, recent studies have also explored the potential of LLMs in medical education and assessment [33, 38]. Despite their current limitations in image processing capabilities, LLMs offer promising potential due to their accessibility and ability to integrate both visual and textual data in medical education and assessment contexts.
Our analysis of 97 questions with images revealed a significant performance gap across all the models. The AtlasGPT and ChatGPT-3.5 models are not considered in this section because they do not have image-evaluation capabilities. While ChatGPT-o1pro achieved 81.1% accuracy for non-image questions, its performance declined substantially to 52.6% for questions with images. This disparity became even more pronounced in the other models, with Gemini and Copilot showing markedly lower accuracy rates of 12.4% and 11.3%, respectively. Perhaps the high rate of non-responses had a marked efect, with Copilot being unable to provide any answer for 70.1% of questions with images and Gemini for 30.9%. The low kappa values for questions with images (ChatGPT-o1pro: 0.40, ChatGPT-4: 0.331) further indicate only fair to moderate consistency in the responses. This pattern of significantly reduced performance in questions with images compared with text-only questions has been consistently reported across multiple medical specialties, with studies showing that even the most advanced LLMs struggle with diagnostic image interpretation, often performing well below both clinical standards and text-based capabilities [25, 27]. Nevertheless, while higher-performing models, such as ChatGPT-o1pro and ChatGPT-4, show promise in theoretical knowledge assessment, their current reliability in image interpretation falls well below clinical standards. These limitations suggest that LLMs are far from reliable for clinical decision-making regarding visual content. Therefore, we believe that they should only be used in supportive contexts such as ideation and brainstorming.
Given that LLMs are predominantly trained in English language content and medical literature [34], we included both Turkish Neurosurgery Board examination questions and English proficiency sample questions in our analysis. In Turkey, the Neurosurgery Board Examination can be taken either during residency or after its completion, whereas no formal equivalent exists for English proficiency. Kappa analysis revealed varying patterns of response consistency across languages. Among the analyzed models, ChatGPT-o1pro exhibited higher consistency in Turkish language questions (κ = 0.836) compared with English questions (κ = 0.689), suggesting potential adaptability to non-English medical content. AtlasGPT and ChatGPT-4 maintained a more balanced performance between languages, indicating their robust cross-lingual capabilities. However, when presented with questions with images, all models showed significantly reduced consistency in both languages, highlighting the fact that visual interpretation challenges transcend language barriers. These findings cannot be directly interpreted as language-dependent performance differences, as we used different question sets for each language rather than identical questions translated into both languages. Although using translated questions might seem methodologically advantageous, we deliberately chose authentic examination questions in each language to maintain real-world validity and avoid potential translation artifacts that could bias the results. Medical terminology and conceptual nuances can be lost or altered during translation, potentially creating artificial advantages or disadvantages for the models. Our approach, although limited to direct language comparisons, provides insights into how these models perform on questions naturally formulated in the medical education context of each language. However, there are inconsistent findings in the literature on the performance of AI models in different languages, with some studies showing no significant improvement in translation [9], whereas others suggesting that the models show a distinct advantage in English [14].
More notably, our analysis revealed a clear pattern of consistency across different question types. The highest agreement was observed in SBA-R and true–false questions (κ = 0.77–0.78), indicating substantial consistency in basic knowledge assessment. However, this consistency decreased in SBA-I questions (κ = 0.55–0.59), suggesting increased variability in responses when complex clinical reasoning was required. The lowest consistency was found in questions with images (ChatGPT-o1pro κ = 0.40, ChatGPT-4 κ = 0.331), with some models showing minimal to no agreement (Copilot κ = −0.04). This progressive decline in consistency from SBA-R to SBA-I to questions with images suggests that LLMs'reliability decreases as task complexity increases, particularly when visual interpretation is required. This is consistent with previous findings [6, 29].
While GPT-4's extensive training data undoubtedly contribute to its superiority, the inability to assess it against niche specialized medical models, such as AtlasGPT, could limit our understanding of its true potential within the medical field. This comparison raises interesting considerations regarding interdisciplinary approaches to medical education and clinical reasoning. Domain-specific models such as AtlasGPT, while offering highly focused and peer-reviewed content, may inherently view medical challenges primarily through a neurosurgical lens owing to their specialized training. By contrast, general-purpose models trained on diverse datasets may offer advantages in scenarios that require multidisciplinary perspectives, such as complex cases involving neurological, psychiatric, and systemic conditions. This broader knowledge base could potentially enable more comprehensive differential diagnoses and treatment considerations spanning multiple medical specialties. Such versatility may be particularly valuable in modern medical education, where an increasing emphasis is placed on holistic patient care and cross-specialty collaboration [1].
To the best of our knowledge, this is the first study to evaluate AtlasGPT and ChatGPT-o1pro using neurosurgical proficiency examinations and the first to assess popular chatbots in both Turkish and English with neurosurgery-specific questions in two different languages. While achieving accuracy rates approaching 80% for theoretical knowledge and showing promising capabilities in multiple languages, these models represent complementary tools rather than primary educational resources. Specialized models, although potentially more consistent in narrowly focused clinical scenarios, may lack the broader perspective advantageous in multidisciplinary clinical contexts. Conversely, general-purpose models, despite extensive training datasets, notably struggle with detailed neurosurgical nuances. This presents a new challenge: accuracy may decrease within specific clinical domains, while larger datasets are necessary to address multidisciplinary limitations. To overcome this issue, employing models trained exclusively on medical datasets might be the most suitable approach. The significant decline in performance in SBA-I and questions with images highlights that LLMs, despite their knowledge base, cannot replicate the nuanced decision-making processes that clinicians develop through years of practical experience. A particularly critical limitation is the inability of these models to assess their own knowledge boundaries, unlike human experts, who can acknowledge uncertainty and seek additional information when needed. This human element—the ability to integrate varied information sources, recognize subtle patterns, and make nuanced decisions based on both knowledge and experience— cannot be replicated by current AI systems.
Given these inherent limitations, it remains essential to implement targeted error-mitigation strategies—such as systematic validation against expert-reviewed benchmarks, integration of clinician oversight, and continuous updating with validated clinical data—to safely leverage the strengths of both model types in neurosurgical education and clinical practice.
Future directions
The future development of LLMs in medical education should focus on several key areas. Priority should be given to developing specialized medical imaging datasets for training and establishing standardized evaluation protocols that account for model updates. Integration frameworks for curriculum development should balance the strengths of both general purpose and specialized medical LLMs, particularly in addressing cross-specialty reasoning capabilities. Implementation guidelines should clearly define appropriate use in medical education, emphasizing the complementary role of these tools, rather than their use as primary educational resources. Furthermore, as Lang et al. (2024) emphasized, incorporating patient- and center-specific data into LLM training while ensuring robust accuracy verification mechanisms could enhance their reliability in specialized medical education [15].
Limitations
This study had several limitations that warrant consideration when interpreting our findings. First, the key methodological constraint is the rapid evolution of LLM capabilities, which represents a temporal snapshot from October 2024 to January 2025, as these models undergo frequent updates that could significantly alter their performance. Thus, repeated evaluations may become necessary as new versions emerge or datasets expand. Second, statistical power was limited by the uneven distribution of question types (392 SBA-R versus 23 SBA-I) and the relatively small sample size for questions with images (n = 97). Additionally, the lack of continual learning capabilities in the current LLMs means that these models cannot adapt based on feedback or new information, which potentially affects the reproducibility of the results. Furthermore, as Barker and Rutka (2023) emphasized, LLMs operate through pattern-matching rather than true understanding, and can produce"hallucinations"– plausible but incorrect responses that are difficult to detect. This limitation raises concerns regarding their reliability in real-world clinical scenarios, where complex decision-making extends beyond standardized test performance [5].
Conclusion
This study investigated the performance of six different LLMs on a specialized set of 599 neurosurgery questions in Turkish and English, focusing on SBA-R, SBA-I, True–False and questions with images. Although ChatGPT-o1Pro, ChatGPT-4, and AtlasGPT demonstrated higher overall accuracy rates, their performances decreased substantially for questions with images. These findings highlight the significant variability among models, suggesting that domain-specific training and advanced datasets can enhance the efficacy of LLMs in neurosurgical contexts. However, the current limitations in interpreting complex clinical scenarios and medical images underline the need for cautious integration in clinical practice. Future enhancements to these technologies may address these issues.
Supplementary Information
Below is the link to the electronic supplementary material.
Author contributions
Mahmut Çamlar, Abuzer Güngör, and Umut Tan Sevgi were responsible for the conceptualization and supervision of the study. Mahmut Çamlar, Abuzer Güngör, Umut Tan Sevgi, and Yücel Doğruel contributed to the study design and methodology. Data collection and interpretation were conducted by Gökberk Erol, Furkan Karakaş, and Yücel Doğruel. Statistical analysis was performed by Furkan Karakaş and Gökberk Erol. Manuscript writing and editing were carried out by all authors, with significant contributions from Mahmut Çamlar, Abuzer Güngör, and Umut Tan Sevgi.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Data availability
No datasets were generated or analysed during the current study.
Declarations
Consent to participate
This study does not involve human participants. The manuscript reports a comparative analysis of chatbot systems, and thus informed consent was not applicable.
Conflict of interest
The authors declare no competing interests.
Declaration of generative AI and AI-assisted technologies in the writing process
The authors acknowledge the use of Anthropic's Claude AI for assistance in refining the manuscript's readability, enhancing sentence flow, and ensuring grammatical accuracy. All intellectual contributions, study design, data analysis, and interpretations remain the sole responsibility of the authors.
Human ethics and consent to participate
Not applicable.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Abbas A, Rehman MS, Rehman SS (2024) Comparing the performance of popular large language models on the national board of medical examiners sample questions. Cureus. 10.7759/cureus.55991 [DOI] [PMC free article] [PubMed]
- 2.Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, Aziz S, Damseh R, AlabedAlrazak S, Sheikh J (2023) Large language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ 9:e48291–e48291 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ali R, Tang OY, Connolly ID et al (2023) Performance of ChatGPT and GPT-4 on neurosurgery written board examinations. Neurosurgery. 10.1227/neu.0000000000002632 [DOI] [PubMed]
- 4.Bair H, Norden J (2023) Large language models and their implications on medical education. Acad Med 98(8):869–870 [DOI] [PubMed] [Google Scholar]
- 5.Barker FG, Rutka JT (2023) Editorial. Generative artificial intelligence, chatbots, and the journal of neurosurgery publishing group. J Neurosurg 139(3):901–903 [DOI] [PubMed] [Google Scholar]
- 6.Bongco EDA, Cua SKN, Hernandez MALU, Pascual JSG, Khu KJO (2024) The performance of ChatGPT versus neurosurgery residents in neurosurgical board examination-like questions: a systematic review and meta-analysis. Neurosurg Rev 47(1):892–892 [DOI] [PubMed] [Google Scholar]
- 7.Chen WM, Fu M, Zhang CJ et al (2022) Deep learning-based universal expert-level recognizing pathological ımages of hepatocellular carcinoma and beyond. Front Med. 10.3389/fmed.2022.853261 [DOI] [PMC free article] [PubMed]
- 8.Choi W (2023) Assessment of the capacity of ChatGPT as a self-learning tool in medical pharmacology: a study using MCQs. BMC Med Educ 23(1):864–864 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Fang C, Wu Y, Fu W et al (2023) How does ChatGPT-4 preform on non-English national medical licensing examination? An evaluation in Chinese language. PLoS Digit Health 2(12):e0000397–e0000397 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Gill GS, Tsai J, Moxam J, Sanghvi HA, Gupta S (2024) Comparison of Gemini advanced and ChatGPT 4.0’s performances on the ophthalmology resident Ophthalmic Knowledge Assessment Program (OKAP) examination review question banks. Cureus 16(9):e69612–e69612 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Guo E, Gupta M, Sinha S et al (2024) NeuroGPT-X: toward a clinic-ready large language model. J Neurosurg 140(4):1041–1053 [DOI] [PubMed] [Google Scholar]
- 12.Hopkins BS, Carter B, Lord J, Rutka JT, Cohen-Gadol AA (2024) Editorial. AtlasGPT: dawn of a new era in neurosurgery for intelligent care augmentation, operative planning, and performance. J Neurosurg 140(5):1211–1214 [DOI] [PubMed] [Google Scholar]
- 13.Jin HK, Lee HE, Kim E (2024) Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC Med Educ 24(1):1013–1013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Joh HC, Kim M-H, Ko JY, Kim JS, Jue MS (2024) Evaluating the performance of ChatGPT in dermatology specialty certificate examination-style questions: a comparative analysis between English and Korean language settings. Indian J Dermatol 69(4):338–341 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Lang SP, Yoseph ET, Gonzalez-Suarez AD, Kim R, Fatemi P, Wagner K, Maldaner N, Stienen MN, Zygourakis CC (2024) Analyzing large language models’ responses to common lumbar spine fusion surgery questions: a comparison between ChatGPT and Bard. Neurospine 21(2):633–641 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lee JY (2023) Can an artificial intelligence chatbot be the author of a scholarly article? J Educ Eval Health Prof 20:6–6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Li Z, Shi Y, Liu Z, Yang F, Liu N, Du M (2024) Quantifying Multilingual Performance of Large Language Models Across Languages. arXiv preprint arXiv:2404.11553
- 18.Liu X, Faes L, Kale AU et al (2019) A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health 1(6):e271–e297 [DOI] [PubMed] [Google Scholar]
- 19.McHugh ML (2012) Interrater reliability: the kappa statistic. Biochem Med (Zagreb) 22(3):276–282 [PMC free article] [PubMed] [Google Scholar]
- 20.Mondillo G, Colosimo S, Perrotta A, Frattolillo V, Masino M (2025) Comparative evaluation of advanced aı reasoning models in pediatric clinical decision support: ChatGPT O1 vs. DeepSeek-R1. 10.1101/2025.01.27.25321169
- 21.Mugaanyi J, Cai L, Cheng S, Lu C, Huang J (2024) Evaluation of large language model performance and reliability for citations and references in scholarly writing: cross-disciplinary study. J Med Internet Res 26:e52935–e52935 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Newton P, Xiromeriti M (2024) ChatGPT performance on multiple choice question examinations in higher education. A pragmatic scoping review. Assess Eval High Educ 49(6):781–798 [Google Scholar]
- 23.Noda R, Yuasa C, Kitano F, Ichikawa D, Shibagaki Y (2025) Performance of o1 pro and GPT-4 in self-assessment questions for nephrology board renewal. 10.1101/2025.01.14.25320525 [DOI] [PubMed]
- 24.Patil A, Serrato P, Chisvo N, Arnaout O, See PA, Huang KT (2024) Large language models in neurosurgery: a systematic review and meta-analysis. Acta Neurochir 166(1):475–475 [DOI] [PubMed] [Google Scholar]
- 25.Polverini G, Gregorcic B (2024) Evaluating vision-capable chatbots in interpreting kinematics graphs: a comparative study of free and subscription-based models. Front Educ. 10.3389/feduc.2024.1452414
- 26.Poterucha T, Elias P, Haggerty CM (2023) The Case Records of ChatGPT: Language Models and Complex Clinical Questions. arXiv preprint arXiv:2305.05609
- 27.Powers AY, McCandless MG, Taussky P, Vega RA, Shutran MS, Moses ZB (2024) Educational limitations of ChatGPT in neurosurgery board preparation. Cureus. 10.7759/cureus.58639 [DOI] [PMC free article] [PubMed]
- 28.Pressman SM, Borna S, Gomez-Cabello CA, Haider SA, Haider CR, Forte AJ (2024) Clinical and surgical applications of large language models: a systematic review. J Clin Med 13(11):3041–3041 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Sahin MC, Sozer A, Kuzucu P, Turkmen T, Sahin MB, Sozer E, Tufek OY, Nernekli K, Emmez H, Celtikci E (2024) Beyond human in neurosurgical exams: ChatGPT’s success in the Turkish neurosurgical society proficiency board exams. Comput Biol Med 169:107807–107807 [DOI] [PubMed] [Google Scholar]
- 30.Shaya MR, Gragnaniello C, Nader R (eds) (2016) Neurosurgery practice questions and answers. neurosurgery practice questions and answers. Thieme, Stuttgart; 2016. 10.1055/b-0036-141861
- 31.Shiferaw MW, Zheng T, Winter A, Mike LA, Chan L-N (2024) Assessing the accuracy and quality of artificial intelligence (AI) chatbot-generated responses in making patient-specific drug-therapy and healthcare-related decisions. BMC Med Inform Decis Mak 24(1):404–404 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Strong E, DiGiammarino A, Weng Y, Kumar A, Hosamani P, Hom J, Chen JH (2023) Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA Intern Med 183(9):1028–1030 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Van M, Verma P, Wu X (2024) On large visual language models for medical imaging analysis: an empirical study. IEEE, pp 172–176
- 34.Wang D, Zhang S (2024) Large language models in medical and healthcare fields: applications, advances, and challenges. Artif Intell Rev 57(11):299–299 [Google Scholar]
- 35.Yu Y, Gomez-Cabello CA, Makarova S, Parte Y, Borna S, Haider SA, Genovese A, Prabha S, Forte AJ (2024) Using large language models to retrieve critical data from clinical processes and business Rules. Bioengineering (Basel, Switzerland). 10.3390/bioengineering12010017 [DOI] [PMC free article] [PubMed]
- 36.Zaragita N, Zhou S, Nugroho SW, Kaliaperumal C (2024) Breaking boundaries in neurosurgery through art and technology: a historical perspective. Brain Spine 4:102836 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Zhou G, Zhu Y, Yin Y, Su M, Li M (2017) Association of wall shear stress with intracranial aneurysm rupture: systematic review and meta-analysis. Sci Rep 7(1):5331–5331 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Zhui L, Yhap N, Liping L, Zhengjie W, Zhonghao X, Xiaoshu Y, Hong C, Xuexiu L, Wei R (2024) Impact of large language models on medical education and teaching adaptations. JMIR Med Inform 12:e55933–e55933 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
No datasets were generated or analysed during the current study.


