Education Research: Can Large Language Models Match MS Specialist Training? A Comparative Study of AI and Student Responses to Support Neurology Education

Hernan Inojosa; Ahmadreza Ramezanzadeh; Iva Gasparovic-Curtini; Isabella Wiest; Jakob Nikolas Kather; Stephen Gilbert; Tjalf Ziemssen

doi:10.1212/NE9.0000000000200260

. 2025 Nov 20;4(4):e200260. doi: 10.1212/NE9.0000000000200260

Education Research: Can Large Language Models Match MS Specialist Training?

A Comparative Study of AI and Student Responses to Support Neurology Education

Hernan Inojosa ¹, Ahmadreza Ramezanzadeh ^1,², Iva Gasparovic-Curtini ¹, Isabella Wiest ², Jakob Nikolas Kather ², Stephen Gilbert ², Tjalf Ziemssen ^1,^✉

PMCID: PMC12636769 PMID: 41281362

Abstract

Background and Objectives

Artificial intelligence (AI), particularly large language models (LLMs), is increasingly explored for clinical decision support and medical education. While general LLM proficiency on broad medical examinations has been demonstrated, their application of domain-specific knowledge in neurology remains underexplored. This study addresses that gap using multiple sclerosis (MS) as an exemplar, evaluating how LLM information access strategies affect accuracy in a specialized postgraduate curriculum and exploring possible roles of LLMs in neurology education.

Methods

A comparative evaluation was conducted using 53 multiple-choice questions (MCQs) and 21 open-ended questions drawn from an MS curriculum used in a postgraduate MS program. As a reference, results from postgraduate students—primarily neurologists and neurology trainees—were used. Each question was answered by 3 LLMs: GPT-4o (general-purpose), MS RAG (retrieval-augmented accessing MS literature), and Prof. Valmed (CE-certified domain-specific, trained on medical data). All models operated in the zero-shot mode without previous exposure to the items. Questions were stratified based on students' performance. Accuracy was compared using χ² tests.

Results

Among LLMs, GPT-4o reached 81.1% accuracy, MS RAG 86.8%, and Prof. Valmed 91.3% while the reference students' cohort (n = 28) achieved a mean of 82% (SD 23%). Although overall differences were not statistically significant (χ²(2) = 2.165, p = 0.339, Cramer V = 0.119), performance varied by question type and difficulty. For MCQs with a single correct answer, domain-specific LLMs outperformed GPT-4o, although differences remained nonsignificant. By contrast, students showed stronger performance on single-wrong answer formats. Stratified by difficulty, students outperformed LLMs on “easy” questions while LLMs tended to achieve higher accuracy on “medium” and “hard” items. For open-ended questions, students reached 77.8% accuracy while GPT-4o, MS RAG, and Prof. Valmed scored 66.7%–85.0%.

Discussion

These findings indicate that while LLMs can perform at levels broadly comparable to postgraduate students, these may be particularly useful on more difficult tasks, where their consistency may complement human reasoning in a neurology subspecialty curriculum. While results should be interpreted cautiously given the limited sample size, this study illustrates possible implications of LLMs in neurology education—for example, as AI tutors for complex topics, as support for formative assessments, or as targeted review resources. Further research should assess integration into educational workflows and decision support.

Introduction

Artificial intelligence (AI), particularly large language models (LLMs), is increasingly explored for medical education and clinical decision support, enabling self-directed learning and context-specific feedback.^1-3 These models have shown promise in replicating elements of expert reasoning and retrieving medical knowledge across neurologic diseases, including conditions such as multiple sclerosis (MS).^4,5

LLMs such as ChatGPT (OpenAI, San Francisco, CA) have demonstrated high performance on standardized examinations such as the United States Medical Licensing Examination, with GPT-4 achieving accuracy rates up to 90%.^6-13 However, performance varies depending on task complexity and alignment between training data and specific medical context and less on specific prompting methods or inclusion of media elements.^3,14 In neurology, particularly in MS, effective knowledge application requires nuanced interpretation of evolving clinical guidelines, an area where domain-aligned models may offer improved support.

Recent advancements in specialized models trained or fine-tuned on curated medical literature underscore the potential of LLMs to provide precise and clinically relevant recommendations.¹⁵ Integrating retrieval-augmented generation (RAG) frameworks may provide insights into how access to up-to-date medical information can enhance the quality and applicability of responses.³ The recently developed model “Prof. Valmed” may highlight the value of domain-specific fine-tuning in improving response accuracy and relevance.¹⁶ Yet, their role in formal neurology and MS education remains relatively underexplored. Unlike other specialties with standardized board certifications, MS education lacks uniform assessment frameworks, complicating benchmarking. Postgraduate programs such as the MS Management Master at Dresden International University (DIU) and the Charcot MS Master (M.Sc.) represent efforts to standardize MS certification and offer curated curricula suitable for comparative evaluation.^17,18

This study addresses a foundational gap in neurology education research: while LLM proficiency on medical examinations has been established, their performance in specialized clinical domains—such as MS—has not been systematically compared with that of expert learners. MS training involves application of evolving treatment guidelines and context-sensitive clinical reasoning, areas where LLMs may offer structured support if properly aligned.

By conducting a structured comparison of multiple LLM architectures against student performance across a standardized MS question set, we evaluate the knowledge application capabilities of these models. The models included a general-purpose LLM (GPT-4o), a RAG model with dynamic access to literature (MS RAG), and a domain-specific model trained on curated medical literature (Prof. Valmed). While educational outcomes were not directly assessed, our study aims to provide a basis for understanding how LLMs might support specialized training in MS and other complex neurologic conditions and inform future applications in neurology education. Such applications may include AI tutors that provide individualized (guideline-based) feedback on complex scenarios, tools for education aligned with postgraduate curricula, or supplementary resources to expand exposure to subspecialty content with evolving evidence.

Methods

We performed a cross-sectional study using a standardized set of questions designed to reflect key aspects of MS diagnosis, treatment, and management based on the MS Management Master's Program at DIU and the Charcot MS Master (M.Sc.). Model-generated answers were produced on December 31, 2024, for GPT-4o and the RAG model, and between April 28 and April 30, 2025, for Prof. Valmed. The study adhered to best practices in AI evaluation in health care, including compliance with ethical standards for research involving nonhuman datasets.

Large Language Models

The 3 distinct configurations of LLMs for evaluation included the following:

GPT-4o: the latest GPT version from OpenAI and well-known state-of-the-art model trained on diverse datasets. Although GPT-4o is not specifically fine-tuned for medical applications, its versatility allows it to perform well across various domains.¹⁹ In this study, GPT-4o was accessed using the official online platform of OpenAI under default settings.
RAG model (MS RAG): a customized retrieval-augmented generation model built using GPT-4o through the GPT builder interface from OpenAI.²⁰ The model was specifically tailored on approximately 516 MB of curated MS-specific material, including lecture scripts, medical guidelines, and curated teaching materials from the MS Management Master's Program at DIU. These materials were partially developed by the same teaching staff who also authored the standardized question set used in this study. MS RAG uses dynamic search and retrieval mechanisms to enhance the contextual relevance and specificity of its responses. The model was accessed through its publicly available instance.²¹
Prof. Valmed: a specialized RAG-augmented knowledge retrieval system with a medical data architecture comprising approximately 2.5 million validated documents, including medical guidelines and content relevant to MS. Prof. Valmed is originally designed to assist health care professionals by generating clinically precise and guideline-concordant recommendations as a medical device with CE class 2b status. Prof. Valmed (version 2.0) was accessed through its proprietary interface.¹⁶ Prof. Valmed's training data are independent from the educational materials used in this study and were not influenced by the academic staff who developed the curriculum.

Dataset

The dataset comprised 53 multiple-choice questions sourced from the curricula of module 4 (“Disease-Modifying MS Treatment”) in both the MS Management Master's Program at the DIU and the Charcot MS Master. The questions were designed to provide comprehensive evaluation on MS covering topics such as efficacy profiles, safety monitoring, sequencing strategies, and therapeutic indications. The question set included 2 formats:

Single-correct answer questions: questions with 1 correct option among multiple distractors (standard multiple-choice format).
Single-wrong answer questions: questions with 1 wrong option among multiple distractors (standard multiple-choice format).

Questions were reviewed and approved by a panel of 3 independent MS specialists from the DIU for content validity, clarity, and guideline alignment. Distractors in the answer formats were specifically designed to reflect common misconceptions, thereby challenging the models to distinguish correct answers from plausible but incorrect options. Each question had a predetermined correct answer based on expert-validated clinical guidelines, serving as the gold standard for evaluating model accuracy.

Reference Group

To contextualize model performance, a group of 28 MS master's students nearing the completion of a postgraduate MS management program served as a reference for human-level performance. The students, divided equally between German-speaking and international cohorts, primarily consisted of medical neurologists and neurologists in training, with few additional participants from related fields (e.g., biology or pharmaceutical industry, among others). They completed the set of questions in a virtual environment in real-life examination conditions.

Prompting Strategy

Responses were generated under uniform conditions to maintain comparability. The models were first tasked with selecting the correct option from predefined answer choices. This format included single-correct answer questions and single-wrong answer questions. To ensure uniformity, each question was presented in isolation, and the models generated responses based solely on their existing knowledge without previous exposure to the questions assessed, adhering to a zero-shot prompting approach. In this study, models are tasked with generating responses without additional in-context examples or domain-specific tuning beyond the training data.^3,22 Owing to its current configuration, Prof. Valmed was not tested on 2 questions requiring image interpretation because its response generation is restricted to text-based inputs and outputs.

A subset of questions appropriate for free-text responses was presented in an open-ended format, where the question text was displayed without predefined answer options, requiring the model to generate a free-text response. This subset excluded questions that inherently relied on structured formats, such as those requiring the identification of a single incorrect statement or those designed to compare multiple predefined options. All other available questions were included. This dual-format strategy allowed a comprehensive evaluation of the models' ability to identify and generate correct answers across both structured and unstructured scenarios. Because open-ended responses were only generated by the models, student performance was not available for direct comparison in this format. Accuracy of open-ended responses was qualitatively assessed based on previously predefined criteria.²³ To determine whether the response appropriately addressed the clinical question posed. Each response was categorized relative to the gold standard as accurate, inaccurate, or indeterminate. Indeterminate responses were those that were vague or generic, only partially addressed the question, or lacked sufficient detail to determine clinical soundness. Two clinicians reviewed and categorized each response independently and were blinded to each other's answers (HI and TZ).

Standard Protocol Approvals, Registrations, and Participant Consents

This study did not constitute human subjects research and did not require institutional review board approval. The need for informed consent was waived accordingly, consistent with ethical research standards studies using nonidentifiable data. Only anonymized, aggregate results were analyzed; no personal or sensitive information was collected, and no individual student performance data were assessed or reported.

Statistical Analysis

All data were summarized and presented in descriptive tables and/or visualized using bar graphs and line graphs. Quantitative results for accuracy and comparative performance metrics were displayed as mean values with SDs, medians with interquartile ranges (IQRs), or percentages, as appropriate. Questions were stratified by difficulty based on student performance: easy (>80% answered correctly), medium (50%–79%), and hard (<50%). Model performance was subsequently analyzed within these difficulty categories to evaluate whether accuracy differed across question types. Statistical comparisons between models for accuracy across question difficulties and question types were conducted using χ² tests, with a significance threshold set at p < 0.05. Accuracy was defined as the proportion of correct answers compared with the gold standard. For open-ended questions, accuracy outcomes were reported as proportions for each response category (accurate, inaccurate, indeterminate). All statistical analyses were performed using IBM SPSS Statistics (version 30.0.0.0; IBM Corp., Armonk, NY), and visualizations were created with GraphPad Prism (version 5; GraphPad Software, San Diego, CA).

Data Availability

The datasets generated and analyzed during this study are available from the corresponding author on reasonable request. However, the multiple-choice and open-ended questions derived from the MS master's program curriculum are subject to academic use restrictions and cannot be publicly shared.

Results

Comparative Accuracy of Students and Models

A total of 53 questions were included in the study. The mean accuracy of the student cohort was 82.2% (SD 23%, median 92.8%, IQR 71%–100%, Figure 1). Among the models, GPT-4o achieved an accuracy of 81.1% and MS RAG achieved 86.8% while Prof. Valmed achieved 91.3%, calculated based on the questions it answered. No statistical significant difference was observed between the models (χ²(2) = 2.165, p = 0.339) with a small effect size (Cramer V = 0.084). Seven questions were left unanswered by Prof. Valmed (referral to a safety mechanism) and excluded from accuracy computation.

Overall accuracy achieved by MS master's students (mean, SD) and 3 large language models (GPT-4o, MS RAG, and Prof. Valmed) in response to n = 53 multiple-choice questions. Prof. Valmed and MS RAG had a tendency toward higher accuracies, overall. However, no significant differences were observed in χ² tests. MS = multiple sclerosis; RAG = retrieval-augmented generation.

Performance by Multiple-Choice Question Type

When stratified by question type, distinct performance patterns were observed among models. For single-correct answer questions (n = 38 questions), Prof. Valmed and MS RAG demonstrated better performances compared with GPT-4o, achieving accuracies of 97.1%, 86.8%, and 78.9%, respectively (Figure 2). However, differences among the models were not statistically significant (χ²(2) = 5.309, p = 0.070, Cramer V = 0.155). The student group performed comparably at 79.5% (SD 25.6%). For single-wrong answer questions (n = 15 questions), the students had a better accuracy (89%, SD 12.9%) while GPT-4o and MS RAG achieved similar accuracies (86.6%) and Prof. Valmed achieved 75%, overall. However, the group differences were not statistically significant (χ²(2) = 0.840, p = 0.657, Cramer V = 0.100).

Accuracy achieved by MS master's students and 3 LLMs (GPT-4o, MS RAG, and Prof. Valmed) according to question type, including single-correct answer (n = 38) and single-wrong answer (n = 15) questions. Prof. Valmed demonstrated superior performance on single-correct answer questions while students performed better across single-wrong answer types, followed by GPT-4o and MS RAG. However, no significant differences were observed in χ² tests. LLM = large language model; MS = multiple sclerosis; RAG = retrieval-augmented generation.

Performance by Difficulty Level

Using the difficulty stratification based on student performance, the models demonstrated varying trends. For “easy” questions (students ≥80% correct, n = 33 questions), students slightly outperformed the LLMs, with 96.3% (SD 5.1%) accuracy, while GPT-4o, MS RAG, and Prof. Valmed achieved 81.8%, 90.9%, and 90.3%, respectively (Figure 3). Differences among models were not statistically significant (χ²(2) = 1.563, p = 0.458; Cramer V = 0.127). On “medium” questions (students 50%–79% correct, n = 15 questions), the models achieved higher accuracies than the students (70.5%, SD 8.0%), with 86.7% (GPT-4o), 80.0% (MS RAG), and 90.9% (Prof. Valmed). Again, there was no significant difference among models (χ²(2) = 0.637, p = 0.727; Cramer V = 0.125). For “hard” questions (students <50% correct, n = 5 questions), student accuracy dropped to 24.3% (SD 10.8%). However, GPT-4o, MS RAG, and Prof. Valmed achieved relatively better results (3/5, 4/5, and 4/4 right answers, respectively).

Accuracy achieved by MS master's students and 3 LLMs (GPT-4o, MS RAG, and Prof. Valmed) according to question difficulty. Questions were stratified based on student performance: easy (>80% answered correctly), medium (50%–79%), and hard (<50%). Students outperformed LLMs on easy questions while performance on medium-difficulty questions was comparable across all groups. For hard questions, MS RAG and Prof. Valmed maintained slightly better accuracy compared with GPT-4o and the students, which experienced notable declines. However, no significant difference was observed in χ² tests. LLM = large language model; MS = multiple sclerosis; RAG = retrieval-augmented generation.

Performance on Open-Ended Questions

In the subset of open-ended questions, GPT-4o produced accurate responses in 66.7% of cases, MS RAG in 76.2%, and Prof. Valmed in 85% (Table). Notably, while GPT4o and MS RAG provided 14.3% and 19.3% wrong answers, respectively, Prof. Valmed did not have any clearly wrong answer, but a higher rate of outputs lacking substantive insight into the question. The responses included statements such as, “I'm here to help with medical and health-related questions. It seems your query is outside my expertise. Please try asking a medical-related question” or “I'm sorry, I couldn't find any relevant information for your question. Please try rephrasing your question or providing more specific details to help me find better results.”

Table.

Accuracy of Open-Ended Responses (N = 21 Questions)

	GPT4o	MS RAG	Prof. Valmed
Accurate answers, %	66.7	76.2	85
Indeterminate answers, %	19	14.3	15
Inaccurate answers, %	14.3	4.8	0

Open in a new tab

Abbreviations: MS = multiple sclerosis; RAG = retrieval-augmented generation.

“Accurate” answers fully addressed the clinical question based on predefined criteria and expert consensus. “Indeterminate” answers included safety, vague, or generic responses that did not provide sufficient information to judge clinical appropriateness. “Inaccurate” answers contained clinically incorrect or misleading information. All categorizations were independently performed by 2 MS-trained clinicians.

Discussion

This study provides novel insights into the performance of LLMs in the context of MS-specific knowledge application in educational settings. By comparing both structured and open-ended performance, the study helps to identify key tradeoffs between general-purpose and domain-specific architectures.

GPT-4o, MS RAG, and Prof. Valmed displayed relatively similar overall accuracy rates, with a tendency toward better results for the latter 2. This suggests that differences in training data and retrieval mechanisms may influence the performance of LLMs, particularly when applied to domain-specific educational tasks. While the overall performance was comparable to postgraduate students, the differences emerged by question type and difficulty level.

LLMs demonstrated greater consistency on medium and hard questions, whereas students performed best on easier items. This may suggest that LLMs provide consistent performance as question complexity increases. However, performance differed depending on question type: single-correct answer questions yielded the highest accuracy across all groups, with Prof. Valmed showing the numerically best performance, potentially because of its alignment with structured, guideline-based queries. Conversely, students outperformed all LLMs on single-wrong answer questions, suggesting human reasoning strengths in distinguishing between near-plausible distractors.

Despite the small number of questions and statistical power, Prof. Valmed showed numerically stronger performance on medium and hard questions, suggesting strengths in structured, guideline-based reasoning. MS RAG also performed relatively well on hard questions, suggesting the value of retrieval-augmented architectures in addressing less straightforward content. These findings are consistent with previous work showing improved accuracy and relevance through dynamic retrieval mechanisms.^3,24 RAG models could determine when additional context is needed, a critical feature for managing questions with nuances of real-world clinical decision making.^24,25 Combining fine-tuned knowledge with selective retrieval may be essential for replicating expert-like behavior.

Although multiple-choice formats are efficient for structured evaluation, they may not fully capture the depth of reasoning and contextual navigation required in clinical practice.^26,27 These formats primarily assess knowledge recall rather than the reasoning processes needed to handle unstructured or ambiguous scenarios. By contrast, information retrieval emphasizes identifying relevant content dynamically, reflecting the reasoning demands of real-world decision making.²⁸ Retrieval performance varies depending on domain-specific fine-tuning, retrieval methods, and prompting strategies.²⁹ In neurology, these are critical for ensuring that LLM-generated responses align with current medical knowledge and guidelines.

In open-ended tasks, Prof. Valmed achieved the highest accuracy, followed by MS RAG and GPT-4o. It is important to note that Prof. Valmed did not produce any inaccurate or clearly incorrect answers. Instead, in a few cases, it provided generic or limited responses, possibly because of built-in safeguards or confidence thresholds. We excluded these unanswered questions from the accuracy calculation, because penalizing abstentions would counteract the model's designed safety mechanism. Nonetheless, this methodological decision may limit comparability. Prof. Valmed's behavior may reflect an intentional safeguard mechanism prioritizing safety and guideline fidelity rather than broad contextual reasoning in probably ambiguous prompts.^3,30 These differences emphasize the tradeoff between adaptability and cautiousness: general-purpose models may respond more flexibly but risk generating inaccurate content, whereas specialized models may prioritize reliability. This distinction is particularly relevant when considering LLM use in decision support.

Certain limitations must be acknowledged. The relatively small and fixed set of questions, although validated by MS specialists, may not capture the full complexity and variability of MS care. The limited number of items may have restricted the statistical power to detect subtle but potentially meaningful performance differences. Moreover, although multiple-choice and open-ended tasks allow standardized comparisons, they may oversimplify the complexity of clinical reasoning.²⁵ The lack of an established MS-specific benchmark limits generalizability, and our inability to assess hallucination rates, explainability, or citation behavior further constrains interpretation. Leveraging structured and validated benchmarks could facilitate robust and reproducible evaluations.³¹ Finally, although most participants were neurologists or trainees, a small proportion came from related fields, introducing a minor potential confounder.

Although not evaluated in an instructional context, these results provide a basis for exploring how LLMs might be integrated into postgraduate neurology education. Beyond demonstrating accuracy comparable to postgraduate students, the models showed added usefulness on more difficult tasks, suggesting potential as complementary tools for training. Possible applications include AI tutors that provide individualized feedback on complex questions; retrieval-augmented systems that support formative assessments by aligning with evolving evidence; or supplementary review tools for learners struggling with specific reasoning tasks. These approaches could help address variability in training and ensure consistent access to evolving knowledge. It is important to note that these findings may also hold relevance for other subspecialties within neurology, where evolving guidelines and complex decision making pose similar educational challenges.

Future research should expand question sets and formats to encompass a broader spectrum of MS care. A benchmarking dataset for MS, combining multiple formats and difficulty levels, would support robust, reproducible model evaluations. Exploring the interpretability of outputs, frequency of hallucinations, and alignment with verifiable references is essential for evaluating safety and clinical utility.

Careful integration, with attention to safety, guideline adherence, and transparency, will be essential before LLMs can be considered meaningful additions to specialized neurology education. Recent innovative conceptual approaches in LLM development, such as “white listening” or reinforcement learning from human feedback, which aim to enhance transparency and controllability of outputs, underscore the growing emphasis on interpretability and safety in clinical applications.^32-34 Incorporating such techniques into domain-specific models may further enhance reliability and trust. Moreover, analyzing the reasoning pathways behind LLM-generated answers—especially for complex items—may provide an additional pedagogical layer, enabling adaptive feedback and uncovering strengths and limitations in replicating expert-level clinical reasoning.

LLMs, particularly those with retrieval-augmented or domain-specific architectures, can apply MS-specific knowledge comparable to postgraduate students in a specialized MS curriculum. Beyond overall accuracy, the models may show added usefulness on more difficult tasks, suggesting potential as complementary tools in neurology education. However, differences in adaptability, safety constraints, and response style highlight the importance of matching LLM architecture to its intended use case. Broader validation with real-world scenarios and standardized benchmark datasets will be essential to confirm educational and clinical utility and ensure safe integration into postgraduate training.

Acknowledgment

The authors thank the teaching staff of the MS Management Master's Program at Dresden International University (DIU) and the Charcot MS Master (M.Sc.) for providing the educational framework and expert-validated materials that served as the basis for this evaluation. The authors also acknowledge the support from validated medical information GmbH for granting independent access to the Prof. Valmed platform.

Glossary

AI: artificial intelligence
DIU: Dresden International University
IQR: interquartile range
LLM: large language model
MS: multiple sclerosis
RAG: retrieval-augmented generation

Footnotes

Editorial What's With All the Hype? Applying Large Language Models in Neurology Education Page e200267

Author Contributions

H. Inojosa: drafting/revision of the manuscript for content, including medical writing for content; major role in the acquisition of data; study concept or design; analysis or interpretation of data. A. Ramezanzadeh: drafting/revision of the manuscript for content, including medical writing for content; analysis or interpretation of data. I. Gasparovic-Curtini: drafting/revision of the manuscript for content, including medical writing for content; major role in the acquisition of data. I. Wiest: study concept or design; analysis or interpretation of data. J.N. Kather: analysis or interpretation of data. S. Gilbert: drafting/revision of the manuscript for content, including medical writing for content; analysis or interpretation of data. T. Ziemssen: drafting/revision of the manuscript for content, including medical writing for content; major role in the acquisition of data; study concept or design; analysis or interpretation of data.

Study Funding

No targeted funding reported.

Disclosure

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. H. Inojosa reports speaker honoraria from Roche and financial support for research activities from Novartis, Teva, Biogen, Alexion. J.N. Kather reports support from the German Federal Ministry of Health (DEEP LIVER ZMVI1-2520DAT111; SWAG 01KD2215B), the Max-Eder-Programme of the German Cancer Aid (grant 70113864), the German Federal Ministry of Education and Research (PEARL 01KD2104C; CAMINO 01EO2101; SWAG 01KD2215A; TRANSFORM LIVER 031 L0312A; TANGERINE 01KT2302), the German Academic Exchange Service (SECAI 57616814), the German Federal Joint Committee (Transplant.KI 01VSF21048), the European Union's Horizon Europe programme (ODELIA 101057091; GENIAL 101096312), and the NIHR Leeds Biomedical Research Centre (NIHR NIHR213331). S. Gilbert reports nonfinancial interest as Advisory Group member of the EY-coordinated “Study on Regulatory Governance and Innovation in the Field of Medical Devices” on behalf of DG SANTE of the European Commission; consulting relationships with Una Health GmbH, Lindus Health Ltd., Flo Ltd., Thymia Ltd., FORUM Institut für Management GmbH, High-Tech Gründerfonds Management GmbH, and Ada Health GmbH; share options in Ada Health GmbH; support from the German Federal Ministry of Health (DEEP LIVER ZMVI1-2520DAT111; SWAG 01KD2215B), the Max-Eder-Programme of the German Cancer Aid (grant 70113864), the German Federal Ministry of Education and Research (PEARL 01KD2104C; CAMINO 01EO2101; SWAG 01KD2215A; TRANSFORM LIVER 031 L0312A; TANGERINE 01KT2302), the German Academic Exchange Service (SECAI 57616814), the German Federal Joint Committee (Transplant.KI 01VSF21048), the European Union's Horizon Europe programme (ODELIA 101057091; GENIAL 101096312), and the National Institute for Health and Care Research (NIHR NIHR213331) Leeds Biomedical Research Centre. T. Ziemssen reports scientific advisory board and/or consulting for Biogen, Roche, Novartis, Celgene, and Merck; compensation for speakers' bureaus for Roche, Novartis, Merck, Sanofi, Celgene, and Biogen; research support from Biogen, Novartis, Merck, and Sanofi. All other authors report no disclosures. Go to Neurology.org/NE for full disclosures.

References

1.Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023;616(7956):259-265. doi: 10.1038/s41586-023-05881-4 [DOI] [PubMed] [Google Scholar]
2.Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30. [Google Scholar]
3.Naveed H, Khan AU, Qiu S, et al. A comprehensive overview of large language models. arXiv. 2023. arXiv:2307.06435. [Google Scholar]
4.Reich DS, Lucchinetti CF, Calabresi PA. Multiple sclerosis. N Engl J Med. 2018;378(2):169-180. doi: 10.1056/NEJMra1401483 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Inojosa H, Voigt I, Wenk J, et al. Integrating large language models in care, research, and education in multiple sclerosis management. Mult Scler. 2024;30(11-12):1392-1401. doi: 10.1177/13524585241277376 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Vij O, Calver H, Myall N, Dey M, Kouranloo K. Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments. PLoS One. 2024;19(7):e0307372. doi: 10.1371/journal.pone.0307372 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Maitland A, Fowkes R, Maitland S. Can ChatGPT pass the MRCP (UK) written examinations? Analysis of performance and errors using a clinical decision-reasoning framework. BMJ Open. 2024;14(3):e080558. doi: 10.1136/bmjopen-2023-080558 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Meyer A, Riese J, Streichert T. Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination: observational study. JMIR Med Educ. 2024;10:e50965. doi: 10.2196/50965 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Ming S, Guo Q, Cheng W, Lei B. Influence of model evolution and system roles on ChatGPT's performance in Chinese medical licensing exams: comparative study. JMIR Med Educ. 2024;10:e52784. doi: 10.2196/52784 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Arvidsson R, Gunnarsson R, Entezarjou A, Sundemo D, Wikberg C. ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study. BMJ Open. 2024;14(12):e086148. doi: 10.1136/bmjopen-2024-086148 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Liu M, Okuhara T, Chang X, et al. Performance of ChatGPT across different versions in medical licensing examinations worldwide: systematic review and meta-analysis. J Med Internet Res. 2024;26:e60807. doi: 10.2196/60807 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Katz U, Cohen E, Shachar E, et al. GPT versus resident physicians: a benchmark based on official board scores. NEJM AI. 2024;1(5):AIdbp2300192. doi: 10.1056/AIdbp2300192 [DOI] [Google Scholar]
13.Singhal K, Tu T, Gottweis J, et al. Toward expert-level medical question answering with large language models. Nat Med. 2025;31(3):943-950. doi: 10.1038/s41591-024-03423-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Brin D, Sorin V, Konen E, Nadkarni G, Glicksberg BS, Klang E. How GPT models perform on the United States medical licensing examination: a systematic review. Disc Appl Sci. 2024;6(10):500. doi: 10.1007/s42452-024-06194-5 [DOI] [Google Scholar]
15.Booth GJ, Hauert T, Mynes M, et al. Fine-tuning large language models to enhance programmatic assessment in graduate medical education. J Educ Perioper Med. 2024;26(3):E729. doi: 10.46374/VolXXVI_Issue3_Moore [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Prof. Valmed: validated medical information GmbH. Prof. Valmed medical co-pilot. Accessed April 28, 2025. profvalmed.com/our-tool/
17.Voigt I, Stadelmann C, Meuth SG, et al. Innovation in digital education: lessons learned from the Multiple Sclerosis Management Master’s Program. Brain Sci. 2021;11(8):1110. doi: 10.3390/brainsci11081110 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Voigt I, Stadelmann C, Meuth SG, et al. Neuer Masterstudiengang . Multiple Sklerose Management [New master’s program: multiple sclerosis management]. Wien Med Wochenschr. 2022;172(15–16):383–391. doi: 10.1007/s10354-021-00900-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Achiam J, Adler S, Agarwal S, et al. Gpt-4 technical report. arXiv. 2023. arXiv:2303.08774. [Google Scholar]
20.OpenAI. ChatGPT Builder UI. 2025. Accessed March 8, 2025. help.openai.com/en/articles/8798878-building-and-publishing-a-gpt [Google Scholar]
21.OpenAI. ChatGPT Builder UI. MS RAG – Retrieval-Augmented GPT for MS. Accessed December 28, 2024. chatgpt.com/g/g-67793b57633081919e1ea5d9c84466e1-rag-ms-gpt [Google Scholar]
22.Shah K, Xu AY, Sharma Y, et al. Large language model prompting techniques for advancement in clinical medicine. J Clin Med. 2024;13(17):5101. doi: 10.3390/jcm13175101 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS Digit Health. 2023;2(2):e0000198. doi: 10.1371/journal.pdig.0000198 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Labruna T, Campos JA, Azkune G. When to retrieve: teaching LLMs to utilize information retrieval effectively. arXiv. 2024. arXiv:2404.19705. [Google Scholar]
25.Izacard G, Grave E. Leveraging passage retrieval with generative models for open domain question answering. arXiv. 2020. arXiv:2007.01282. [Google Scholar]
26.Li W, Li L, Xiang T, Liu X, Deng W, Garcia N. Can multiple-choice questions really be useful in detecting the abilities of LLMs? arXiv. 2024. arXiv:2403.17752. [Google Scholar]
27.Wang H, Zhao S, Qiang Z, Xi N, Qin B, Liu T. Beyond the answers: reviewing the rationality of multiple choice question answering for the evaluation of large language models. arXiv. 2024. arXiv:2402.01349. [Google Scholar]
28.Zhang W, Liao J, Li N, Du K. Agentic information retrieval. arXiv. 2024. arXiv:2410.09713. [Google Scholar]
29.Zhu Y, Yuan H, Wang S, et al. Large language models for information retrieval: a survey. arXiv. 2023. arXiv:2308.07107. [Google Scholar]
30.Belkin M, Hsu D, Ma S, Mandal S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc Natl Acad Sci USA. 2019;116(32):15849-15854. doi: 10.1073/pnas.1903070116 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Kwiatkowski T, Palomaki J, Redfield O, et al. Natural questions: a benchmark for question answering research. Trans Assoc Comput Linguist. 2019;7:453-466. doi: 10.1162/tacl_a_00276 [DOI] [Google Scholar]
32.Borys K, Schmitt YA, Nauta M, et al. Explainable AI in medical imaging: an overview for clinical practitioners - beyond saliency-based XAI approaches. Eur J Radiol. 2023;162:110786. doi: 10.1016/j.ejrad.2023.110786 [DOI] [PubMed] [Google Scholar]
33.Schwartz IS, Link KE, Daneshjou R, Cortés-Penfield N. Black box warning: large language models and the future of infectious diseases consultation. Clin Infect Dis. 2024;78(4):860-866. doi: 10.1093/cid/ciad633 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Wang X, Zhang NX, He H, et al. Safety challenges of AI in medicine in the era of large language models. arXiv. 2024. arXiv:2409.18968. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[R1] 1.Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023;616(7956):259-265. doi: 10.1038/s41586-023-05881-4 [DOI] [PubMed] [Google Scholar]

[R2] 2.Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30. [Google Scholar]

[R3] 3.Naveed H, Khan AU, Qiu S, et al. A comprehensive overview of large language models. arXiv. 2023. arXiv:2307.06435. [Google Scholar]

[R4] 4.Reich DS, Lucchinetti CF, Calabresi PA. Multiple sclerosis. N Engl J Med. 2018;378(2):169-180. doi: 10.1056/NEJMra1401483 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Inojosa H, Voigt I, Wenk J, et al. Integrating large language models in care, research, and education in multiple sclerosis management. Mult Scler. 2024;30(11-12):1392-1401. doi: 10.1177/13524585241277376 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Vij O, Calver H, Myall N, Dey M, Kouranloo K. Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments. PLoS One. 2024;19(7):e0307372. doi: 10.1371/journal.pone.0307372 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Maitland A, Fowkes R, Maitland S. Can ChatGPT pass the MRCP (UK) written examinations? Analysis of performance and errors using a clinical decision-reasoning framework. BMJ Open. 2024;14(3):e080558. doi: 10.1136/bmjopen-2023-080558 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Meyer A, Riese J, Streichert T. Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination: observational study. JMIR Med Educ. 2024;10:e50965. doi: 10.2196/50965 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Ming S, Guo Q, Cheng W, Lei B. Influence of model evolution and system roles on ChatGPT's performance in Chinese medical licensing exams: comparative study. JMIR Med Educ. 2024;10:e52784. doi: 10.2196/52784 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Arvidsson R, Gunnarsson R, Entezarjou A, Sundemo D, Wikberg C. ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study. BMJ Open. 2024;14(12):e086148. doi: 10.1136/bmjopen-2024-086148 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Liu M, Okuhara T, Chang X, et al. Performance of ChatGPT across different versions in medical licensing examinations worldwide: systematic review and meta-analysis. J Med Internet Res. 2024;26:e60807. doi: 10.2196/60807 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Katz U, Cohen E, Shachar E, et al. GPT versus resident physicians: a benchmark based on official board scores. NEJM AI. 2024;1(5):AIdbp2300192. doi: 10.1056/AIdbp2300192 [DOI] [Google Scholar]

[R13] 13.Singhal K, Tu T, Gottweis J, et al. Toward expert-level medical question answering with large language models. Nat Med. 2025;31(3):943-950. doi: 10.1038/s41591-024-03423-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Brin D, Sorin V, Konen E, Nadkarni G, Glicksberg BS, Klang E. How GPT models perform on the United States medical licensing examination: a systematic review. Disc Appl Sci. 2024;6(10):500. doi: 10.1007/s42452-024-06194-5 [DOI] [Google Scholar]

[R15] 15.Booth GJ, Hauert T, Mynes M, et al. Fine-tuning large language models to enhance programmatic assessment in graduate medical education. J Educ Perioper Med. 2024;26(3):E729. doi: 10.46374/VolXXVI_Issue3_Moore [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Prof. Valmed: validated medical information GmbH. Prof. Valmed medical co-pilot. Accessed April 28, 2025. profvalmed.com/our-tool/

[R17] 17.Voigt I, Stadelmann C, Meuth SG, et al. Innovation in digital education: lessons learned from the Multiple Sclerosis Management Master’s Program. Brain Sci. 2021;11(8):1110. doi: 10.3390/brainsci11081110 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Voigt I, Stadelmann C, Meuth SG, et al. Neuer Masterstudiengang . Multiple Sklerose Management [New master’s program: multiple sclerosis management]. Wien Med Wochenschr. 2022;172(15–16):383–391. doi: 10.1007/s10354-021-00900-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Achiam J, Adler S, Agarwal S, et al. Gpt-4 technical report. arXiv. 2023. arXiv:2303.08774. [Google Scholar]

[R20] 20.OpenAI. ChatGPT Builder UI. 2025. Accessed March 8, 2025. help.openai.com/en/articles/8798878-building-and-publishing-a-gpt [Google Scholar]

[R21] 21.OpenAI. ChatGPT Builder UI. MS RAG – Retrieval-Augmented GPT for MS. Accessed December 28, 2024. chatgpt.com/g/g-67793b57633081919e1ea5d9c84466e1-rag-ms-gpt [Google Scholar]

[R22] 22.Shah K, Xu AY, Sharma Y, et al. Large language model prompting techniques for advancement in clinical medicine. J Clin Med. 2024;13(17):5101. doi: 10.3390/jcm13175101 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS Digit Health. 2023;2(2):e0000198. doi: 10.1371/journal.pdig.0000198 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Labruna T, Campos JA, Azkune G. When to retrieve: teaching LLMs to utilize information retrieval effectively. arXiv. 2024. arXiv:2404.19705. [Google Scholar]

[R25] 25.Izacard G, Grave E. Leveraging passage retrieval with generative models for open domain question answering. arXiv. 2020. arXiv:2007.01282. [Google Scholar]

[R26] 26.Li W, Li L, Xiang T, Liu X, Deng W, Garcia N. Can multiple-choice questions really be useful in detecting the abilities of LLMs? arXiv. 2024. arXiv:2403.17752. [Google Scholar]

[R27] 27.Wang H, Zhao S, Qiang Z, Xi N, Qin B, Liu T. Beyond the answers: reviewing the rationality of multiple choice question answering for the evaluation of large language models. arXiv. 2024. arXiv:2402.01349. [Google Scholar]

[R28] 28.Zhang W, Liao J, Li N, Du K. Agentic information retrieval. arXiv. 2024. arXiv:2410.09713. [Google Scholar]

[R29] 29.Zhu Y, Yuan H, Wang S, et al. Large language models for information retrieval: a survey. arXiv. 2023. arXiv:2308.07107. [Google Scholar]

[R30] 30.Belkin M, Hsu D, Ma S, Mandal S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc Natl Acad Sci USA. 2019;116(32):15849-15854. doi: 10.1073/pnas.1903070116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Kwiatkowski T, Palomaki J, Redfield O, et al. Natural questions: a benchmark for question answering research. Trans Assoc Comput Linguist. 2019;7:453-466. doi: 10.1162/tacl_a_00276 [DOI] [Google Scholar]

[R32] 32.Borys K, Schmitt YA, Nauta M, et al. Explainable AI in medical imaging: an overview for clinical practitioners - beyond saliency-based XAI approaches. Eur J Radiol. 2023;162:110786. doi: 10.1016/j.ejrad.2023.110786 [DOI] [PubMed] [Google Scholar]

[R33] 33.Schwartz IS, Link KE, Daneshjou R, Cortés-Penfield N. Black box warning: large language models and the future of infectious diseases consultation. Clin Infect Dis. 2024;78(4):860-866. doi: 10.1093/cid/ciad633 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Wang X, Zhang NX, He H, et al. Safety challenges of AI in medicine in the era of large language models. arXiv. 2024. arXiv:2409.18968. [Google Scholar]

PERMALINK

Education Research: Can Large Language Models Match MS Specialist Training?

Hernan Inojosa

Ahmadreza Ramezanzadeh

Iva Gasparovic-Curtini

Isabella Wiest

Jakob Nikolas Kather

Stephen Gilbert

Tjalf Ziemssen

Abstract

Background and Objectives

Methods

Results

Discussion

Introduction

Methods

Large Language Models

Dataset

Reference Group

Prompting Strategy

Standard Protocol Approvals, Registrations, and Participant Consents

Statistical Analysis

Data Availability

Results

Comparative Accuracy of Students and Models

Figure 1. Overall Accuracy Across Groups.

Performance by Multiple-Choice Question Type

Figure 2. Accuracy by Question Type.

Performance by Difficulty Level

Figure 3. Accuracy by Question Difficulty.

Performance on Open-Ended Questions

Table.

Discussion

Acknowledgment

Glossary

Footnotes

Author Contributions

Study Funding

Disclosure

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases