Abstract
The integration of Large Language Models (LLMs) in medical education has gained significant attention, particularly in their ability to handle complex medical knowledge assessments. However, a comprehensive evaluation of their performance in anatomical education remains limited. To evaluate the performance accuracy of current LLMs compared to previous versions in answering anatomical multiple-choice questions and assessing their reliability across different anatomical topics. We analyzed the performance of four LLMs (GPT-4o, Claude, Copilot, and Gemini) on 325 USMLE-style MCQs covering seven anatomical topics. Each model attempted the questions three times. Results were compared with the previous year’s GPT-3.5 performance and random guessing. Statistical analysis included chi-square tests for performance differences. Current LLMs achieved an average accuracy of 76.8 ± 12.2%, significantly higher than GPT-3.5 (44.4 ± 8.5%) and random responses (19.4 ± 5.9%). GPT-4o demonstrated the highest accuracy (92.9 ± 2.5%), followed by Claude (76.7 ± 5.7%), Copilot (73.9 ± 11.9%), and Gemini (63.7 ± 6.5%). Performance varied significantly across anatomical topics, with Head & Neck (79.5%) and Abdomen (78.7%) showing the highest accuracy rates, while Upper Limb questions showed the lowest performance (72.9%). Only 29.5% of questions were answered correctly by all LLMs, and 2.5% were never answered correctly. Statistical analysis confirmed significant differences between models and across topics (χ2 = 182.11–518.32, p < 0.001). Current LLMs show markedly improved performance in anatomical knowledge assessment compared to previous versions, with GPT-4o demonstrating superior accuracy and consistency. However, performance variations across anatomical topics and between models suggest the need for careful consideration in educational applications. These tools show promise as supplementary resources in medical education while highlighting the continued necessity for human expertise.
Keywords: Artificial intelligence, Medical education, Anatomy, Large language models, Assessment, ChatGPT
Subject terms: Preclinical research, Anatomy, Endocrine system, Gastrointestinal system, Musculoskeletal system, Nervous system, Oral anatomy, Urinary tract
Introduction
The introduction of AI-driven large language models (LLMs) raised great interest in their use in medical education and assessment. Ever since ChatGPT’s first launch in November 2022, these models have excelled in elaborate processes like text processing and human-like speech generation1.
The United States Medical Licensing Examination (USMLE) is a three-step examination for medical licensure in the United States. The USMLE Step 1, which focuses on pre-clinical sciences, is relevant for this study as it includes questions on gross anatomy, a fundamental discipline taught to first and second-year medical students2. Gross anatomy forms the foundation of medical education, covering the macroscopic structure of the human body and its organs3.
AI technologies are increasingly being integrated into medical education, offering innovative approaches to USMLE preparation for US medical students3,4. These applications range from adaptive learning platforms5 that adjust to students’ knowledge gaps to virtual patient simulations that enhance clinical reasoning skills tested on the USMLE6. Large Language Models can generate practice questions, provide immediate explanations for correct and incorrect answers, create customized study schedules, and simulate exam-like environments for assessment7,1. They can also synthesize complex information from multiple medical resources, helping students grasp difficult concepts across the basic sciences tested in Step 1, including gross anatomy8.
As LLMs become more sophisticated, it is important to understand how these educational technologies can be incorporated into existing frameworks without compromising the quality of medical education. Many studies have demonstrated the ability of AI to improve students’ learning. However, it is still best utilized alongside traditional teaching practices6,9.
To evaluate LLMs’ proficiency and reliability, many researchers are studying how these models manage sophisticated concepts of medicine and clinical reasoning in different examinations. It has been observed that these models have been performing remarkably well on a variety of medical licensing examinations worldwide, especially GPT-4. For instance, in comparison to GPT-3.5, which gained 36% to 77% accuracy, GPT-4 achieved 64.4% − 100% accuracy across numerous medical licensing examinations compared to its predecessors10. These conclusions have been substantiated by meta-analyses of LLM results in their different versions and formats of examinations, where on average GPT-4 achieved an overall accuracy rate of 81%, significantly surpassing the 58% accuracy rate of GPT-3.511.
The integration of AI into medical education presents both opportunities and challenges. While these tools add value for learning and examination purposes, there is some skepticism about their dependability and scope for errors. Studies indicate that LLMs tend to “hallucinate,” which points to a generated piece of information that is false yet asserts confidence, demonstrating why caution is needed when deploying these LLMs12. The usefulness of AI when creating educational materials, especially multiple-choice questions and assessments, has been researched, and the results do not show a consensus in comparison to human content12,13.
Some researchers have found that LLMs’ performance results significantly depending on the particular disciplines and nature of the assessments. For example, although ChatGPT proved useful as an engaging pedagogic tool for anatomy education, its ability to give detailed descriptions of anatomy as well as create acceptable images was limited14. In the same manner, for some of these models, the performance metrics of some clinical specialties are similar to the performance of junior medical residents but not of experienced clinicians15,16.
It has been established that AI-generated content, when infused into the curricula, is a very effective way of reinforcing what is taught in the classroom, provided such content is subjected to careful consideration from qualified instructors17. Recent research has specifically focused on identifying which curriculum components best prepare students to evaluate AI outputs critically18. Chatbots’ ability to perform content generation and knowledge assessment is impressive, however, their limitations and possible risks must be considered.
The aim of this research was to evaluate the progress that ChatGPT has made over the last year and answer the following research questions:
- What is the performance accuracy of GPT-4o compared to the previous version of GPT-3.5 in answering USMLE-style MCQs across different anatomical topics?
- How does the performance of different LLMs (Claude, Copilot, and Gemini) differ across various anatomical topics compared to GPT-4o results?
- How do the different LLMs’ accuracy and reliability compare to each other?
Methods
Study design
This research evaluated the performance of the four currently available most popular large language models, GPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Gemini 1.5 Flash (Google), and Copilot (Microsoft), versus the previous version of ChatGPT-3.5 on their proficiency in different anatomical topics. 325 USMLE-style MCQs, with five options and a single correct answer among them, were randomly chosen from the Gross Anatomy course’s examination database for medical students and validated by three independent experts in our previous research19. The study did not include questions with images and tables. The selected questions encompassed various levels of complexity. They were distributed across seven distinctive topics/regions: Abdomen (50 MCQs), Back (25 MCQs), Head and Neck (50 MCQs), Lower Limb (50 MCQs), Pelvis (50 MCQs), Thorax (50 MCQs), and Upper limb (50 MCQs), so the entire questionary was 325 questions. An example of MCQ and LLM response is shown in Fig. 1.
Fig. 1.
An example of a prompt, a multiple-choice question from the Head and Neck topic, and Claude’s response.
Data collection
Each selected chatbot was required to answer the full questionnaire for the testing phase. GPT-4-1106, Claude 3.5 Sonnet, Gemini 1.5 Flash, and Copilot proficiency in responding to multiple-choice questions was assessed in January 2025. ChatGPT (GPT-3.5) responses were recorded in October 202314. Each chatbot was given a prompt: “Generate the list of correct answers for the following MCQs:” following the MCQ set from each specific topic, one by one. After that, this data collection was repeated 3 times with no particular time period between the attempts. The results of these three successive attempts by each chatbot to answer this questionnaire were meticulously recorded in a Microsoft Excel spreadsheet (Microsoft®365) and evaluated based on accuracy. A total of 4,875 answers from LLMs were analyzed.
To compare chatbots’ results with random guessing, three random sets of answers were generated for the same questionnaire utilizing the RAND() function in Microsoft Excel and analyzed.
Data analysis
The data from each of the three attempts was matched with the answer key and compared with results from previous attempts, finding the percentage of correct and repeated answers. After that, a detailed item analysis was performed regarding different topics and questions for each LLM.
Basic data statistics were conducted using Statistica 13.5.0.17 (TIBC® Statistica™), with the Pearson chi-squared test employed to compare performance between different topics and LLMs using a significance threshold of p ≤ 0.05.
Results
According to our data, on average, four tested LLMs (GPT-4o, Claude, Copilot, and Gemini) accurately answered 76.8 ± 12.2% out of 325 MCQs from 7 topics in the Gross Anatomy course. This result was 27.7% above the GPT-3.5 year-ago results (44.4 ± 8.5%) and 3.7 times better than randomly generated responses (19.4 ± 5.9%) for the same questionnaire (Fig. 2).
Fig. 2.
Percentile of correct answers from different chatbots on 325 MCQs from the Gross Anatomy course. Y-axis: % of correct answers; X-axis: different LLMs’ results.
There was a significant variation in correct responses among the current versions of LLMs. The best results were shown by GPT-4o (92.9 ± 2.5%), followed by Claude (76.7 ± 5.7%), Copilot (73.9 ± 11.9%), and Gemini (63.7.5 ± 6.5%).
In the box plot analysis of AI system performance, GPT-4o demonstrates quite consistent performance across topics, with scores tightly clustered between 88% − 95.3%, and Copilot demonstrated the biggest results variation of 56% − 89.3%.
Chi-square analysis revealed that all LLMs showed statistically significant deviation from the expected uniform distribution of correct answers χ2 = 182.11–518.32 (p < 0.001). This means that the null hypothesis of uniform performance across topics and models can be rejected, and there is a statistically significant relationship between LLM performance and both model type and topic / anatomical region.
After that, a detailed topic-vise evaluation of the results received from all up-to-date LLMs (GPT-4o, Claude, Copilot, and Gemini) was performed and compared to ChatGPT-3.5 year-ago performance (Fig. 3).
Fig. 3.
Heatmap of LLMs’ topic-wise performance in the Gross Anatomy course. Numbers are % of correct answers in the specific topic for each chatbot.
In all attempts, only 29.5% (96/325) of questions were answered correctly by GPT-4o, Claude, Copilot, and Gemini. General item analysis revealed that Head & Neck and Abdomen were the two best categories, with the average results for these LLMs being 79.5% and 78.7%, respectively. In contrast, the lowest results were recorded for Upper Limb questions − 72.9%. Statistical analysis reveals statistically significant differences between different topics’ performances across all LLMs (all p-values < 0.001). The highest variation was calculated for the Upper limb questions (χ2 = 243.88) and the lowest for the Back (χ2 = 109.25).
2.5% (8/325) of the questions were never answered correctly by any LLM. Item analysis revealed that all of them were high-level critical-thinking questions, equally (1–2) distributed among the different topics.
Comparative analysis of GPT-4o and GPT-3.5 performance (Open AI)
The results of three successive GPT-4o attempts to answer the 325 Gross Anatomy MCQs in January 2025 showed 92.9 ± 2.5% correct answers, 48.5% (χ2 = 270.67, p < 0.001) better than GPT-3.5 performance in October 2023 (44.4 ± 8.5%). Interestingly, for both generations of ChatGPT, the results gradually increased in each consequent attempt: 91.7%, 93.2%, 94.8% and 42.8%, 43.1%, 44% percentage of correct answers, for GPT-4o and GPT3.5, respectively.
The coincidence generated by GPT-4o’s answers with the earlier attempts was 96.6% − 98.2%, and among them, the coincidence of correct answers was 91.4% − 93.2%, so consistency and reliability were very good. The previous model, GPT-3.5, did not show such results a year ago: coincidence with previously generated answers was 56%-61.8%, and correct among them were only 31.7%-32.3%, so the answers were mostly unreliable.
Topic-wise analysis revealed the largest performance gaps for the following topics: Thorax, Upper and Lower limbs, and the lowest - for Back, Head & Neck, and Pelvis (Fig. 4).
Fig. 4.
Percentile of correct answers GPT-4o and GPT-3.5 on 325 MCQs from the Gross Anatomy course. Y-axis: % of correct answers; X-axis: topics/regions.
GPT-4o’s best performing topics were Pelvis (95.3 ± 0.2%), Upper limb (94.7 ± 0.2%), and Thorax (94.0 ± 0.2%). GPT-3.5 demonstrated the best results answering questions in the following topics: Back (60.0 ± 0.4%), Head and Neck (50.0 ± 0.4%), and Pelvis (46.6 ± 0.4%).
91.1% (296/325) of questions were answered correctly across three attempts by GPT-4o, which is a phenomenal result compared to the year-ago GPT-3.5 performance when only 28.3% (92/325) of questions were constantly answered correctly.
GPT-4o did not answer only 5.2% (17/325) of MCQs from the entire questionnaire in any one out of 3 attempts, unlike GPT-3.5 was unable to answer 37.8% (123/325) of the questions.
Claude 3.5 sonnet (Anthropic)
Claude, across three attempts, provided 76.7 ± 5.7% correct answers to the same questionnaire, 16.2% less (p < 0.001) than GPT-4o. The first attempt was the most successful, with 78.8% correct answers, followed by 76% and 75.5% in the second and third attempts, so its attempts’ dynamic is opposite to ChatGPT models. The coincidence generated by Claude’s answers with the previous attempts was 86.8% − 89.2%, and among them, the coincidence of correct answers was 71.7% − 73.5%, with relatively good consistency. The item analysis suggested that Claude correctly answered 80.7% − 86.7% of questions from Lower limb topics and Pelvis, and the worst two topics were Upper limb and Abdomen, 69.3% − 72%. Results for the rest of the topics were in the mid-70s. Claude answered correctly 70.5% (229/325) of the questions across all attempts and did not solve 17.2% (56/325) of the MCQs. These were comprehensive questions from different topics.
Copilot (Microsoft)
The disadvantage of working with Copilot is that it can only accept up to 4000 characters in the prompt, so only 15–25 MCQs can be answered simultaneously. However, the big advantage of this LLM is that Copilot is integrated into Microsoft’s working space (Windows, Office, web browser) and is always available. Copilot generated 73.9 ± 11.9% accurate answers for 325 MCQs from the Gross Anatomy course, showing the third-best result. It is 19% (p < 0.001) below GPT-4o but only 2.8% less than Claude’s results. Attempts-wise, it shows the same dynamic as ChatGPT - the results are rising: 65.5%, 72%, and 80.6%. The coincidence generated by Copilot answers with the earlier attempts was 74.8% − 85.2%; among them, the coincidence of correct answers was 60.6% − 69.8%. The high standard deviation (11.9%) suggested more variability in its performance and, subsequently, low reliability. Copilot solved 59.1% of MCQs (192/325) across all three attempts, however, it could not answer 16% (52/325) of the questions. These MCQs are mostly from Thorax and Pelvis material. The item analysis revealed that Copilot performed well in Abdomen and Back questions (87.3%-89.3%), and the two lowest results were in Pelvis and Thorax (56%-64.8%) material.
Gemini 1.5 flash (Google)
Among current LLMs, Gemini finished last with 63.7.5 ± 6.5% correct answers to the same set of questions. This result was 28.5% below GPT-4o’s performance but 19.3% above GPT-3.5’s performance; both differences were statistically significant (p < 0.001). The first two attempts showed almost identical results, 60.9% and 60% correct answers; the third one was the most successful, with 71.4% success. The coincidence generated by Gemini’s answers with the previous attempts was 62.8% − 85.2%, and among them, the coincidence of correct answers was 50.8% − 55.4%, with a moderate standard deviation of 6.5%.
Gemini answered correctly 47.7% (155/325) across all attempts and did not solve 17.8% of MCQs (58/325). Item performance analysis revealed that Gemini’s two best topics were Pelvis and Head & Neck (71.3%-72.6%), and the lowest result was answering Upper Limb questions − 56%.
Difference in LLMs performance
Due to the binary nature of the data, we employed the Pearson Chi-square test to compare the performance of the different AI-driven chatbots against each other (Table 1).
Table 1.
Results of pearson Chi-square test to compare the performance of Copilot, Claude, GPT-4o, and gemini against each other.
| LLMs | Chi-square | df | P-value |
|---|---|---|---|
| GPT-4o vs. Claude | 46.29 | 3 | 3.54E-10* |
| GPT-4o vs. Copilot | 93.56 | 3 | 6.49E-20* |
| GPT-4o vs. Gemini | 150.53 | 3 | 1.52E-32* |
| GPT-4o vs. GPT-3.5 | 270.67 | 3 | 1.87E-58* |
| Claude vs. Copilot | 18.14 | 3 | 2.72E-04* |
| Claude vs. Gemini | 49.76 | 3 | 2.00E-11* |
| Claude vs. GPT-3.5 | 121.01 | 3 | 5.94E-26* |
| Copilot vs. Gemini | 17.6 | 3 | 2.08E-04* |
| Copilot vs. GPT-3.5 | 86.59 | 3 | 1.99E-19* |
| Gemini vs. GPT-3.5 | 41.85 | 3 | 3.83E-09* |
All p-values were extremely small (much smaller than 0.05 or even 0.001), indicating that the performance differences between all model pairs are highly statistically significant. The smallest p-values are observed in comparisons involving GPT-4o with other models. The relatively larger (but still very small) p-values are found in Copilot vs. Gemini and Claude vs. Copilot.
* Statistically significant difference.
These results quantify the statistical significance of the performance differences we observed, with all comparisons showing extremely strong evidence of real differences in performance distributions between the models.
Discussion
Principal findings
A thorough evaluation of our data explains the dramatic progress achieved by contemporary LLMs in resolving anatomical multiple-choice questions. Currently, LLMs achieve an average accuracy of 76.8 ± 12.2%. This represents a dramatic increase over last year’s GPT-3.5 performance (44.4 ± 8.5%) and random answers (19.4 ± 5.9%). This improvement reflects considerable strides in the AI’s ability to understand and utilize medical information.
Among all the models tested, GPT-4o stood out as the best performer with a remarkable accuracy of 92.9 ± 2.5%, followed by Claude (76.7 ± 5.7%), Copilot (73.9 ± 11.9%) and Gemini (63.7 ± 6.5%). The ranking of LLMs’ performance remained the same during different test tries, although some models were more reliable than others. Most striking was GPT-4o’s accuracy across different anatomical topics, which ranged from 88% to 95.3%, while Copilot ranged from 56% to 89.3%.
Comparison to literature
Recent research supports and extends our findings in the field of AI-assisted medical education. In comparison with those, GPT-4 managed a perfect score of 100%, far better than GPT-3.5 (82.21%), Claude (84.66%), and Bard (75.46%)20. In another extensive analysis, GPT-4 scored 83.3%, which is greatly superior compared to Claude (62%), Gemini (55.3%), and even Bard (54.7%), and excelled in pattern recognition (85%) versus intervention planning (71%)21. Meta-analyses of medical licensing examinations have shown that GPT-4 achieves an overall accuracy rate of 81% (95% CI 78–84%), significantly outperforming GPT-3.5’s accuracy rate of 58% (95% CI 53–63%)11.
Regarding specific medical course performances, variable quality in anatomical responses has been documented, with accuracy rates ranging from extremely good to very poor quality13,14. ChatGPT showed its effectiveness in tackling reasoning questions across diverse physiology modules, achieving an impressive 74% correctness22. Neuroscience testing revealed topic-specific variations. The strongest performance was seen in Neurocytology, Embryology, and Diencephalon (75–83%), while Brainstem, Cerebellum, and Special senses showed lower results (49–54%). On average, GPT-4 led with 81.7% accuracy, followed by Copilot (59.5%), GPT-3.5 (58.3%), and Gemini (53.6%)23.
In clinical specialties, studies have shown 68% accuracy rates in diagnostic tasks, with performance decreasing when dealing with image-based scenarios16. In head and neck surgery, it responded correctly to 84.7% of closed-ended questions. It provided accurate diagnoses in 81.7% of clinical scenarios, with room for improvement in procedural details and bibliographic references24. Pathological diagnosis achieved an accuracy of 89.1%, achieving good results in infectious pneumonia and atelectasis; diffuse alveolar disease, however, was more difficult (66.7% accuracy)25. The progression in model capabilities is further evidenced by documented increases in performance from 37.2% for GPT-3.5 to 67.8% for GPT-4 in anesthesiology examinations26.
Studies of Japanese medical licensing examinations have documented GPT-4o achieving accuracy rates of 89.2%, with approximately a 10% accuracy gap between image and non-image questions6.
Studies of AI versus human-generated multiple-choice questions have found AI-generated questions to be easier (mean difficulty index = 0.78 ± 0.22 vs. 0.69 ± 0.23, p < 0.01) but with similar discrimination indices27. Research focusing on curriculum components has shown that interactive case-based and pathology teaching are most helpful in evaluating AI outputs18.
Implications of findings
The remarkable development of LLMs has dramatic consequences for medical education. Given the high accuracy of GPT-4o (92.9%), there are possibilities of using it as an additional educational assistance tool, especially in self-assessment and examination. The proportion of queries that remain unanswered or are answered incorrectly is considerable (2.5% − 8% depending on LLM) and brings the need for instructor supervision, which is well correlated with recent studies highlighting the importance of balancing the use of AI technologies with conventional instructional methods.
The varying performance across different topics highlights the importance of subject-specific validation before implementing these tools in educational settings. It has been proven that performance can differ considerably across specialties and subjects10,11, thus implying the possible need for a more focused approach towards training or LLMs in particular medical subjects as opposed to using a common model.
The real-life value of LLMs answering USMLE-style questions extends beyond demonstrating AI technological advancement. Currently, US medical students primarily rely on commercial question data banks that provide static explanations after practice tests. While these resources are very valuable, they have inherent limitations in adaptability and personalization28. AI chatbots trained on medical content offer several unique advantages: they can provide dynamic explanations tailored to a student’s specific misunderstandings, they can reformulate explanations when initial clarifications are insufficient, and they can connect concepts across different medical subjects in real-time that guide students toward a deeper understanding of the material29,30. The training of AI chatbots specifically for medical question-answering is valuable because it creates tools that complement existing resources while addressing their limitations, potentially creating new access to high-quality USMLE preparation technology31. However, as our study demonstrates, the variable performance across anatomical topics necessitates careful consideration and validation of LLM limitations to ensure students do not develop misconceptions from incorrect AI-generated explanations.
Further developments
The fast-changing world of AI in medical education opens multiple research possibilities. To evaluate proficiency and reliability of LLMs more studies are needed due to the release of new versions. Studies utilizing image-based questions and clinical scenarios are necessary, as these areas are important in medical education.
With a focus on addressing the performance variations observed across different medical specialties, the development of specialized medical educational LLM would be a very interesting topic for research.
The creation of standardized guidelines for appropriate LLM use in medical education represents another extremely important area for future work. These guidelines can be developed by current implementations as well as future shifts in AI technology.
Strengths and limitations
The study benefits a lot from its key strengths, like comprehensive evaluation of different anatomic topics and use of various currently available LLMs for benchmarking. The large question bank of 325 MCQs and the ability to perform multiple attempts provide strong data for analysis, while the comparison to historical data and random guessing provides context for the interpretation of the results.
Despite these advantages, it is important to note that there are a number of limitations in this study. The exclusion of questions with images and tables, while necessary for our study design, limits the generalizability of our results to the full scope of medical education. Also, our focus on MCQs, while providing clear metrics for comparison, does not address other important aspects of medical assessment, such as clinical reasoning and practical skills. The study was also limited to specific versions of LLMs available during the study period, and the rapid pace of AI development means that newer versions may show different performance characteristics.
Conclusions
AI-driven LLMs today do significantly better than a year ago on anatomical multiple-choice questions, representing a new frontier in AI applications for medical education. Features of advancement were universal across all tested models, indicating that a real step forward has been achieved in the technology’s capability to understand and utilize medical information.
In the analysis of different anatomical topics, LLMs’ performance revealed significant variations, with some topics being more accurately addressed than others. The differences were statistically significant irrespective of the models tested, which means they are related to the knowledge gaps in some topics, which affected AI performance. These results show that the special tuning of subject matter and the discipline’s specificity should be done to improve LLMs’ reliability.
In the comparative analysis of different models, a clear superiority was demonstrated by GPT-4o, which consistently and most accurately answered MCQs in all anatomical topics compared to other models. Claude and Copilot also performed well but were inconsistent on some topics. Such a difference in the degree of reliability and accuracy of results shown by the models indicates the need for caution in selecting the model for particular educational purposes.
These results encourage the possible incorporation of LLMs in teaching anatomy and, simultaneously, warm against their over-exploitation across different subjects. LLMs should only act as plausible additions to conventional medical methods, not in place of them.
Acknowledgements
The authors thank Dr. Inna Shypilova and Dr. Larysa Sankova for their help reviewing the questions.
Author contributions
O.B. designed the research. O.B. and V.M. revived the questions and collected and analyzed the data. V.M. did the statistical analysis. All authors were involved in interpreting the data, drafting the article, and critically revising it. All have approved the submitted and final versions.
Data availability
The data supporting this study’s findings are available on request from the corresponding author.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Abd-Alrazaq, A. et al. Large Language models in medical education: Opportunities, Challenges, and future directions. JMIR Med. Educ.9, e48291. 10.2196/48291 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Kung, T. H. et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large Language models. PLOS Digit. Health. 2(2), e0000198. 10.1371/journal.pdig.0000198 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Peterson, C. A. & Tucker, R. P. Medical gross anatomy as a predictor of performance on the USMLE step 1. Anat. Rec B New. Anat.283(1), 5–8. 10.1002/ar.b.20054 (2005). [DOI] [PubMed] [Google Scholar]
- 4.Boscardin, C. K. et al. ChatGPT and generative artificial intelligence for medical education: potential impact and opportunity. Acad. Med.99(1), 22–27. 10.1097/ACM.0000000000005439 (2024). [DOI] [PubMed] [Google Scholar]
- 5.Tan, L. et al. Artificial Intelligence-Enabled adaptive learning platforms: A review. Comput. Educ. Artif. Intell. 100429. 10.1016/j.caeai.2025.100429 (2025).
- 6.Cook, D. A. Creating virtual patients using large Language models: scalable, global, and low cost. Med. Teach.47(1), 40–42. 10.1080/0142159X.2024.2376879 (2025). [DOI] [PubMed] [Google Scholar]
- 7.Sharma, S. et al. The role of large Language models in personalized learning: a systematic review of educational impact. Discov Sustain.6, 243. 10.1007/s43621-025-01094-z (2025). [Google Scholar]
- 8.Meng, X. et al. The application of large Language models in medicine: A scoping review. iScience27(5), 109713. 10.1016/j.isci.2024.109713 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wilson, R. N. et al. The effects of supplemental instruction derived from peer leaders on student outcomes in undergraduate human anatomy. Anat. Sci. Educ.17(6), 1239–1250. 10.1002/ase.2464 (2024). [DOI] [PubMed] [Google Scholar]
- 10.Jin, H. K., Lee, H. E. & Kim, E. Performance of ChatGPT-3.5 and GPT-4 in National licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC Med. Educ.24(1), 1013. 10.1186/s12909-024-05944-8 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Liu, M. et al. Performance of ChatGPT across different versions in medical licensing examinations worldwide: systematic review and Meta-Analysis. J. Med. Internet Res.26, e60807. 10.2196/60807 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Han, Z. et al. An explorative assessment of ChatGPT as an aid in medical education: use it with caution. Med. Teach.46(5), 657–664. 10.1080/0142159X.2023.2271159 (2024). [DOI] [PubMed] [Google Scholar]
- 13.Mavrych, V., Ganguly, P. & Bolgova, O. Using large Language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in gross anatomy course: comparative analysis. Clin. Anat.38(2), 200–210. 10.1002/ca.24244 (2025). [DOI] [PubMed] [Google Scholar]
- 14.Totlis, T. et al. The potential role of ChatGPT and artificial intelligence in anatomy education: a conversation with ChatGPT. Surg. Radiol. Anat.45(10), 1321–1329. 10.1007/s00276-023-03229-1 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Chen, A., Chen, D. O. & Tian, L. Benchmarking the symptom-checking capabilities of ChatGPT for a broad range of diseases. J. Am. Med. Inf. Assoc.31(9), 2084–2088. 10.1093/jamia/ocad245 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Shemer, A. et al. Diagnostic capabilities of ChatGPT in ophthalmology. Graefes Arch. Clin. Exp. Ophthalmol.262(7), 2345–2352. 10.1007/s00417-023-06363-z (2024). [DOI] [PubMed] [Google Scholar]
- 17.Surapaneni, K. M. et al. Evaluating ChatGPT as a self-learning tool in medical biochemistry: A performance assessment in undergraduate medical university examination. Biochem. Mol. Biol. Educ.52(2), 237–248. 10.1002/bmb.21808 (2024). [DOI] [PubMed] [Google Scholar]
- 18.Waldock, W. J. et al. Which curriculum components do medical students find most helpful for evaluating AI outputs? BMC Med. Educ.25(1), 195. 10.1186/s12909-025-06735-5 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Bolgova, O. et al. How well did ChatGPT perform in answering questions on different topics in gross anatomy? Eur. J. Med. Health Sci.5(6), 94–100. 10.24018/ejmed.2023.5.6.1989 (2023). [Google Scholar]
- 20.Abbas, A., Rehman, M. S. & Rehman, S. S. Comparing the performance of popular large Language models on the National board of medical examiners sample questions. Cureus16(3), e55991. 10.7759/cureus.55991 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Wei, B. Performance evaluation and implications of large Language models in radiology board exams: prospective comparative analysis. JMIR Med. Educ.11, e64284. 10.2196/64284 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Banerjee, A., Ahmad, A., Bhalla, P. & Goyal, K. Assessing the efficacy of ChatGPT in solving questions based on the core concepts in physiology. Cureus15(8), e43314. 10.7759/cureus.43314 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Mavrych, V., Yaqinuddin, A., Bolgova, O. & Claude ChatGPT, Copilot, and gemini performance versus students in different topics of neuroscience. Adv. Physiol. Educ.. 1710.1152/advan.00093.2024 (2025). [DOI] [PubMed]
- 24.Vaira, L. A. et al. Accuracy of ChatGPT-Generated information on head and neck and oromaxillofacial surgery: A multicenter collaborative analysis. Otolaryngol. Head Neck Surg.170(6), 1492–1503. 10.1002/ohn.489 (2024). [DOI] [PubMed] [Google Scholar]
- 25.Du, W. et al. Large Language models in pathology: A comparative study of ChatGPT and bard with pathology trainees on multiple-choice questions. Ann. Diagn. Pathol.73, 152392. 10.1016/j.anndiagpath.2024.152392 (2024). [DOI] [PubMed] [Google Scholar]
- 26.Artsi, Y. et al. Large Language models for generating medical examinations: systematic review. BMC Med. Educ.24(1), 354. 10.1186/s12909-024-05239-y (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Law, A. K. et al. AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination. BMC Med. Educ.25(1), 208. 10.1186/s12909-025-06796-6 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wynter, L. et al. Medical students: what educational resources are they using? BMC Med. Educ.19, 36. 10.1186/s12909-019-1462-9 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Ghorashi, N. et al. AI-Powered chatbots in medical education: potential applications and implications. Cureus15(8), e43271. 10.7759/cureus.43271 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Labadze, L., Grigolia, M. & Machaidze, L. Role of AI chatbots in education: systematic literature review. Int. J. Educ. Technol. High. Educ.20, 56. 10.1186/s41239-023-00426-1 (2023). [Google Scholar]
- 31.Sriram, A., Ramachandran, K. & Krishnamoorthy, S. Artificial intelligence in medical education: transforming learning and practice. Cureus17(3), e80852. 10.7759/cureus.80852 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data supporting this study’s findings are available on request from the corresponding author.




