Skip to main content
BMC Medical Education logoLink to BMC Medical Education
. 2025 Aug 23;25:1192. doi: 10.1186/s12909-025-07752-0

The performance of ChatGPT on medical image-based assessments and implications for medical education

Xiang Yang 1, Wei Chen 1,2,
PMCID: PMC12374324  PMID: 40849473

Abstract

Background

Generative artificial intelligence (AI) tools like ChatGPT (OpenAI) have garnered significant attention for their potential in fields such as medical education; however, their performance of large language and vision models on medical test items involving images remains underexplored, limiting their broader educational utility. This study aims to evaluate the performance of GPT-4 and GPT-4 Omni (GPT-4o), accessed via the ChatGPT platform, on image-based United States Medical Licensing Examination (USMLE) sample items, to explore their implications for medical education.

Methods

We identified all image-based questions from the USMLE Step 1 and Step 2 Clinical Knowledge sample item sets. Prompt engineering techniques were applied to generate responses from GPT-4 and GPT-4o. Each model was independently tested, with accuracy calculated based on the proportion of correct answers. In addition, we explored the application of these models in case-based teaching scenarios involving medical images.

Results

A total of 38 image-based questions spanning multiple medical disciplines—including dermatology, cardiology, and gastroenterology—were included in the analysis. GPT-4 achieved an accuracy rate of 73.4% (95% CI, 57.0% to 85.5%), while GPT-4o outperformed it with an accuracy of 89.5% (95% CI, 74.4% to 96.1%), with a numerically higher accuracy but no statistically significant difference (P = 0.137). The two models showed substantial disagreement in their classification of question complexity. In exploratory case-based teaching scenarios, GPT-4o was able to analyze and revise incorrect responses with logical reasoning. Moreover, it demonstrated potential to assist educators in designing structured lesson plans focused on core clinical knowledge areas, though human oversight remained essential.

Conclusion

This study demonstrates that GPT models can accurately answer image-based medical examination questions, with GPT-4o exhibiting numerically higher performance. Prompt engineering further enables their use in instructional planning. While these models hold promise for enhancing medical education, expert supervision remains critical to ensure the accuracy and reliability of AI-generated content.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12909-025-07752-0.

Keywords: Medical education, Medical examination, Large language models, ChatGPT

Introduction

Chat Generative Pre-trained Transformer (ChatGPT) is an advanced artificial intelligence (AI) large language models (LLMs) developed by OpenAI, designed to assist with a variety of tasks ranging from natural language understanding to content generation. The latest iterations, GPT-4 and GPT-4 Omni (GPT-4o), offer advanced capabilities in natural language understanding, reasoning, and multimodal processing. These models have demonstrated improved accuracy in complex tasks and are increasingly explored for applications in domains such as medical education and assessment, where both content generation and clinical reasoning are essential [16].

ChatGPT has shown significant potential in the field of clinical medicine and medical education, particularly in streamlining learning, supporting clinical decision-making, and enhancing examination preparation [712]. Its ability to assist with medical exam questions, such as those seen in standardized tests like the United States Medical Licensing Examination (USMLE), highlights its potential practical value [1318]. Recent systematic reviews have highlighted the promising performance of LLMs in health care examinations. For example, Waldock et al. reported an overall accuracy of approximately 61% across medical assessments and 51% for the USMLE, with ChatGPT reaching 64% [19]. However, these evaluations largely focused on text-based questions, with limited assessment of capabilities of models in image-integrated clinical reasoning. In medical education, image-based content plays a crucial role.

Evaluating AI performance on USMLE-style questions, particularly those involving clinical images, serves multiple purposes. First, success on standardized medical examinations acts as a proxy for essential medical knowledge and clinical reasoning skills, critical for clinical decision support development. Second, image-based clinical reasoning reflects real-world diagnostic challenges in fields such as dermatology, radiology, and pathology. Third, insights into multimodal reasoning capabilities can inform the creation of educational tools, assessment systems, and simulation environments that better reflect the complexity of clinical practice. However, research specifically focused on the performance of ChatGPT in image-based medical assessments remains limited [13, 20], which constrains its full application in educational settings. Miao et al. evaluated the performance of GPT-4 Vision in kidney pathology examinations; [21] meanwhile Yang et al. assessed its diagnostic support potential on USMLE-style imaging questions [22]. However, these studies focused primarily on isolated visual recognition without integrating complex clinical reasoning or educational applications.

Our study extends this field by examining performance of ChatGPT on multimodal clinical questions involving both images and text, and further exploring its potential to support teaching and curriculum development in medical education. By addressing this gap, the potential of model to redefine medical training and evaluation could be largely realized. Thus, the present study was designed to evaluate the performance of GPT-4 and GPT-4o on image-based USMLE sample items, and to explore its implications in the context of medical education.

Methods

Study design and data source

This study identified sample items from the USMLE Step 1 and Step 2 Clinical Knowledge (CK) sample questions provided by the National Board of Medical Examiners (NBME) and the Federation of State Medical Boards (FSMB) to evaluate ChatGPT [23]. The USMLE is a comprehensive, multi-step examination used to assess the knowledge and skills necessary for medical practice in the United States. It is widely regarded as one of the most rigorous and representative medical licensing assessments globally. The USMLE Step 1 primarily evaluates the understanding and application of foundational medical sciences, including anatomy, physiology, pathology, and pharmacology. Step 2 CK assesses the ability to apply clinical knowledge and skills in the diagnosis and management of patient care scenarios across various medical disciplines. These sample items are specifically designed to evaluate both foundational medical knowledge and clinical reasoning, making them an appropriate and effective tool for examining the capabilities of AI models like advanced ChatGPT version of GPT-4o in responding to medical test items involving images. All 38 publicly available USMLE-style multiple-choice questions involving image-based content were included to ensure comprehensive coverage in this study. In accordance with the common rule [24], institutional review board approval was not required since human participants, animal subjects, or health care data was not involved.

The accuracy of GPT-4 and GPT-4o was evaluated for every multiple-choice item. For each test, item per session was copied and pasted into the chat box followed by the prompt: “Provide only an answer choice without explanation”, as suggested in the previous literature (Fig. 1) [4, 25]. Each test item from the USMLE sample set included both an image and accompanying textual information, in accordance with the standard examination format. The responses of models were thus based on integrated interpretation of both modalities. The observed accuracy rate was calculated based on the proportion of correct responses across all items without repeated attempts. Additionally, each item was classified with respect to complexity by referring to the previous literature [26], with uniformity by the researchers if there were any differences. This was requested using the prompt: “In addition, please provide two additional categorization results for each problem. The first categorization is based on the disciplinary attribute of the problem, e.g., Neurology, Pediatrics, Sexual medicine, etc.; the second categorization follows the complexity (Recall, Interpretation, and Problem-solving). "This classification strategy was informed by prior studies that explored LLMs performance across medical knowledge domains [26]. As a supplementary exploration, we examined how LLMs could be prompted to support case-based teaching based on image-based clinical questions. This approach was informed by the growing interest in integrating AI into clinical education and was aimed at evaluating the feasibility of generating structured instructional feedback. Detailed prompts and examples are provided in Supplementary Appendix 1. Model responses were collected in January 2025 via the ChatGPT web interface, using default system settings and no custom instructions. GPT-4 outputs were generated between January 2–5, using the GPT-4 model (version: gpt-4–2023-12–20), and GPT-4o outputs between January 6–10, using the initial public release of GPT-4o. All evaluations were performed via the ChatGPT web interface under default system settings without custom instructions.

Fig. 1.

Fig. 1

Overview of Image-Based USMLE Sample Test Questions and ChatGPT Performance. A Categorization of test questions by disciplinary attribute; B Counts of image-based questions in each discipline; C Consistency in complexity classification between models; and D Correctness comparison of responses by both models

The primary outcome measured was the accuracy of GPT-4 and GPT-4o with respect to answering multiple-choice items. Variables for this study also included subject attribute categorization, complexity categorization, and step 1 or step 2 CK of the USMLE for each test question.

Statistical analysis

Data were analyzed between January 14, 2025 and January 25, 2025, using Stata 17 software (StataCorp LLC., College Station, Texas) and GraphPad Prism (version 10.1.0; GraphPad Software LLC., Boston, Massachusetts) software. The chi-square test was used to compare the performance of GPT-4 and GPT-4o across all questions. Fisher exact test was applied when expected cell counts were below 5. A 95% confidence interval (CI) for the observed accuracy was computed using the exact binomial (Clopper-Pearson) method. A statistically significant level was determined by considering a 2-tailed P value of less than 0.05.

Results

Figure 1 shows an overview of the 38 image-based sample test questions from the USMLE Step 1 and Step 2 CK exams included in this study. These questions were grouped into 18 distinct medical disciplines, with the largest categories being dermatology (6 questions), followed by cardiology and gastroenterology (4 questions for each), as shown in Fig. 1A and B. Using prompt engineering, we examined the classification of question complexity by both ChatGPT versions. The classification showed considerable disagreement, with less than half (48.7%, 19/38) of the questions classified identically by both models, with an attribute of problem-solving being categorized the most consistently (Fig. 1C).

The correctness for each item answered by both GPT-4 and GPT-4o is presented in Fig. 1D, and the details were presented in the Supplementary Appendix 1. Notably, the four errors made by GPT-4o were also answered incorrectly by GPT-4. In this study, GPT-4 achieved an accuracy of 73.4% (95% CI, 57.0% to 85.5%), while GPT-4o outperformed with an accuracy rate of 89.5% (95% CI, 74.4% to 96.1%), with no statistically significant difference (P = 0.137) (Fig. 2A). In the subgroup analysis, GPT-4 made errors in recall type questions, whereas GPT-4o did not. Besides, both versions of GPT have similar errors in the test questions of Step 1 and Step 2 CK, but the error rate is higher in Step 2 CK. These findings are graphically represented in Fig. 2.

Fig. 2.

Fig. 2

Performance Evaluation of GPT-4 and GPT-4 Omni (GPT-4o) on Image-Based USMLE Test Items. A Accuracy comparison between GPT-4 and GPT-4o; B Subgroup analysis according to recall-type errors and Step 1 and Step 2 clinical knowledge (CK). The numerical values representing the number of responses (blue for correct answers and purple for incorrect answers)

In case-based teaching scenarios, the use of GPT-4o demonstrated its ability to analyze and reason through incorrect answers (Supplementary Appendix 1). However, human validation and correction remain necessary. The model can aid instructors in developing lesson plans that address specific clinical knowledge points and in generating relevant test questions, as depicted in Supplementary Appendix 2.

Discussion

This study fills an existing gap by exploring the use of ChatGPT in the field of medical image-based questions and case-based teaching scenarios. Our findings demonstrate that ChatGPT exhibit the capability to accurately answer image-based medical questions, with the GPT-4o version demonstrating a numerically higher accuracy. Prompt engineering enables ChatGPT to assist in the design of lesson plans, showcasing significant potential in medical education. Nevertheless, human verification and correction remain essential. These findings may serve as a significant step toward expanding the practical use of advanced ChatGPT versions, which have already demonstrated great potential in medical fields.

Previous studies have widely assessed the performance of ChatGPT on medical questions, predominantly utilizing earlier versions such as GPT-3.5 and GPT-4, which yielded inconsistent accuracy levels [1, 3, 4, 1315, 17, 26, 27]. However, these studies excluded image-based questions due to the limitations of earlier versions, hindering potentials of ChatGPT in real-world teaching contexts. In response, our study specifically targeted medical image-based questions to fill this gap. The accuracy of ChatGPT in our findings is consistent with previous studies [1417], while the higher version, GPT-4o, exhibited greater accuracy (nearly 90%) [18]. While GPT-4o demonstrated higher accuracy than GPT-4 across the image-based items, this difference did not reach statistical significance, likely due to the limited sample size. Nonetheless, the consistent directional trend aligns with recent findings on the evolving capabilities of multimodal LLMs, and supports further investigation in larger-scale evaluations.

In addition to evaluating model performance, our findings highlight broader implications for the application of large vision-language models (LVLMs) in medical education—particularly in the areas of assessment design and instructional content development. LVLMs such as GPT-4o may assist in generating multiple-choice questions from image-based course materials, offering opportunities to streamline item development. Furthermore, our study demonstrates that GPT-4o possesses a notable capacity for logical reasoning and analytical processing when evaluating incorrect answers, suggesting its potential utility in developing explanations that reinforce clinical thinking frameworks. However, ensuring the clinical validity, cognitive appropriateness, and pedagogical alignment of AI-generated content remains a critical challenge, underscoring the need for expert oversight. While this study focused on interpreting images embedded within question stems, future work could explore more complex formats in which answer options themselves are visual (e.g., electrocardiogram tracings or radiographs). These tasks require fine-grained image discrimination and multimodal reasoning, areas in which current models still face limitations. As such, advancing the visual acuity and contextual understanding of LVLMs is essential to support their integration into high-stakes assessment environments.

Beyond assessment, LVLMs like ChatGPT show promise as instructional aids in case-based teaching. As demonstrated in exploratory prompts (Appendix 1 and 2), the model can support educators in structuring lessons around specific clinical concepts and generating adaptive instructional feedback [27, 28]. With its ability to present information logically and adjust content to learner needs, ChatGPT may enhance personalized learning and curricular efficiency [5, 28, 29]. Our findings, in conjunction with prior research [30, 31], support the growing interest in AI-assisted pedagogy. Nonetheless, limitations persist. While GPT-4o exhibits high coherence and insight in reasoning tasks, errors remain—particularly in nuanced clinical contexts involving medical images. These inaccuracies underscore the importance of human oversight and expert validation to ensure instructional reliability and clinical relevance [27, 3235]. As such tools continue to evolve, their optimal use will likely depend on integration with faculty-led review and revision mechanisms, ensuring safety, accuracy, and pedagogical value in medical education.

This study has several limitations. First, all model responses were generated via the ChatGPT web interface under default settings rather than through the OpenAI Application Programming Interface (API). Although personalization was disabled and each question was submitted in a newly initiated session to minimize memory effects, API-based deployment would allow for greater control over system parameters and eliminate potential user-specific variability. Second, each question was evaluated only once per model. This approach, consistent with prior LLM evaluation studies, minimized contextual contamination from repeated prompts—particularly relevant in session-based environments—but precluded assessment of intra-model variability. Future studies should consider multi-sample testing under controlled API conditions to examine response stability and reproducibility. Third, the relatively small number of publicly available USMLE-style questions with image content (n = 38) limited the statistical power and generalizability of the findings. Expanding the question pool and including a broader range of visual modalities—such as radiographs and electrocardiograms—would enhance benchmarking rigor. Fourth, the study did not isolate the respective contributions of visual versus textual inputs to model performance. While examples in the Supplementary Appendix suggest engagement with image content, dedicated experimental designs are needed to disentangle multimodal reasoning pathways and assess their relative influence. Finally, the study focused exclusively on GPT-4 and GPT-4o, which were the most accessible and stable vision-capable models at the time of evaluation (late 2024 to early 2025). Comparative studies involving other LVLMs, such as Gemini, LLaVA, or DeepSeek, are warranted to explore model-specific strengths and inform future applications in medical education.

Conclusions

In conclusion, this study demonstrates that ChatGPT, particularly the GPT-4o version, can effectively and accurately complete medical test questions involving images, offering valuable insights into its potential in medical education. The results underscore the ability of ChatGPT models to analyze complex medical scenarios, assist in case-based teaching, and help design coherent lesson plans. However, despite its promising capabilities, the need for human verification and correction remains crucial, as errors in reasoning are still present. These findings contribute to the growing body of knowledge on the application of AI in medical education, paving the way for future advancements in the use of AI to support both teaching and learning in healthcare.

Supplementary Information

Supplementary Material 1. (260.4KB, docx)

Acknowledgements

None.

Clinical trial number

Not applicable.

Abbreviations

AI

Artificial Intelligence

API

Application Programming Interface

ChatGPT

Chat Generative Pre-trained Transformer

CK

Clinical Knowledge

GPT-4o

GPT-4 Omni

LLMs

Large language models

LVLMs

Large vision-language models

USMLE

United States Medical Licensing Examination

Authors’ contributions

All authors participated in design of the study. XY participated in data collection. All authors performed quantitative and qualitative data analysis. All authors wrote, reviewed, and revised the manuscript.

Funding

None.

Data availability

All data analyzed in the study are publicly available at: United States Medical Licensing Examination (https://www.usmle.org/).

Declarations

Ethics approval and consent to participate

This study was deemed exempt by the West China Hospital of University Institutional Review Board. The need for consent to participate was waived by West China Hospital of Institutional Review Board as all data were from publically available sources. This study adhered to the Declaration of Helsinki.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Zong H, et al. Large language models in worldwide medical exams: platform development and comprehensive analysis. J Med Internet Res. 2024;26: e66114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Liu M, et al. Performance of chatGPT across different versions in medical licensing examinations worldwide: systematic review and meta-analysis. J Med Internet Res. 2024;26: e60807. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Herrmann-Werner A, et al. Assessing chatgpt’s mastery of Bloom’s taxonomy using psychosomatic medicine exam questions: mixed-methods study. J Med Internet Res. 2024;26: e52113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Li DJ, et al. Comparing the performance of ChatGPT GPT-4, Bard, and Llama-2 in the Taiwan psychiatric licensing examination and in differential diagnosis with multi-center psychiatrists. Psychiatry Clin Neurosci. 2024;78(6):347–52. [DOI] [PubMed] [Google Scholar]
  • 5.Wu Z, Li S, Zhao X. The application of ChatGPT in medical education: prospects and challenges. Int J Surg. 2024;111(1):1652–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Samuel A, Soh M, Jung E. Enhancing reflective practice with ChatGPT: a new approach to assignment design. Med Teach, 2025: p. 1–3. [DOI] [PubMed]
  • 7.Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330(1):78–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Hunter RB, et al. Using chatgpt to provide patient-specific answers to parental questions in the PICU. Pediatrics. 2024. 10.1542/peds.2024-066615. [DOI] [PubMed] [Google Scholar]
  • 9.Masters K, et al. Preparing for artificial general intelligence (AGI) in health professions education: AMEE guide no. 172. Med Teach. 2024;46(10):1258–71. [DOI] [PubMed] [Google Scholar]
  • 10.Liu J, et al. The diagnostic ability of GPT-3.5 and GPT-4.0 in surgery: comparative analysis. J Med Internet Res. 2024;26: e54985. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Jo E, et al. Assessing GPT-4’s performance in delivering medical advice: comparative analysis with human experts. JMIR Med Educ. 2024;10: e51282. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Hoppe JM, et al. ChatGPT with GPT-4 outperforms emergency department physicians in diagnostic accuracy: retrospective analysis. J Med Internet Res. 2024;26: e56110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zhu L, et al. Step into the era of large multimodal models: a pilot study on ChatGPT-4V(ision)’s ability to interpret radiological images. Int J Surg. 2024;110(7):4096–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Yaneva V, et al. Examining chatGPT performance on USMLE sample items and implications for assessment. Acad Med. 2024;99(2):192–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Patel D, et al. Evaluating prompt engineering on GPT-3.5’s performance in USMLE-style medical calculations and clinical scenarios generated by GPT-4. Sci Rep. 2024;14(1): 17341. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Shieh A, et al. Assessing chatgpt 4.0’s test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Sci Rep. 2024;14(1): 9330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Mihalache A, et al. ChatGPT-4: an assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med Teach. 2024;46(3):366–72. [DOI] [PubMed] [Google Scholar]
  • 18.Bicknell BT et al. Critical analysis of ChatGPT 4 omni in USMLE disciplines, clinical clerkships, and clinical skills. JMIR Med Educ, 2024. [DOI] [PMC free article] [PubMed]
  • 19.Waldock WJ, et al. The accuracy and capability of artificial intelligence solutions in health care examinations and certificates: systematic review and meta-analysis. J Med Internet Res. 2024;26: e56532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Arruzza ES, Evangelista CM, Chau M. The performance of ChatGPT-4.0o in medical imaging evaluation: a cross-sectional study. J Educ Eval Health Prof. 2024;21: 29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Miao J, et al. Performance of GPT-4 vision on kidney pathology exam questions. Am J Clin Pathol. 2024;162(3):220–6. [DOI] [PubMed] [Google Scholar]
  • 22.Yang Z et al. Performance of multimodal GPT-4V on USMLE with image: potential for imaging diagnostic support with explanations. medRxiv, 2023: p. 2023.10.26.23297629.
  • 23.Federation of State Medical Boards. N.B.o.M.E. USMLE practice materials for step 1 and step 2 clinical knowledge. Available from: https://www.usmle.org/exam-resources. Cited 2025 Jan 20.
  • 24.Masters K. Ethical use of artificial intelligence in health professions education: AMEE guide no. 158. Med Teach. 2023;45(6):574–84. [DOI] [PubMed] [Google Scholar]
  • 25.Masters K, et al. Twelve tips on creating and using custom GPTs to enhance health professions education. Med Teach. 2024;46(6):752–6. [DOI] [PubMed] [Google Scholar]
  • 26.Yudovich MS, et al. Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study. J Educ Eval Health Prof. 2024;21: 17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Luke W, et al. Is ChatGPT “ready” to be a learning tool for medical undergraduates and will it perform equally in different subjects? Comparative study of ChatGPT performance in tutorial and case-based learning questions in physiology and biochemistry. Med Teach. 2024;46(11):1441–7. [DOI] [PubMed] [Google Scholar]
  • 28.Gin BC, et al. Entrustment and EPAs for artificial intelligence (AI): a framework to safeguard the use of AI in health professions education. Acad Med. 2024. 10.1097/ACM.0000000000005930. [DOI] [PubMed] [Google Scholar]
  • 29.Wang S, et al. Medical education and artificial intelligence: web of science-based bibliometric analysis (2013–2022). JMIR Med Educ. 2024;10: e51411. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Boscardin CK, et al. ChatGPT and generative artificial intelligence for medical education: potential impact and opportunity. Acad Med. 2024;99(1):22–7. [DOI] [PubMed] [Google Scholar]
  • 31.Lee H. The rise of ChatGPT: exploring its potential in medical education. Anat Sci Educ. 2024;17(5):926–31. [DOI] [PubMed] [Google Scholar]
  • 32.Schuwirth LW, Van der Vleuten CP. Programmatic assessment: from assessment of learning to assessment for learning. Med Teach. 2011;33(6):478–85. [DOI] [PubMed] [Google Scholar]
  • 33.Cabral S, et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern Med. 2024;184(5):581–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Preiksaitis C, Rose C. Opportunities, challenges, and future directions of generative artificial intelligence in medical education: scoping review. JMIR Med Educ. 2023;9: e48785. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Masters K. Medical teacher’s first ChatGPT’s referencing hallucinations: lessons for editors, reviewers, and teachers. Med Teach. 2023;45(7):673–5. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1. (260.4KB, docx)

Data Availability Statement

All data analyzed in the study are publicly available at: United States Medical Licensing Examination (https://www.usmle.org/).


Articles from BMC Medical Education are provided here courtesy of BMC

RESOURCES