Skip to main content
BMC Medical Education logoLink to BMC Medical Education
. 2026 Feb 7;26:399. doi: 10.1186/s12909-026-08781-z

Artificial intelligence in medical ethics education: a descriptive study of eight models in multiple choice question generation

John Obeid 1, Christopher Bobier 1,, Alex Gillham 2, Adam Omelianchuk 3, Daniel Hurst 4
PMCID: PMC12977604  PMID: 41654925

Abstract

Introduction

Integrating artificial intelligence (AI) into medical education poses barriers and opportunities for medical educators. One of those opportunities is content creation for medical ethics assessment. Assessing the performance of AI in generating multiple choice questions (MCQs) in ethics that aligns with United States Medical Licensing Examination (USMLE) is important. The present study evaluates the performance of eight AI models—GPT-4 Turbo, GPT-3.5 Turbo 0125, o1 Mini, o1 Preview, GPT-4, Claude 3.5 Sonnet, Claude 3 Opus, and Gemini—in regard to the relevance, clarity, and accuracy of generated ethics-based MCQs.

Methods

Each of the eight models was tasked with generating two sets of question, answer, and explanation for each of 13 selected USMLE-aligned student learning outcomes related to medical ethics. Responses were rated by four independent experts in medical ethics on relevance, clarity, and accuracy. Performance metrics were assessed using mean, median, range, and total scores, with an overall percentage score computed for each model. Qualitative responses noting strengths and weaknesses were collected.

Results

Claude 3.5 Sonnet had the highest overall performance (86.28%), accuracy (85.57%), and clarity (91.73%). o1 Mini followed closely with an overall score of 85.19%, while GPT-4 had the highest rating for relevance (88.65%) and a total score of 84.87%. GPT-4 Turbo and o1 Preview had the lowest overall score at 76.79%. Identified weaknesses included incorrect answer selection, best answer not available, and the question was not ethics-based.

Conclusion

The findings indicate clear potential for medical educators tasked with designing medical ethics MCQs for teaching and assessment. However, expert oversight is needed to ensure proper utilization.

Data availability statement

Data supporting the findings of this study are available from the corresponding author on request.

Keywords: Ethics education, Artificial intelligence, Multiple choice question creation, Educational technology

Background

An integral component to the evaluation of student learning and comprehension in undergraduate medical education is multiple-choice examinations [13]. However, crafting high-quality multiple-choice questions (MCQs) is both time-consuming and challenging, as it requires ensuring that each question is appropriately structured, aligned with learning objectives, and free from ambiguity, cue-ing, or bias [4, 5]. Given the effort required for MCQ development, researchers have been exploring the potential of generative artificial intelligence (AI) to assist in question creation. However, findings on AI-generated MCQs have been mixed [68]. Rezigalla [9] reported that AI-generated questions were frequently rated as overly simplistic by subject matter experts, while Ngo et al. [10] found that one AI model produced 41 incorrect questions—including flawed answers and explanations—out of a total of 60. By contrast, Klang et al. [11] found that only 15% of 210 generated questions needed revising. The quality and accuracy of AI-generated MCQs for medical education is a topic of ongoing research.

Medical ethics is a critical component of undergraduate medical education, with specific learning objectives outlined in both the United States Medical Licensing Examination (USMLE) Content Outlines and the Comprehensive Osteopathic Medical Licensing Examination (COMLEX-USA) Blueprint–the sets of board exams that medical students in the United States must pass in order to achieve medical licensure [12, 13]. However, generating high-quality MCQs for medical ethics presents unique challenges, as ethical reasoning is often perceived as more subjective and less matter-of-fact [2]. While considerable research has examined AI-generated MCQs in clinical and biomedical domains, much of this work has focused on medical knowledge rather than ethical knowledge or reasoning [14, 15]. Additionally, emerging evidence suggests that AI models do not perform as well when given medical ethics MCQs compared to medical knowledge MCQs [16]. Khan et al. [17] found that GPT-3.5 and GPT-4 performed below the average student performance level on USMLE-style ethics MCQs. Similarly, Balas et al. [18] found that GPT-4 struggled to interpret ethical nuances and correctly apply ethical principles. These studies suggest that AI models may have difficulty generating useful ethics MCQs for medical education. Unfortunately, however, no research has specifically examined how AI models perform when tasked with generating USMLE-style medical ethics MCQs.

As AI continues to be integrated into medical education [19], it is essential to assess its capabilities and limitations in supporting MCQ development. Recognizing the potential benefits and challenges of using AI for ethics question generation, this study aims to compare the performance of eight models in generating USMLE-style medical ethics MCQs, including their corresponding answer choices and explanations.

Methods

These models were selected for three reasons. First, some versions require subscriptions with variable access (such as GPT-4 Turbo), while others are free and widely available (such as o1 Mini). This is relevant for medical educators who may need to determine whether investing in a paid AI model yields substantial benefits in generated MCQ quality and alignment with educational standards. Second, some of the models are generative pre-trained (GPT) that are trained from large sets of text and they generate content from prompts. The company OpenAI labels these models with the acronym “GPT” in the name. Other models are reasoning-based models that are trained with reinforcement learning: these models spend more time performing complex problem solving and therefore specialize in nuanced or step-wise reasoning. These OpenAI models have “o [number]” in the name. Our study allows for insight into these two types of AI models vis a vis MCQ generation for medical educators. Third, the selection encompasses both commonly studied OpenAI models and those that have been less frequently examined in the literature, thereby ensuring a more representative sample.

While other prominent AI models such as Llama 3, DeepSeek, and Mistral exist, they were excluded due to limited public accessibility, licensing or deployment constraints, and/or release timing relative to the study period. By focusing on models that were both accessible and widely used, we ensured the study maintains contemporary relevance and practical applicability for medical educators.

Study design

This study employed a comparative research design to evaluate the performance of five OpenAI AI models—GPT-4 Turbo, GPT-3.5 Turbo 0125, o1 Mini, o1 Preview, GPT-4, Claude 3.5 Sonnet, Claude 3 Opus, and Gemini—in generating multiple-choice questions (MCQs) on medical ethics that align with USMLE learning outcomes [12].

The study was conducted over a four-month period (November 2024–March 2025). Two authors (JO and CB) selected 13 student learning objectives posted by Federation of State Medical Boards and National Board of Medical Examiners for preparation of the STEP 2 examination as the basis for AI-generated MCQs [12] (Table 1). Each of the eight AI models was then tasked with generating two distinct MCQs per learning outcome, resulting in a total of 26 questions per model and 208 questions overall. To ensure consistency and minimize variability in AI responses, identical prompts were used across all four models in the following format:

“You are a medical school professor. You have many years of teaching medical students and developing NBME-style exam questions for the curriculum. Write two questions on [USMLE Student Learning Objective]. Provide the NBME-style question, each answer choice and their corresponding letters, followed by a new line, then a thorough explanation of the answer choice:”

Table 1.

Themes supported by the USMLE STEP 2 examination learning objectives

No. Learning Outcome Description
1 Informed Consent for Research Understand ethical basis for obtaining informed consent in research: elements include disclosure, voluntariness, and patient comprehension.
2 Determination of Medical Decision-Making Capacity Assess a patient’s ability to understand, communicate, appreciate, and reason in order to participate in medical decision-making
3 The Duty to Report an Impaired Physician Recognize ethical responsibility to report colleagues with impairments affecting patient care.
4 Negligence/Malpractice Identify the elements of medical malpractice: duty, breach, causation, and harm.
5 Physician-Assisted Death/Euthanasia Be able to distinguish physician-assisted death and euthanasia. Understand the ethical status of physician-assisted death, including physician responsibilities.
6 Involuntary Admission Identify criteria for involuntary psychiatric hospitalization, including risk to self/others and inability to care for oneself.
7 Legal Requirements for Reporting Abuse or Neglect Recognize mandatory reporting laws for child abuse, elder abuse, and intimate partner violence, and appropriate steps for reporting.
8 Brain Death Diagnosing and pronouncing brain death.
9 Advance Directives Understand types of advance directives (living wills, durable power of attorney) and their role in guiding end-of-life care.
10 Negligence Differentiate negligence from malpractice and understand standards of care and duty to the patient.
11 Boundaries in the Physician-Patient Relationship Recognize inappropriate behaviors, conflicts of interest, and ethical guidelines for maintaining professional boundaries.
12 Organ Donation Understand ethical principles for organ donation, including consent, allocation, and brain death criteria.
13 Shared Medical Decision-Making and Medical Proxies Apply principles of shared decision-making, patient autonomy, and the role of medical proxies in cases where patients cannot decide for themselves.

A Python script was developed using Microsoft Visual Studio Code to automate the data collection process. The script systematically retrieved each question from the spreadsheet, executed Application Programming Interface (API) calls to the respective language models, and stored the generated responses in separate text files. To address potential errors resulting from prompt overload, a retry mechanism with a delay interval was implemented, allowing the script to reattempt failed API calls. The primary purpose of these retry mechanisms is to handle errors and empty responses from API calls. Each function continuously attempts to retrieve a response until a valid answer is returned. If an API call fails due to network errors, rate-limiting, or empty responses, the script is paused for a predefined delay (default 2 s) before retrying. Infinite loops with “if/else” statements were used to allow the script to retry API calls yielding error messages. When the LLM provides a response that does not contain an error message, the loop function is passed then the script progresses. This method allows for failures to be addressed directly in the automated data collection process and provides robustness for temporary failures while preventing indefinite blocking of data collection. The automated methodology also ensures consistency in question formatting and phrasing, while minimizing biases and errors associated with manual input [20, 21]. Table 1 lists themes supported by the USMLE STEP 2 examination learning objectives [12].

Data collection and analysis

Each set of AI-generated responses were standardized to ensure consistency in formatting and presentation. Each standardized response set contained the model name and the 26 generated MCQs. The five sets were then distributed to the other reviewers, along with an Excel-based grading rubric and a detailed explanation of the Likert scale scoring criteria.

Four independent reviewers (CB, DH, AO, AG) then evaluated the responses using a structured assessment framework. Each response was assessed across three key domains—relevance, accuracy, and clarity—using a 5-point Likert scale (Table 2). Relevance assessed the degree to which the MCQ aligned with USMLE STEP 2 exam content, specifically whether it addressed a medical ethics topic that medical students are expected to know and may be tested on. Accuracy evaluated whether the MCQ was factually correct, evidence-based, and consistent with current medical guidelines and board examination standards. Clarity measured how well the MCQ was written, assessing whether it was unambiguous, easy to understand, and required no further revision for clarity. If the generated content was judged to be outside the scope of ethics, it received a 1 for relevance; if a model did not generate 26 outputs, it received 1s for each category for each missing output. Reviewers also had the option to provide qualitative feedback, noting specific strengths and weaknesses of each question.

Table 2.

The likert scale (score of 1–5) for Relevance, Accuracy, and clarity

Category Score 1 Score 2 Score 3 Score 4 Score 5
Relevance

Not Relevant-

The question is unrelated to board exam content or preparation and does not contribute meaningfully to the learning process.

Minimally Relevant-

The question has limited applicability to board exams and may not significantly aid in preparation.

Moderately Relevant-

The question has some value for board preparation but focuses on less commonly tested material or is tangentially related.

Relevant-

The question is appropriate and helpful for board exam preparation but might be slightly less central or specific.

Highly- Relevant

The question aligns closely with board exam content and emphasizes critical knowledge or skills necessary for success.

Accuracy

Not Accurate-

The question is entirely incorrect, misleading, or contradicts current medical knowledge and board exam standards.

Minimally Accurate-

The question contains significant inaccuracies or inconsistencies, making it unsuitable without major revisions.

Moderately Accurate-

The question has some accurate elements but includes notable errors or outdated information that could mislead students.

Mostly Accurate-

The question is accurate but may contain minor details that could benefit from clarification or slight adjustment.

Completely Accurate-

The question is entirely correct, evidence-based, and aligns perfectly with current medical guidelines and board exam standards.

Clarity

Very Unclear-

The question is extremely difficult to understand or interpret, requiring major revisions to be usable.

Unclear-

The question is poorly written, with significant ambiguities or unclear phrasing that make it difficult to understand.

Moderately Clear-

The question is somewhat understandable but contains some ambiguities or complex wording that might confuse students.

Clear-

The question is well-written and understandable, though a minor refinement could enhance clarity.

Very Clear-

The question is exceptionally well-written, unambiguous, and easy to understand, with no need for clarification.

To ensure consistency in evaluation, all reviewers met specific expertise criteria. Three reviewers (CB, DH, AO) are employed in undergraduate medical ethics education, while the fourth (AG) has extensive familiarity with USMLE medical ethics education. All reviewers hold a doctorate in philosophy specializing in ethics and are practicing medical or clinical ethicists. Additionally, a standardized review prompt was provided to ensure that ratings were based on USMLE board exam requirements rather than personal interpretations of the correct answer. To facilitate consistency, assessments were informed by a shared working familiarity with standard medical ethics texts [2124] and a pre-review standardizing session was done to ensure consistency.

Scores were recorded and analyzed in terms of a total score for each category (a total of 520 possible). The total percentage performance was computed out of a total score of 1560. Qualitative feedback was assessed for repeated themes. All analyses were descriptive. No inferential statistical tests were conducted; therefore, observed differences between the AI models should be interpreted as descriptive only.

Results

Relevance, clarity, and accuracy

Interrater reliability was assessed using the intraclass correlation coefficient (ICC) for the three coding categories. ICC values were 0.74 for relevance and 0.78 for accuracy, indicating good agreement among raters. The ICC for clarity was 0.67, reflecting moderate but still acceptable agreement. See Table 3 for results.

Table 3.

Total percentage score for each category (a total of 520 possible) and overall performance (a total of 1560)

Model Overall Performance Relevance Clarity Accuracy
Claude 3.5 Sonnet 86.28% 81.53% 91.73% 85.57%
01-mini 85.19% 84.23% 88.46% 83.46%
GPT4 84.87% 88.65% 90.76% 75.19%
Claude 3 Opus 84.42% 85.19% 89.03% 79.03%
Gemini 81.85% 85.76% 86.34% 73.46%
GPT35 Turbo 0125 81.79% 85.96% 84.42% 75.00%
GPT-4 Turbo 76.79% 79.23% 80.38% 70.96%
o1 Preview 76.79% 81.92% 77.88% 70.57%

Strengths and weaknesses

Reviewers noted certain strengths. Specifically, across each of the eight AI programs, reviewers noted that the majority of generated MCQs were relevant, accurate, and clear. Reviewers agreed that, in general, the generated content was useful and of an appropriate level for undergraduate medical ethics education and aligned well with USMLE learning outcomes, though the reviewers also agreed that outputs still require subject matter expert oversight and certain refinement before utilization.

Reviewers noted several weaknesses, however. All four reviewers noted that certain questions and outputs were not in the domain of ethics. For instance, Claude 3 Opus generated a question about optimal body temperature in a brain-dead donor, while o1 Mini generated a question on blood type compatibility between organ donor and organ recipient. GPT4-Turbo generated a question about how a calcineurin inhibitor helps prevent organ rejection. Each of these questions assesses medical knowledge, not ethical knowledge. Similar considerations applied to questions generated on the topic of malpractice, which seemed to fall in the domain of legal knowledge. Upon review, each model generated 2–4 MCQs that fell outside the ethics domain, representing approximately 7–15% of all questions generated per model. The questions were related to transplantation and malpractice.

All four reviewers noted that certain questions and outputs were relevant, accurate, and clear, but the answer was too obvious, rendering the MCQ inappropriate for use without revision. Either the answer was contained in the question stem or the generated answer set contained clearly false answers (e.g., discharging a suicidal patient). GPT-4 generated a question on what a physician should tell a patient asking what an advanced directive is and included the following possible answers: “Advance directives are unnecessary. You should trust your doctors to make the right decisions.” and “Advance directives are orders given in advance by doctors to limit treatment.” Claude 3.5 Sonnet generated a question on research ethics asking which element of informed consent is not required and gave the obvious answer, “Guarantee that the new drug will be effective.” Or, consider GPT-4 Turbo’s generated question on informed consent to participate in research, which asks which of the following is most important to ensure before obtaining written consent for a competent adult patient invited to participate in research:

“A. The patient’s family agrees with his decision. B. The patient understands the alternatives to participating in the study. C. The patient’s primary care provider agrees with his participation. D. The patient is compensated for participating. E. The patient has no other medical conditions.”

Reviewers noted that the answer choices allow for easy elimination to the correct answer.

All four reviewers noted that some generated MCQs had the incorrect answer or that the correct (or most correct) answer was not present. For instance, it is widely agreed that soliciting organ donation is in the purview of the organ procurement organization, not the deceased patient’s care team [21, 22, 25]. Yet, o1 Mini generated a question about next steps for a brain dead patient and posited the right answer to be for the emergency department physician to “request consent from the patient’s next of kin for organ donation.” Another example across all eight AIs regards the questions and answers generated on physician-assisted death (PAD). The STEP exam is a national exam that does not permit state-level distinction, and yet, some models generated questions on PAD that asked about state-level legality of PAD.

Discussion

Ours is the first study to assess the use of AI to generate medical ethics board-style MCQs. The results of our study indicate a potentially useful role of AI in MCQ creation. In general, all eight AI models quickly produced ethics MCQs that aligned with USMLE learning outcomes and were accurate, relevant, and accessible. With proper editing and oversight, our results suggest that these AIs can be thoughtfully utilized in medical course preparation and assessment. This confirms previous research showing the potential usefulness of AI utilization in MCQ and exam construction in non-ethics domains of medical education [26]. Our study indicates that medical educators can thoughtfully utilize AI in ethics-based MCQ generation for students.

However, our study raises caution much in line with previous research. Medical educators seeking to utilize AI to assist in MCQ creation must exercise proper caution and oversight [68, 14, 26]. – [27] A common theme of research on the utilization of AIs for MCQ creation is that expert oversight and review is critical to ensure the correctness and applicability of the generated content. As of yet, medical educators cannot uncritically accept generated content, and thus, AI is best seen as a tool that helps medical educators create material to work with—rather than having to design MCQs from scratch, an AI can be used to create something workable, which can be refined [28].

A benefit of our study is that it allows insight into different AI models. We expected paid versions to outperform the free versions; however, in our descriptive data, each AI model performed well, above 75% overall and above 70% in each category—medical educators who exercise caution and diligence can effectively utilize any of the available versions for impactful ethics-based MCQ generation. It is also important to note that our interpretation is based on descriptive data without statistical testing. Nonetheless, one implication is that medical educators may not significantly benefit by purchasing a paid version for the purposes of assisting in MCQ generation. In addition, we did not find a significant difference in results between generative pre-trained models and reasoning-based models, although further research is needed to compare the models since we based this finding on descriptive data without statistical modelling. Our results also show comparable performance between OpenAI models and other AI models.

There are several limitations to consider. Our study focused on the creation of medical ethics multiple-choice questions (MCQs) based on 13 USMLE Student Learning Objectives (SLOs). While these SLOs cover key aspects of medical ethics, they do not encompass the full breadth of ethical topics that medical students are expected to master, and while there is significant overlap with COMLEX-USA ethics SLOs, the precise wording differs which may result in different generated MCQs. The generalizability is restricted to USMLE-aligned ethics questions. This relates to another limitation: generated output depends on the prompting, and we did not engage in prompt refinement. Alternative prompting and prompt refining would generate different responses, thereby impacting reviewer scoring for each LLM and the generalizability of this kind of study. Accordingly, our assessment of model performance is not a perfect estimate of each model’s optimum performance, which is why our study does not allow for meaningful model-to-model comparisons. Further research is needed.

Additionally, we instructed each AI to generate only two questions per SLO. This limited sample size may not provide a sufficiently robust assessment of each AI’s overall ability to create MCQs across different ethical domains. A larger dataset would be necessary to better evaluate consistency, variability, and alignment with expert-validated standards. Further research is needed to corroborate our findings. Another limitation is that our study focused on eight AI models, five of which are available by one company. As such, our results cannot be extended to other AI models not presently assessed nor more recently released AI models. Whether similar results could be generated with other and more recent AI models is an area of future research. In addition, while interrater reliability was assessed between reviewers, future methodological work could explore potential variability in rater agreement across specific domains.

Although we minimized bias by using the Python script (similar prompt style each time, no potential errors in copying/pasting, blinding between the four reviewers), a potential bias is that each of the five sets of outputs had the model name present—reviewers knew the name of the AI model whose generated MCQs they were assessing. It is possible that this knowledge affected reviewer ratings.

Finally, one model—o1 Preview—-failed to complete the task for SLO 13 (Shared Medical Decision-Making and Medical Proxies), stating that it was “unable to create USMLE questions.” It generated 24 instead of 26 MCQs. Though this is suspected to be a safeguard implemented by OpenAI, the scoring for this model was adjusted accordingly, with a score of 1 given across all three categories of accuracy, relevance, and clarity for the two questions. Since the total possible points per category remained consistent across each LLM, the scoring for failed responses mimicked the scoring of responses with poor accuracy, relevance, and clarity. This highlights a limitation of the study, as it does not distinguish between outright failures and low-quality responses, and it resulted in o1 Preview receiving a lower score for each category and across the board than if it would have received had it generated the remaining two questions.

Conclusion

This study highlights the potential for AI to assist medical educators in generating ethics-based MCQ drafts. While these tools may streamline question development and provide useful starting points, careful oversight and expert validation remain essential, and AI should not replace expert-created or validated assessment content.

Acknowledgements

We would like to thank the anonymous reviewers for their helpful feedback.

Abbreviations

AI

Artificial intelligence

MCQ

Multiple choice question

LLM

Large language model

USMLE

United States Medical Licensing Examination

COMLEX

Content Outlines and the Comprehensive Osteopathic Medical Licensing Examination

PAS

Physician assisted suicide

API

Application Programming Interface

Authors’ contributions

JO and CB designed the study, participated in the research, and drafted the manuscript; AG, AO, and DH participated in the research and writing. All authors have read the completed manuscript in its present form and agree to its submission.

Funding

None.

Data availability

Data supporting the findings of this study are available from the corresponding author on request.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.French S, Dickerson A, Mulder RA. A review of the benefits and drawbacks of high-stakes final examinations in higher education. High Educ. 2024;88(3):893–918. [Google Scholar]
  • 2.Epstein RM. Assessment in medical education. N Engl J Med. 2007;356(4):387–96. [DOI] [PubMed] [Google Scholar]
  • 3.Al-Wardy N.M. Assessment methods in undergraduate medical education. Sultan Qaboos Univ Med J. 2010;10(2):203. [PMC free article] [PubMed] [Google Scholar]
  • 4.Frederiksen N. The real test bias: influences of testing on teaching and learning. Am Psychol. 1984;39(3):193. [Google Scholar]
  • 5.Collins J. Education techniques for lifelong learning: writing multiple-choice questions for continuing medical education activities and self-assessment modules. Radiographics: Rev Publication Radiological Soc North Am Inc. 2006;26(2):543–51. [DOI] [PubMed] [Google Scholar]
  • 6.Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE. High rates of fabricated and inaccurate references in ChatGPT-Generated medical content. Cureus. 2023;15(5):e39238. 10.7759/cureus.39238. Published 2023 May 19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Vaishya R, Misra A, Vaish A. ChatGPT: is this version good for healthcare and research? Diabetes Metab Syndr. 2023;17(4):102744. 10.1016/j.dsx.2023.102744. [DOI] [PubMed] [Google Scholar]
  • 8.Kıyak YS, Soylu A, Coşkun Ö, Budakoğlu Iİ, Peker TV. Can ChatGPT generate acceptable Case-Based Multiple‐Choice questions for medical school anatomy exams? A pilot study on item difficulty and discrimination. Clin Anat. 2025;38(4):505–10. [DOI] [PubMed] [Google Scholar]
  • 9.Rezigalla AA. AI in medical education: uses of AI in construction type A MCQs. BMC Med Educ. 2024;24:247. 10.1186/s12909-024-05250-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Ngo A, Gupta S, Perrine O, Reddy R, Ershadi S, Remick D. ChatGPT 3.5 fails to write appropriate multiple choice practice exam questions. Acad Pathol. 2024;11(1):100099. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Klang E, Portugez S, Gross R, Brenner A, Gilboa M, Ortal T, Ron S, Robinzon V, Meiri H, Segal G. 2023. Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4. BMC Medical Education, 23. [DOI] [PMC free article] [PubMed]
  • 12.Federation of State Medical Boards of the United States and NBME, USMLE Content Outline. 2024. Accessed February 2, 2025 at https://www.usmle.org/sites/default/files/2022-01/USMLE_Content_Outline_0.pdf
  • 13.National Board of Osteopathic Medical Examiners, Professionalism in the Practice of Osteopathic Medicine. Accessed February 2. 2025 at https://www.nbome.org/assessments/comlex-usa/comlex-usa-blueprint/d1-competency-domains/test-specifications-professionalism-in-the-practice-of-osteopathic-medicine/
  • 14.Gordon M, Daniel M, Ajiboye A, Uraiby H, Xu NY, Bartlett R, Hanson J, Haas M, Spadafore M, Grafton-Clarke C, Gasiea RY. A scoping review of artificial intelligence in medical education: BEME guide 84. Med Teach. 2024;46(4):446–70. [DOI] [PubMed] [Google Scholar]
  • 15.Aster A, Laupichler MC, Rockwell-Kollmann T, Masala G, Bala E, Raupach T, et al. ChatGPT and other large language models in medical education—scoping literature review. Med Sci Educ. 2025;35(1):555–67. [DOI] [PMC free article] [PubMed]
  • 16.Danehy T, Hecht J, Kentis S, Schechter CB, Jariwala SP. ChatGPT performs worse on USMLE-style ethics questions compared to medical knowledge questions. Appl Clin Inf. 2024;15(05):1049–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Khan AA, Khan AR, Munshi S et al. Assessing the performance of ChatGPT in medical ethical decision-making: a comparative study with USMLE-based scenarios Journal of Medical Ethics Published Online First: 25 January 2025. 10.1136/jme-2024-110240 [DOI] [PubMed]
  • 18.Balas M, Wadden JJ, Hébert PC, et al. Exploring the potential utility of AI large Language models for medical ethics: an expert panel evaluation of GPT-. J Med Ethics. 2024;4:50:90–6. [DOI] [PubMed] [Google Scholar]
  • 19.Kunze KN, Nwachukwu BU, Cote MP, Ramkumar PN. Large Language models applied to health care tasks May improve clinical efficiency, value of care rendered, research, and medical education. Arthroscopy: J Arthroscopic Relat Surg. 2025;41(3):547–56. [DOI] [PubMed] [Google Scholar]
  • 20. Kıyak YS. Emekli E. ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgrad Med J. 2024;100(1189):858–65. [DOI] [PubMed]
  • 21.Lo B. Resolving ethical dilemmas: a guide for clinicians. Lippincott Williams & Wilkins; 2012.
  • 22.Fischer C. Medical ethics for the boards. 3rd ed. McGraw Hill; 2016.
  • 23.Fischer C. Master the boards USMLE step 2 CK. Simon and Schuster; 2019.
  • 24.Toy E, Raine S, Cochrane T. 2015. Case Files: Medical Ethics & Professionalism.McGraw Hill.
  • 25.American Medical Association. Code of medical ethics. AMA; 2016.
  • 26.Artsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. 2024. Large language models for generating medical examinations: systematic review. BMC Medical Education, 24(1), p.354. [DOI] [PMC free article] [PubMed]
  • 27.Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT across different versions in medical licensing examinations worldwide: systematic review and meta-analysis. J Med Internet Res. 2024;26:e60807. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Xu X, Chen Y, Miao J. 2024. Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review. Journal of Educational Evaluation for Health Professions, 21. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data supporting the findings of this study are available from the corresponding author on request.

Data supporting the findings of this study are available from the corresponding author on request.


Articles from BMC Medical Education are provided here courtesy of BMC

RESOURCES