Abstract
This Diagnostic/Prognostic Study evaluates the performance of a large language model in generating answers to practice questions for the neonatal-perinatal board examination.
Large language models (LLMs) have demonstrated an impressive mastery of clinical knowledge, including diagnostic medicine,1,2 and have shown good performance on medical licensing examination questions.3 We assessed the performance of a popular LLM on practice questions for the neonatal-perinatal board examination.4
Methods
The institutional review board at Harvard Medical School deemed this not human subjects research, so consent waivers and institutional review board approval were unnecessary. We extracted all multiple-choice questions from a neonatal-perinatal medicine board examination preparation book.4 There were 936 eligible questions after excluding those containing images or equations, and each was categorized according to Bloom taxonomy as knowledge recall, simple reasoning, or multilogical reasoning.
Each eligible question was entered into ChatGPT version 3.5 by 2 independent raters between February 1 and 14, 2023, to estimate accuracy and self-agreement in the LLM’s responses. We instructed the LLM to “Please select the most appropriate answer and explain your reasoning step-by-step,” yielding a selected answer and a short passage of text containing the LLM’s answer justification. We calculated the percentage of correct answers on each medical topic in addition to overall accuracy. Two board-certified neonatologists (D.B. and C.M.) with extensive board question writing experience and coeditors of Neonatology Review evaluated 10 justifications from each topic provided by the LLM using a standardized rubric previously introduced5 to evaluate LLMs. This rubric assesses alignment to several important categories.
Results
The LLM generated correct answers for 424 (first rater) and 428 (second rater) of 936 questions (Table 1), though there was considerable variation in accuracy across topics, with the lowest being 37.5% correct in gastroenterology (15 of 40) and the highest being 78.5% in ethics (11 of 40). The LLM’s self-agreement was 86%, ranging from 80% in fluids, electrolytes, and nutrition to 100% in ethics. The LLM did worse on multilogical questions (23 of 68 [33.8%] correct) compared to knowledge recall (61 of 110 [55.5%] correct) and basic clinical reasoning questions (56 of 127 [44.1%] correct). The LLM answer justifications showed some evidence of appropriate reasoning on 299 of 305 (98.0%) and 292 of 305 (95.7%) questions, respectively, although these justifications were fully aligned with scientific consensus on only 140 of 305 (54.0%) (Table 2). The LLM provided information that was factually incorrect in 96 of 305 (31.5%) and had errors of omission in 97 of 305 (31.8%) of its justifications. There was little evidence of harm or bias in the justifications (6 and 4 of 305, respectively).
Table 1. Summary of Large Language Model (LLM) Accuracy and Agreement on Neonatal-Perinatal Medicine Board Examination Questions by Topic.
Subject | Total No. of questions | Eligible questions after exclusion | Correct answers, mean of 2 raters, % | No. (%) | LLM self-agreement, % | |
---|---|---|---|---|---|---|
Correct answers, first rater | Correct answers, second rater | |||||
Cardiology | 57 | 51 | 42.2 | 23 (45.0) | 20 (39.2) | 86.3 |
Endocrinology | 76 | 76 | 42.8 | 32 (42.1) | 33 (43.4) | 88 |
Ethics | 14 | 14 | 78.5 | 11 (78.5) | 11 (78.5) | 100 |
Fluids | 107 | 102 | 45.1 | 48 (47.1) | 44 (43.1) | 80.4 |
Genetics | 86 | 82 | 45.1 | 37 (45.1) | 37 (45.1) | 85.4 |
Gastroenterology | 45 | 40 | 38.8 | 16 (40.0) | 15 (37.5) | 92.5 |
Hematology | 64 | 61 | 40.9 | 24 (39.3) | 26 (42.6) | 86.9 |
Infectious diseases | 110 | 104 | 43.3 | 42 (40.4) | 48 (46.2) | 82.7 |
Metabolism | 33 | 30 | 43.3 | 13 (43.3) | 13 (43.3) | 86.7 |
Maternal-fetal medicine | 139 | 136 | 47.1 | 66 (48.5) | 62 (45.6) | 88.2 |
Neurology | 97 | 93 | 52.2 | 44 (47.3) | 53 (57.0) | 84.9 |
Pharmacology | 71 | 68 | 51.5 | 36 (52.9) | 34 (50.0) | 85.3 |
Respiratory | 87 | 81 | 39.5 | 32 (39.5) | 32 (39.5) | 90.1 |
Overall | 986 | 936 | 45.4 | 424 (45.3) | 428 (45.7) | 86.2 |
Table 2. Correctness, Completeness, and Potential for Harm and Bias in the Text Produced by the Large Language Model to Justify Answer Selectiona.
Subject | % | |||||||
---|---|---|---|---|---|---|---|---|
Scientific consensus | Evidence of correct comprehension | Correct reasoning | Incorrect reasoning | Inappropriate information | Missing information | Possible bias | Possible harm | |
Cardiology (N = 24) | 50.0 | 91.7 | 91.7 | 50.0 | 20.8 | 33.3 | 0 | 4.2 |
Endocrinology (N = 23) | 87.0 | 100 | 100 | 30.4 | 34.8 | 39.1 | 4.3 | 0 |
Ethics (N = 28) | 82.1 | 100 | 100 | 10.7 | 7.1 | 14.3 | 3.6 | 0 |
Fluids (N = 26) | 50.0 | 100 | 100 | 23.1 | 46.2 | 34.6 | 0 | 3.8 |
Gastroenterology (N = 22) | 63.6 | 100 | 90.9 | 59.1 | 40.9 | 31.8 | 9.1 | 9.1 |
Genetics (N = 28) | 35.7 | 100 | 85.7 | 46.4 | 39.3 | 32.1 | 0 | 0 |
Hematology (N = 24) | 41.7 | 87.5 | 95.8 | 58.3 | 50.0 | 50.0 | 0 | 0 |
Infectious diseases (N = 23) | 60.9 | 100 | 95.7 | 26.1 | 26.1 | 34.8 | 0 | 4.3 |
Maternal-fetal medicine (N = 21) | 52.4 | 100 | 100 | 33.3 | 23.8 | 14.3 | 0 | 0 |
Metabolism (N = 22) | 36.4 | 100 | 95.5 | 50.0 | 40.9 | 36.4 | 0 | 0 |
(Neurology (N = 23) | 52.2 | 100 | 95.7 | 17.4 | 30.4 | 34.8 | 0 | 4.3 |
Pharmacology (N = 17) | 58.8 | 94.1 | 94.1 | 41.2 | 35.3 | 47.1 | 0 | 0 |
Respiratory (N = 24) | 33.3 | 100 | 100 | 45.8 | 16.7 | 16.7 | 0 | 0 |
Overall | 54.2 | 97.9 | 95.8 | 37.8 | 31.7 | 32.3 | 1.3 | 2.0 |
Knowledge recall (N = 127/305) | 66.9 | 99.2 | 98.4 | 39.4 | 28.3 | 29.9 | 0.8 | 3.1 |
Simple reasoning (N = 110/305) | 50.9 | 98.2 | 96.4 | 20.9 | 23.6 | 16.4 | 1.8 | 0 |
Multilogical reasoning (N = 68/305) | 35.3 | 95.6 | 89.7 | 60.3 | 50.0 | 60.3 | 1.5 | 2.9 |
Two board-certified neonatologists (D.B. and C.M.) with board question writing experience evaluated the percentage of the language model’s justifications that aligned with each category shown.
Discussion
LLMs represent a new paradigm for medical artificial intelligence.6 Clinicians, educators, and trainees are eager to integrate these tools into clinical workflows to improve the delivery of care. However, our results suggest that LLMs currently lack the ability to reliably generate neonatal-perinatal board examination questions and do not consistently provide explanations that are aligned with the scientific and medical consensus. Given that the LLM in our study only generated correct answers for 46% of the questions in our study, it is unlikely that it would pass the actual neonatal-perinatal medicine board examination and should not be considered a reliable resource for maintenance of certification examination. We acknowledge the potential bias in our evaluation, as the authors evaluating the LLM’s generated responses also coedited the practice questions. The pace of progress of artificial intelligence systems is rapid. As LLM capabilities are modified, we will need to reevaluate their proficiency in neonatal-perinatal medicine.
References
- 1.Schulman J, Zoph B, Kim C, et al. ChatGPT: optimizing language models for dialogue. OpenAI . Published November 30, 2022. Accessed February 28, 2023. https://chat.openai.com
- 2.Levine DM, Tuwani R, Kompa B, et al. The diagnostic and triage accuracy of the gpt-3 artificial intelligence model. medRxiv. Published online February 1, 2023. doi: 10.1101/2023.01.30.23285067 [DOI] [PubMed]
- 3.Kung TH, Cheatham M, Medinilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi: 10.1371/journal.pdig.0000198 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Morton S, Ehret D, Ghanta S, Sajti E, Walsh B. In: Brodsky D, Martin CR, eds. Neonatology Review: Q & A. 3rd ed. Lulu; 2015. [Google Scholar]
- 5.Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. arXiv. Published online December 26, 2022. https://arxiv.org/abs/2212.13138
- 6.Finlayson SG, Beam AL, van Smeden M. Machine learning and statistics in clinical research articles-moving past the false dichotomy. JAMA Pediatr. 2023;177(5):448-450. doi: 10.1001/jamapediatrics.2023.0034 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.