Skip to main content
JAMA Network logoLink to JAMA Network
. 2023 Jul 17;177(9):977–979. doi: 10.1001/jamapediatrics.2023.2373

Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination

Kristyn Beam 1, Puneet Sharma 2, Bhawesh Kumar 3, Cindy Wang 4, Dara Brodsky 1, Camilia R Martin 5, Andrew Beam 6,
PMCID: PMC10352922  PMID: 37459084

Abstract

This Diagnostic/Prognostic Study evaluates the performance of a large language model in generating answers to practice questions for the neonatal-perinatal board examination.


Large language models (LLMs) have demonstrated an impressive mastery of clinical knowledge, including diagnostic medicine,1,2 and have shown good performance on medical licensing examination questions.3 We assessed the performance of a popular LLM on practice questions for the neonatal-perinatal board examination.4

Methods

The institutional review board at Harvard Medical School deemed this not human subjects research, so consent waivers and institutional review board approval were unnecessary. We extracted all multiple-choice questions from a neonatal-perinatal medicine board examination preparation book.4 There were 936 eligible questions after excluding those containing images or equations, and each was categorized according to Bloom taxonomy as knowledge recall, simple reasoning, or multilogical reasoning.

Each eligible question was entered into ChatGPT version 3.5 by 2 independent raters between February 1 and 14, 2023, to estimate accuracy and self-agreement in the LLM’s responses. We instructed the LLM to “Please select the most appropriate answer and explain your reasoning step-by-step,” yielding a selected answer and a short passage of text containing the LLM’s answer justification. We calculated the percentage of correct answers on each medical topic in addition to overall accuracy. Two board-certified neonatologists (D.B. and C.M.) with extensive board question writing experience and coeditors of Neonatology Review evaluated 10 justifications from each topic provided by the LLM using a standardized rubric previously introduced5 to evaluate LLMs. This rubric assesses alignment to several important categories.

Results

The LLM generated correct answers for 424 (first rater) and 428 (second rater) of 936 questions (Table 1), though there was considerable variation in accuracy across topics, with the lowest being 37.5% correct in gastroenterology (15 of 40) and the highest being 78.5% in ethics (11 of 40). The LLM’s self-agreement was 86%, ranging from 80% in fluids, electrolytes, and nutrition to 100% in ethics. The LLM did worse on multilogical questions (23 of 68 [33.8%] correct) compared to knowledge recall (61 of 110 [55.5%] correct) and basic clinical reasoning questions (56 of 127 [44.1%] correct). The LLM answer justifications showed some evidence of appropriate reasoning on 299 of 305 (98.0%) and 292 of 305 (95.7%) questions, respectively, although these justifications were fully aligned with scientific consensus on only 140 of 305 (54.0%) (Table 2). The LLM provided information that was factually incorrect in 96 of 305 (31.5%) and had errors of omission in 97 of 305 (31.8%) of its justifications. There was little evidence of harm or bias in the justifications (6 and 4 of 305, respectively).

Table 1. Summary of Large Language Model (LLM) Accuracy and Agreement on Neonatal-Perinatal Medicine Board Examination Questions by Topic.

Subject Total No. of questions Eligible questions after exclusion Correct answers, mean of 2 raters, % No. (%) LLM self-agreement, %
Correct answers, first rater Correct answers, second rater
Cardiology 57 51 42.2 23 (45.0) 20 (39.2) 86.3
Endocrinology 76 76 42.8 32 (42.1) 33 (43.4) 88
Ethics 14 14 78.5 11 (78.5) 11 (78.5) 100
Fluids 107 102 45.1 48 (47.1) 44 (43.1) 80.4
Genetics 86 82 45.1 37 (45.1) 37 (45.1) 85.4
Gastroenterology 45 40 38.8 16 (40.0) 15 (37.5) 92.5
Hematology 64 61 40.9 24 (39.3) 26 (42.6) 86.9
Infectious diseases 110 104 43.3 42 (40.4) 48 (46.2) 82.7
Metabolism 33 30 43.3 13 (43.3) 13 (43.3) 86.7
Maternal-fetal medicine 139 136 47.1 66 (48.5) 62 (45.6) 88.2
Neurology 97 93 52.2 44 (47.3) 53 (57.0) 84.9
Pharmacology 71 68 51.5 36 (52.9) 34 (50.0) 85.3
Respiratory 87 81 39.5 32 (39.5) 32 (39.5) 90.1
Overall 986 936 45.4 424 (45.3) 428 (45.7) 86.2

Table 2. Correctness, Completeness, and Potential for Harm and Bias in the Text Produced by the Large Language Model to Justify Answer Selectiona.

Subject %
Scientific consensus Evidence of correct comprehension Correct reasoning Incorrect reasoning Inappropriate information Missing information Possible bias Possible harm
Cardiology (N = 24) 50.0 91.7 91.7 50.0 20.8 33.3 0 4.2
Endocrinology (N = 23) 87.0 100 100 30.4 34.8 39.1 4.3 0
Ethics (N = 28) 82.1 100 100 10.7 7.1 14.3 3.6 0
Fluids (N = 26) 50.0 100 100 23.1 46.2 34.6 0 3.8
Gastroenterology (N = 22) 63.6 100 90.9 59.1 40.9 31.8 9.1 9.1
Genetics (N = 28) 35.7 100 85.7 46.4 39.3 32.1 0 0
Hematology (N = 24) 41.7 87.5 95.8 58.3 50.0 50.0 0 0
Infectious diseases (N = 23) 60.9 100 95.7 26.1 26.1 34.8 0 4.3
Maternal-fetal medicine (N = 21) 52.4 100 100 33.3 23.8 14.3 0 0
Metabolism (N = 22) 36.4 100 95.5 50.0 40.9 36.4 0 0
(Neurology (N = 23) 52.2 100 95.7 17.4 30.4 34.8 0 4.3
Pharmacology (N = 17) 58.8 94.1 94.1 41.2 35.3 47.1 0 0
Respiratory (N = 24) 33.3 100 100 45.8 16.7 16.7 0 0
Overall 54.2 97.9 95.8 37.8 31.7 32.3 1.3 2.0
Knowledge recall (N = 127/305) 66.9 99.2 98.4 39.4 28.3 29.9 0.8 3.1
Simple reasoning (N = 110/305) 50.9 98.2 96.4 20.9 23.6 16.4 1.8 0
Multilogical reasoning (N = 68/305) 35.3 95.6 89.7 60.3 50.0 60.3 1.5 2.9
a

Two board-certified neonatologists (D.B. and C.M.) with board question writing experience evaluated the percentage of the language model’s justifications that aligned with each category shown.

Discussion

LLMs represent a new paradigm for medical artificial intelligence.6 Clinicians, educators, and trainees are eager to integrate these tools into clinical workflows to improve the delivery of care. However, our results suggest that LLMs currently lack the ability to reliably generate neonatal-perinatal board examination questions and do not consistently provide explanations that are aligned with the scientific and medical consensus. Given that the LLM in our study only generated correct answers for 46% of the questions in our study, it is unlikely that it would pass the actual neonatal-perinatal medicine board examination and should not be considered a reliable resource for maintenance of certification examination. We acknowledge the potential bias in our evaluation, as the authors evaluating the LLM’s generated responses also coedited the practice questions. The pace of progress of artificial intelligence systems is rapid. As LLM capabilities are modified, we will need to reevaluate their proficiency in neonatal-perinatal medicine.

Supplement.

Data sharing statement

References

  • 1.Schulman J, Zoph B, Kim C, et al. ChatGPT: optimizing language models for dialogue. OpenAI . Published November 30, 2022. Accessed February 28, 2023. https://chat.openai.com
  • 2.Levine DM, Tuwani R, Kompa B, et al. The diagnostic and triage accuracy of the gpt-3 artificial intelligence model. medRxiv. Published online February 1, 2023. doi: 10.1101/2023.01.30.23285067 [DOI] [PubMed]
  • 3.Kung TH, Cheatham M, Medinilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi: 10.1371/journal.pdig.0000198 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Morton S, Ehret D, Ghanta S, Sajti E, Walsh B. In: Brodsky D, Martin CR, eds. Neonatology Review: Q & A. 3rd ed. Lulu; 2015. [Google Scholar]
  • 5.Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. arXiv. Published online December 26, 2022. https://arxiv.org/abs/2212.13138
  • 6.Finlayson SG, Beam AL, van Smeden M. Machine learning and statistics in clinical research articles-moving past the false dichotomy. JAMA Pediatr. 2023;177(5):448-450. doi: 10.1001/jamapediatrics.2023.0034 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement.

Data sharing statement


Articles from JAMA Pediatrics are provided here courtesy of American Medical Association

RESOURCES