Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination

Kristyn Beam; Puneet Sharma; Bhawesh Kumar; Cindy Wang; Dara Brodsky; Camilia R Martin; Andrew Beam

doi:10.1001/jamapediatrics.2023.2373

. 2023 Jul 17;177(9):977–979. doi: 10.1001/jamapediatrics.2023.2373

Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination

Kristyn Beam ¹, Puneet Sharma ², Bhawesh Kumar ³, Cindy Wang ⁴, Dara Brodsky ¹, Camilia R Martin ⁵, Andrew Beam ^6,^✉

PMCID: PMC10352922 PMID: 37459084

Abstract

This Diagnostic/Prognostic Study evaluates the performance of a large language model in generating answers to practice questions for the neonatal-perinatal board examination.

Large language models (LLMs) have demonstrated an impressive mastery of clinical knowledge, including diagnostic medicine,^1,2 and have shown good performance on medical licensing examination questions.³ We assessed the performance of a popular LLM on practice questions for the neonatal-perinatal board examination.⁴

Methods

The institutional review board at Harvard Medical School deemed this not human subjects research, so consent waivers and institutional review board approval were unnecessary. We extracted all multiple-choice questions from a neonatal-perinatal medicine board examination preparation book.⁴ There were 936 eligible questions after excluding those containing images or equations, and each was categorized according to Bloom taxonomy as knowledge recall, simple reasoning, or multilogical reasoning.

Each eligible question was entered into ChatGPT version 3.5 by 2 independent raters between February 1 and 14, 2023, to estimate accuracy and self-agreement in the LLM’s responses. We instructed the LLM to “Please select the most appropriate answer and explain your reasoning step-by-step,” yielding a selected answer and a short passage of text containing the LLM’s answer justification. We calculated the percentage of correct answers on each medical topic in addition to overall accuracy. Two board-certified neonatologists (D.B. and C.M.) with extensive board question writing experience and coeditors of Neonatology Review evaluated 10 justifications from each topic provided by the LLM using a standardized rubric previously introduced⁵ to evaluate LLMs. This rubric assesses alignment to several important categories.

Results

The LLM generated correct answers for 424 (first rater) and 428 (second rater) of 936 questions (Table 1), though there was considerable variation in accuracy across topics, with the lowest being 37.5% correct in gastroenterology (15 of 40) and the highest being 78.5% in ethics (11 of 40). The LLM’s self-agreement was 86%, ranging from 80% in fluids, electrolytes, and nutrition to 100% in ethics. The LLM did worse on multilogical questions (23 of 68 [33.8%] correct) compared to knowledge recall (61 of 110 [55.5%] correct) and basic clinical reasoning questions (56 of 127 [44.1%] correct). The LLM answer justifications showed some evidence of appropriate reasoning on 299 of 305 (98.0%) and 292 of 305 (95.7%) questions, respectively, although these justifications were fully aligned with scientific consensus on only 140 of 305 (54.0%) (Table 2). The LLM provided information that was factually incorrect in 96 of 305 (31.5%) and had errors of omission in 97 of 305 (31.8%) of its justifications. There was little evidence of harm or bias in the justifications (6 and 4 of 305, respectively).

Table 1. Summary of Large Language Model (LLM) Accuracy and Agreement on Neonatal-Perinatal Medicine Board Examination Questions by Topic.

Subject	Total No. of questions	Eligible questions after exclusion	Correct answers, mean of 2 raters, %	No. (%)		LLM self-agreement, %
Subject	Total No. of questions	Eligible questions after exclusion	Correct answers, mean of 2 raters, %	Correct answers, first rater	Correct answers, second rater	LLM self-agreement, %
Cardiology	57	51	42.2	23 (45.0)	20 (39.2)	86.3
Endocrinology	76	76	42.8	32 (42.1)	33 (43.4)	88
Ethics	14	14	78.5	11 (78.5)	11 (78.5)	100
Fluids	107	102	45.1	48 (47.1)	44 (43.1)	80.4
Genetics	86	82	45.1	37 (45.1)	37 (45.1)	85.4
Gastroenterology	45	40	38.8	16 (40.0)	15 (37.5)	92.5
Hematology	64	61	40.9	24 (39.3)	26 (42.6)	86.9
Infectious diseases	110	104	43.3	42 (40.4)	48 (46.2)	82.7
Metabolism	33	30	43.3	13 (43.3)	13 (43.3)	86.7
Maternal-fetal medicine	139	136	47.1	66 (48.5)	62 (45.6)	88.2
Neurology	97	93	52.2	44 (47.3)	53 (57.0)	84.9
Pharmacology	71	68	51.5	36 (52.9)	34 (50.0)	85.3
Respiratory	87	81	39.5	32 (39.5)	32 (39.5)	90.1
Overall	986	936	45.4	424 (45.3)	428 (45.7)	86.2

Open in a new tab

Table 2. Correctness, Completeness, and Potential for Harm and Bias in the Text Produced by the Large Language Model to Justify Answer Selection^a.

Subject	%
Subject	Scientific consensus	Evidence of correct comprehension	Correct reasoning	Incorrect reasoning	Inappropriate information	Missing information	Possible bias	Possible harm
Cardiology (N = 24)	50.0	91.7	91.7	50.0	20.8	33.3	0	4.2
Endocrinology (N = 23)	87.0	100	100	30.4	34.8	39.1	4.3	0
Ethics (N = 28)	82.1	100	100	10.7	7.1	14.3	3.6	0
Fluids (N = 26)	50.0	100	100	23.1	46.2	34.6	0	3.8
Gastroenterology (N = 22)	63.6	100	90.9	59.1	40.9	31.8	9.1	9.1
Genetics (N = 28)	35.7	100	85.7	46.4	39.3	32.1	0	0
Hematology (N = 24)	41.7	87.5	95.8	58.3	50.0	50.0	0	0
Infectious diseases (N = 23)	60.9	100	95.7	26.1	26.1	34.8	0	4.3
Maternal-fetal medicine (N = 21)	52.4	100	100	33.3	23.8	14.3	0	0
Metabolism (N = 22)	36.4	100	95.5	50.0	40.9	36.4	0	0
(Neurology (N = 23)	52.2	100	95.7	17.4	30.4	34.8	0	4.3
Pharmacology (N = 17)	58.8	94.1	94.1	41.2	35.3	47.1	0	0
Respiratory (N = 24)	33.3	100	100	45.8	16.7	16.7	0	0
Overall	54.2	97.9	95.8	37.8	31.7	32.3	1.3	2.0
Knowledge recall (N = 127/305)	66.9	99.2	98.4	39.4	28.3	29.9	0.8	3.1
Simple reasoning (N = 110/305)	50.9	98.2	96.4	20.9	23.6	16.4	1.8	0
Multilogical reasoning (N = 68/305)	35.3	95.6	89.7	60.3	50.0	60.3	1.5	2.9

Open in a new tab

^{^a}

Two board-certified neonatologists (D.B. and C.M.) with board question writing experience evaluated the percentage of the language model’s justifications that aligned with each category shown.

Discussion

LLMs represent a new paradigm for medical artificial intelligence.⁶ Clinicians, educators, and trainees are eager to integrate these tools into clinical workflows to improve the delivery of care. However, our results suggest that LLMs currently lack the ability to reliably generate neonatal-perinatal board examination questions and do not consistently provide explanations that are aligned with the scientific and medical consensus. Given that the LLM in our study only generated correct answers for 46% of the questions in our study, it is unlikely that it would pass the actual neonatal-perinatal medicine board examination and should not be considered a reliable resource for maintenance of certification examination. We acknowledge the potential bias in our evaluation, as the authors evaluating the LLM’s generated responses also coedited the practice questions. The pace of progress of artificial intelligence systems is rapid. As LLM capabilities are modified, we will need to reevaluate their proficiency in neonatal-perinatal medicine.

Supplement.

Data sharing statement

Click here for additional data file.^{(14.1KB, pdf)}

References

1.Schulman J, Zoph B, Kim C, et al. ChatGPT: optimizing language models for dialogue. OpenAI . Published November 30, 2022. Accessed February 28, 2023. https://chat.openai.com
2.Levine DM, Tuwani R, Kompa B, et al. The diagnostic and triage accuracy of the gpt-3 artificial intelligence model. medRxiv. Published online February 1, 2023. doi: 10.1101/2023.01.30.23285067 [DOI] [PubMed]
3.Kung TH, Cheatham M, Medinilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi: 10.1371/journal.pdig.0000198 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Morton S, Ehret D, Ghanta S, Sajti E, Walsh B. In: Brodsky D, Martin CR, eds. Neonatology Review: Q & A. 3rd ed. Lulu; 2015. [Google Scholar]
5.Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. arXiv. Published online December 26, 2022. https://arxiv.org/abs/2212.13138
6.Finlayson SG, Beam AL, van Smeden M. Machine learning and statistics in clinical research articles-moving past the false dichotomy. JAMA Pediatr. 2023;177(5):448-450. doi: 10.1001/jamapediatrics.2023.0034 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement.

Data sharing statement

Click here for additional data file.^{(14.1KB, pdf)}

[pld230028r1] 1.Schulman J, Zoph B, Kim C, et al. ChatGPT: optimizing language models for dialogue. OpenAI . Published November 30, 2022. Accessed February 28, 2023. https://chat.openai.com

[pld230028r2] 2.Levine DM, Tuwani R, Kompa B, et al. The diagnostic and triage accuracy of the gpt-3 artificial intelligence model. medRxiv. Published online February 1, 2023. doi: 10.1101/2023.01.30.23285067 [DOI] [PubMed]

[pld230028r3] 3.Kung TH, Cheatham M, Medinilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi: 10.1371/journal.pdig.0000198 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pld230028r4] 4.Morton S, Ehret D, Ghanta S, Sajti E, Walsh B. In: Brodsky D, Martin CR, eds. Neonatology Review: Q & A. 3rd ed. Lulu; 2015. [Google Scholar]

[pld230028r5] 5.Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. arXiv. Published online December 26, 2022. https://arxiv.org/abs/2212.13138

[pld230028r6] 6.Finlayson SG, Beam AL, van Smeden M. Machine learning and statistics in clinical research articles-moving past the false dichotomy. JAMA Pediatr. 2023;177(5):448-450. doi: 10.1001/jamapediatrics.2023.0034 [DOI] [PubMed] [Google Scholar]

PERMALINK

Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination

Kristyn Beam, MD, MPH

Puneet Sharma, MD

Bhawesh Kumar, BS

Cindy Wang

Dara Brodsky, MD

Camilia R Martin, MD, MS

Andrew Beam, PhD

Abstract

Methods

Results

Table 1. Summary of Large Language Model (LLM) Accuracy and Agreement on Neonatal-Perinatal Medicine Board Examination Questions by Topic.

Table 2. Correctness, Completeness, and Potential for Harm and Bias in the Text Produced by the Large Language Model to Justify Answer Selection^a.

Discussion

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination

Kristyn Beam, MD, MPH

Puneet Sharma, MD

Bhawesh Kumar, BS

Cindy Wang

Dara Brodsky, MD

Camilia R Martin, MD, MS

Andrew Beam, PhD

Abstract

Methods

Results

Table 1. Summary of Large Language Model (LLM) Accuracy and Agreement on Neonatal-Perinatal Medicine Board Examination Questions by Topic.

Table 2. Correctness, Completeness, and Potential for Harm and Bias in the Text Produced by the Large Language Model to Justify Answer Selectiona.

Discussion

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 2. Correctness, Completeness, and Potential for Harm and Bias in the Text Produced by the Large Language Model to Justify Answer Selection^a.