Large language models (LLMs) have demonstrated promising performance on clinical knowledge tasks, including the United States Medical Licensing Examination and subspecialty board examinations [1–4]. Previous work from our group assessed the performance of ChatGPT-3.5 on practice questions for the neonatal-perinatal medicine board examination and found that it performed below a passing rate [3]. Recent results, however, have shown that a new version, GPT-4, performs substantially better than GPT-3.5 on board questions in other medical fields [1]. Therefore, we conducted a comparative analysis of the performance of GPT-4 and GPT-3.5 on practice questions for the neonatal-perinatal medicine board examination.
We compiled questions from a neonatal-perinatal medicine board examination preparation book and excluded questions that were non-multiple-choice format or had figures as GPT-3.5 does not support visual inputs [5]. This yielded 926 questions, which was sufficient to detect an effect size of 20% on paired t-test (α = 0.05, β = 0.2). Each eligible question was entered into the application program interface using the same prompt for both versions. We instructed the LLM that it is the neonatal expert in the specific domain of the question and asked it to “take a deep breath and solve the following question.” We used standard settings on both versions, including a temperature of zero.
Of the 926 eligible questions, GPT-3.5 generated 425 correct responses and GPT-4 generated 651 correct responses (46% v 70%, p < 0.01). GPT-4 outperformed GPT-3.5 across all domains, with the largest improvements in respiratory (33%) and infectious diseases (31%) (Table 1). While the exact passing rate on the board examination is unknown, the performance of GPT-4 is likely far closer than that of GPT-3.5. The reason for this improved performance is likely a result of GPT-4 having significantly more parameters and training data than its previous version, allowing for improved ability to grasp medical knowledge and perform complex tasks.
Table 1.
Summary of LLM accuracy on neonatal-perinatal medicine board examination questions by topic.
| Topic | Total Number of Questions | Total Correct (GPT-3.5) | Percentage Correct (GPT-3.5) | Total Correct (GPT-4) | Percentage Correct (GPT-4) | Percentage Improvement |
|---|---|---|---|---|---|---|
| Cardiology | 51 | 25 | 49% | 39 | 76% | 27% |
| Endocrinology | 75 | 39 | 52% | 58 | 77% | 25% |
| Ethics | 14 | 10 | 71% | 13 | 93% | 22% |
| Fluids | 99 | 41 | 41% | 64 | 65% | 24% |
| Genetics | 82 | 43 | 52% | 57 | 70% | 18% |
| Gastroenterology | 39 | 20 | 51% | 24 | 62% | 11% |
| Hematology | 59 | 27 | 46% | 35 | 59% | 13% |
| Infectious Diseases | 102 | 42 | 41% | 73 | 72% | 31% |
| Metabolism | 30 | 12 | 40% | 20 | 67% | 27% |
| Maternal-Fetal Medicine | 129 | 61 | 47% | 99 | 77% | 30% |
| Neurology | 94 | 37 | 39% | 64 | 68% | 29% |
| Pharmacology | 69 | 38 | 55% | 48 | 70% | 15% |
| Respiratory | 83 | 30 | 36% | 57 | 69% | 33% |
| Overall | 926 | 425 | 46% | 651 | 70% | 24% |
Since the introduction of LLMs, there has been interest in benchmarking their performance on standardized testing [1–4]. Our study demonstrates a significant improvement in accuracy in just one generation that better approximates the performance of a neonatologist. However, there are other important metrics of performance that this study did not investigate. In this study, we only queried both versions once for each question precluding any assessment of concordance of response. Furthermore, we did not ask the LLM to explain its answers, limiting our ability to assess hallucinations or poor insight. Other investigators have demonstrated concordant and insightful responses on other medical testing, and our group previously assessed these factors in neonatal-perinatal medicine for GPT-3.5 [3, 4]. However, it would be worthwhile to assess performance in these domains for GPT-4 as well. Nevertheless, as LLMs continue to iterate and learn, it is likely that they will approximate the knowledge of a trained clinician and therefore continue to have a growing role in modern medicine.
ACKNOWLEDGEMENTS
Use of LLMs: Given the aim of this investigation, we used LLMs, specifically ChatGPT-3.5 and ChatGPT-4, in the data collection and analysis. The details of their use are summarized in the text. However, no LLMs were used in the writing or editing of the report.
FUNDING
This work was supported by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (5T32HD098061).
Footnotes
COMPETING INTERESTS
The authors declare no competing interests.
REFERENCES
- 1.Mihalache A, Huang RS, Popovic MM, Muni RH. ChatGPT-4: an assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med Teach. 2023. 10.1080/0142159X.2023.2249588. [DOI] [PubMed] [Google Scholar]
- 2.Beam K, Sharma P, Levy P, Beam AL. Artificial intelligence in the neonatal intensive care unit: the time is now. J Perinatol; 10.1227/neu.0000000000002551. [DOI] [PubMed] [Google Scholar]
- 3.Beam K, Sharma P, Kumar B, Wang C, Brodsky D, Martin CR, et al. Performance of a large language model on practice questions for the neonatal board examination. JAMA Pediatr. 2023;177:977–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2:e0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Morton S, Ehret D, Ghanta S, Sajti E, Walsh B. Neonatology review: Q&A. 3rd ed.. Morrisville (US): Lulu; 2015. [Google Scholar]
