Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Mar 7.
Published in final edited form as: J Perinatol. 2024 Feb 24;44(9):1365–1366. doi: 10.1038/s41372-024-01912-8

Assessment of the clinical knowledge of ChatGPT-4 in neonatal-perinatal medicine: a comparative analysis with ChatGPT-3.5

Puneet Sharma 1,6,, Guangze Luo 2,6, Cindy Wang 3, Dara Brodsky 4, Camilia R Martin 5, Andrew Beam 2,6, Kristyn Beam 4,6
PMCID: PMC11886979  NIHMSID: NIHMS2059031  PMID: 38402349

Large language models (LLMs) have demonstrated promising performance on clinical knowledge tasks, including the United States Medical Licensing Examination and subspecialty board examinations [14]. Previous work from our group assessed the performance of ChatGPT-3.5 on practice questions for the neonatal-perinatal medicine board examination and found that it performed below a passing rate [3]. Recent results, however, have shown that a new version, GPT-4, performs substantially better than GPT-3.5 on board questions in other medical fields [1]. Therefore, we conducted a comparative analysis of the performance of GPT-4 and GPT-3.5 on practice questions for the neonatal-perinatal medicine board examination.

We compiled questions from a neonatal-perinatal medicine board examination preparation book and excluded questions that were non-multiple-choice format or had figures as GPT-3.5 does not support visual inputs [5]. This yielded 926 questions, which was sufficient to detect an effect size of 20% on paired t-test (α = 0.05, β = 0.2). Each eligible question was entered into the application program interface using the same prompt for both versions. We instructed the LLM that it is the neonatal expert in the specific domain of the question and asked it to “take a deep breath and solve the following question.” We used standard settings on both versions, including a temperature of zero.

Of the 926 eligible questions, GPT-3.5 generated 425 correct responses and GPT-4 generated 651 correct responses (46% v 70%, p < 0.01). GPT-4 outperformed GPT-3.5 across all domains, with the largest improvements in respiratory (33%) and infectious diseases (31%) (Table 1). While the exact passing rate on the board examination is unknown, the performance of GPT-4 is likely far closer than that of GPT-3.5. The reason for this improved performance is likely a result of GPT-4 having significantly more parameters and training data than its previous version, allowing for improved ability to grasp medical knowledge and perform complex tasks.

Table 1.

Summary of LLM accuracy on neonatal-perinatal medicine board examination questions by topic.

Topic Total Number of Questions Total Correct (GPT-3.5) Percentage Correct (GPT-3.5) Total Correct (GPT-4) Percentage Correct (GPT-4) Percentage Improvement
Cardiology 51 25 49% 39 76% 27%
Endocrinology 75 39 52% 58 77% 25%
Ethics 14 10 71% 13 93% 22%
Fluids 99 41 41% 64 65% 24%
Genetics 82 43 52% 57 70% 18%
Gastroenterology 39 20 51% 24 62% 11%
Hematology 59 27 46% 35 59% 13%
Infectious Diseases 102 42 41% 73 72% 31%
Metabolism 30 12 40% 20 67% 27%
Maternal-Fetal Medicine 129 61 47% 99 77% 30%
Neurology 94 37 39% 64 68% 29%
Pharmacology 69 38 55% 48 70% 15%
Respiratory 83 30 36% 57 69% 33%
Overall 926 425 46% 651 70% 24%

Since the introduction of LLMs, there has been interest in benchmarking their performance on standardized testing [14]. Our study demonstrates a significant improvement in accuracy in just one generation that better approximates the performance of a neonatologist. However, there are other important metrics of performance that this study did not investigate. In this study, we only queried both versions once for each question precluding any assessment of concordance of response. Furthermore, we did not ask the LLM to explain its answers, limiting our ability to assess hallucinations or poor insight. Other investigators have demonstrated concordant and insightful responses on other medical testing, and our group previously assessed these factors in neonatal-perinatal medicine for GPT-3.5 [3, 4]. However, it would be worthwhile to assess performance in these domains for GPT-4 as well. Nevertheless, as LLMs continue to iterate and learn, it is likely that they will approximate the knowledge of a trained clinician and therefore continue to have a growing role in modern medicine.

ACKNOWLEDGEMENTS

Use of LLMs: Given the aim of this investigation, we used LLMs, specifically ChatGPT-3.5 and ChatGPT-4, in the data collection and analysis. The details of their use are summarized in the text. However, no LLMs were used in the writing or editing of the report.

FUNDING

This work was supported by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (5T32HD098061).

Footnotes

COMPETING INTERESTS

The authors declare no competing interests.

REFERENCES

  • 1.Mihalache A, Huang RS, Popovic MM, Muni RH. ChatGPT-4: an assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med Teach. 2023. 10.1080/0142159X.2023.2249588. [DOI] [PubMed] [Google Scholar]
  • 2.Beam K, Sharma P, Levy P, Beam AL. Artificial intelligence in the neonatal intensive care unit: the time is now. J Perinatol; 10.1227/neu.0000000000002551. [DOI] [PubMed] [Google Scholar]
  • 3.Beam K, Sharma P, Kumar B, Wang C, Brodsky D, Martin CR, et al. Performance of a large language model on practice questions for the neonatal board examination. JAMA Pediatr. 2023;177:977–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2:e0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Morton S, Ehret D, Ghanta S, Sajti E, Walsh B. Neonatology review: Q&A. 3rd ed.. Morrisville (US): Lulu; 2015. [Google Scholar]

RESOURCES