Assessment of the clinical knowledge of ChatGPT-4 in neonatal-perinatal medicine: a comparative analysis with ChatGPT-3.5

Puneet Sharma; Guangze Luo; Cindy Wang; Dara Brodsky; Camilia R Martin; Andrew Beam; Kristyn Beam

doi:10.1038/s41372-024-01912-8

. Author manuscript; available in PMC: 2025 Mar 7.

Published in final edited form as: J Perinatol. 2024 Feb 24;44(9):1365–1366. doi: 10.1038/s41372-024-01912-8

Assessment of the clinical knowledge of ChatGPT-4 in neonatal-perinatal medicine: a comparative analysis with ChatGPT-3.5

Puneet Sharma ^1,^6,^✉, Guangze Luo ^2,⁶, Cindy Wang ³, Dara Brodsky ⁴, Camilia R Martin ⁵, Andrew Beam ^2,⁶, Kristyn Beam ^4,⁶

PMCID: PMC11886979 NIHMSID: NIHMS2059031 PMID: 38402349

Large language models (LLMs) have demonstrated promising performance on clinical knowledge tasks, including the United States Medical Licensing Examination and subspecialty board examinations [1–4]. Previous work from our group assessed the performance of ChatGPT-3.5 on practice questions for the neonatal-perinatal medicine board examination and found that it performed below a passing rate [3]. Recent results, however, have shown that a new version, GPT-4, performs substantially better than GPT-3.5 on board questions in other medical fields [1]. Therefore, we conducted a comparative analysis of the performance of GPT-4 and GPT-3.5 on practice questions for the neonatal-perinatal medicine board examination.

We compiled questions from a neonatal-perinatal medicine board examination preparation book and excluded questions that were non-multiple-choice format or had figures as GPT-3.5 does not support visual inputs [5]. This yielded 926 questions, which was sufficient to detect an effect size of 20% on paired t-test (α = 0.05, β = 0.2). Each eligible question was entered into the application program interface using the same prompt for both versions. We instructed the LLM that it is the neonatal expert in the specific domain of the question and asked it to “take a deep breath and solve the following question.” We used standard settings on both versions, including a temperature of zero.

Of the 926 eligible questions, GPT-3.5 generated 425 correct responses and GPT-4 generated 651 correct responses (46% v 70%, p < 0.01). GPT-4 outperformed GPT-3.5 across all domains, with the largest improvements in respiratory (33%) and infectious diseases (31%) (Table 1). While the exact passing rate on the board examination is unknown, the performance of GPT-4 is likely far closer than that of GPT-3.5. The reason for this improved performance is likely a result of GPT-4 having significantly more parameters and training data than its previous version, allowing for improved ability to grasp medical knowledge and perform complex tasks.

Table 1.

Summary of LLM accuracy on neonatal-perinatal medicine board examination questions by topic.

Topic	Total Number of Questions	Total Correct (GPT-3.5)	Percentage Correct (GPT-3.5)	Total Correct (GPT-4)	Percentage Correct (GPT-4)	Percentage Improvement
Cardiology	51	25	49%	39	76%	27%
Endocrinology	75	39	52%	58	77%	25%
Ethics	14	10	71%	13	93%	22%
Fluids	99	41	41%	64	65%	24%
Genetics	82	43	52%	57	70%	18%
Gastroenterology	39	20	51%	24	62%	11%
Hematology	59	27	46%	35	59%	13%
Infectious Diseases	102	42	41%	73	72%	31%
Metabolism	30	12	40%	20	67%	27%
Maternal-Fetal Medicine	129	61	47%	99	77%	30%
Neurology	94	37	39%	64	68%	29%
Pharmacology	69	38	55%	48	70%	15%
Respiratory	83	30	36%	57	69%	33%
Overall	926	425	46%	651	70%	24%

Open in a new tab

Since the introduction of LLMs, there has been interest in benchmarking their performance on standardized testing [1–4]. Our study demonstrates a significant improvement in accuracy in just one generation that better approximates the performance of a neonatologist. However, there are other important metrics of performance that this study did not investigate. In this study, we only queried both versions once for each question precluding any assessment of concordance of response. Furthermore, we did not ask the LLM to explain its answers, limiting our ability to assess hallucinations or poor insight. Other investigators have demonstrated concordant and insightful responses on other medical testing, and our group previously assessed these factors in neonatal-perinatal medicine for GPT-3.5 [3, 4]. However, it would be worthwhile to assess performance in these domains for GPT-4 as well. Nevertheless, as LLMs continue to iterate and learn, it is likely that they will approximate the knowledge of a trained clinician and therefore continue to have a growing role in modern medicine.

ACKNOWLEDGEMENTS

Use of LLMs: Given the aim of this investigation, we used LLMs, specifically ChatGPT-3.5 and ChatGPT-4, in the data collection and analysis. The details of their use are summarized in the text. However, no LLMs were used in the writing or editing of the report.

FUNDING

This work was supported by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (5T32HD098061).

Footnotes

COMPETING INTERESTS

The authors declare no competing interests.

REFERENCES

1.Mihalache A, Huang RS, Popovic MM, Muni RH. ChatGPT-4: an assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med Teach. 2023. 10.1080/0142159X.2023.2249588. [DOI] [PubMed] [Google Scholar]
2.Beam K, Sharma P, Levy P, Beam AL. Artificial intelligence in the neonatal intensive care unit: the time is now. J Perinatol; 10.1227/neu.0000000000002551. [DOI] [PubMed] [Google Scholar]
3.Beam K, Sharma P, Kumar B, Wang C, Brodsky D, Martin CR, et al. Performance of a large language model on practice questions for the neonatal board examination. JAMA Pediatr. 2023;177:977–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2:e0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Morton S, Ehret D, Ghanta S, Sajti E, Walsh B. Neonatology review: Q&A. 3rd ed.. Morrisville (US): Lulu; 2015. [Google Scholar]

[R1] 1.Mihalache A, Huang RS, Popovic MM, Muni RH. ChatGPT-4: an assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med Teach. 2023. 10.1080/0142159X.2023.2249588. [DOI] [PubMed] [Google Scholar]

[R2] 2.Beam K, Sharma P, Levy P, Beam AL. Artificial intelligence in the neonatal intensive care unit: the time is now. J Perinatol; 10.1227/neu.0000000000002551. [DOI] [PubMed] [Google Scholar]

[R3] 3.Beam K, Sharma P, Kumar B, Wang C, Brodsky D, Martin CR, et al. Performance of a large language model on practice questions for the neonatal board examination. JAMA Pediatr. 2023;177:977–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2:e0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Morton S, Ehret D, Ghanta S, Sajti E, Walsh B. Neonatology review: Q&A. 3rd ed.. Morrisville (US): Lulu; 2015. [Google Scholar]

PERMALINK

Assessment of the clinical knowledge of ChatGPT-4 in neonatal-perinatal medicine: a comparative analysis with ChatGPT-3.5

Puneet Sharma

Guangze Luo

Cindy Wang

Dara Brodsky

Camilia R Martin

Andrew Beam

Kristyn Beam

Table 1.

ACKNOWLEDGEMENTS

FUNDING

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Assessment of the clinical knowledge of ChatGPT-4 in neonatal-perinatal medicine: a comparative analysis with ChatGPT-3.5

Puneet Sharma

Guangze Luo

Cindy Wang

Dara Brodsky

Camilia R Martin

Andrew Beam

Kristyn Beam

Table 1.

ACKNOWLEDGEMENTS

FUNDING

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases