Comparing closed and open large language models on pediatric cardiology board exam performance

Nino Nikolovski; Conall T Morgan; Michael N Gritti

doi:10.4103/apc.apc_301_25

. 2026 Mar 16;18(6):590–593. doi: 10.4103/apc.apc_301_25

Comparing closed and open large language models on pediatric cardiology board exam performance

Nino Nikolovski ¹, Conall T Morgan ¹, Michael N Gritti ^1,^✉

PMCID: PMC13048703 PMID: 41939808

Abstract

Large language models (LLMs) have gained traction in medicine, but there is limited research comparing closed- and open-source models in subspecialty contexts. This study evaluated ChatGPT-4.0o and DeepSeek-R1 on a pediatric cardiology board-style examination to quantify their accuracy and discuss educational and clinical utility. ChatGPT-4.0o and DeepSeek-R1 were used to answer 88 text-based multiple choice questions across 11 pediatric cardiology subtopics from a Pediatric Cardiology Board Review textbook. DeepSeek-R1’s processing time per question was measured. ChatGPT-4.0o and DeepSeek-R1 achieved 70% (62/88) and 68% (60/88) accuracy, respectively (p = 0.53). Subtopic accuracy was equal in 5 of 11 chapters, with each model outperforming its counterpart in 3 of 11. DeepSeek-R1’s processing time negatively correlated with accuracy (r = -0.68, p = 0.02). ChatGPT-4.0o and DeepSeek-R1 were comparable in accuracy and approached the passing threshold on a pediatric cardiology board examination. While further development of LLMs is required for clinical integration into pediatric cardiology, these findings suggest the potential utility of these models as educational aids.

Keywords: Artificial intelligence, ChatGPT, DeepSeek, large language model, pediatric cardiology

INTRODUCTION

The integration of artificial intelligence (AI) into medicine has expedited with the rise of large language models (LLMs).[1,2] LLMs perform complex reasoning tasks and generate contextually accurate responses.[2] Proprietary, closed-source models such as OpenAI’s Chat Generative Pre-Trained Transformer (GPT) have gained traction due to their accuracy in radiology[2,3] and superior performance compared to physicians.[4] ChatGPT-3.5 and ChatGPT-4.0 were recently compared using a pediatric cardiology assessment, and despite failing scores, the successor performed better.[5]

Closed-source model training datasets are obscured from public knowledge due to commercial competition and security.[6] Proprietary LLMs have been criticized for a lack of accessibility and the economic resources required for training and commercial use.[1] Conversely, open-source and free alternatives such as DeepSeek’s DeepSeek-R1 aim to match proprietary LLM performance with accessibility, customization, and resource efficiency.[1,2] While open-source LLMs are less examined, Temsah et al. highlight the discipline-specific modulation capacity of these models.[1] This customizability places open-source LLMs in a unique position to serve as an adjunct to clinical care and education; however, they must first demonstrate high reliability and accuracy without hallucinations.[1]

There are limited comparative evaluations of proprietary and open-source LLMs in subspecialty medical disciplines, such as pediatric cardiology. To address this knowledge gap, the present study compared the performance of ChatGPT-4.0o and DeepSeek-R1 on a multiple-choice examination modeled after a pediatric cardiology board examination. The objectives of the present study are as follows: (1) to compare the overall and topic-specific accuracy between models and (2) to determine whether closed and/or open-sourced LLMs can meet the 70% passing standard for board certification. We hypothesized that ChatGPT-4.0o would perform better than its open-source counterpart; however, both models would fail to meet the threshold.

MATERIALS AND METHODS

Reproducing the methodology comparing the performance between ChatGPT models,[5] we used a dataset of 88 multiple-choice questions taken verbatim from the Pediatric Cardiology Board Review by Eidem et al.[7] ChatGPT-4.0o and DeepSeek-R1 were used to answer the questions obtained from the question bank. DeepSeek-R1’s “DeepThink (R1)” feature was used to quantify processing time per question. We used the first eight “text-only” questions from each of the 11 eligible chapters in the textbook for a total of 88 questions. Questions with images or other modalities, like sound, were excluded.

Each chapter contained eight questions from the following topics: cardiac anatomy and physiology, congenital cardiac malformations, diagnosis of congenital heart disease, cardiac catheterization and angiography, noninvasive cardiac imaging, electrophysiology questions for paediatrics, exercise physiology and testing, outpatient cardiology, cardiac intensive care and heart failure, cardiac pharmacology, and surgical palliation and repair of congenital heart disease.

Each question was entered as a separate prompt with identical wording between models, and previous conversations were cleared to ensure no previous information affected the chatbot’s answers. Each answer was manually reviewed by members of our team to ensure the question was answered. If either LLM deemed that none, multiple, or all the answers were correct, when this was not one of the options, it was scored as incorrect. The answer key provided by the textbook was used to verify the correct answers. Data normality was confirmed with a Shapiro–Wilk test. Statistical analyses were conducted on SPSS 20 (IBM Software, Armonk, New York, USA) using a paired t-test and Pearson’s r correlation. Significance was set at P < 0.05. Due to the nature of this study, no research ethics approval was obtained from our institution.

RESULTS

Overall model performance

Data are reported as a mean percentage (%) of total questions answered correctly. ChatGPT-4.0o answered 62 questions or 70% correctly, while DeepSeek-R1 answered 60 questions or 68% correctly. Model performance was not statistically different [Figure 1; P = 0.53].

Subtopic stratification

ChatGPT 4.0o

When stratified by subtopic, ChatGPT-4.0o answered 5.6 ± 1.7 questions correctly [Table 1]. ChatGPT-4.0o performed best on Chapter 8: Exercise Physiology and Testing (8/8, 100%) and worst on Chapters 3 and 4: Diagnosis of Congenital Heart Disease and Cardiac Catheterization and Angiography (3/8, 38%). ChatGPT-4.0o answered 70 ± 21% questions correctly.

Table 1.

Number of correctly answered questions by ChatGPT-4.0o and DeepSeek-R1 across pediatric cardiology subtopics

Chapter subtopic	ChatGPT-4.0o (/8)	DeepSeek-R1 (/8)
Chapter 1: Cardiac anatomy and physiology	4	5
Chapter 2: Congenital cardiac malformations	6	6
Chapter 3: Diagnosis of congenital heart disease	3	3
Chapter 4: Cardiac catheterization and angiography	3	3
Chapter 5: Noninvasive cardiac imaging	7	7
Chapter 6: Electrophysiology questions for paediatrics	7	7
Chapter 7: Exercise physiology and testing	8	5
Chapter 8: Outpatient cardiology	7	6
Chapter 9: Cardiac intensive care and heart failure	6	5
Chapter 10: Cardiac pharmacology	5	6
Chapter 11: Surgical palliation and repair of congenital heart disease	6	7
Total (/88)	62	60
Score on entire examination (%)	70	68

Open in a new tab

Values for each respective large language model are out of a maximum of eight per chapter across 11 total chapters. Questions are derived from the Pediatric Cardiology Board Review textbook[7]

DeepSeek-R1

When stratified by subtopic, DeepSeek-R1 answered 5.5 ± 1.4 questions correctly. DeepSeek-R1 performed best on Chapters 5, 6, and 11: Noninvasive Cardiac Imaging, Electrophysiology, and Surgical Palliation and Repair of Congenital Heart Disease [Table 1; 7/8, 88%]. DeepSeek-R1 answered 68 ± 18% of questions correctly. Processing time averaged 51 ± 15 s per question (range: 12–218 s). Processing time displayed a strong negative correlation with accuracy [Figure 2; r = −0.68, P = 0.02].

ChatGPT-4.0o versus DeepSeek-R1

ChatGPT-4.0o and DeepSeek-R1 performed equally on Congenital Cardiac Malformations, Diagnosis of Congenital Heart Disease, Cardiac Catheterization and Angiography, Noninvasive Cardiac Imaging, and Electrophysiology. ChatGPT-4.0o outperformed DeepSeek-R1 in Exercise Physiology and Testing (100% vs. 63%), Outpatient Cardiology (88% vs. 75%), and Cardiac Intensive Care and Heart Failure (75% vs. 63%). Conversely, DeepSeek-R1 performed better in Cardiac Anatomy and Physiology (63% vs. 50%), Cardiac Pharmacology (75% vs. 63%), and Surgical Palliation and Repair of Congenital Heart Disease (88% vs. 75%) [Table 1].

Qualitative observations

Both models provided intelligible reasoning but occasionally presented hallucinations when justifying incorrect answers. These hallucinations were more frequent in application-based problems rather than recall questions.

DISCUSSION

This study compared a closed-source LLM, ChatGPT-4.0o, to an open-source LLM, DeepSeek-R1, on a pediatric cardiology board-style examination. ChatGPT-4.0o achieved a higher accuracy (70%) than DeepSeek-R1 (68%) (P = 0.53). These findings offer novel evidence that proprietary and open-source LLMs can meet competency thresholds in a specialized medical discipline. While promising, the ~ 30% error rate falls short of the standard for clinical decision-making. However, these results suggest potential educational utility for medical trainees as a study aid for recall-based questions. The hallucinatory tendencies highlight the need for further development before LLMs can be reliably used in nuanced application-based questions.

ChatGPT-4.0o’s passing grade aligns with evidence showing that closed-source LLMs perform better on AI benchmarking protocols used for comparing models.[2,6] However, the present findings also align with data revealing that open-sourced LLMs are bridging the performance divide through improvements in domain-specific fine-tuning.[2] This is evidenced by DeepSeek-R1 performing equally on 5/11 and better on 3/11 subtopics. A strength of the present study is the measurement of processing time to gain insight into LLM reasoning capacity. This inverse relationship between processing time and accuracy per subtopic [Figure 2] provides proof-of-concept that LLMs have difficulty with multi-step reasoning problems. Processing time in open-sourced LLMs may serve as a valuable diagnostic tool for identifying model deficiencies.

Open-source LLMs are advantageous for their customizability and smaller size. Despite their clinical utility, LLMs heighten the climate crisis through carbon emissions.[8] The smaller size of open-source LLMs attenuates the environmental footprint associated with the energy-intensive requirements of closed-source counterparts. Doo et al. found that smaller LLMs may achieve greater diagnostic accuracy while consuming seven times less energy than general-purpose models.[8] These findings suggest that domain-specific models should be developed in medical subspecialties to balance technological advancement with environmental sustainability.

Future research in open-source AI should direct attention to developing models specific to the multimodal demands of pediatric cardiology[9,10] for advancing the quality of care and medical education while minimizing environmental impact. The inter-subtopic variability, in which each model excelled in different domains, suggest differences in training data and provides and an avenue for investigation. In addition, exploring expanded datasets with multimodal questions and comparing models against human benchmarks would contextualize results relative to physicians and trainees. While LLMs do not meet clinical reliability and accuracy demands, the near-threshold performance highlights rapid advancements and their potential usefulness as educational aids.

Conflicts of interest

There are no conflicts of interest.

Funding Statement

Nil.

REFERENCES

1.Temsah A, Alhasan K, Altamimi I, Jamal A, Al-Eyadhy A, Malki KH, et al. DeepSeek in healthcare: Revealing opportunities and steering challenges of a new open-source artificial intelligence frontier. Cureus. 2025;17:e79221. doi: 10.7759/cureus.79221. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Savage CH, Kanhere A, Parekh V, Langlotz CP, Joshi A, Huang H, et al. Open-source large language models in radiology: A review and tutorial for practical research and clinical deployment. Radiology. 2025;314:e241073. doi: 10.1148/radiol.241073. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Kim SH, Schramm S, Adams LC, Braren R, Bressem KK, Keicher M, et al. Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports. NPJ Digit Med. 2025;8:97. doi: 10.1038/s41746-025-01488-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183:589–96. doi: 10.1001/jamainternmed.2023.1838. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Gritti MN, AlTurki H, Farid P, Morgan CT. Progression of an artificial intelligence chatbot (ChatGPT) for pediatric cardiology educational knowledge assessment. Pediatr Cardiol. 2024;45:309–13. doi: 10.1007/s00246-023-03385-6. [DOI] [PubMed] [Google Scholar]
6.Khan WZ, Kibriya H, Siddiqa A, Khan MK. Privacy issues in large language models: A survey. Comput Electr Eng. 2024;120:109698. [Google Scholar]
7.Eidem B, Cannon BC, Johnson JN, Chang AC, Cetta F. 2nd. Philadelphia, PA: Wolters Kluwer; 2023. Pediatric Cardiology Board Review. [Google Scholar]
8.Doo FX, Savani D, Kanhere A, Carlos RC, Joshi A, Yi PH, et al. Optimal large language model characteristics to balance accuracy and energy use for sustainable medical applications. Radiology. 2024;312:e240320. doi: 10.1148/radiol.240320. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Gritti MN, Prajapati R, Yissar D, Morgan CT. Precision of artificial intelligence in paediatric cardiology multimodal image interpretation. Cardiol Young. 2024;34:2349–54. doi: 10.1017/S1047951124036035. [DOI] [PubMed] [Google Scholar]
10.Gritti MN, Morgan CT. Rethinking paediatric cardiology training in Canada. CJC Pediatr Congenit Heart Dis. 2024;3:43–6. doi: 10.1016/j.cjcpc.2023.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Temsah A, Alhasan K, Altamimi I, Jamal A, Al-Eyadhy A, Malki KH, et al. DeepSeek in healthcare: Revealing opportunities and steering challenges of a new open-source artificial intelligence frontier. Cureus. 2025;17:e79221. doi: 10.7759/cureus.79221. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Savage CH, Kanhere A, Parekh V, Langlotz CP, Joshi A, Huang H, et al. Open-source large language models in radiology: A review and tutorial for practical research and clinical deployment. Radiology. 2025;314:e241073. doi: 10.1148/radiol.241073. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Kim SH, Schramm S, Adams LC, Braren R, Bressem KK, Keicher M, et al. Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports. NPJ Digit Med. 2025;8:97. doi: 10.1038/s41746-025-01488-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183:589–96. doi: 10.1001/jamainternmed.2023.1838. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Gritti MN, AlTurki H, Farid P, Morgan CT. Progression of an artificial intelligence chatbot (ChatGPT) for pediatric cardiology educational knowledge assessment. Pediatr Cardiol. 2024;45:309–13. doi: 10.1007/s00246-023-03385-6. [DOI] [PubMed] [Google Scholar]

[R6] 6.Khan WZ, Kibriya H, Siddiqa A, Khan MK. Privacy issues in large language models: A survey. Comput Electr Eng. 2024;120:109698. [Google Scholar]

[R7] 7.Eidem B, Cannon BC, Johnson JN, Chang AC, Cetta F. 2nd. Philadelphia, PA: Wolters Kluwer; 2023. Pediatric Cardiology Board Review. [Google Scholar]

[R8] 8.Doo FX, Savani D, Kanhere A, Carlos RC, Joshi A, Yi PH, et al. Optimal large language model characteristics to balance accuracy and energy use for sustainable medical applications. Radiology. 2024;312:e240320. doi: 10.1148/radiol.240320. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Gritti MN, Prajapati R, Yissar D, Morgan CT. Precision of artificial intelligence in paediatric cardiology multimodal image interpretation. Cardiol Young. 2024;34:2349–54. doi: 10.1017/S1047951124036035. [DOI] [PubMed] [Google Scholar]

[R10] 10.Gritti MN, Morgan CT. Rethinking paediatric cardiology training in Canada. CJC Pediatr Congenit Heart Dis. 2024;3:43–6. doi: 10.1016/j.cjcpc.2023.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Comparing closed and open large language models on pediatric cardiology board exam performance

Nino Nikolovski

Conall T Morgan

Michael N Gritti

Abstract

INTRODUCTION

MATERIALS AND METHODS

RESULTS

Overall model performance

Figure 1.

Subtopic stratification

ChatGPT 4.0o

Table 1.

DeepSeek-R1

Figure 2.

ChatGPT-4.0o versus DeepSeek-R1

Qualitative observations

DISCUSSION

Conflicts of interest

Funding Statement

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Comparing closed and open large language models on pediatric cardiology board exam performance

Nino Nikolovski

Conall T Morgan

Michael N Gritti

Abstract

INTRODUCTION

MATERIALS AND METHODS

RESULTS

Overall model performance

Figure 1.

Subtopic stratification

ChatGPT 4.0o

Table 1.

DeepSeek-R1

Figure 2.

ChatGPT-4.0o versus DeepSeek-R1

Qualitative observations

DISCUSSION

Conflicts of interest

Funding Statement

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases