Pediatric surgical trainees and artificial intelligence: a comparative analysis of DeepSeek, Copilot, Google Bard and pediatric surgeons’ performance on the European Pediatric Surgical In-Training Examinations (EPSITE)

Richard Gnatzy; Martin Lacher; Salvatore Cascio; Oliver Münsterer; Richard Wagner; Ophelia Aubert

doi:10.1007/s00383-025-06104-9

. 2025 Aug 8;41(1):247. doi: 10.1007/s00383-025-06104-9

Pediatric surgical trainees and artificial intelligence: a comparative analysis of DeepSeek, Copilot, Google Bard and pediatric surgeons’ performance on the European Pediatric Surgical In-Training Examinations (EPSITE)

Richard Gnatzy ^1,^#, Martin Lacher ^1,^✉,^#, Salvatore Cascio ², Oliver Münsterer ³, Richard Wagner ¹, Ophelia Aubert ^1,⁴

PMCID: PMC12334501 PMID: 40781215

Abstract

Objective

Large language models (LLMs) have advanced rapidly, but their utility in pediatric surgery remains uncertain. This study assessed the performance of three AI models—DeepSeek, Microsoft Copilot (GPT-4) and Google Bard—on the European Pediatric Surgery In-Training Examination (EPSITE).

Methods

We evaluated model performance using 294 EPSITE questions from 2021 to 2023. Data for Copilot and Bard were collected in early 2024, while DeepSeek was assessed in 2025. Responses were compared to those of pediatric surgical trainees. Statistical analyses determined performance differences.

Results

DeepSeek achieved the highest accuracy (85.0%), followed by Copilot (55.4%) and Bard (48.0%). Pediatric surgical trainees averaged 60.1%. Performance differences were statistically significant (p < 0.0001). DeepSeek significantly outperformed both human trainees and other models (p < 0.0001), while Bard was consistently outperformed by trainees across all training levels (p < 0.01). Sixth-year trainees performed better than Copilot (p < 0.05). Copilot and Bard failed to answer a small portion of questions (3.4% and 4.7%, respectively) due to ethical concerns or perceived lack of correct choices. The time gap between model assessments reflects the rapid evolution of LLMs, contributing to the superior performance of newer models like DeepSeek.

Conclusion

LLMs show variable performance in pediatric surgery, with newer models like DeepSeek demonstrating marked improvement. These findings highlight the rapid progression of LLM capabilities and emphasize the need for ongoing evaluation before clinical integration, especially in high-stakes decision-making contexts.

Supplementary Information

The online version contains supplementary material available at 10.1007/s00383-025-06104-9.

Keywords: Large language models, Pediatric surgery, Medical education, Surgical training

Introduction

The advent of large language models (LLMs) and their processing tools driven by artificial intelligence (AI) has revolutionized various sectors, including medicine [1–4] The integration of AI into the medical field has opened up new avenues for understanding and addressing complex surgical challenges. Efforts have been made to benchmark the performance of LLMs in various areas [3, 4]. So far, little is known about the ability of LLMs to accurately answer questions or assess clinical situation in the field of pediatric surgery [1, 2, 5]. In this study, we evaluate three prominent AI models, DeepSeek, Microsoft Copilot (GPT-4) and Google Bard, in the context of the European Pediatric Surgery In-Training Examination (EPSITE) [6–8]. This annual online examination, initiated in 2019, was developed for pediatric surgical trainees by the European Paediatric Surgeons´ Association (EUPSA). It was established for pediatric surgeons worldwide who wish to assess their level of knowledge in different areas of the pediatric surgical curriculum and consists of 120 multiple-choice questions in 12 categories, based on the European Board of Pediatric Surgery examination.

This study aimed to conduct a comprehensive comparison of these three AI models in the realm of pediatric surgery, specifically evaluating their performance in answering correctly the original EPSITE questions from the past 3 years. For DeepSeek, we used the most recent model version available, while the Copilot and Google Bard results reflect their performance at the time of their respective releases. This approach enabled a fair comparison of each model’s initial capabilities and highlights how newer models like DeepSeek may benefit from architectural refinements and lower development costs. Additionally, we compared the AI scores with real-world data from human test takers, representing a wide array of international clinical experience from almost 40 countries of origin. Through rigorous evaluation, we sought to shed light on the potential role of AI models in pediatric surgical practice.

Material and methods

The performance of DeepSeek, Copilot powered by GPT-4 and Google Bard was evaluated using the EPSITE examinations from 2021 to 2023. All LLMs employed in this study are open access. The EPSITE is an online examination provided once a year by the EUPSA and available worldwide for pediatric surgeons. Participation was voluntary and registration required a fee (50€ for EUPSA members, 100€ for non-EUPSA members). EPSITE questions were designed by a panel of experts from the European Board of Pediatric Surgery (EBPS) and the EUPSA. The examination comprised questions necessitating simple fact recall and questions involving analytical tasks to obtain the correct answer using clinical vignettes. Two independent reviewers (RG, RW) classified questions according to the aforementioned characteristics, and possible conflicts were resolved by discussion or consultation of a third reviewer (OA). Since Copilot only accepts text input, questions comprising image data were eliminated (n = 29). All other questions were included. Thus, a total of 294 questions in a single best answer multiple-choice format were entered individually (Fig. 1). In an initial standardized prompt, the LLMs were informed that they would be presented with multiple-choice questions about pediatric surgical topics and asked to choose the best possible answer. Then, each model was presented with the questions in the same sequence. Data from AI models was collected between 20/01/24 and 25/01/24 for Copilot and Google Bard and between 16/02/25 and 23/02/25 for DeepSeek. Responses were compared to the official scoring grid and assessed for correctness. Questions where no definitive answer was given were considered incorrect. For participants test scores were deidentified and stored anonymously. By participating, test takers agreed to the use of their data and no ethics board approval was required.

Fig. 1 — Examples for the individual entry of multiple-choice questions into the LLMs DeepSeek (A), Copilot (B) and Google Bard (C)

Statistical analysis

All analyses were performed using GraphPad Prism version 8.0.2 (GraphPad Software, Boston, MA, USA). Descriptive statistics and frequency were reported for all items. Subgroup analyses according to year of training of participants were performed to assess differences between groups. The Chi-square and Fisher exact test were used to analyze differences in performance. The Holm method was used to correct for multiple comparisons in the pairwise analyses using Fisher’s exact test. Odds ratios and confidence intervals were calculated based on total correct vs. incorrect responses at the group level. A p-value < 0.05 was considered statistically significant.

Results

Out of three examinations, a total of 294 questions were included (Table 1). The EPSITE participants originated from 39 different countries and the total number of participating pediatric surgical trainees was 232 (range: 67–90 participants/year). The distribution of participants by year of surgical training is provided in Table 1. The overall level of training did not differ between annual examinations.

Table 1.

Distribution of the number of questions and participants by year of surgical training

Examination year	Total examination questions (n)	Year of surgical training, n (%)
Examination year	Total examination questions (n)	1st	2nd	3rd	4th	5th	6th	> 6th	Total
2023	120	11 (12.2)	9 (10.0)	10 (11.1)	13 (14.4)	12 (13.3)	15 (16.7)	20 (22.2)	90 (100.0)
2022	100	7 (10.4)	6 (9.0)	9 (13.4)	8 (11.9)	13 (19.4)	5 (7.5)	19 (28.4)	67 (100.0)
2021	74	7 (9.3)	8 (10.7)	11 (14.7)	8 (10.7)	13 (17.3)	8 (10.7)	20 (26.7)	75 (100.0)
Total	294	25 (10.8)	23 (9.9)	30 (12.9)	29 (12.5)	38 (16.4)	28 (12.1)	59 (25.4)	232 (100.0)

Open in a new tab

Among the AI models, DeepSeek achieved the highest score with 85.0% of correct answers, followed by Copilot (55.4%) and Google Bard (48.0%) (Table 2). On average, pediatric surgical trainees answered 60.1% questions correctly. When considering all questions, the overall differences in performance between DeepSeek, Copilot, Google Bard and pediatric surgical trainees was found to be statistically significant (p < 0.0001) (Table 2). This was also the case, when assessing examinations individually by year (p < 0.01). To delineate the factors driving statistical differences, performances were evaluated pairwise using Fisher’s exact test using the Holms method to correct for multiple comparison. Differences in the performance of Copilot and Google Bard compared to all pediatric surgical trainees was found to be not statistically significant (p > 0.05) (Table 3). In contrast, DeepSeek scored significantly higher (p < 0.0001, odds ratio (OR) [95% confidence interval (CI)] 3.76 [2,54–5,55]).

Table 2.

Comparison of overall responses between LLMs and all pediatric surgical trainees. Chi-square test for multiple comparisons

	Copilot, n (%)		Google Bard, n (%)		DeepSeek, n (%)		Pediatric surgeons, n (%)		p-value
	Right	Wrong	Right	Wrong	Right	Wrong	Right	Wrong
Total responses	163 (55.4)	131 (44.6)	141 (48.0)	153 (52.0)	250 (85.0)	44 (15.0)	177 (60.1)	117 (39.9)	****
Examination year
2023	61 (50.8)	59 (49.2)	56 (46.7)	64 (53.3)	103 (85.8)	17 (14.2)	70 (58.3)	50 (41.7)	****
2022	59 (59.0)	41 (41.0)	51 (51.0)	49 (49.0)	90 (90.0)	10 (10.0)	63 (63.0)	37 (37.0)	****
2021	43 (58.1)	31 (41.9)	34 (45.9)	40 (54.1)	57 (77.0)	17 (23.0)	44 (59.3)	30 (40.7)	**

Open in a new tab

**p < 0.01, ****p < 0.0001

Table 3.

Comparison of the performance of LLMs and pediatric surgical trainees by the year of training. Fisher’s exact test. P-values were adjusted for multiple comparisons using the Holm method

		Year of surgical training
vs	Pediatric surgeons	1st	2nd	3rd	4th	5th	6th	> 6th
DeepSeek	****	****	****	****	****	****	****	****
Copilot	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
Google Bard	0.126	0.098	0.098	0.063	0.063	0.063	*	0.063

Open in a new tab

*p < 0.05, ****p < 0.0001

To address whether differences exist between years of surgical training, we further performed a subgroup analysis. Among all participants, pediatric surgical trainees in their 6th year of training performed best, with 191 out of 294 correct answers (65.0%). (Table 4). Pediatric surgical trainees in their 6th year outperformed Google Bard (p < 0.05) and answers given were twice as likely to be correct compared to Google Bard (OR [CI] 2.01 [1.45–2.81]) (Table 3). DeepSeek outperformed pediatric surgical trainees regardless of year of training (Table 3). No statistically significant differences were found between pediatric surgeons and Copilot (GPT-4) regardless of the year of training.

Table 4.

Comparison of the overall responses by year of surgical training. Chi-square test for multiple comparisons

Examination year	Year of surgical training, n (%)														p-value
	1st		2nd		3rd		4th		5th		6th		> 6th
	Right	Wrong	Right	Wrong	Right	Wrong	Right	Wrong	Right	Wrong	Right	Wrong	Right	Wrong
2023	65 (54.2)	55 (45.8)	64 (53.3)	56 (46.7)	73 (60.8)	47 (39.2)	71 (59.2)	49 (40.8)	72 (60.0)	48 (40.0)	72 (60.0)	48 (40.0)	72 (60.0)	48 (40.0)	0.83
2022	64 (64.0)	36 (36.0)	65 (65.0)	35 (35.0)	64 (64.0)	36 (36.0)	62 (62.0)	38 (38.0)	65 (65.0)	35 (35.0)	70 (70.0)	30 (30.0)	60 (60.0)	40 (40.0)	0.88
2021	40 (54.7)	34 (45.3)	44 (59.8)	30 (40.2)	44 (58.9)	30 (41.1)	47 (63.1)	27 (36.9)	40 (54.6)	34 (45.4)	49 (66.5)	25 (33.5)	44 (59.6)	30 (40.4)	0.72
Total	169 (57.6)	125 (42.4)	173 (58.8)	121 (41.2)	181 (61.5)	113 (38.5)	180 (61.3)	114 (38.7)	177 (60.3)	117 (39.7)	191 (65.0)	103 (35.0)	176 (60.0)	118 (40.0)	0.65

Open in a new tab

To identify areas where LLMs performed poorly, we analyzed performances by topic (e.g., oncological surgery) or type of question (simple fact recall vs. complex analytical questions). Statistically significant differences were found between the AI models for questions about pathology, clinical history/data, general, neonatal and urogenital surgery, pediatric oncology, trauma and burn, as well as critical care (Table 5). Similarly, statistically significant differences in performances were found between simple fact recall and complex questions between LLMs (Fig. 2). When comparing results by topic or type of question to elucidate reasons for the poor performance of Google Bard, no statistically significant differences were found between Google Bard and pediatric surgeons by question type or topic (Supplementary Table 1).

Table 5.

Comparison of topic-specific responses between LLMs. Chi-square test for multiple comparisons

Question category	Copilot, n (%)		Google Bard, n (%)		DeepSeek, n (%)		p-value
Question category	Right	Wrong	Right	Wrong	Right	Wrong
Anatomy, physiology and embryology	19 (76.0)	6 (24.0)	17 (68.0)	8 (32.0)	23 (92.0	2 (8.0)	0.11
Pathology, microbiology, genetics and immunology	17 (58.6)	12 (41.4)	17 (58.6)	12 (41.4)	26 (89.7)	3 (10.3)	*
Clinical history, examination and operative technique	20 (54.1)	17 (45.9)	20 (54.1)	17 (45.9)	31 (83.8)	6 (16.2)	**
Clinical data interpretation	7 (36.8)	12 (63.2)	9 (47.4)	10 (52.6)	15 (78.9)	4 (21.1)	*
Radiological investigations	9 (52.9)	8 (47.1)	7 (41.2)	10 (58.8)	13 (76.5)	4 (23.5)	0.11
General pediatric surgery	10 (47.6)	21 (52.4)	12 (38.7)	19 (61.3)	24 (77.4)	7 (22.6)	***
Pediatric oncology	16 (61.5)	10 (38.5)	11 (42.3)	15 (57.7)	21 (80.8)	5 (19.2)	*
Neonatal surgery	15 (51.7)	14 (48.3)	13 (44.8)	16 (55.2)	25 (86.2)	4 (13.8)	**
Genitourinary surgery	11 (52.4)	10 (47.6)	6 (28.6)	15 (71.4)	20 (95.2)	1 (4.8)	****
Pediatric trauma and burn care	14 (53.8)	12 (46.2)	10 (38.5)	16 (61.5)	21 (80.8)	5 (19.2)	**
Pediatric critical care	12 (60.0)	8 (40.0)	7 (35.0)	13 (65.0)	17 (85.0)	3 (15.0)	**
Statistics, research, audit, history and ethics	10 (71.4)	4 (28.6)	9 (64.3)	5 (35.7)	14 (100)	0 (0)	0.05
Total	160 (54.4)	134 (45.6)	138 (46.9)	156 (53.1)	250 (85.0)	44 (15.0)	********

Open in a new tab

*p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001

Fig. 2 — Bar graph showing the percentages of correct answers by DeepSeek, Copilot and Google Bard according to the type of question. No significant differences between LLMs and type of question were found

Lastly, both Copilot and Google Bard failed to answer a minority of questions (Copilot n = 10, 3.4%; Google Bard n = 14, 4.7%). For Copilot, failure to answer only occurred when the LLM estimated that the right answer was not among the choices given (Fig. 3). For Google Bard, failure to answer occurred for the same reason in five questions or due to ethical reasons in the remainder of the questions (n = 9) (Fig. 4). In the latter case, questions were either related to outcome predictions or management of trauma patients (n = 3), complex diseases or genetic syndromes (e.g., management of necrotizing enterocolitis, n = 4) or child abuse (n = 2). Finally, it was noticed that when entering questions into DeepSeek in rapid succession, rate-limiting issues occurred, with prompts advising to return later, which significantly delayed data entry.

Fig. 3 — Example of a question that the LLM Copilot refused to answer due to a lack of correct answer options

Fig. 4 — Example of a question that the LLM Google Bard refused to answer due to ethical reasons

Discussion

In this study, we evaluated the performance of DeepSeek, Copilot powered by GPT-4 and Google Bard on the pediatric surgical EPSITE examination. To our knowledge, this is the first study to assess AI performance in a standardized examination on pediatric surgery. DeepSeek significantly outperformed both Copilot and Google Bard, as well as pediatric surgical trainees. Scores obtained by Copilot were comparable to scores of pediatric surgical trainees, while Google Bard performed significantly worse. Furthermore, when assessing the results of pediatric surgical trainees, our analysis revealed that 6th year trainees outperformed Google Bard, but not DeepSeek. Sixth year trainees are generally at a level of training right before or after their pediatric surgical board examination which likely explains this result.

Based on our results, caution should be taken when using LLMs as an assistant for clinical practice, and responses need to be appraised critically. Only one of the LLMs achieved the passing grade of 75% required by the American Board of Surgery or the 59% required by the Part 1 of the European Board Examination. The stronger performance of DeepSeek may reflect rapid progress in LLM development, as well as the benefits of architectural refinements and improved training strategies in newer models. Notably, we identified significant differences in model performance both by question type and by topic with DeepSeek consistently outperforming other LLMs across multiple pediatric surgery domains and in both simple fact recall and complex analytical reasoning. Additionally, Copilot and Google Bard failed to answer several questions which required ethical reasoning or when it considered that the right answer was not listed among response choices. While refusals were treated as incorrect in our analysis, we acknowledge this may penalize models designed with safety constraints. However, only a small number of questions were affected and, in examination conditions, selecting the best available answer remains essential. Open-source LLMs are trained on publicly available data and may incorporate outdated information. High-quality and peer-reviewed content oftentimes requires access to paid journal subscriptions or non-public materials such as textbooks. However, in highly specialized fields such as pediatric surgery, access to relevant medical literature is needed for field-specific training and improvement [9, 10]. This may partly explain the limited performance of Copilot and Google Bard at the time. In contrast, DeepSeek’s strong performance highlights how future LLMs may achieve high accuracy even with fewer resources, by leveraging accumulated experience and streamlined training pipelines. Nevertheless, improvements are still needed across the board for LLMs to serve as reliable decision-making aids in clinical practice.

A strength of our study is the incorporation of data from human test takers representing a wide array of clinical experience and different countries of origin. Many of the questions involved clinical vignettes mimicking real-world situations. Additionally, questions addressed a wide variety of topics from neonatal, to trauma, genitourinary or oncological surgery, therefore assessing diverse aspects of pediatric surgical practice. Other areas of pediatric surgical care such as orthopedic, cardiac or neuro-surgery were not evaluated; however, these typically fall within the expertise of the corresponding adult surgeons and are less relevant to pediatric surgeons. Furthermore, the multiple question format allowed for objective assessment of the LLM model’s performance. On the other hand, the closed question format might not allow for adequate evaluation of the extent of the well-known phenomenon of hallucination and confabulation, where LLMs manufacture data and which may be encountered in open-ended questions. Another limitation of our study was our inability to incorporate questions containing images. However, assessing radiological images or visible signs and symptoms is a key component of diagnostics. Given that LLMs are constantly evolving and being refined, this limitation will likely be overcome in the future.

As the use of AI becomes increasingly prevalent in everyday life, it is bound to gain importance in the medical field as well. LLMs are periodically updated and with additional training, acquisition of field-specific content, AI-based systems may soon become a reliable and omnipresent tool in various medical specialties, especially in diagnostics. Several research groups have trained AI systems to process and interpret large data sets from electronic health records using up to billions of data points [11, 12]. In a previous study, the authors extracted data from electronic health records from over 500 000 patients and 1 300 000 outpatient visits to train an AI model [11]. The model was applied to a large pediatric population and achieved high diagnostic accuracy not only for various organ systems, but also for diseases with a high morbidity such as bacterial meningitis. Importantly, it outperformed junior physicians but scored lower than senior physicians, suggesting that AI models may play a role in assisting physicians during their training. Furthermore, in the field of pediatric surgery where many parts of the world are underserved, it may help physicians in clinical decision-making. Another area where LLMs could be instrumental in pediatric surgery is in the assessment of radiological or histopathological examinations. AI-based systems are already able to detect fractures in infants with an accuracy comparable to pediatric radiologists, but so far applications are limited to specific types of fractures and not broadly applicable [13–16]. Similarly, a recent study showed promising results in the diagnosis of Hirschsprung disease from histopathological specimens a deep-learning approach [17]. This could prove invaluable, as histological diagnosis of Hirschsprung disease is challenging and requires a high level of expertise which is not always available. However, annotation and segmentation of samples for training and validating algorithms are currently very time-consuming and require qualified personnel. Additionally, while the diagnostic accuracy for Hirschsprung Disease was very high (92%), misclassification occurred in a number of cases, which in the case of Hirschsprung disease can lead to significant clinical complications (e.g., ileus).

AI will likely also play a role in surgical education and help examiners and test takers alike, as it can generate questions and help studying for exams by providing the rationale to the correct answer and in some cases (e.g., Copilot), even supplying accurate citations [18]. It may not only aid in acquiring and testing knowledge, but also enhance interpersonal skills and provide a safe environment if used to simulate doctor–patient interactions. Physicians will also benefit from the use of AI copilots for more mundane tasks such as notetaking or filling out forms [19].

Lastly, it is important to validate LLMs before widespread use and our study serves as a benchmark of DeepSeek, Copilot and Google Bard performance in the field of pediatric surgery. Future studies will not only assess performance but also use broader metrics and evaluate linguistic parameters, perform text network analysis and assess the self-correction capabilities of LLMs [3, 4].

Conclusion

For the first time, a comparative performance analysis of DeepSeek, Microsoft Copilot and Google Bard was carried out in the field of pediatric surgery. Our study revealed substantial shortcomings of Copilot and Google Bard. In contrast, DeepSeek achieved high overall accuracy, surpassing both other LLMs and human trainees, indicating the rapid progress of newer models. LLMs vary in performance and over time when instructed to answer standardized questions on pediatric surgical knowledge. However, caution must be taken when incorporating artificial intelligence in pediatric surgical decision-making, especially in the clinical setting when patient well-being is at stake.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 18 KB)^{(18.5KB, docx)}

Author contribution

ML and OA designed the study. RG, RW and OA performed the experiments. OA performed statistical analysis. SC, OM, ML designed the questions. RG and OA wrote the manuscript. ML provided substantial scientific input. All authors critically reviewed the manuscript.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Data availability

No datasets were generated or analysed during the current study.

Declarations

Conflict of interest

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Richard Gnatzy and Martin Lacher have contributed equally.

References

1.González R, Poenaru D, Woo R, Trappey AF 3rd, Carter S, Darcy D et al (2024) ChatGPT: what every pediatric surgeon should know about its potential uses and pitfalls. J Pediatr Surg 59(5):941 [DOI] [PubMed] [Google Scholar]
2.Xiao D, Meyers P, Upperman JS, Robinson JR (2023) Revolutionizing healthcare with ChatGPT: an early exploration of an AI language model’s impact on medicine at large and its role in pediatric surgery. J Pediatr Surg 58(12):2410–2415 [DOI] [PubMed] [Google Scholar]
3.Wu S, Koo M, Blum L, Black A, Kao L, Fei Z et al (2024) Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology. NEJM AI 1(2):AIdbp2300092 [Google Scholar]
4.Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun CH, Lam JSH, et al. (2023) Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine 95 [DOI] [PMC free article] [PubMed]
5.Gnatzy R, Lacher M, Berger M, Boettcher M, Deffaa OJ, Kübler J, et al. (2025) Solving Complex Pediatric Surgical Case Studies: A Comparative Analysis of Copilot, ChatGPT-4, and Experienced Pediatric Surgeons’ Performance. Eur J Pediatr Surg 2025. [DOI] [PubMed]
6.OpenAI (2023) GPT-4 technical report
7.Google. An overview of Bard: an early experiment with generative AI, https://ai.google/static/documents/google-about-bard.pdf; 2023 [accessed March 7th.2024].
8.Microsoft. Introducing Microsoft 365 Copilot – your copilot for work, https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work/; 2023 [accessed March 7th 2024.
9.LV K, Yang Y, Liu T, Gao Q, Guo Q, Qiu X. (2023) Full Parameter Fine-tuning for Large Language Models with Limited Resources
10.Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. (2023) A Survey of Large Language Models
11.Liang H, Tsui BY, Ni H, Valentim CCS, Baxter SL, Liu G et al (2019) Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence. Nat Med 25(3):433–438 [DOI] [PubMed] [Google Scholar]
12.Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C et al (2022) A large language model for electronic health records. npj Digital Med 5(1):194 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Zech JR, Carotenuto G, Igbinoba Z, Tran CV, Insley E, Baccarella A et al (2023) Detecting pediatric wrist fractures using deep-learning-based object detection. Pediatr Radiol 53(6):1125–1134 [DOI] [PubMed] [Google Scholar]
14.Zech JR, Jaramillo D, Altosaar J, Popkin CA, Wong TT (2023) Artificial intelligence to identify fractures on pediatric and young adult upper extremity radiographs. Pediatr Radiol 53(12):2386–2397 [DOI] [PubMed] [Google Scholar]
15.Gasmi I, Calinghen A, Parienti JJ, Belloy F, Fohlen A, Pelage JP (2023) Comparison of diagnostic performance of a deep learning algorithm, emergency physicians, junior radiologists and senior radiologists in the detection of appendicular fractures in children. Pediatr Radiol 53(8):1675–1684 [DOI] [PubMed] [Google Scholar]
16.Ghosh A, Bose S, Patton D, Kumar I, Khalkhali V, Henry MK et al (2023) Deep learning-based prediction of rib fracture presence in frontal radiographs of children under two years of age: a proof-of-concept study. British J Radiol 96:1145 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Duci M, Magoni A, Santoro L, Dei Tos AP, Gamba P, Uccheddu F et al (2023) Enhancing diagnosis of Hirschsprung’s disease using deep learning from histological sections of post pull-through specimens: preliminary results. Pediatr Surg Int 40(1):12 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Choudhary OP, Saini J, Challana A, Choudhary O, Saini J, Challana A (2023) ChatGPT for veterinary anatomy education: an overview of the prospects and drawbacks. Int J Morphol 41(4):1198–1202 [Google Scholar]
19.Wang J, Yang J, Zhang H, Lu H, Skreta M, Husić M et al (2022) PhenoPad: Building AI enabled note-taking interfaces for patient encounters. npj Digital Med 5(1):12 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary file1 (DOCX 18 KB)^{(18.5KB, docx)}

Data Availability Statement

No datasets were generated or analysed during the current study.

[CR1] 1.González R, Poenaru D, Woo R, Trappey AF 3rd, Carter S, Darcy D et al (2024) ChatGPT: what every pediatric surgeon should know about its potential uses and pitfalls. J Pediatr Surg 59(5):941 [DOI] [PubMed] [Google Scholar]

[CR2] 2.Xiao D, Meyers P, Upperman JS, Robinson JR (2023) Revolutionizing healthcare with ChatGPT: an early exploration of an AI language model’s impact on medicine at large and its role in pediatric surgery. J Pediatr Surg 58(12):2410–2415 [DOI] [PubMed] [Google Scholar]

[CR3] 3.Wu S, Koo M, Blum L, Black A, Kao L, Fei Z et al (2024) Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology. NEJM AI 1(2):AIdbp2300092 [Google Scholar]

[CR4] 4.Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun CH, Lam JSH, et al. (2023) Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine 95 [DOI] [PMC free article] [PubMed]

[CR5] 5.Gnatzy R, Lacher M, Berger M, Boettcher M, Deffaa OJ, Kübler J, et al. (2025) Solving Complex Pediatric Surgical Case Studies: A Comparative Analysis of Copilot, ChatGPT-4, and Experienced Pediatric Surgeons’ Performance. Eur J Pediatr Surg 2025. [DOI] [PubMed]

[CR6] 6.OpenAI (2023) GPT-4 technical report

[CR7] 7.Google. An overview of Bard: an early experiment with generative AI, https://ai.google/static/documents/google-about-bard.pdf; 2023 [accessed March 7th.2024].

[CR8] 8.Microsoft. Introducing Microsoft 365 Copilot – your copilot for work, https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work/; 2023 [accessed March 7th 2024.

[CR9] 9.LV K, Yang Y, Liu T, Gao Q, Guo Q, Qiu X. (2023) Full Parameter Fine-tuning for Large Language Models with Limited Resources

[CR10] 10.Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. (2023) A Survey of Large Language Models

[CR11] 11.Liang H, Tsui BY, Ni H, Valentim CCS, Baxter SL, Liu G et al (2019) Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence. Nat Med 25(3):433–438 [DOI] [PubMed] [Google Scholar]

[CR12] 12.Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C et al (2022) A large language model for electronic health records. npj Digital Med 5(1):194 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Zech JR, Carotenuto G, Igbinoba Z, Tran CV, Insley E, Baccarella A et al (2023) Detecting pediatric wrist fractures using deep-learning-based object detection. Pediatr Radiol 53(6):1125–1134 [DOI] [PubMed] [Google Scholar]

[CR14] 14.Zech JR, Jaramillo D, Altosaar J, Popkin CA, Wong TT (2023) Artificial intelligence to identify fractures on pediatric and young adult upper extremity radiographs. Pediatr Radiol 53(12):2386–2397 [DOI] [PubMed] [Google Scholar]

[CR15] 15.Gasmi I, Calinghen A, Parienti JJ, Belloy F, Fohlen A, Pelage JP (2023) Comparison of diagnostic performance of a deep learning algorithm, emergency physicians, junior radiologists and senior radiologists in the detection of appendicular fractures in children. Pediatr Radiol 53(8):1675–1684 [DOI] [PubMed] [Google Scholar]

[CR16] 16.Ghosh A, Bose S, Patton D, Kumar I, Khalkhali V, Henry MK et al (2023) Deep learning-based prediction of rib fracture presence in frontal radiographs of children under two years of age: a proof-of-concept study. British J Radiol 96:1145 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Duci M, Magoni A, Santoro L, Dei Tos AP, Gamba P, Uccheddu F et al (2023) Enhancing diagnosis of Hirschsprung’s disease using deep learning from histological sections of post pull-through specimens: preliminary results. Pediatr Surg Int 40(1):12 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Choudhary OP, Saini J, Challana A, Choudhary O, Saini J, Challana A (2023) ChatGPT for veterinary anatomy education: an overview of the prospects and drawbacks. Int J Morphol 41(4):1198–1202 [Google Scholar]

[CR19] 19.Wang J, Yang J, Zhang H, Lu H, Skreta M, Husić M et al (2022) PhenoPad: Building AI enabled note-taking interfaces for patient encounters. npj Digital Med 5(1):12 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Pediatric surgical trainees and artificial intelligence: a comparative analysis of DeepSeek, Copilot, Google Bard and pediatric surgeons’ performance on the European Pediatric Surgical In-Training Examinations (EPSITE)

Richard Gnatzy

Martin Lacher

Salvatore Cascio

Oliver Münsterer

Richard Wagner

Ophelia Aubert

Abstract

Objective

Methods

Results

Conclusion

Supplementary Information

Introduction

Material and methods

Fig. 1.

Statistical analysis

Results

Table 1.

Table 2.

Table 3.

Table 4.

Table 5.

Fig. 2.

Fig. 3.

Fig. 4.

Discussion

Conclusion

Supplementary Information

Author contribution

Funding

Data availability

Declarations

Conflict of interest

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases