Skip to main content
Journal of Translational Medicine logoLink to Journal of Translational Medicine
letter
. 2024 Jun 1;22:527. doi: 10.1186/s12967-024-05302-8

Exploring the performance of ChatGPT on acute pancreatitis-related questions

Ren-Chun Du 1,#, Xing Liu 1,#, Yong-Kang Lai 1,2,3,#, Yu-Xin Hu 1, Hao Deng 4, Hui-Qiao Zhou 5, Nong-Hua Lu 1, Yin Zhu 1,✉,#, Yi Hu 1,6,✉,#
PMCID: PMC11143553  PMID: 38822323

Letter to the editor:

Acute pancreatitis (AP) is a serious gastrointestinal disease with an incidence rate of approximately 34 cases per 100,000 individuals annually, the overall burden of AP remains high with the aging population [1]. There is a notable trend among the public to acknowledge AP-related information to improve awareness.

Artificial intelligence (AI) is a large language model providing updated and useful information. The Chat Generative Pre-trained Transformer (ChatGPT, https://openai.com), developed by OpenAI and launched on November 30, 2022, stands out in this field. Various studies have explored its utility in responding to medical questions. This study aims to evaluate and compare the capabilities of ChatGPT-3.5 and ChatGPT-4.0 in answering test questions about AP, employing both subjective and objective metrics.

Methods

As shown in Table S1, we conducted our study using 18 subjective test questions derived from the Atlanta AP classification consensus and the American Gastroenterological Association (AGA) guidelines (Strength of recommendation: Strong) [24]. Additionally, we selected 73 objective questions with the highest number of tested times from the Chinese professional physician test database, categorizing them into four subfields (Table S2). These questions were submitted to ChatGPT in two separate sessions on February 1, 2024, and February 8, 2024, respectively. Two independent reviewers evaluated the subjective questions using a 5-point Likert Scale. Any discordance was resolved by the third author. The flowchart of overall study design is presented in Figure S1. The response accuracy was analyzed using the Chi-squared and Mann–Whitney U tests, with a P-value of < 0.05 indicating statistical significance.

Results

As shown in Table 1, ChatGPT-3.5 correctly answered 80% of subjective questions, while ChatGPT-4.0 achieved an accuracy rate of 94%. For objective questions, ChatGPT-4.0 outperformed ChatGPT-3.5 with a 78.1% accuracy rate compared to 68.5% (P = 0.01) (Figure S2A). Across all questions tested in the study, the concordance rate between ChatGPT-3.5 and ChatGPT-4.0 was 80.8% and 83.6% (Figure S2B), respectively, with the mean number of words per response being 218.5 for ChatGPT-3.5 and 246.0 for ChatGPT-4.0 (Table 1). Notably, correct answers showed higher concordance rates than incorrect ones across both versions of ChatGPT (95.7% and 91.1% vs. 55.6% and 58.8%) (Table 2). Notably, both ChatGPT-3.5 and ChatGPT-4.0 demonstrated high accuracy rates, particularly in the etiology category.

Table 1.

Quality indicators (scientific adequacy) for answers from ChatGPT version 3.5 and 4.0

Common questions Sources of answers Words Grades “The answers are scientifically adequate”
Response 1 Response 2 Mean P-value
All (Mean) ChatGPT 3.5 255 237 4 < 0.01
ChatGPT 4.0 202 235 4.7
Basic knowledge
 What are of types of acute pancreatitis? ChatGPT 3.5 138 167 5 1
ChatGPT 4.0 165 164 5
 What is the difference between mild, moderate and severe acute pancreatitis? ChatGPT 3.5 218 225 5 1
ChatGPT 4.0 316 284 5
 What is the identification of two distinct phases of acute pancreatitis? ChatGPT 3.5 213 172 4 0.029
ChatGPT 4.0 246 175 5
Diagnosis
 What is the diagnosis criteria of acute pancreatitis? ChatGPT 3.5 169 185 4 0.029
ChatGPT 4.0 123 146 5
 What are the common symptoms of acute pancreatitis? ChatGPT 3.5 190 156 5 1
ChatGPT 4.0 183 186 5
 What is the signs of acute pancreatitis systemic inflammatory response syndrome (SIRS)? ChatGPT 3.5 191 165

3

4.5

0.029
ChatGPT 4.0 104 166
Treatment
 What is the initial management of acute pancreatitis? ChatGPT 3.5 317 276 4 0.686
ChatGPT 4.0 263 337 4.25
 When should patients with acute pancreatitis combined with acute cholangitis receive Endoscopic Retrograde Cholangiopancreatography (ERCP)? ChatGPT 3.5 286 299 3 0.343
ChatGPT 4.0 137 160 4
 What should be done for an acute pancreatitis extrapancreatic infection, such as cholangitis, catheter-acquired infections, bacteremia, urinary tract infections, and pneumonia? ChatGPT 3.5 319 269 4 0.686
ChatGPT 4.0 331 339 4.25
 Should prophylactic antibiotics be routinely used in patients with severe acute pancreatitis? ChatGPT 3.5 306 240 3.5 0.029
ChatGPT 4.0 139 203 5
 Should patients with sterile necrosis use antibiotics to prevent the development of infected necrosis? ChatGPT 3.5 322 225 2 0.029
ChatGPT 4.0 165 189 5
 Should patients with mild acute pancreatitis found to have gallstones in the gallbladder receive cholecystectomy before discharge? ChatGPT 3.5 255 286 3 0.029
ChatGPT 4.0 190 204 4
 Does the presence of asymptomatic pseudocysts and pancreatic and / or extrapancreatic necrosis require intervention? ChatGPT 3.5 284 261 3.5 0.343
ChatGPT 4.0 201 369 4
 Is maintaining enteral nutrition thought to be helpful for patients with acute pancreatitis? ChatGPT 3.5 318 300 5 1
ChatGPT 4.0 143 221 5
Prevention
 What are the most common causes of acute pancreatitis? ChatGPT 3.5 194 221 5 1
ChatGPT 4.0 171 250 5
 What are the well-studied interventions for patients undergoing a therapeutic ERCP to decrease the risk of post-ERCP pancreatitis, especially severe disease? ChatGPT 3.5 331 322 4 0.343
ChatGPT 4.0 220 342 4.5
Others
 What is the definition of idiopathic acute pancreatitis? ChatGPT 3.5 174 208 5 1
ChatGPT 4.0 190 170 5
 Do we need more evidence to optimize the management of acute pancreatitis? ChatGPT 3.5 371 286 4 1
ChatGPT 4.0 342 331 4

Table 2.

Performance of ChatGPT 3.5, ChatGPT 4.0 and medical college examinees on acute pancreatitis test questions and by different subfields

Test questions by subfields ChatGPT 3.5 ChatGPT 4.0 Examinees
Correct Incorrect Correct Incorrect
All test questions, No. 73 73
 1st run, No. (%) 46 (63.0) 27 (37.0) 56 (76.7) 17 (23.3)
 2nd run, No. (%) 56 (76.7) 17 (23.3) 58 (79.5) 15 (20.5)
 Concordance between 2 runs, No. (%) 44 (95.7) 15 (55.6) 51 (91.1) 10 (58.8)
 Total concordance, No. (%) 59 (80.8) 61 (83.6)
 Total accuracy (%) 68.5 78.1 72.4
Diagnosis, No. 43 43
 1st run, No. (%) 26 (60.5) 17 (39.5) 34 (79.1) 9 (20.9)
 2nd run, No. (%) 32 (74.4) 11 (25.6) 35 (81.4) 8 (18.6)
 Concordance between 2 runs, No. (%) 26 (100) 11 (64.7) 31 (91.2) 5 (66.7)
 Total concordance, No. (%) 37 (86.0) 36 (83.7)
 Total accuracy (%) 65.1 80.2 75.3
Clinical feature, No. 9 9
 1st run, No. (%) 4 5 7 2
 2nd run, No. (%) 6 3 6 3
 Concordance between 2 runs, No. (%) 3 (75.0) 2 (40.0) 6 (85.7) 2 (100.0)
 Total concordance, No. (%) 5 (55.6) 8 (88.9)
 Total accuracy (%) 55.6 72.2 60.0
Treatment, No. 12 12
 1st run, No. (%) 7 (58.3) 5 (41.7) 8 (58.3) 4 (41.7)
 2nd run, No. (%) 9 (75.0) 3 (25.0) 9 (75.0) 3 (25.0)
 Concordance between 2 runs, No. (%) 6 (85.7) 2 (40.0) 7 (87.5) 2 (50.0)
 Total concordance, No. (%) 8 (66.7) 9 (75.0)
 Total accuracy (%) 66.7 70.8 69.1
Etiology, No. 9 9
 1st run, No. (%) 9 (100.0) 0 7 (77.8) 2 (22.2)
 2nd run, No. (%) 9 (100.0) 0 8 (88.9) 1 (11.1)
 Concordance between 2 runs, No. (%) 9 (100.0) 0 7 (100.0) 1 (50.0)
 Total concordance, No. (%) 9 (100) 9 (100)
 Total accuracy (%) 100 83.3 75.2

Discussion

Our findings indicate that ChatGPT-4.0 outperformed ChatGPT-3.5 in answering both subjective and objective test questions related to AP, demonstrating a superior total accuracy. The accuracy of both ChatGPT-3.5 and the examinees in responding to clinical feature test questions was generally low, which suggests that clinical features associated with AP are complex, often involving numerous complications, which makes identifying the optimal solution challenging.

In addressing subjective questions, ChatGPT tends to provide a range of answers, mixing relevant with irrelevant information, making it challenging to discern the most accurate answer for healthcare professionals and patients. This discrepancy highlights the lower accuracy rate for objective choice questions compared to subjective ones. However, ChatGPT-4.0 showed improvements in providing more precise, concise, and focused answers.

Although ChatGPT answered most subjective questions correctly, the standard answers were conducted based on early guideline evidence. A significant limitation of artificial intelligence is its inability to update information in real-time. Recent randomized controlled trials focusing on AP have presented evidence that questions existing management strategies, such as the use of antibiotics, fluid resuscitation, the handling of infected necrosis, and the early application of ERCP [5]. It is imperative to reevaluate the current management guidelines to ensure they reflect the latest evidence.

This study has several limitations. Firstly, although we conducted two separate evaluations, the results might be influenced by the timing of the assessments of ChatGPT. Secondly, we did not incorporate patient perspectives, which are crucial as they are the ultimate recipients of AP-related information. Thirdly, the study participants were medical students, and we lacked data from practicing doctors.

In conclusion, ChatGPT-4.0 exhibited superior performance compared to ChatGPT-3.5. However, both versions of ChatGPT tended to provide broad and generalized answers across various topics and aspects, rather than offering optimal solutions. Therefore, ChatGPT excels at addressing subjective questions and offering a wide range of options, but it is not suitable for providing optimal management strategies, and cannot adjust treatment plans based on the latest evidence, where enhancements in training are required.

Supplementary Information

12967_2024_5302_MOESM1_ESM.jpg (1,021.5KB, jpg)

Supplementary Material 1: Figure S1: Flowchart of overall study design

12967_2024_5302_MOESM2_ESM.jpg (938.7KB, jpg)

Supplementary Material 2: Figure S2: Comparison of accuracy of ChatGPT-4.0, ChatGPT-3.5 and examinees on acute pancreatitis test objective questions (A); Comparison of concordance of ChatGPT-4.0, ChatGPT-3.5 on acute pancreatitis test objective questions (B)

Supplementary Material 3. (74.6KB, docx)
Supplementary Material 4. (38.6KB, docx)

Acknowledgements

Not applicable.

Author contributions

Ren-Chun Du, Xing Liu and Yong-Kang Lai performed the statistical analysis and wrote the manuscript. Yu-Xin Hu, Hao Deng and Hui-Qiao Zhou collected the data. Yin Zhu and Yi Hu designed the study. Nong-Hua Lu, Yin Zhu and Yi Hu revised the manuscript. All authors contributed to the article and approved the final manuscript.

Funding

This study was supported by the National Natural Science Foundation of China (NO. 82000531, 82360118 and 82170580); the Project for Academic and Technical Leaders of Major Disciplines in Jiangxi Province (NO. 20212BCJL23065); the Key Research and Development Program of Jiangxi Province (NO. 20212BBG73018).

Availability of data and materials

Not applicable.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no conflict of interest.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Ren-Chun Du, Xing Liu and Yong-Kang Lai contributed equally to this work and shared first authorship.

Yin Zhu and Yi Hu contributed equally to this work and shared last authorship.

Contributor Information

Yin Zhu, Email: ndyfy01977@ncu.edu.cn.

Yi Hu, Email: ndyfy06202@ncu.edu.cn.

References

  • 1.Petrov MS, Yadav D. Global epidemiology and holistic prevention of pancreatitis. Nat Rev Gastroenterol Hepatol. 2019;16(3):175–184. doi: 10.1038/s41575-018-0087-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Banks PA, et al. Classification of acute pancreatitis–2012: revision of the Atlanta classification and definitions by international consensus. Gut. 2013;62(1):102–111. doi: 10.1136/gutjnl-2012-302779. [DOI] [PubMed] [Google Scholar]
  • 3.Baron TH, et al. American gastroenterological association clinical practice update: management of pancreatic necrosis. Gastroenterology. 2020;158(1):67–75.e1. doi: 10.1053/j.gastro.2019.07.064. [DOI] [PubMed] [Google Scholar]
  • 4.Crockett SD, et al. American gastroenterological association institute guideline on initial management of acute pancreatitis. Gastroenterology. 2018;154(4):1096–1101. doi: 10.1053/j.gastro.2018.01.032. [DOI] [PubMed] [Google Scholar]
  • 5.de Madaria E, Buxbaum JL. Advances in the management of acute pancreatitis. Nat Rev Gastroenterol Hepatol. 2023;20(11):691–692. doi: 10.1038/s41575-023-00808-w. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12967_2024_5302_MOESM1_ESM.jpg (1,021.5KB, jpg)

Supplementary Material 1: Figure S1: Flowchart of overall study design

12967_2024_5302_MOESM2_ESM.jpg (938.7KB, jpg)

Supplementary Material 2: Figure S2: Comparison of accuracy of ChatGPT-4.0, ChatGPT-3.5 and examinees on acute pancreatitis test objective questions (A); Comparison of concordance of ChatGPT-4.0, ChatGPT-3.5 on acute pancreatitis test objective questions (B)

Supplementary Material 3. (74.6KB, docx)
Supplementary Material 4. (38.6KB, docx)

Data Availability Statement

Not applicable.


Articles from Journal of Translational Medicine are provided here courtesy of BMC

RESOURCES