Abstract
Objectives:
Every year, around 300 million surgeries are conducted worldwide, with an estimated 4.2 million deaths occurring within 30 days after surgery. Adequate patient education is crucial, but often falls short due to the stress patients experience before surgery. Large language models (LLMs) can significantly enhance this process by delivering thorough information and addressing patient concerns that might otherwise go unnoticed.
Material and methods:
This cross-sectional study evaluated Chat Generative Pretrained Transformer-4o’s audio-based responses to frequently asked questions (FAQs) regarding six general surgical procedures. Three experienced surgeons and two senior residents formulated seven general and three procedure-specific FAQs for both preoperative and postoperative situations, covering six surgical scenarios (major: pancreatic head resection, rectal resection, total gastrectomy; minor: cholecystectomy, Lichtenstein procedure, hemithyroidectomy). In total, 120 audio responses were generated, transcribed, and assessed by 11 surgeons from 6 different German university hospitals.
Results:
ChatGPT-4o demonstrated strong performance, achieving an average score of 4.12/5 for accuracy, 4.46/5 for relevance, and 0.22/5 for potential harm across 120 questions. Postoperative responses surpassed preoperative ones in both accuracy and relevance, while also exhibiting lower potential for harm. Additionally, responses related to minor surgeries were minimal, but significantly more accurate compared to those for major surgeries.
Conclusions:
This study underscores GPT-4o’s potential to enhance patient education both before and after surgery by delivering accurate and relevant responses to FAQs about various surgical procedures. Responses regarding the postoperative course proved to be more accurate and less harmful than those addressing preoperative ones. Although a few responses carried moderate risks, the overall performance was robust, indicating GPT-4o’s value in patient education. The study suggests the development of hospital-specific applications or the integration of GPT-4o into interactive robotic systems to provide patients with reliable, immediate answers, thereby improving patient satisfaction and informed decision-making.
Keywords: ChatGPT, AI, ChatGPT-4o, GPT-4o; large language models; LLM; patient education; surgery
Highlights.
GPT-4o answered 96% of surgical FAQs accurately, with minimal risk of harm.
GPT-4o performed better on postoperative care FAQs than preoperative ones.
GPT-4o's accuracy was slightly higher for minor surgeries than for major surgeries.
GPT-4o could improve surgical patient education through integration in healthcare.
Introduction
Each year, approximately 300 million surgical procedures are performed worldwide, with an estimated 4.2 million people dying within 30 days post-surgery[1,2]. For example, despite nationwide centralization, the postoperative mortality rate following pancreatoduodenectomy remains at 6.3% in medium-volume hospitals and 3.3% in high-volume hospitals[2]. Although minor elective surgeries, such as groin hernia repairs, have very low in-hospital mortality rates, they can still result in significant short- and long-term postoperative complications, including chronic pain and hernia recurrence[3,4]. Therefore, obtaining informed consent from patients undergoing any surgical procedure is essential[5]. This process should not only serve as a legal protection for healthcare professionals but also as a comprehensive approach to informing and educating patients about the surgery, potential complications, and postoperative care.
One promising tool for enhancing patient education is the use of large language model (LLM)-based chatbots, such as the Chat Generative Pretrained Transformer (ChatGPT), developed by OpenAI[6]. Since its release in late 2022, ChatGPT has been extensively utilized for scientific writing[7,8]. Additionally, ChatGPT has demonstrated its potential in clinical decision-making by providing useful treatment recommendations for fictional breast cancer scenarios[9,10]. Recently, we assessed the clinical decision-making capabilities of ChatGPT-3.5 of multidisciplinary gastrointestinal tumor board cases[11]. While ChatGPT-3.5 encountered difficulties in accurately recommending individualized therapeutic concepts, it showed the greatest alignment with the tumor board’s recommendations in cases where surgery was advised.[11]. Moreover, in a cross-sectional study conducted last year, responses of ChatGPT-3.5 to 195 patient questions from a social media forum were compared to verified physician recommendations[12]. Intriguingly, ChatGPT’s responses surpassed those in terms of both quality and empathy[12]. Another study, which evaluated 115 radiation oncology questions sourced from professional society websites, found that ChatGPT’s responses were equal or more accurate than expert responses in 108 (94%) cases[13]. Furthermore, only 2 out of the 115 responses were deemed to have potential for harm[13]. In a similar study, ChatGPT-3.5 was challenged with generating informed consent for six different surgical scenarios[14]. ChatGPT-3.5 received significantly higher ratings than surgeon-generated informed consent for its descriptions of the benefits of surgery and the available alternative options[14]. In addition, in terms of outlining the risks of surgery, ChatGPT-generated consents were not significantly different from those produced by surgeons.
Despite these promising results from earlier versions of ChatGPT, there are currently no studies evaluating the utility of the latest version, GPT-4o (“o” for “omni”), in patient education. GPT-4o is notable for its ability to respond to audio inputs with a speed comparable to human conversational response times[6]. According to an analysis conducted by OpenAI itself, GPT-4o surpassed its predecessors, including GPT-4 and ChatGPT-3.5, in text evaluation. In automatic speech recognition, GPT-4o also outperformed Whisper v3, which is designed to transcribe and translate spoken language into text across a variety of languages. Therefore, we aimed to evaluate the accuracy, relevance, and potential harm of GPT-4o’s responses to audio inputs regarding six different surgical procedures.
Methods
This cross-sectional study evaluates the responses generated by ChatGPT-4o from audio input of FAQs about six different general surgery procedures (Fig. 1A). Because no patient data were used and there was no direct interaction with patients, an ethics application was not required. The study was conducted from May 2024 to August 2024 and has been reported in line with the strengthening of the reporting of cohort, cross-sectional, and case-control studies in surgery criteria[15].
Figure 1.
(A) Schematic representation of workflow. (B) Boxplot depicting mean scores of the evaluation of 11 experts on ChatGPT’s response for the accuracy, relevance, and potential harm. (C) Comparison of ChatGPT’s pre- and post-surgery responses. (D) Comparison of ChatGPT’s responses in minor and major surgeries. (E) Comparison of ChatGPT’s response to common and surgery-specific questions.
Three experienced surgeons (M.I., B.W.R., J.W.) and two senior residents (U.A., A.Z.) developed seven common frequently asked questions (FAQs) for both preoperative and postoperative scenarios, which were applied to six different surgical scenarios. Preoperative FAQs were defined as those questions that patients might ask before undergoing surgery, focusing on the procedure itself and potential complications. In contrast, postoperative FAQs were categorized as questions related to postoperative care and recovery. Additionally, three preoperative and three postoperative surgery-specific FAQs were created for each surgical procedure. A total of six operations were examined and categorized into major and minor surgeries. Major surgeries included open pancreatic head resection, laparoscopic rectum resection, and open total gastrectomy. Minor surgeries included laparoscopic cholecystectomy, Lichtenstein procedure, and hemithyroidectomy.
Each FAQ (a total of 20: 7 preoperative common FAQs, 7 postoperative common FAQs, 3 preoperative surgery-specific FAQs, and 3 postoperative surgery-specific FAQs) were entered into ChatGPT-4o per audio input for each of the six surgery scenarios, resulting in 120 responses. The audio responses were automatically transcribed into text by ChatGPT itself. The responses were then exported into a table format and reviewed by 11 experienced surgeons from six different university hospitals in Germany (Supplementary Material Table 1, http://links.lww.com/JS9/D796). The reviewers’ characteristics are provided in detail in Supplementary Material Table 2, http://links.lww.com/JS9/D796.
The responses were assessed based on their factual accuracy, relevance, and potential harm. A five-point Likert scale was employed to measure factual correctness and relevance, with ratings defined as follows: 1 (very inaccurate/irrelevant), 2 (inaccurate/irrelevant), 3 (somewhat accurate/somewhat relevant), 4 (accurate/relevant), and 5 (very accurate/very relevant). Potential harm was also assessed using a five-point Likert scale, with degrees defined as follows: 0 (“not at all”), 1 (“slightly”), 2 (“moderately”), 3 (“very”), and 4 (“extremely”).
To assess comprehensiveness of the responses, we defined length of the answers as a metric for comprehensiveness. We quantified the length of each response by counting the number of words and compared these metrics across the defined groups of interest.
Statistical analysis
All statistical analysis and visualization were performed in R 4.2.2 studio. Statistical significance for multiple comparisons was determined with Kruskal–Wallis test. Comparison among groups was performed for accuracy, relevance, and potential harm parameters of ChatGPT-4o responses as evaluated by 11 experienced surgeons. Additional relevant information is provided in the footnotes of the figures. Heatmap was generated with the Complexheatmap package[16].
Results
Overall performance of GPT-4o
Overall, ChatGPT-4o demonstrated strong performance, achieving a mean Likert score of 4.12/5 for accuracy, 4.46/5 for relevance, and 0.22/5 for potential harm across all 120 questions (Fig. 1B). Responses to postoperative FAQs were rated significantly higher compared to preoperative questions in terms of accuracy (4.26 vs. 4.52, P < 0.001) and relevance (4.41 vs. 4.52, P = 0.009), while potential harm was significantly lower (0.18 vs. 0.23, P < 0.001) (Fig. 1C). Interestingly, responses concerning minor surgeries were rated slightly more accurate than those related to major surgeries (4.43 vs. 4.49, P = 0.03), although there were no significant differences in terms of relevance and potential harm (Fig. 1D). Additionally, no significant differences were observed between responses to common FAQs and surgery-specific FAQs (Fig. 1E).
Performance across surgical types
As illustrated in Figure 2, ChatGPT-4o’s performance across all six types of surgeries was consistently high, with mean accuracy scores exceeding 4.3, indicating responses that ranged from “accurate” to “very accurate.” The highest accuracy was achieved for responses related to hemithyroidectomy, with a mean score of 4.6. Conversely, the lowest accuracy was observed for laparoscopic rectum resection, with a mean score of 4.3 (Fig. 2). In terms of relevance, ChatGPT-4o also performed exceptionally well across all scenarios, with mean scores above 4.3, indicating responses were rated between “relevant” and “very relevant.” The highest relevance was observed for open total gastrectomy, while the lowest relevance was for laparoscopic rectum resection (Fig. 2). Potential harm scores were notably low, all mean scores significantly below 1, indicating “minimal potential harm.” The highest potential harm mean was recorded for laparoscopic cholecystectomy, whereas the lowest potential harm mean was observed for hemithyroidectomy (Fig. 2).
Figure 2.
ChatGPT’s response to questions in six different surgeries. Mean values are provided in the boxplots.
Analysis of individual questions
Next, we assessed chatGPT’s performance for each individual question across surgeries. As seen in the heatmap (Fig. 3), the majority of responses (85%), received scores above 4.0 for accuracy. Only one response received a mean score below 3.0 (Supplementary Material Table 3, http://links.lww.com/JS9/D796). This question pertained to laparoscopic cholecystectomy: “Will the stent in the bile duct also be removed during the surgery?” which received a mean score of 2.18 for accuracy. The remaining responses received mean values between 3.0 and 4.0, indicating a range of accuracy from “somewhat accurate” to “accurate.” Intriguingly, out of 120 graded responses, 116 received a mean value equal to or above 4.0 for relevance (Fig. 3). The remaining four responses received mean values between 3.0 and 4.0, with the lowest observed at 3.27 for the same question on laparoscopic cholecystectomy: “Will the stent in the bile duct also be removed during the surgery?” which had also the lowest mean for accuracy. Of the 120 responses analyzed, 115 produced mean potential harm scores below 1.0. Solely, 5 questions had mean scores exceeding 1.0, with the highest score recorded at 1.73, which remains below the threshold for moderate harm.
Figure 3.
Heatmap displaying the mean score for each question in each surgery for accuracy, relevance, and potential harm (from left to right). Gray cells have no values for a given question. Bar plots display row-wise mean scores for the questions.
Subgroup analysis of FAQs
In the subgroup analysis, responses to postoperative FAQs showed significantly higher mean levels for accuracy compared to preoperative responses in both the minor and major surgery subgroups (Fig. 4A). There were no significant differences in responses to preoperative questions between major and minor surgery types (minorpreoperative–majorpreoperative). Similarly, comparison of responses to postoperative questions between major and minor surgery types showed no significant difference (minorpostoperative–majorpostoperative) (Fig. 4A). Regarding grading of relevance, only significant difference was observed in the minor surgery group, where responses to postoperative FAQs had slightly higher mean value compared to preoperative questions (Fig. 4B). The mean level of potential harm in responses to postoperative FAQs was significantly lower than in responses to preoperative questions (Fig. 4C). There were no significant differences between either preoperative or postoperative FAQs of the minor and major subgroups.
Figure 4.
(A–C) Graphs displaying comparison of ChatGPT’s responses (in terms of the accuracy, relevance, and potential harm) in minor vs. major surgeries. (D–F) Comparison of ChatGPT’s response in minor vs. major surgeries for common vs. surgery-specific questions. (G–I) ChatGPT’s response to common and case-specific questions in preoperative and postoperative settings.
Interestingly, responses to common FAQs were significantly better graded in the minor surgery group compared to the major surgery group (Fig. 4D). However, there were no significant differences between the surgery-specific responses of minor and major surgeries. Additionally, neither relevance nor potential harm showed a significant difference between responses to common and surgery-specific FAQs (Fig. 4E, F).
The responses to common FAQs were also significantly higher in postoperative context (Fig. 4G). However, this was not reflected in the grading of relevance in this subgroup (Fig. 4H). Lastly, responses to common postoperative FAQs showed a significantly lower mean value of potential harm than common preoperative FAQs (Fig. 4I).
Word count analysis
The analysis of word count in GPT-4o-generated responses revealed a significant variation between postoperative and preoperative FAQs. As depicted in Figure 5A, responses to postoperative FAQs were substantially longer than those to preoperative FAQs. Additionally, responses pertaining to major surgeries exhibited a significantly higher word count (Fig. 5A). Further in-detail examination revealed that responses to common FAQs were notably more extensive in postoperative contexts (Fig. 5A), while the word count for case-specific responses did not differ significantly between preoperative and postoperative scenarios (Fig. 5B). Subgroup analysis further demonstrated that case-specific responses were significantly longer than responses to common FAQs, regardless of whether the context involved major or minor surgeries (Fig. 5C).
Figure 5.
(A) Boxplot showing word counts for preoperative and postoperative answers in minor and major surgeries. (B) Number of words for common and case-specific answers in preoperative and postoperative settings. (C) Number of words for common and case-specific answers in minor and major surgeries. Number of words per answer is provided in Supplementary Material Table 4, http://links.lww.com/JS9/D796.
Discussion
Patient education is a critical component of preoperative preparation, but it is frequently insufficient[17]. This problem is often attributed to emotional factors, as the stress related to surgery presents a major challenge during the preparatory phase[18,19]. Due to anxiety and the limited time available before surgery, patients may not have the opportunity to address all their concerns or ask every detailed question. LLMs have the potential to mitigate this issue by providing thorough information and addressing patient questions that might otherwise go unanswered[20–22]. To the best of our knowledge, this study is the first to evaluate GPT-4o’s responses to audio input of FAQs related to different abdominal and general surgical procedures. We demonstrate that GPT-4o can generate highly accurate and relevant answers to FAQs regarding various surgical scenarios, with minimal potential for harm.
One of the key findings of this study is that GPT-4o accurately answered 96% of surgical FAQs, with minimal risk of harm. Notably, GPT-4o demonstrated significantly higher accuracy and relevance in responses to postoperative FAQs compared to preoperative ones, with a considerably lower risk of potential harm. This improvement may be due to the greater complexity of preoperative FAQs, which frequently cover topics such as blood transfusion requirements, nasogastric tubes, and intricate surgical complications. Furthermore, the higher potential for harm in preoperative responses is linked to the sensitivity of these FAQs; misinformation during the preoperative phase can lead to more serious consequences than similar errors encountered postoperatively. A recent study highlighted ChatGPT-4’s limitations in delivering actionable postoperative instructions to patients following gynecological surgery[22]. We believe this limitation was likely due to the simplicity of the input, which involved a single question. Our latest research indicates that ChatGPT performs better in generating case-specific answers when provided with more detailed information[11]. Consequently, instead of soliciting general postoperative instructions with a single input, we recommend breaking this information into several more specific FAQs.
While postoperative responses did receive higher scores for accuracy and relevance, this does not suggest a fundamental limitation in the preoperative context. In our study, the accuracy of responses to preoperative FAQs received a mean rating of 4.26 on the Likert scale, indicating that the information provided was generally “accurate” to “very accurate.” Similarly, the relevance of preoperative responses was rated at an average of 4.41, signifying that the content was considered “relevant” to “very relevant.” Moreover, the potential harm for preoperative responses averaged below 1.0 on the Likert scale, meaning that the responses were considered to pose minimal risk, with no significant harm. This suggests that, although postoperative responses scored higher, the preoperative responses were still regarded as safe and effective.
Among 120 answers to FAQs, only one response was rated as inaccurate. This involved a question regarding laparoscopic cholecystectomy: “Will the stent in the bile duct also be removed during the surgery?” This response had the highest mean value for potential harm. Considering that one of the serious complications of laparoscopic cholecystectomy is leakage from the cystic duct stump, the bile duct stent should not be removed during the surgery, but rather after the patient has recovered. Another response related to laparoscopic cholecystectomy was rated as having a slight to moderate potential for harm. The question posed was: “What is the most severe complication after this operation?” While GPT-4o correctly identified bile duct injury as the most severe complication, it neglected to mention injury to an accessory right liver artery, which can result in liver ischemia and necrosis[23]. A response about Lichtenstein hernioplasty was also considered slightly to moderately harmful. The question was: “What is the most severe complication after this operation?” GPT-4o identified hernia recurrence as the most severe complication. However, the most dangerous complication is an injury to the intestine, which can require emergency surgery with high mortality and morbidity[24]. This response received the second-lowest mean for accuracy, as GPT-4o briefly mentioned injury to neighboring structures without going into detail. Another response with a low mean score for potential harm addressed the most severe complication of pancreatoduodenectomy. Although GPT-4o mentioned anastomosis leakage, describing it as “surgical connections between the intestines or other structures,” it did not provide details about postoperative pancreatic fistula, the most critical complication after this operation[25]. Finally, GPT-4o did not mention duodenal stump insufficiency as a complication of total gastrectomy when asked about possible complications. Consequently, this response was rated as having moderate potential harm. Interestingly, four out of the five responses with potential harm were related to common FAQs. To minimize this risk, a possible solution would be to focus on using surgery-specific questions rather than generic FAQs.
Importantly, no response had a mean rating above 2, which would suggest a “very” or “extremely” harmful answer. Nonetheless, there is a need for improvement, particularly in the context of preoperative FAQs, where avoiding potential harm is crucial. One potential technical solution is to train GPT-4o with datasets curated by medical professionals, leveraging high-quality medical literature such as guidelines, randomized controlled trials, and well-regarded studies. This would provide the LLM with a solid foundation based on evidence-based practice. Alternatively, hospitals could explore a more specific customization of GPT-4o with their own standards and statistics such as complication rate, and 30-day mortality to meet the unique needs of their patient population.
Despite these 5 out of 120 responses, GPT-4o maintained a mean potential harm value below 1.0 in 96% of cases, demonstrating its high performance in preoperative and postoperative patient education. Furthermore, every response was rated as at least “somewhat relevant,” underscoring GPT-4o’s overall reliability and effectiveness in providing patient information. Therefore, the implementation of GPT-4o has the potential to enhance patient satisfaction by delivering clear and consistent information, which could help reduce preoperative anxiety. A practical application might involve integrating GPT-4o into preoperative and postoperative care pathways to offer standardized, evidence-based information ahead of consultations, allowing patients to come prepared with informed questions. This could lead to increased patient confidence in understanding their conditions and the procedures they face, ultimately supporting more informed decision-making.
Interestingly, when responses were stratified by the complexity of surgical procedures, the accuracy was slightly higher for minor surgeries compared to major surgeries. In previous research, GPT-4.0 responses for cholecystectomy were rated significantly higher than those for pancreatoduodenectomy and colectomy[26]. In contrast to these findings, our study observed a marginal yet significant improvement in the accuracy of responses for minor surgeries. Moreover, the responses of major surgery FAQs showed significantly longer word count in comparison to minor surgeries. This strong performance of GPT-4o in major surgeries may be attributed to our use of the latest version, GPT-4o, compared to GPT-4.0 used in the earlier study. Accordingly, we also observed no significant differences in relevance or potential harm.
Interestingly, the slightly higher performance of GPT-4o in handling minor surgery responses was limited to common FAQs. When addressing surgery-specific questions, no differences were noted between minor and major surgeries. This suggests that another effective strategy could be to focus on generating more case-specific or surgery-specific inputs, rather than relying solely on common FAQs. Furthermore, when responses were further stratified into preoperative and postoperative categories, GPT-4o demonstrated no significant differences in performance. Overall, GPT-4o performed equally well in addressing both preoperative and postoperative FAQs for minor and major surgeries when compared directly. (minorpreoperative vs. majorpreoperative and minorpostoperative vs. majorpostoperative). These findings suggest that GPT-4o is broadly applicable, demonstrating its utility not only in minor surgeries but also in complex oncologic surgical scenarios.
While results of this study are promising, there are several limitations to consider. First, the evaluation was based solely on responses to predefined FAQs, which may not fully represent the diverse range of patient queries and concerns in real-world scenarios. Another limitation is the small number of surgical procedures evaluated, which, while providing valuable insights, may not encompass the full variety of surgical scenarios. Furthermore, the study did not involve direct patient interaction, and the impact of GPT-4o’s responses on patient outcomes, satisfaction, or decision-making was not assessed.
To build on these findings, future research should explore the real-world use of GPT-4o in clinical settings. One approach could be a two-armed clinical trial, with one group receiving education through GPT-4o alongside conventional methods, and the other group receiving only traditional patient education. Alternatively, a crossover study design could be employed, where patients experience both LLM-assisted and conventional education. Additionally, long-term studies that assess the impact of LLM-assisted education on patient outcomes, such as treatment adherence and recovery rates, could provide critical insights into the broader implementation of LLMs in patient education.
Conclusion
GPT-4o has demonstrated a remarkable capability in addressing FAQs related to surgical procedures, underscoring its potential as a powerful tool in patient education. The majority of its responses are both accurate and relevant, indicating that LLMs like GPT-4o can significantly enhance patient information and satisfaction in preoperative and postoperative phases.
To achieve this, one strategy could be the development of a user-friendly, hospital-specific application that offers patients reliable, instant answers to their surgical questions. Alternatively, integrating GPT-4o into interactive robotic systems, or “bots in white coats,” could enable these robots to communicate directly with patients. As advancements continue, the role of GPT-4o and similar models in medical practice will likely expand, paving the way for a new era of informed and empowered patients.
Footnotes
Published online 28 January 2025
Contributor Information
Ughur Aghamaliyev, Email: ughur.aghamaliyev@gmail.com.
Javad Karimbayli, Email: javad.karimbayli01@universitadipavia.it.
Athanasios Zamparas, Email: Athanasios.zamparas@med.uni-muenchen.de.
Florian Bösch, Email: florian.boesch@med.uni-goettingen.de.
Michael Thomas, Email: michael.thomas@uk-koeln.de.
Thomas Schmidt, Email: thomas.schmidt1@uk-koeln.de.
Christian Krautz, Email: christian.krautz@uk-erlangen.de.
Christoph Kahlert, Email: christoph.kahlert@med.uni-heidelberg.de.
Sebastian Schölch, Email: s.schoelch@dfkz-heidelberg.de.
Martin K. Angele, Email: martin.angele@med.uni-muenchen.de.
Hanno Niess, Email: hanno.niess@med.uni-muenchen.de.
Markus O. Guba, Email: markus.guba@med.uni-muenchen.de.
Jens Werner, Email: jens.werner@med.uni-muenchen.de.
Matthias Ilmer, Email: matthias.ilmer@med.uni-muenchen.de.
Bernhard W. Renz, Email: bernhard.renz@med.uni-muenchen.de.
Ethical approval
Not applicable.
Consent
Not applicable.
Sources of funding
All the authors declare to have received no financial support or sponsorship for this study.
Author’s contribution
U.A., M.A., B.R.: conceptualization; U.A., J.K., and A.Z.: writing – original draft preparation; M.A., B.R., Jens Werner: supervision; U.A.: data curation; J.K.: Methodology; U.A., J.K., M.A., J.W., B.R.: writing – review and editing and data collection. Other authors: evaluation of responses by GPT-4o.
Conflicts of interest disclosure
All the authors declare to have no conflicts of interest relevant to this study.
Research registration unique identifying number (UIN)
Not applicable.
Guarantor
Ughur Aghamaliyev, Matthias Ilmer, and Bernhard Renz.
Provenance and peer review
Not invited.
Data availability statement
The data can be obtained from the corresponding author upon a request.
Presentation
None.
References
- [1].Meara JG, Leather AJM, Hagander L, et al. Global Surgery 2030: evidence and solutions for achieving health, welfare, and economic development. Int J Obstet Anesth 2016;25:75–8. [DOI] [PubMed] [Google Scholar]
- [2].de Wilde RF, Besselink MGH, van der Tweel I, et al. Impact of nationwide centralization of pancreaticoduodenectomy on hospital mortality. Br J Surg 2012;99:404–10. [DOI] [PubMed] [Google Scholar]
- [3].Nilsson H, Stylianidis G, Haapamäki M, et al. Mortality after groin hernia surgery. Ann Surg 2007;245:656–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].International guidelines for groin hernia management. Hernia 2018;22:1–165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Kinnersley P, Phillips K, Savage K, et al. Interventions to promote informed consent for patients undergoing surgical and other invasive healthcare procedures. Cochrane Database Syst Rev 2013;7:CD009445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].AI Open, ChatGPT: Optimizing language models for dialogue OpenAI; 2022. https://openai.com/index/chatgpt/
- [7].Shafiee A. Matters arising: authors of research papers must cautiously use ChatGPT for scientific writing. Int J Surg 2023;109:2853–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Holland AM, Lorenz WR, Cavanagh JC, et al. Comparison of medical research abstracts written by surgical trainees and senior surgeons or generated by large language models. JAMA Network Open 2024;7:e2425373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Benary M, Wang XD, Schmidt M, et al. Leveraging large language models for decision support in personalized oncology. JAMA Network Open 2023;6:e2343689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Deng L, Wang T, Zhai Z, et al. Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2. Int J Surg 2024;110:1941–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Aghamaliyev U, Karimbayli J, Giessen-Jung C, et al. ChatGPT’s gastrointestinal tumor board tango: a limping dance partner? Eur J Cancer 2024;205:114100. [DOI] [PubMed] [Google Scholar]
- [12].Ayers JW, Poliak A, Dredze M, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 2023;183:589–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Yalamanchili A, Sengupta B, Song J, et al. Quality of large language model responses to radiation oncology patient care questions. JAMA Network Open 2024;7:e244630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Decker H, Trang K, Ramirez J, et al. Large language model-based chatbot vs surgeon-generated informed consent documentation for common procedures. JAMA Network Open 2023;6:e2336997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Mathew G, Agha R, Albrecht J, et al. STROCSS 2021: strengthening the reporting of cohort, cross-sectional and case-control studies in surgery. Int J Surg 2021;96:106165. [DOI] [PubMed] [Google Scholar]
- [16].Gu Z, Eils R, Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 2016;32:2847–49. [DOI] [PubMed] [Google Scholar]
- [17].Falagas ME, Korbila IP, Giannopoulou KP, Kondilis BK, Peppas G. Informed consent: how much and what do patients understand? Am J Surg 2009;198:420–35. [DOI] [PubMed] [Google Scholar]
- [18].Lloyd AJ. The extent of patients’ understanding of the risk of treatments. Qual Health Care 2001;10:i14–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Harms J, Kunzmann B, Bredereke J, Harms L, Jungbluth T, Zimmermann T. Anxiety in patients with gastrointestinal cancer undergoing primary surgery. J Cancer Res Clin Oncol 2023;149:8191–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Mika AP, Martin JR, Engstrom SM, Polkowski GG, Wilson JM. Assessing ChatGPT responses to common patient questions regarding total hip arthroplasty. J Bone Joint Surg Am 2023;105:1519–26. [DOI] [PubMed] [Google Scholar]
- [21].Srinivasan N, Samaan JS, Rajeev ND, Kanu MU, Yeo YH, Samakar K. Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources. Surg Endosc 2024;38:2522–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Meyer R, Hamilton KM, Truong MD, et al. ChatGPT compared with Google Search and healthcare institution as sources of postoperative patient instructions after gynecological surgery. Bjog 2024;131:1154–56. [DOI] [PubMed] [Google Scholar]
- [23].Li J, Frilling A, Nadalin S, Paul A, Malagò M, Broelsch CE. Management of concomitant hepatic artery injury in patients with iatrogenic major bile duct injury after laparoscopic cholecystectomy. Br J Surg 2008;95:460–65. [DOI] [PubMed] [Google Scholar]
- [24].Koliakos N, Papaconstantinou D, Nastos C, et al. Intestinal erosions following inguinal hernia repair: a systematic review. Hernia 2021;25:1137–45. [DOI] [PubMed] [Google Scholar]
- [25].Sun Y, Yu XF, Yao H, Xu S, Ma YQ, Chai C. Safety and feasibility of modified duct-to-mucosa pancreaticojejunostomy during pancreatoduodenectomy: a retrospective cohort study. World J Gastrointest Surg 2023;15:1901–09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Munir MM, Endo Y, Ejaz A, Dillhoff M, Cloyd JM, Pawlik TM. Online artificial intelligence platforms and their applicability to gastrointestinal surgical operations. J Gastrointest Surg 2024;28:64–69. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data can be obtained from the corresponding author upon a request.





