Abstract
Background and Aims
We evaluated the precision, medical accuracy, superfluous content, and consistency of ChatGPT's responses to commonly asked questions about endoscopic procedures and its capability to provide emotional support, comparing its performance with the generative pretrained transformer 4 (GPT-4) model.
Methods
A set of 113 questions related to EGD, colonoscopy, EUS, and ERCP was curated from professional societies and institutional web pages. Responses from ChatGPT were generated and subsequently graded by board-certified gastroenterologists and advanced endoscopists. The emotional support efficacy of ChatGPT and GPT-4 was also assessed by a board-certified psychiatrist (L.S.-M.).
Results
ChatGPT exhibited moderate precision in answering questions about EGD (57.9% comprehensive), colonoscopy (47.6% comprehensive), EUS (48.1% comprehensive), and ERCP (44.4% comprehensive). Medical accuracy was highest for EGD (52.6% fully accurate) and lowest for EUS (40.7% fully accurate). Concerning superfluous content, responses were predominantly concise for EGD and colonoscopy, with ERCP and EUS showing increased extraneous content. Reproducibility scores varied across domains, ranging from 50.34% (for EUS) to 68.6% (for EGD). GPT-4 outperformed ChatGPT in emotional support, although both models exhibited satisfactory performance.
Conclusions
ChatGPT delivers moderately precise and medically accurate answers related to common endoscopic procedures with varying levels of extraneous content. It holds promise as a supplementary information resource for both patients and healthcare professionals.
Endoscopy remains vital in managing GI diseases by providing essential diagnostic and therapeutic interventions for innumerable GI conditions.1 In 2020, approximately 20 million endoscopic procedures were performed in the United States, highlighting the extensive reliance on these procedures in clinical settings.2
The most common endoscopic procedures are EGD, colonoscopy, EUS, and ERCP. Because of the widespread nature of GI disorders and the significance of endoscopic procedures in addressing these issues, patients will frequently have questions regarding these techniques.3
Artificial intelligence (AI) has made remarkable strides in natural language processing in recent years.4 Models such as ChatGPT (generative pretrained transformer) and GPT-4 (generative pretrained transformer 4), developed by OpenAI, an AI research organization based in San Francisco, California, USA, have demonstrated potential for various healthcare applications.5 These models have been used in such tasks as responding to medical student examination queries, creating basic medical reports, and offering information on various health-related subjects.6, 7, 8 ChatGPT could serve as a supplementary information resource for patients, improving patient education and outcomes.8 Nevertheless, concerns persist about ChatGPT's ability to provide accurate and comprehensive responses to detailed medical questions.9, 10, 11
Although a few pilot studies have examined ChatGPT's capabilities in gastroenterology, the comprehensive literature has yet to sufficiently explore its potential to address questions concerning common endoscopic procedures.10,11 In this study, we assessed the precision, comprehensiveness, and reliability of ChatGPT's answers to common queries about patient care and management regarding endoscopic procedures. Moreover, we compared the performance of ChatGPT (freely accessible) with GPT-4 (paid subscription and limited access) when responding to emotional questions posed by patients, because this comparison could reveal further insight into ChatGPT's potential role as a virtual assistant for patients and healthcare providers in the realm of endoscopic procedures.
Methods
Data source
We gathered frequently asked questions about endoscopic procedures from reputable professional societies and institutional web pages (Supplementary file, available online at www.igiejournal.org). Questions about endoscopic procedures, knowledge, and management were modified to ensure inclusiveness and representation of patients and caregivers. The criteria for exclusion consisted of questions that conveyed similar meanings, questions with ambiguous meanings (such as inquiries about how endoscopic procedures impact the body), queries that could differ from individual to individual (like the likelihood of a person's condition worsening after the procedure), and questions not related to the medical aspects of the procedures. One hundred thirteen questions were selected for common endoscopic procedures (EGD, colonoscopy, EUS, and ERCP) (Supplementary file, available online at www.igiejournal.org). Furthermore, we evaluated the capability of ChatGPT and GPT-4 to function as a psychological support system for patients (Supplementary file, available online at www.igiejournal.org). Because there were no predefined benchmarks for the responses, the assessment of the model's effectiveness in responding to emotional support questions was performed by a board-certified psychiatric (L.S.-M.).
Response generation
Launched in November 2022, ChatGPT is a refined version of the GPT-3.5 large language model, developed to advance natural language processing. It was trained using a rich compilation of text data from diverse online sources up to 2021. Key to its design is incorporating user feedback, a strategy known as “reinforcement learning from human feedback,” which allows it to generate more contextually apt and coherent responses. To mitigate abuse, ChatGPT developers have implemented multiple safeguards, such as preventing the generation of derogatory or harmful responses.
Questions were input into the ChatGPT March 23 version. Each question was submitted as an independent separate prompt using the "New Chat" function. Every question was entered into ChatGPT twice independently by 2 authors, with both responses recorded to assess the reproducibility of ChatGPT's answers using a text compare tool. Both authors also went through the context of responses to investigate similarities as in prior literature.9 We did not use GPT-4 for the primary analysis of this study because it is a paid subscription plan with a cap for the number of responses per hour and is not freely available to the public.5 In selecting ChatGPT (GPT-3.5 and GPT-4) for our study, we prioritized it over other models like BLOOM (BigScience, various institutions), LaMBDA/Bard (Google, Mountain View, Calif), LLaMA (Meta AI, Menlo Park, Calif) because it has an established reputation, vast training data, seamless integration, and widespread recognition.
Grading of questions
For questions related to EGD and colonoscopy, the review and grading processed were carried out by at least 2 board-certified/eligible gastroenterologists (B.P.M. and P.P.). In addition, questions about EUS and ERCP were assessed by at least 2 board-certified/eligible advanced endoscopists (I.O. and J.J.E.). The overall responses were graded as follows:
-
•
Comprehensive: The answer was thorough and all-encompassing.
-
•
Correct but insufficient: Although the answer was accurate, it lacked depth or detailed information.
-
•
Mixed with correct and incorrect or outdated data: The response comprised a combination of correct, outdated, or incorrect data.
-
•
Completely incorrect: The entire content of the answer was inaccurate.
This grading approach was inspired by methodology from related literature.9
In cases where discrepancies in grading or reproducibility assessments emerged among the primary reviewers, a third blinded reviewer (K.L.) stepped in for an unbiased assessment and resolution. In addition to the primary grading, we evaluated the medical accuracy of each response:
-
•
Fully accurate: The content aligns completely with established medical facts and knowledge.
-
•
Generally accurate: Although the response might have minor inaccuracies or omissions, it remains largely correct.
-
•
Predominantly or completely inaccurate: The answer deviates significantly or entirely from factual medical knowledge.
Responses were also examined for any unrelated or superfluous content:
-
•
Concise: No presence of extraneous information.
-
•
Moderate extraneous content: The answer contains a moderate amount of irrelevant information.
-
•
Overloaded: The response is largely filled with unrelated content or is entirely off-topic.
Emotional support questions and responses
To address ChatGPT's performance on emotional support questions, 2 authors modified frequently asked questions to address this need using questions posted on professional societies and institutional web pages (Supplementary file, available online at www.igiejournal.org). The emotional question was modified with a clear subject, relevant context, personalized touch, and open-endedness to elicit a response from ChatGPT. Performance of ChatGPT's potential as a psychological support system for patients was compared with the latest GPT-4 model by open AI. Because there was no established standard for the answers, a board-certified physician psychiatrist (L.S.-M.) participated in assessing the responses and gauging the model's effectiveness in providing emotional support. Questions were graded as follows:
-
•
Not comprehensive: The answer provided minimal information or needed to address the question adequately.
-
•
Somewhat comprehensive: The answer provided some information but left out important details.
-
•
Moderately comprehensive: The answer covered most aspects of the question but may have lacked depth or specificity in certain areas.
-
•
Very comprehensive: The answer addressed all aspects of the question and provided detailed information.
-
•
Extremely comprehensive: The answer was exhaustive, covering all aspects of the question with great detail and specificity.
Statistical analysis
To assess reproducibility, we evaluated the consistency between 2 responses to each question, as previously described, using text similarity by means of an online tool (https://notes.xxi2.com/tools/similar/). This tool compared the content of 2 text documents, pinpointing differences and calculating a similarity percentage. A score ≥50% was indicative of similar text. Results are presented as percentage medians alongside the interquartile range (IQR). If both responses scored below 50% in similarity and both authors concurred with the online tool's assessment, the responses were deemed significantly different. In cases of dissimilarity, only the initial response was graded. For emotional support questions, ChatGPT's performance was juxtaposed against GPT-4, with the model identifiers anonymized to ensure objectivity. As previously outlined, the distribution of each grade across responses for different endoscopic procedure domains was calculated and is presented as percentages. All analyses were conducted using GPT-4 code interpreter.5
Results
ChatGPT displayed moderate levels of precision when answering questions about EGD (n = 38), colonoscopy (n = 21), ERCP (n = 27), and EUS (n = 27) about treatment, lifestyle/aftercare, basic knowledge, and others.
Frequently asked questions about EGD
The reproducibility of text for both responses was 68.6% (IQR, 62.1-75.0) across all categories, with individual percentages given in Table 1. The percentage of answers considered "comprehensive" was 57.9%, "correct but insufficient" was 28.9%, and "mixed with correct and incorrect/outdated data" was 13.2% for the basic knowledge, treatment, lifestyle/aftercare, and others categories (Fig. 1). For medical knowledge, the percentage of answers considered "fully accurate" was 52.6%, "generally accurate" was 44.7%, and "largely or entirely inaccurate" was 2.6% for all categories overall (Fig. 2). For superfluous content, the percentage of answers considered "concise" was 55.3%, "moderate extraneous content" was 42.1%, and "overloaded" was 2.6% for all categories overall (Fig. 3).
Table 1.
Text reproducibility between the 2 responses of Chat GPT and GPT-4
| Basic knowledge | Treatment | Lifestyle/aftecare | Other | |
|---|---|---|---|---|
| ERCP | 51.5 (45.6-56.8) | 36.6 (32.8-44) | 65.4 (36.1-82.5) | 71.9 (71.8-72.4) |
| EUS | 47.4 (44-52.1) | 48.255 (44.6-51.5) | 48.33 (40.5175-55.3) | 65.7 (53.8-71.8) |
| EGD | 64.5 (60.8-75.2) | 69.3 (65.6-79.7) | 74.0 (72-75.0) | 66.0 (44.5-69.5) |
| Colonoscopy | 72.0 (65.2-74.5) | 73.0 (71.5-75.2) | 54.0 (53-56) | 39.5 (33.5-48.5) |
Values are percent (interquartile range).
Figure 1.
Overall grades of responses by the ChatGPT language model to questions related to endoscopic procedures.
Figure 2.
Grading for medical accuracy of responses by the ChatGPT language model to questions related to endoscopic procedures.
Figure 3.
Grading for superfluous content of responses by the ChatGPT language model to questions related to endoscopic procedures.
Frequently asked questions about colonoscopy
The reproducibility of text for both responses was 62% (IQR, 53.0-73.0) across all categories, with individual percentages given in Table 1. The percentage of answers considered "comprehensive" was 47.6%, "correct but inadequate" was 33.3%, and "mixed with correct and incorrect/outdated data" was 19.0% (Fig. 1). For medical knowledge, the percentage of answers considered "fully accurate" was 38.1% and "generally accurate" was 61.9% (Fig. 2). For superfluous content, the percentage of answers considered "concise" was 55.3%, "moderate extraneous content" was 42.1%, and "overloaded" was 2.6% for all categories overall (Fig. 3).
Frequently asked questions about EUS
The reproducibility of text for both responses was 50.34% (IQR, 44.4-56.1) across all categories, with individual percentages given in Table 1. The percentage of answers considered "comprehensive" was 48.1%, "correct but inadequate" was 40.7%, and "mixed with correct and incorrect/outdated data" was 11.1% (Fig. 1). For medical knowledge, the percentage of answers considered "fully accurate" was 40.7%, "generally accurate" was 55.6%, and "predominantly or completely inaccurate" was 3.7% (Fig. 2). For superfluous content, the percentage of answers considered "concise" was 33.3%, "moderate extraneous content" was 51.9%, and "overloaded" was 14.8% for all categories overall (Fig. 3).
Frequently asked questions about ERCP
The reproducibility of text for both responses was 51.4% (IQR, 39.7-70.8) across all categories, with individual percentages given in Table 1. The percentage of answers considered "comprehensive" was 44.4%, "correct but inadequate" was 37.0%, and "mixed with correct and incorrect/outdated data" was 18.5%. For medical knowledge, the percentage of answers considered "fully accurate" was 37.0%, "generally accurate" was 59.3%, and "predominantly or completely inaccurate" was 3.7% (Fig. 2). For superfluous content, the percentage of answers considered "concise" was 44.4%, "moderate extraneous content" was 51.9%, and "overloaded" was 3.7% for all categories overall (Fig. 3).
Emotional support questions about endoscopic procedures
The responses to emotional support questions were graded from 1 to 5 based on the level of comprehensiveness (Fig. 4). Both large language models (ChatGPT and GPT-4) performed adequately, with all responses being moderate to extremely comprehensive. GPT-4 outperformed ChatGPT responses to emotional questions. No answers from either large language model were deemed noncomprehensive.
Figure 4.
Responses to emotional support statements by ChatGPT and GPT-4 language models.
Discussion
This study assessed the precision and consistency of ChatGPT in addressing patient inquiries about endoscopic procedures. ChatGPT generated accurate and relevant responses to these procedures and provided comprehensive information to patients, performing better with basic procedures like EGD and colonoscopy compared with EUS and ERCP. The variations in performance could be attributed to factors such as the complexity of the procedure, clarity of questions posed, and volume of training set data that ChatGPT has on these procedures. It is possible that basic procedures like EGD and colonoscopy have more training data available, leading to more accurate responses. ChatGPT also effectively addressed emotional concerns, showcasing empathy and understanding.12
Global differences in endoscopic guidelines were not evaluated. GPT-4 was not examined because of accessibility constraints. An intriguing aspect of AI models like ChatGPT and GPT-4 is their potential role in healthcare workflows. They could be integrated into applications based on the OpenAI application programming interface, allowing patients to access instant information. However, this integration also brings up potential challenges and ethical considerations, including data security, reliability of advice, and over-reliance on automated tools.
Health literacy is vital for patients undergoing GI endoscopic procedures.13,14 Despite the need for accessible and accurate information, obtaining easy-to-understand resources can be a struggle. ChatGPT can address this issue by delivering health information conversationally, simplifying complex medical jargon15,16 and potentially leading to better patient understanding.17 ChatGPT can support healthcare providers by generating responses to routine patient inquiries, potentially saving time for more complex cases. The accuracy of the responses varies, and with technologic improvements, this could increase, possibly boosting provider productivity.18
ChatGPT and GPT-4 showed empathetic responses to emotional questions. Interest is growing in the ability of AI models to provide emotional support, a pivotal aspect of patient care. Beyond merely understanding medical terminologies and procedures, AI models can significantly address patient fears, offer comfort, and understand their emotional needs. Although this study touches on these capabilities, a deeper exploration into the existing literature in fields beyond GI could give insights into the boundaries and potential of AI in this context.
Although ChatGPT demonstrated strengths in many areas of patient inquiries, a notable shortcoming was its performance in addressing lifestyle and aftercare questions related to EGDs and colonoscopies. It is also worth noting that ChatGPT can generate varying replies for the same query, which can be considered a significant limitation. These limitations underscore a potential distinction in the utility of such AI tools. Questions that delve into the specifics of a procedure or its general knowledge are more effectively answered by ChatGPT. However, the tool needs to be improved regarding nuanced physician recommendations and situation-specific aftercare. This is likely because these topics often involve the subtle judgment calls made by physicians based on individual patient situations rather than standardized or universally applicable information. ChatGPT should be viewed as a complement to, rather than a replacement for, personalized advice and information from healthcare professionals.
This study's main strength is the comprehensive collection of inquiries from authoritative sources. However, ChatGPT has limitations. Although ChatGPT provides comprehensive answers to certain questions, it emphasizes its potential as a supplementary resource, not a standalone substitute for healthcare professionals. The discrepancies among reviewers, which accounted for less than 25%, hint at a potential subjectivity that might have influenced the grading of ChatGPT's responses. This could have skewed our evaluation of the AI model's performance. ChatGPT's training data, limited to 2021, may lead to outdated responses. The quality of its training data remains under review, affecting reliability. Furthermore, ChatGPT struggled with specifics like laboratory cutoff values or treatment durations. These specific struggles can be crucial in real-world endoscopy, where accurate and precise information is imperative. Reviewers' awareness of ChatGPT might have led to stricter grading, potentially underestimating its performance. Finally, globally varying guidelines could lead to confusion or harm if not correctly specified. Further refinement is needed to enhance data reliability and specificity. Additionally, the rating system used, adapted from a previous study, has inherent subjectivity and might only capture some nuances of the responses, especially subtle inaccuracies.
ChatGPT can augment healthcare providers, assisting patients with pertinent questions. Our study examined the accuracy and reproducibility of ChatGPT's responses to common patient inquiries on GI endoscopic procedures. ChatGPT frequently provided accurate, albeit sometimes incomplete, responses. The model's advice, varying across regions without regional or personalized contextualization, could be misleading.
Disclosure
The following author disclosed financial relationships: D. G. Adler: Consultant for Boston Scientific and MicroTech. All other authors disclosed no financial relationships.
Footnotes
How to cite iGIE articles: Johnson R, Webber C, Thompson TJ, et al. Article title. IGIE 2023;2:10-26.
Supplementary data
References
- 1.ASGE Standards of Practice Committee. Early D.S., Ben-Menachem T., Decker G.A., et al. Appropriate use of GI endoscopy. Gastrointest Endosc. 2012;75:1127–1131. doi: 10.1016/j.gie.2012.01.011. [DOI] [PubMed] [Google Scholar]
- 2.Delegge M.H. The difficult-to-sedate patient in the endoscopy suite. Gastrointest Endosc Clin North Am. 2008;18:679–693. doi: 10.1016/j.giec.2008.06.011. viii. [DOI] [PubMed] [Google Scholar]
- 3.Tierney M., Bevan R., Rees C.J., et al. What do patients want from their endoscopy experience? The importance of measuring and understanding patient attitudes to their care. Frontline Gastroenterol. 2016;7:191–198. doi: 10.1136/flgastro-2015-100574. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Nadkarni P.M., Ohno-Machado L., Chapman W.W. Natural language processing: an introduction. J Am Med Inform Assoc. 2011;18:544–551. doi: 10.1136/amiajnl-2011-000464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.OpenAI OpenAI: models GPT-3.5. https://chat.openai.com/ Available at:
- 6.Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback. arXiv:220302155 [cs]. Published online March 4, 2022. Available at: https://arxiv.org/abs/2203.02155. Accessed February 23, 2023.
- 7.Gilson A., Safranek C.W., Huang T., et al. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9 doi: 10.2196/45312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Jeblick K, Schachtner B, Dexl J, et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. arXiv:221214882 [cs]. Published online December 30, 2022. Available at: https://arxiv.org/abs/2212.14882. Accessed February 23, 2023. [DOI] [PMC free article] [PubMed]
- 9.Yeo Y.H., Samaan J.S., Ng W.H., et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol. 2023;29:721–732. doi: 10.3350/cmh.2023.0089. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lee T.C., Staller K., Botoman V., et al. ChatGPT answers common patient questions about colonoscopy. Gastroenterology. 2023;165:509–511. doi: 10.1053/j.gastro.2023.04.033. [DOI] [PubMed] [Google Scholar]
- 11.Suchman K., Garg S., Trindade A.J. Chat generative pretrained transformer fails the multiple-choice American College of Gastroenterology self-assessment test. Am J Gastroenterol. 2023;118:2280–2282. doi: 10.14309/ajg.0000000000002320. [DOI] [PubMed] [Google Scholar]
- 12.Ayers J.W., Poliak A., Dredze M., et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183:589–596. doi: 10.1001/jamainternmed.2023.1838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kolb J.M., Chen M., Tavakkoli A., et al. Patient knowledge, risk perception, and barriers to Barrett’s esophagus screening. Am J Gastroenterol. 2023;118:615–626. doi: 10.14309/ajg.0000000000002054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Smith S.G., von Wagner C., McGregor L.M., et al. The influence of health literacy on comprehension of a colonoscopy preparation information leaflet. Dis Colon Rectum. 2012;55:1074–1080. doi: 10.1097/DCR.0b013e31826359ac. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zeng Q.T., Kogan S., Plovnick R.M., et al. Positive attitudes and failed queries: an exploration of the conundrums of consumer health information retrieval. Int J Med Inform. 2004;73:45–55. doi: 10.1016/j.ijmedinf.2003.12.015. [DOI] [PubMed] [Google Scholar]
- 16.Morahan-Martin J.M. How internet users find, evaluate, and use online health information: a cross-cultural review. Cyberpsychol Behav. 2004;7:497–510. doi: 10.1089/cpb.2004.7.497. [DOI] [PubMed] [Google Scholar]
- 17.Miner A.S., Laranjo L., Kocaballi A.B. Chatbots in the fight against the COVID-19 pandemic. NPJ Digit Med. 2020;3:65. doi: 10.1038/s41746-020-0280-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Xu L., Sanders L., Li K., et al. Chatbot for health care and oncology applications using artificial intelligence and machine learning: systematic review. JMIR Cancer. 2021;7 doi: 10.2196/27850. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




