Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Dec 1.
Published in final edited form as: Am J Gastroenterol. 2023 Jul 10;118(12):2276–2279. doi: 10.14309/ajg.0000000000002397

Evaluation of the Potential Utility of an Artificial Intelligence ChatBot in GERD Management

Jacqueline B Henson 1, Jeremy R Glissen Brown 1, Joshua P Lee 1, Amit Patel 1,2, David A Leiman 1,3
PMCID: PMC10834834  NIHMSID: NIHMS1914808  PMID: 37410934

Abstract

Introduction:

Artificial intelligence chatbots could serve as an information resource for patients and a tool for clinicians. Their ability to respond appropriately to questions regarding gastroesophageal reflux disease (GERD) is unknown.

Methods:

Twenty-three prompts regarding GERD management were submitted to ChatGPT, and responses were rated by three gastroenterologists and eight patients.

Results:

ChatGPT provided largely appropriate responses (91.3%), though with some inappropriateness (8.7%) and inconsistency. Most (78.3%) responses contained at least some specific guidance. Patients considered this a useful tool (100%).

Conclusions:

ChatGPT’s performance demonstrates the potential for this technology in healthcare, though also its limitations in its current state.

Keywords: artificial intelligence, GERD, chatbot, accuracy, large language models

Introduction

Gastroesophageal reflux disease (GERD) is a common medical condition that affects nearly 30% of adults in the United States.1 Due to evolving diagnostic criteria and novel therapies, management complexity is increasing.2 ChatGPT is a chat-based artificial intelligence large language model that has gained wide popularity since its release in November 2022.3 It has demonstrated sufficient medical knowledge to successfully pass the United States Medical Licensing Examination.4 Therefore, it could serve as an interactive information resource for patients with GERD symptoms, as well as a tool for clinicians, though the ability of ChatGPT to provide guidance for GERD management is uncertain.5 We performed this study to evaluate ChatGPT responses to questions regarding GERD diagnosis and treatment.

Methods

We generated 23 GERD management prompts based on decision points from published clinical guidelines and expert consensus recommendations across the domains of diagnosis (n=5), treatment (n=11), or diagnosis and treatment (n=7) (Table 1).2,69 Each prompt was submitted to ChatGPT (version 3/14/2023) three times during independent interactions without intervening feedback to assess consistency. Responses were rated by three board-certified gastroenterologists, including two esophagologists (AP and DAL), for appropriateness (“completely appropriate”, “mostly appropriate with some slight inaccuracies or irrelevant information”, “mostly inappropriate”) and specificity (“containing some specific guidance”, “containing only generic information or guidance”). Two reviewers independently rated each response and disagreement was resolved by a third. Eight patients each provided an assessment of their comprehension (“completely understand”, “mostly understand”, “mostly do not understand”) of representative responses to half of the prompts as well their usefulness (“very useful”, “somewhat useful, “not useful). Patients were asked whether the responses were considered useful for medical information, and how the reponse format compared to a traditional search engine. The proportion of provider and patient ratings of responses and the consistency of appropriate responses to the same prompt were evaluated. A kappa statistic was used to assess the inter-rater reliability of responses (Supplemental Table 1). Analysis was performed using R 4.1.2. The Duke University Institutional Review Board considered this study exempt.

Table 1.

Prompts on GERD diagnosis and management submitted to ChatGPT.

Category Topic
Treatment GERD symptoms
Treatment GERD symptoms despite lifestyle modifications
Treatment GERD symptoms resolved on PPI
Treatment GERD symptoms despite lifestyle modifications and PPI
Treatment Recurrent GERD symptoms after PPI discontinuation
Treatment Safety of long-term PPI use
Treatment Long-term PPI use in GERD with Barretťs esophagus
Treatment Long-term PPI use in GERD with severe esophagitis
Treatment Long-term PPI use in GERD with peptic stricture
Diagnosis Need for further testing to diagnose GERD with Barretťs esophagus >3cm
Diagnosis Need for further testing to diagnose GERD with severe esophagitis
Diagnosis Need for further testing to diagnose GERD with mild esophagitis
Diagnosis Need for further testing to diagnose GERD with peptic stricture
Diagnosis Need for further testing to diagnose GERD with normal EGD
Both Acid exposure time >6%
Both Acid exposure time >6% and body mass index of 40
Both Acid exposure time >6% and large hiatal hernia
Both Acid exposure time <4% and symptom index >50%
Both Acid exposure time <4%, and symptom index >50% with large hiatal hernia
Both Acid exposure time <4%, symptom index <50%, >80 reflux events, and PPI use
Both Acid exposure time <4%, symptom index <50%, <80 reflux events, and PPI use
Treatment Acid exposure time >6%, normal esophageal motility, and surgery
Treatment Acid exposure time >6%, ineffective esophageal motility, and surgery

Both indicates that the prompt included both interpretation of the diagnostic testing and treatment recommendations. Abbreviations: GERD, gastroesophageal reflux disease; PPI, proton pump inhibitor

Results

ChatGPT provided appropriate responses to 63/69 queries (91.3%), including 29.0% which were considered completely appropriate and 62.3% mostly appropriate (Figure 1). Responses to the same prompt were frequently inconsistent with 16/23 (70.0%) prompts yielding variable appropriateness, including three (13.0%) with both inappropriate and appropriate responses. Prompts regarding treatment received the highest proportion of completely appropriate responses (39.4%), while questions involving diagnosis and management had the most inappropriate responses (14.3%) (Figure 1A). ChatGPT failed to recommend consideration of Roux-en-Y gastric bypass for ongoing GERD symptoms with pathologic acid exposure in the setting of obesity (Supplemental Material). Additionally, some potential PPI risks were stated as fact. The majority of responses (78.3%) contained some specific guidance, particularly for prompts addressing diagnosis (93.3%) (Figure 1B). Patients from a range of educational backgrounds considered the responses both generally understandable and useful (Figure 2AB; Supplemental Table 2). All reported that they would find the tool a useful source of medical information and that the format of responses was more useful than a traditional search engine.

Figure 1.

Figure 1.

Appropriateness (A) and specificity (B) of ChatGPT responses to prompts regarding GERD management, overall and by category.

Figure 2.

Figure 2.

Patient comprehension (A) and perceived usefulness (B) of ChatGPT responses to prompts regarding GERD management, overall and by category.

Conclusions

We found that ChatGPT provided largely appropriate and at least some specific guidance for GERD management, highlighting the potential for this technology to serve as a source of information for patients as well as an aid for clinicians. Many individuals with GERD symptoms may seek medical advice from the internet, and ChatGPT demonstrated the ability to provide generally appropriate responses to both common and more complicated queries. In addition, patients found these responses understandable, and all reported that the format was more useful than a traditional search engine. Patients may also contact clinicians with questions, and ChatGPT’s conversational format may provide an ideal patient-facing medical information platform to perform initial triage of messages. This potential to improve clinical efficiency and reduce patient message and call volumes could diminish clinician burnout.10,11 However, given the inconsistency of responses observed, limited specific guidance, and content errors, such implementation is infeasible at this time without clinical oversight.

Overall, ChatGPT provided the most appropriate and understandable responses to straightforward questions regarding diagnosis and treatment options, and complicated prompts incorporating both diagnostic test interpretation and recommended management had a higher proportion of inappropriate responses. ChatGPT demonstrated an ability to accurately interpret diagnostic testing, such as determining that an individual with acid exposure time <4% and symptom index <50% is likely experiencing functional heartburn, though treatment recommendations were more limited in proficiency, including failing to consider Roux-en-Y gastric bypass in a GERD patient with body mass index of 40. Indeed, after providing the appropriate diagnosis, ChatGPT tended toward logorrhea-like responses, recounting associated information regarding GERD management but rarely specific treatment advice (e.g., “a PPI may be recommended” instead of “consider a PPI such as omeprazole at 40mg in the morning 30 minutes prior to a meal.”). This circumlocution diluted the clinical impact of responses in many cases, though this is a recognized limitation of GPT-3.5 and future iterations may improve upon this limitation with ongoing model training.3,12

While the vast majority of ChatGPT’s responses were appropriate, the presence of inappropriate responses with inconsistencies to the same prompt largely preclude its application within healthcare in its present state, at least for GERD. Additional areas of potential concern may arise when “discussing” areas of unsettled fact, such as the potential risks associated with PPI use. In its current form, ChatGPT may be susceptible to recounting all reported risks, without adequately balancing benefits or contextualizing the tradeoffs.13 It is noteworthy that ChatGPT was not specifically trained on medical bodies of text, and we did not evaluate how the underlying algorithm would respond to repeated requests for clarifications. Indeed, in all of the responses ChatGPT suggested contacting a health professional for further advice.

In conclusion, ChatGPT demonstrated the ability to provide appropriate and specific recommendations for GERD management across a large number of clinical questions, and patients found these responses understandable and useful. This highlights its potential in healthcare to improve information accessibility for patients as well as a tool to improve clinician workflow, but limitations preclude its uptake in routine clinical care at this time.12,14

Supplementary Material

Supplementary File

Acknowledgements:

The initial draft of this brief communication was generated using ChatGPT and then substantially edited for clarity, accuracy, and completeness.

Financial support:

JBH is supported by NIH grant T32DK007568.

Abbreviations:

EGD

esophagogastroduodenoscopy

GERD

gastroesophageal reflux disease

PPI

proton pump inhibitor

Footnotes

Potential competing interests: JRGB has served as a consultant for Medtronic. The other authors disclose no relevant conflicts of interest.

References

  • 1.El-Serag HB, Sweet S, Winchester CC, Dent J. Update on the epidemiology of gastro-oesophageal reflux disease: a systematic review. Gut. 2014;63(6):871–880. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Gyawali CP, Kahrilas PJ, Savarino E, et al. Modern diagnosis of GERD: the Lyon Consensus. Gut. 2018;67(7):1351–1362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Introducing ChatGPT. Accessed March 29, 2023. https://openai.com/blog/chatgpt
  • 4.Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Rogers B, Samanta S, Ghobadi K, et al. Artificial intelligence automates and augments baseline impedance measurements from pH-impedance studies in gastroesophageal reflux disease. J Gastroenterol. 2021;56(1):34–41. [DOI] [PubMed] [Google Scholar]
  • 6.Yadlapati R, Gyawali CP, Pandolfino JE, CGIT GERD Consensus Conference Participants. AGA Clinical Practice Update on the Personalized Approach to the Evaluation and Management of GERD: Expert Review. Clin Gastroenterol Hepatol. 2022;20(5):984–994.e1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Katz PO, Dunbar KB, Schnoll-Sussman FH, Greer KB, Yadlapati R, Spechler SJ. ACG Clinical Guideline for the Diagnosis and Management of Gastroesophageal Reflux Disease. Am J Gastroenterol. 2022;117(1):27–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Targownik LE, Fisher DA, Saini SD. AGA Clinical Practice Update on De-Prescribing of Proton Pump Inhibitors: Expert Review. Gastroenterology. 2022;162(4):1334–1342. [DOI] [PubMed] [Google Scholar]
  • 9.Gyawali CP, Carlson DA, Chen JW, Patel A, Wong RJ, Yadlapati RH. ACG Clinical Guidelines: Clinical Use of Esophageal Physiologic Testing. Am J Gastroenterol. 2020;115(9):1412–1428. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Tai-Seale M, Dillon EC, Yang Y, et al. Physicians’ Well-Being Linked To In-Basket Messages Generated By Algorithms In Electronic Health Records. Health Aff. 2019;38(7):1073–1078. [DOI] [PubMed] [Google Scholar]
  • 11.Yan Q, Jiang Z, Harbin Z, Tolbert PH, Davies MG. Exploring the relationship between electronic health records and provider burnout: A systematic review. J Am Med Inform Assoc. 2021;28(5):1009–1021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023;388(13):1233–1239. [DOI] [PubMed] [Google Scholar]
  • 13.Freedberg DE, Kim LS, Yang YX. The Risks and Benefits of Long-term Use of Proton Pump Inhibitors: Expert Review and Best Practice Advice From the American Gastroenterological Association. Gastroenterology. 2017;152(4):706–715. [DOI] [PubMed] [Google Scholar]
  • 14.Will ChatGPT transform healthcare? Nat Med. 2023;29(3):505–506. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

RESOURCES