Skip to main content
Clinical and Translational Gastroenterology logoLink to Clinical and Translational Gastroenterology
. 2024 Aug 30;15(11):e00765. doi: 10.14309/ctg.0000000000000765

Digesting Digital Health: A Study of Appropriateness and Readability of ChatGPT-Generated Gastroenterological Information

Avi Toiv 1,, Zachary Saleh 2, Angela Ishak 1, Eva Alsheik 2, Deepak Venkat 2, Neilanjan Nandi 3, Tobias E Zuchelli 2
PMCID: PMC11596446  PMID: 39212302

Abstract

INTRODUCTION:

The advent of artificial intelligence–powered large language models capable of generating interactive responses to intricate queries marks a groundbreaking development in how patients access medical information. Our aim was to evaluate the appropriateness and readability of gastroenterological information generated by Chat Generative Pretrained Transformer (ChatGPT).

METHODS:

We analyzed responses generated by ChatGPT to 16 dialog-based queries assessing symptoms and treatments for gastrointestinal conditions and 13 definition-based queries on prevalent topics in gastroenterology. Three board-certified gastroenterologists evaluated output appropriateness with a 5-point Likert-scale proxy measurement of currency, relevance, accuracy, comprehensiveness, clarity, and urgency/next steps. Outputs with a score of 4 or 5 in all 6 categories were designated as “appropriate.” Output readability was assessed with Flesch Reading Ease score, Flesch-Kinkaid Reading Level, and Simple Measure of Gobbledygook scores.

RESULTS:

ChatGPT responses to 44% of the 16 dialog-based and 69% of the 13 definition-based questions were deemed appropriate, and the proportion of appropriate responses within the 2 groups of questions was not significantly different (P = 0.17). Notably, none of ChatGPT’s responses to questions related to gastrointestinal emergencies were designated appropriate. The mean readability scores showed that outputs were written at a college-level reading proficiency.

DISCUSSION:

ChatGPT can produce generally fitting responses to gastroenterological medical queries, but responses were constrained in appropriateness and readability, which limits the current utility of this large language model. Substantial development is essential before these models can be unequivocally endorsed as reliable sources of medical information.

KEYWORDS: natural language processing, AI, artificial intelligence, medical terminology, gastroenterology


graphic file with name ct9-15-e00765-g001.jpg

INTRODUCTION

An astounding volume of health information is available online, yet no mechanisms exist for regulating its accuracy and validity. Compounding this problem is the recent arrival of artificial intelligence (AI)-based large language models (LLM), which use unaudited online data to generate answers to queries posed by the general public. With more than 70% of internet users consulting online health resources (1), empowering patients with the skills to discern between reliable and questionable online health information is of growing importance, and providing clinicians with tools for countering medical misinformation is critical. Thus, understanding the quality and characteristics of AI-generated responses to medical inquiries is needed before effective strategies can be developed to help patients and medical professionals cope with potentially false, inaccurate, or even dangerous information and to help LLM developers improve the reliability of their tools.

Chat Generative Pretrained Transformer (ChatGPT; OpenAI, San Francisco, CA) is a popular AI-powered LLM that engages in natural language conversations with users and generates human-like responses based on an evolving and nontransparent set of online training data (2). Notably, ChatGPT responses are dynamic, which means that responses to the same question can vary when posed by different users. These unique features make ChatGPT-generated healthcare responses difficult to appraise because traditional evaluation methods rely on authorship characteristics, such as attribution, authority, and references (35), none of which are applicable to ChatGPT. However, text readability and basic information quality, which are 2 key features that contribute to patients' ability to assess and understand online medical information, can be evaluated with existing methods. For instance, DISCERN (4) and QUality Evaluation Scoring Tool (5) are validated scoring frameworks for evaluating online health information quality. Numerous readability algorithms are available, such as the Flesch Reading Ease (FRE) (6) and Simple Measure of Gobbledygook (SMOG) scores (7). In general, because many people struggle to understand complex medical concepts and terminology (810), the LLMs that are fundamentally altering how individuals access online medical information need to be rigorously investigated to minimize dissemination of inappropriate and unreadable health information.

Therefore, in light of the explosive popularity of LLMs and the problems these platforms face in generating accurate, understandable medical information, we performed a study of the appropriateness and readability of ChatGPT-generated responses to questions regarding common gastroenterological topics. Our aim was to assess the suitability of ChatGPT responses to questions on a relevant medical topic that has an inherently high level of biological complexity and challenging technical terminology. We had an expert panel evaluate the appropriateness of ChatGPT responses to a range of gastroenterological inquiries and analyzed the responses for readability with 3 readability formulas. Understanding how LLMs such as ChatGPT may be helping—or hindering—patients’ understanding of important gastroenterological issues will be crucial for ensuring effective communication between caregivers and diverse patient populations and empowering individuals to make informed and safe health decisions.

METHODS

We performed an analysis of appropriateness and readability for ChatGPT responses to gastroenterology-focused inquiries. We established a ChatGPT-3 account on November 5, 2023, and systematically input questions into separate queries. ChatGPT-3 was used because it is free and, therefore, the most accessible to patients. This study was deemed exempt from review by the Institutional Review Board at Henry Ford Hospital.

Question development

We developed 29 questions on gastroenterological topics of public interest to simulate queries that laypeople might pose to ChatGPT (see Supplementary Tables 1 and 2, http://links.lww.com/CTG/B201). To best address gastrointestinal (GI) inquiries of the highest public interest, we asked ChatGPT to report the most frequently requested GI-related inquiries that it had received over the past year; however, ChatGPT declined to provide this information, emphasizing its commitment to user privacy. Nonetheless, ChatGPT generated a list of 100 common GI-related questions, and these were then juxtaposed against a list of topics curated by the research team. Compiled topics were then assessed in Google Trends to gauge worldwide search volume over the preceding decade, which we considered a surrogate indicator of public interest.

We developed 2 sets of questions: dialog-based questions that included disease signs and symptoms and definition-based questions on GI-related diseases and topics. Two different questioning approaches were used to investigate whether ChatGPT's responses would differ between direct questions merely pursuing general information vs dialog-based inquiries that could require consideration of a differential diagnosis or other clinical information for providing an accurate response.

The first set of questions comprised 16 dialog-based inquiries concerning signs, symptoms, and disease management. These questions were designed to mimic the process of laypeople presenting disease signs or symptoms to ChatGPT for medical interpretation. Following methodology used in a previous study (11), questions were organized across 3 distinct clinical categories: malignancy, emergency, and benign gastroenterological conditions. All 3 categories encompassed conditions commonly addressed by gastroenterologists and hepatologists. The varied nature of these categories holds significant implications for patient outcomes contingent on the quality, accuracy, and readability of the resulting information as patient decisions and the urgency with which they seek care can often be influenced by the information—or misinformation—obtained online (12).

The second set of questions comprised 13 definition-based questions designed to elicit straightforward informative responses regarding GI-related diseases and treatments, but all devoid of specific symptoms or signs. Colonoscopy-related questions were prioritized because colon cancer screening is a common way in which the public may come into contact with gastroenterology.

Determination of ChatGPT response appropriateness

All generated responses (see Supplementary Tables 1 and 2, http://links.lww.com/CTG/B201) were recorded for grading on medical appropriateness by an expert panel of 3 researchers who were all board-certified gastroenterologists and native English speakers. Each member had unique expertise and advanced training in various gastroenterology subspecialties, including advanced endoscopy, esophagus, hepatology, inflammatory bowel disease, and motility.

We identified 6 key categories for grading the appropriateness of online healthcare information: currency, relevance, accuracy, comprehensiveness, clarity, and urgency/next steps. Experts graded ChatGPT responses with a 5-point Likert scale (1: strongly disagree/very inaccurate; 2: disagree/inaccurate; 3: somewhat agree/accurate; 4: agree/accurate; and 5: strongly agree/very accurate). Outputs with a mean grade ≥ 4 in all 6 categories were deemed appropriate. Outputs with a mean grade < 4 in any of the 6 categories were classified as not appropriate.

Determination of ChatGPT response readability

Because no consensus has been reached on which method is best for evaluating readability of patient education materials, we used 3 of the most commonly used readability algorithms: the FRE scale (6), the Flesch-Kincaid Reading Grade Level (FKRGL) score (13), and the SMOG index (7). We derived FRE and FKRGL scores in Microsoft Word (Microsoft, Redmond, WA). The FRE score ranges from 0 to 100, where 0 signifies unreadable and 100 represents the highest readability. The FKRGL, the most widely used tool to assess readability (8), is a modified version of the FRE that produces an estimated grade level for text, in which higher scores mean lower readability (i.e., more difficult) and lower scores mean higher readability (i.e., easier) (13). We also used the SMOG index to calculate readability scores for all responses because it is uniquely recommended by the National Cancer Institute for developing health literature (14). SMOG scores were calculated with an available online readability calculator (15). Notably, SMOG scores often designate higher reading levels than other readability formulas because scores are based on an ideal of 100% comprehension for each grade level (7,8,16).

Response heterogeneity

To assess how input question heterogeneity may impact ChatGPT-generated responses, we performed a subanalysis of one dialog-based and one definition-based question. Variations of one question from each group were composed with the help of an editorial assistant, who is not a medical doctor, to aid in ensuring these inquiries represent the layperson. These questions were input one by one into new windows with ChatGPT and qualitatively reviewed and analyzed for FKRGL.

Statistical analysis

Descriptive statistics were used to characterize the appropriateness scores and readability results for all ChatGPT responses. Analyses were conducted in R version 4.2.2 (R Foundation, Vienna, Austria). Mean ± SD was used to describe continuous variables. Counts and percentages were used to describe categorical variables. One-way analysis of variance and t test were used to compare mean readability scores of responses to dialog-based vs definition-based questions. Fisher exact and χ2 tests were used to compare the proportions of responses that were deemed appropriate between dialog-based and definition-based questions and between malignant, benign, and emergent disease-related questions. All tests were 2-tailed, and P < 0.05 was the cutoff for statistical significance.

RESULTS

Expert assessment of appropriateness

For the 16 dialog-based questions, 7 (44%) responses were deemed appropriate, and 9 (56%) were deemed not appropriate. These responses had the highest mean scores for currency and relevance and had the lowest scores for clarity. The malignancy-related questions generated a significantly higher proportion of responses that were deemed appropriate (4 of 5; 80%) compared with 60% of responses for questions on benign conditions (3 of 5) and no responses (0 of 6) to questions on GI emergency topics (P < 0.05) (Table 1). For the dialog-based questions, the proportion of appropriate responses between the 9 symptom-related vs 7 treatment-related responses was not significantly different (56% vs 29%; P = 0.36) (Table 1).

Table 1.

Appropriateness of ChatGPT responses to dialog-based questions on gastrointestinal disease symptoms and treatments

Areas and question topics Expert opinion appropriateness scores, mean (SD)a
N = 3
Appropriateb,c
Currency Relevance Accuracy Comp Clarity Next steps/urgency
Malignancy signs and symptoms
 Dysphagia and weight loss 5.0 (0) 5.0 (0) 5.0 (0) 5.0 (0) 4.7 (0.58) 4.7 (0.58) Yes
 Thin stool 5.0 (0) 5.0 (0) 4.3 (0.58) 4.7 (0.58) 4.7 (0.58) 5.0 (1.15) Yes
 Obstipation 4.7 (0.58) 4.7 (0.58) 4.7 (0.58) 4.7 (0.58) 4.7 (0.58) 5.0 (0) Yes
Malignancy treatments
 Radiofrequency ablation 4.7 (0.58) 4.7 (0.58) 4.7 (0.58) 4.7 (0.58) 4.3 (0.58) 4.7 (0.58) Yes
 Colon cancer surgery 4.0 (1) 4.3 (1.15) 4.0 (1) 4.0 (1) 3.3 (2.08) 4.0 (1) No
Emergency signs and symptoms
 Hematemesis 5.0 (0) 4.7 (0.58) 4.3 (0.58) 4.3 (0.58) 3.7 (2.31) 3.7 (2.31) No
 Hematochoezia 4.7 (0.58) 5.0 (0) 4.3 (0.58) 4.3 (0.58) 3.7 (1.53) 4.0 (1) No
 Food impaction 4.3 (0.58) 4.3 (0.58) 3.3 (1.53) 4.0 (1) 4.7 (0.58) 4.7 (0.58) No
Emergency treatments
 Hematochezia workup 5.0 (0) 5.0 (0) 4.3 (1.15) 5.0 (0) 3.7 (2.31) 3.7 (2.31) No
 Endoscopic variceal banding 5.0 (0) 5.0 (0) 5.0 (0) 4.3 (0.58) 3.7 (1.53) 3.7 (1.15) No
 Food impaction management 3.7 (1.53) 4.0 (1.73) 3.0 (1.73) 3.3 (1.53) 3.3 (2.08) 4.3 (1.15) No
Benign GI signs and symptoms
 Jaundice and abdominal distention 4.7 (0.58) 4.0 (1) 3.7 (1.53) 4.3 (1.15) 3.0 (2) 3.7 (1.15) No
 Confusion (hepatic encephalopathy) 4.7 (0.58) 5.0 (0) 4.7 (0.58) 4.7 (0.58) 4.0 (1) 4.3 (0.58) Yes
 Burning in throat 5.0 (0) 5.0 (0) 4.7 (0.58) 5.0 (0) 5.0 (0) 4.3 (0.58) Yes
Benign GI treatments
 Cirrhosis management 5.0 (0) 5.0 (0) 4.3 (1.15) 4.7 (0.58) 3.3 (1.15) 4.3 (1.15) No
 GERD treatment 4.3 (1.15) 4.7 (0.58) 4.7 (0.58) 4.7 (0.58) 5.0 (0) 4.7 (0.58) Yes
Overall mean (SD) 4.7 (0.40) 4.7 (0.36) 4.3 (0.56) 4.5 (0.44) 4.0 (0.67) 4.3 (0.47) N/A

ChatGPT, Chat Generative Pretrained Transformer; Comp, comprehensiveness; GERD, gastroesophageal reflux disease; GI, gastrointestinal; N/A, not applicable.

a

Scores based on 5-point Likert scale: 1 = strongly disagree/very inaccurate; 2 = disagree/inaccurate; 3 = somewhat agree/accurate; 4 = agree/accurate; and 5 = strongly agree/very accurate.

b

Appropriate responses were those with a mean score ≥4 in all 6 categories.

c

Boldface text highlighting responses that did not meet appropriateness threshold criteria.

For the 13 definition-based questions, 9 (69%) responses were deemed appropriate, and 4 (31%) were deemed not appropriate. These responses had the highest mean scores for currency and relevance and the lowest mean scores for next steps/urgency (Table 2). The proportion of appropriate responses to dialog-based vs definition-based questions was not significantly different (44% vs 69%; P = 0.17). Overall, for both types of questions, only 55% were deemed appropriate. Notably, 72% of all ChatGPT responses received at least 1 score of < 4 by an expert in 1 of the proxy appropriateness categories, indicating substantially unsuitable responses by ChatGPT (Tables 1 and 2).

Table 2.

Appropriateness of ChatGPT responses to definition-based questions on gastrointestinal disease symptoms and treatments

Areas and questions Expert opinion appropriateness scores, mean (SD)a
N = 3
Appropriateb,c
Currency Relevance Accuracy Comp Clarity Next steps/urgency
General gastrointestinal topics
 Acid reflux 4.3 (0.58) 5.0 (0) 4.3 (0.6) 4.3 (0.6) 4.7 (0.6) 4.0 (1) Yes
 Acid reflux treatment 5.0 (0) 5.0 (0) 5.0 (0) 5.0 (0) 5.0 (0) 4.7 (0.6) Yes
 Celiac disease 5.0 (0) 5.0 (0) 5.0 (0) 4.3 (0.6) 4.3 (1.2) 4.3 (0.6) Yes
 Celiac disease treatment 5.0 (0) 5.0 (0) 5.0 (0) 5.0 (0) 4.7 (0.58) 4.7 (0.58) Yes
 Dietary recommendations 5.0 (0) 5.0 (0) 4.0 (0) 4.7 (0.58) 4.7 (0.58) 4.3 (0.58) Yes
 Diverticulitis 5.0 (0) 4.7 (0.58) 5.0 (0) 5.0 (0) 4.0 (1.73) 4.3 (1.15) Yes
 Diverticulitis treatment 5.0 (0) 5.0 (0) 4.3 (0.58) 4.7 (0.58) 3.7 (2.31) 4.3 (1.15) No
 Hepatitis 5.0 (0) 5.0 (0) 5.0 (0) 4.3 (0.58) 3.7 (2.31) 3.3 (1.15) No
 Hepatitis treatment 5.0 (0) 5.0 (0) 5.0 (0) 4.7 (0.58) 4.7 (0.58) 4.3 (0.58) Yes
 IBD vs IBS 5.0 (0) 5.0 (0) 4.3 (1.15) 4.7 (0.58) 3.7 (2.31) 3.7 (1.53) No
Colonoscopy
 What is a colonoscopy? 5.0 (0) 5.0 (0) 5.0 (0) 4.7 (0.58) 4.7 (0.58) 3.3 (2.08) No
 Colonoscopy preparation 5.0 (0) 5.0 (0) 4.7 (0.58) 4.7 (0.58) 4.7 (0.58) 4.3 (1.15) Yes
 Age for colonoscopy 5.0 (0) 5.0 (0) 5.0 (0) 4.3 (0.58) 4.3 (1.15) 4.3 (1.15) Yes
Overall mean (SD) 4.9 (0.19) 5.0 (0.09) 4.7 (0.36) 4.6 (0.25) 4.4 (0.46) 4.2 (0.44) N/A

ChatGPT, Chat Generative Pretrained Transformer; IBD, inflammatory bowel disease; IBS, irritable bowel syndrome; Comp, comprehensiveness; N/A, not applicable.

a

Scores based on 5-point Likert scale: 1 = strongly disagree/very inaccurate; 2 = disagree/inaccurate; 3 = somewhat agree/accurate; 4 = agree/accurate; and 5 = strongly agree/very accurate.

b

Appropriate responses were those with a mean score ≥ 4 in all 6 categories.

c

Boldface text highlighting responses that did not meet appropriateness threshold criteria.

Readability

For dialog-based questions, the mean FRE score was 32.6 ± 11.4, the mean FKRGL grade was 13.4 ± 1.9, and the mean SMOG index was 15.5 ± 1.6 (Table 3). Almost all scores (15 of 16) corresponded to an undergraduate to graduate reading level, with only 1 response at the recommended seventh-grade level.

Table 3.

Readability of ChatGPT responses to dialog-based questions on gastrointestinal conditions and treatments

Areas and question topics Readability scores
Flesch Reading Easea Flesch-Kincaid Grade Levelb SMOGc
Malignancy signs and symptoms
 Dysphagia and weight loss 21.1 (very difficult) 15.5 (graduate) 17.6
 Thin stool 38.7 (difficult) 12.2 (college) 14.6
 Obstipation 16.4 (very difficult) 15.5 (graduate) 17.0
Malignancy treatments
 Radiofrequency ablation 32.7 (difficult) 13.6 (college) 15.8
 Colon cancer surgery 35.8 (difficult) 13.4 (college) 15.3
Emergency signs and symptoms
 Hematemesis 32.8 (difficult) 13.3 (college) 15.6
 Hematochoezia 31.8 (difficult) 13.9 (college) 14.8
 Food impaction 63.6 (plain English) 8.3 (middle school) 11.8
Emergency treatments
 Hematochezia workup 24.4 (very difficult) 15.2 (graduate) 16.5
 Endoscopic variceal banding 28.3 (very difficult) 14.3 (college) 17.0
 Food impaction management 39.0 (difficult) 12.9 (college) 15.5
Benign GI signs and symptoms
 Jaundice and abdominal distention 33.3 (difficult) 13.4 (college) 15.5
 Confusion (hepatic encephalopathy) 28.2 (very difficult) 13.7 (college) 15.8
 Burning in throat 42.8 (difficult) 11.1 (high school) 13.0
Benign GI treatments
 Cirrhosis management 16.1 (very difficult) 15.7 (graduate) 17.8
 GERD treatment 35.8 (difficult) 12.1 (college) 13.8
Overall mean (SD) 32.6 (11.4) (difficult) 13.4 (1.9) (college) 15.5 (1.6)

ChatGPT, Chat Generative Pretrained Transformer; GERD, gastroesophageal reflux disease; GI, gastrointestinal; SMOG, Simple Measure of Gobbledygook Index.

a

Flesch Reading Ease score ranges from 1 to 100 (higher score is easier to read).

b

Flesch-Kincaid Grade Level ranges from 1 to 18 (lower grade is easier to read).

c

SMOG formula estimates reading grade level (lower grade is easier to read).

Readability measures for responses to dialog-based symptom-related questions vs treatment-related questions were not significantly different per any of the readability measures: FRE (P = 0.48), FKRGL (P = 0.33), and SMOG (P = 0.28). Also, comparison of readability between responses based on malignant, emergency, and benign conditions was not significantly different: FRE (P = 0.54), FKRGL (P = 0.66), and SMOG (P = 0.64) (Table 2).

For definition-based questions, the mean FRE score was 29.7 ± 6.6, the mean FKRGL grade was 14.0 ± 1.3, and the mean SMOG index was 15.7 ± 1.2. These readability scores correspond to undergraduate to graduate reading levels (Table 3). Overall, the mean readability scores for dialog-based vs definition-based questions were not significantly different: FRE score (P = 0.40), FKRGL (P = 0.33), and SMOG index (P = 0.71) (Table 4).

Table 4.

Readability of ChatGPT responses to definition-based questions on gastrointestinal conditions and treatments

Areas and questions Readability scores
Flesch Reading Easea Flesch-Kincaid Grade Levelb SMOGc
General gastrointestinal
 Acid reflux 36.6 (difficult) 12.1 (college) 13.6
 Acid reflux treatment 31.2 (difficult) 13.5 (college) 14.6
 Celiac disease 22.9 (very difficult) 15.2 (postgrad) 16.9
 Celiac disease treatment 29.2 (very difficult) 14.1 (college) 15.1
 Dietary recommendations 37.6 (difficult) 12.2 (college) 13.8
 Diverticulitis 19.5 (very difficult) 15.9 (postgrad) 16.4
 Diverticulitis treatment 21.5 (very difficult) 15.2 (postgrad) 16.7
 Hepatitis 32.4 (difficult) 13.6 (college) 15.8
 Hepatitis treatment 28.2 (very difficult) 13.0 (college) 15.5
 IBD vs IBS 26.4 (very difficult) 14.0 (college) 16.3
Colonoscopy
 What is a colonoscopy? 27.5 (very difficult) 15.7 (postgrad) 17.3
 Colonoscopy prep 42.5 (difficult) 12.4 (college) 14.9
 Age for colonoscopy 30.2 (difficult) 14.8 (college) 16.7
Overall mean (SD) 29.7 (6.6) (very difficult) 14.0 (1.3) (college) 15.7 (1.2)

ChatGPT, Chat Generative Pretrained Transformer; Comp, comprehensiveness; IBD, inflammatory bowel disease; IBS, irritable bowel syndrome; SMOG, Simple Measure of Gobbledygook Index.

a

Flesch Reading Ease score ranges from 1 to 100 (higher score is easier to read).

b

Flesch-Kincaid Grade Level ranges from 1 to 18 (lower grade is easier to read).

c

SMOG formula estimates reading grade level (lower grade is easier to read).

Response heterogeneity

On qualitative review of the ChatGPT responses to heterogeneous inputs (see Supplementary Tables 3 and 4, http://links.lww.com/CTG/B201), we noted that despite variations in the questions posed to ChatGPT, the overall structure and content of the responses remained qualitatively consistent. For the hematochezia questions, ChatGPT typically generated a generalized statement indicating that there are various potential causes for blood in a bowel movement, followed by a list of several possible etiologies with brief explanations, and concluded with a recommendation to consult a healthcare provider. In 8 of 10 questions, ChatGPT used the same list of etiologies. The phrasing of the recommendation to consult a healthcare provider varied, with some responses suggesting more urgency than others. Similarly, the inputs for the definition-based question regarding hepatitis treatments were also qualitatively similar. The FKRGL of all responses to the dialog-based questions was at or above an undergraduate college reading level.

DISCUSSION

In this exploratory study of ChatGPT-generated responses to gastroenterological questions, an expert panel of gastroenterologists rated almost half of the questions as being inappropriate per 1 or more of 6 appropriateness measures. In particular, none of the responses to queries about symptoms and treatments for GI emergencies were designated appropriate. Although ChatGPT responses were generally rated by medical experts as current and relevant, they tended to fall short of clarity and communicating urgency when clinically appropriate. In addition, ChatGPT provided information that was consistently at the undergraduate college reading level or higher, which is well above the recommended sixth- to eighth-grade level for patient-focused materials (1719). Our findings highlight that laypeople seeking GI-related medical information through online LLMs may receive responses that are not only difficult to understand, but more importantly, may be inappropriate or inaccurate.

Our results suggest that ChatGPT may be less able to generate appropriate responses for medical inquiries in gastroenterology than for other medical fields. For instance, one of the first studies to evaluate appropriateness of ChatGPT responses to cardiovascular disease inquiries found that 84% of the responses were appropriate, much higher than our rate of 55% (20). Similarly, a study that evaluated ChatGPT responses to urological questions found 77.1% as being appropriate (11). Although 2 studies assessing the accuracy of ChatGPT responses to gastroenterology questions reported high accuracy rates between 84% and 91% (21,22), these and other studies investigating ChatGPT responses to GI-related queries (2327) lacked a formal grading rubric with clear criteria for evaluating appropriateness. Instead, they relied on general subjective determinations of appropriateness or comprehensiveness. Our study created a clearly, structured comprehensive, and reproducible evaluation that revealed more nuanced findings and highlighted specific deficiencies in ChatGPT responses. Indeed, when using a more granular grading system, the study by Kerbage et al (21) found that only 40% of responses were completely appropriate. Overall, given the nuanced nature of gastroenterological diseases, which are characterized by many nonspecific symptoms, successfully discerning an individual's experience and providing accurate medical information necessitates a thorough understanding of GI physiology and pathophysiology, which Al-powered LLMs cannot truly offer. This contrasts with other medical fields that may focus on smaller anatomical areas with more localized symptoms. Regardless, ChatGPT faces notable challenges in addressing medically related topics, which might partially stem from the program's design, in which responses are formulated based on a prediction algorithm that uses statistical data to anticipate word combinations and generate expected responses, which may lead to mismatched and incomplete information (2,28,29). Moreover, the sources of information available to ChatGPT may not always be well-substantiated.

Patients regularly use the internet to obtain information about disease symptoms and treatments as well as general disease-specific information, with each type of query requiring the implementation of varied approaches to ensure accurate and appropriate responses. Although dialog-based questions about symptoms and treatments demand a synthesis of multifaceted medical information to generate a comprehensive differential diagnosis, definition-based questions typically require a simpler consolidation of current medical literature. Our study is unique in its comparative analysis of ChatGPT responses for both types of queries. Our findings suggest that ChatGPT responses to basic definition-based questions may result in more appropriate responses than dialog-based questions about symptoms. Crucially, LLMs by design lack the capacity to comprehend text with true human discernment, so they may not interpret users' prompts as the users intend, which can result in confusing responses (2). Thus, users must formulate carefully worded prompts to elicit responses that align with their intent.

Previous studies of ChatGPT-generated GI medical information have typically assessed responses to a general list of questions (21) or have focused on specific topics such as individual diseases (2225) or procedures (26,27). Our study, however, used a distinct comparative analysis of questions across different categories within gastroenterology—benign conditions, malignancies, and emergencies—yielding several noteworthy findings. Perhaps, the most disturbing finding in our study was that experts deemed 0 of 6 responses to dialog-based questions on GI emergencies (e.g., hematemesis and food impaction) as meeting the appropriateness threshold. This is a critical finding because patients often consult the internet first when experiencing symptoms, and their subsequent health decision-making could be influenced by inaccurate information (12), potentially resulting in dire outcomes. It is crucial to note that ChatGPT was not designed to provide medical advice, so responses are not generated to necessarily align with accepted treatment guidelines. Overall, using ChatGPT to obtain gastroenterological information may be safer when users are seeking basic, definition-based information rather than diagnostic information, and better methods for protecting users from false information other than simple disclaimers are critically needed.

Although substantial efforts have been made to enhance ChatGPT's capacity to generate responses that align with user intent (2), the subtle nature of certain medical inquiries may remain particularly difficult for these platforms. Thus, evaluating the appropriateness of AI-generated responses is challenging, primarily because traditional measures rely on characteristics such as author features and cited reference sources (35), and LLMs cannot be held responsible for the quality of responses because they are not true “authors.” In addition, the algorithms that guide how LLMs generate responses are complex and not transparent, making it impossible to evaluate the quality of the source materials they use. Moreover, LLMs generate dynamic responses that differ based on user prompts and on evolving training data sets, making evaluation tools that can only evaluate static published medical content less valuable. Therefore, traditional methods are suboptimal for measuring the appropriateness of AI-generated medical information, and alternative methods are needed.

In addition to accuracy and appropriateness, the readability of LLM-generated medical information is also a concern. In our study, all 3 readability formulas indicated that ChatGPT-generated responses were written at a college reading level or higher for all but 1 question. This surpasses the current medical composition guidelines recommending that patient education materials should not exceed a sixth- to eighth-grade reading level (1719). But, the discovery that ChatGPT generates medical information above the recommended reading levels is not surprising because even medical information created by human authors tends to be written at levels surpassing current recommended standards (3032). This phenomenon can be attributed, in part, to the nature of readability formulas, which assess the number of syllables in words to gauge textual complexity. Medical terminology includes lengthy words with numerous syllables, such as “gastroenterology,” comprising 7 syllables, or “endoscopic retrograde cholangiopancreatography,” with a staggering 16 syllables. This complex terminology unavoidably contributes to the elevated reading grade level of medical texts. Nonetheless, clear and understandable gastroenterology medical information tailored to the appropriate patient reading level is critically needed.

Previous studies collecting ChatGPT-generated GI medical information have typically input the same question 2–3 times to review response consistency (2227). By contrast, our study designed a novel qualitative analysis using heterogeneous inputs—differently worded questions focused on specific symptoms and definitions. This approach assesses the consistency of responses to various wordings of the same inquiry, providing a unique contribution to the literature on AI-generated medical information. Our assessment revealed that, despite the variation in the questions posed, ChatGPT generated largely similar responses. This consistency is likely due to the design of the language model and its utilization of a prediction algorithm that generates expected responses, as previously discussed (2,28,29). This finding underscores that although generating dynamic answers is possible, to do so ChatGPT requires careful prompt engineering to extract specific information, as the LLM tends to respond similarly to a broad set of questions unless precise terminology is used.

Medical providers and patients have adapted to technological advancements over time. The internet, in particular, has become a valuable resource for patients seeking medical information, and LLMs are now a novel tool at their disposal. In our study, we observed that although ChatGPT provided some reasonably suitable responses to gastroenterological medical queries, it fell short of meeting our standards for generating appropriate and readable healthcare information. As of now, we lack confidence in ChatGPT's ability to accurately comprehend natural language medical prompts and deliver contextually relevant information in gastroenterology. We advise that physicians caution patients about relying on LLMs for healthcare information at this time. With these caveats in mind, we are optimistic about the potential advancements of LLMs to eventually assist users in navigating complex medical topics, which we hope will ultimately promote improved health literacy. For an LLM to be able to successfully perform this task, it would require the training of its transformer neural network architectures based on a well-substantiated body of evidence-based medical information followed by a supervised learning phase involving clinician human feedback to refine the model's responses and improve accuracy. It would additionally require the LLM to be able to prompt users for additional, appropriate information to narrow a differential diagnosis. Finally, it must be able to quickly recognize GI emergencies and clearly convey the urgency for users to seek immediate medical attention.

Limitations

Our study had several limitations. We evaluated AI-generated responses to medical questions using only one AI-powered LLM on a limited subset of questions, which may limit the generalizability of our findings to other available LLMs. We had board-certified gastroenterologists rather than a blinded panel of independent experts assess the appropriateness of responses from ChatGPT, which may have introduced some bias. Finally, the rapidly evolving nature of AI technology means that newer versions of ChatGPT and other LLMs may offer different capabilities and performance, which could influence the outcomes of similar studies in the future.

AI-driven LLMs have a remarkable capacity for generating helpful responses to gastroenterological medical inquiries. However, we observed that AI-generated information may be inappropriate and inaccurate, limiting their current utility. Moreover, the readability of responses usually surpassed the sixth- to eighth-grade reading level recommended for healthcare information. Significant improvements in reliability and accuracy are needed before LLMs can be unequivocally endorsed as dependable sources of medical information, especially for diagnostics and emergent health topics. To be considered for lay public use, LLMs must possess the capability to identify, triage, and provide clear information, including next steps and urgency, based on current evidence-based guidelines.

CONFLICTS OF INTEREST

Guarantor of the article: Tobias E. Zuchelli, MD.

Specific author contributions: A.T.: conceptualization, data curation, formal analysis, methodology, writing, reviewing, and editing. Z.S.: conceptualization, methodology, writing, reviewing, and editing. A.I.: formal analysis and writing. E.A.: data curation, formal analysis, reviewing, and editing. D.V.: data curation, formal analysis, reviewing, and editing. N.N.: data curation, formal analysis, writing, reviewing, and editing. T.E.Z.: conceptualization, data curation, writing, reviewing, and editing. All authors have approved the final version of this manuscript and concur with submission to Clinical and Translational Gastroenterology.

Financial support: None to report.

Potential competing interests: T.Z. is a consultant for Boston Scientific Corporation. N.N. has received consulting fees from Pfizer, Janssen, Boehringer Ingelheim, and Bristol Myers Squibb. The other authors have no conflicts of interest to disclose. Preliminary results of this study were presented at Digestive Disease Week 2024 as a poster presentation entitled Starting the Conversation: The Reliability of Artificial Intelligence Generated Medical Information in Gastroenterology.

Writing assistance: We thank Karla D. Passalacqua, PhD, and Stephanie Stebens, MLIS, at Henry Ford Health for their editorial assistance.

Data transparency statement: All analytical methods and data were included in the manuscript's methods and results sections.

Study Highlights.

WHAT IS KNOWN

  • ✓ The release of Chat Generative Pretrained Transformer (ChatGPT) marks a groundbreaking development in how patients access medical information.

  • ✓ ChatGPT generates human-like responses to prompts based on an evolving and nontransparent set of training data.

WHAT IS NEW HERE

  • ✓ Most ChatGPT-generated gastrointestinal (GI) medical information was deemed medically inappropriate by an expert panel of gastroenterologists.

  • ✓ All the ChatGPT responses to questions about GI emergencies were deemed inappropriate.

  • ✓ ChatGPT may generate less appropriate responses in gastroenterology compared with other medical fields.

  • ✓ The readability of the ChatGPT responses was higher than the National Institutes of Health-recommended reading level for healthcare information.

  • ✓ Significant improvements are needed before ChatGPT can be unequivocally endorsed as a dependable source of GI information.

Supplementary Material

ct9-15-e00765-s001.docx (88.5KB, docx)

Footnotes

SUPPLEMENTARY MATERIAL accompanies this paper at http://links.lww.com/CTG/B201.

Contributor Information

Zachary Saleh, Email: zsaleh3@hfhs.org.

Angela Ishak, Email: aishak1@hfhs.org.

Eva Alsheik, Email: ealshei1@hfhs.org.

Deepak Venkat, Email: dvenkat1@hfhs.org.

Neilanjan Nandi, Email: neilanjan.nandi@pennmedicine.upenn.edu.

Tobias E. Zuchelli, Email: tzuchel1@hfhs.org.

REFERENCES

  • 1.Fox S, Duggan M. Health online 2013. Pew Research Center January 15, 2013 (https://www.pewresearch.org/internet/2013/01/15/health-online-2013/) Accessed January 15, 2024. [Google Scholar]
  • 2.Ouyang L Wu J Jiang X, et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst. 2022;35:27730–44. [Google Scholar]
  • 3.Silberg WM, Lundberg GD, Musacchio RA. Assessing, controlling, and assuring the quality of medical information on the Internet: caveant lector et viewor: Let the reader and viewer beware. JAMA 1997;277(15):1244–5. [PubMed] [Google Scholar]
  • 4.Charnock D, Shepperd S, Needham G, et al. DISCERN: An instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Health 1999;53(2):105–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Robillard JM, Jun JH, Lai JA, et al. The QUEST for quality online health information: Validation of a short quantitative tool. BMC Med Inform Decis Mak 2018;18(1):87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Flesch R. A new readability yardstick. J Appl Psychol 1948;32(3):221–33. [DOI] [PubMed] [Google Scholar]
  • 7.Mc Laughlin GH. SMOG grading-a new readability formula. J Reading 1969;12:639–46. [Google Scholar]
  • 8.Badarudeen S, Sabharwal S. Assessing readability of patient education materials: Current role in orthopaedics. Clin Orthop Relat Res 2010;468(10):2572–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Baker DW, Parker RM, Williams MV, et al. The relationship of patient reading ability to self-reported health and use of health services. Am J Public Health 1997;87(6):1027–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Johnson K, Weiss BD. How long does it take to assess literacy skills in clinical practice? J Am Board Fam Med 2008;21(3):211–4. [DOI] [PubMed] [Google Scholar]
  • 11.Davis R, Eppler M, Ayo-Ajibola O, et al. Evaluating the effectiveness of artificial intelligence-powered large language models application in disseminating appropriate and readable health information in urology. J Urol 2023;210(4):688–94. [DOI] [PubMed] [Google Scholar]
  • 12.Pourmand A, Sikka N. Online health information impacts patients' decisions to seek emergency department care. West J Emerg Med 2011;12(2):174–7. [PMC free article] [PubMed] [Google Scholar]
  • 13.Kincaid JP Fishburne RP Jr Rogers RL, et al. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Naval Technical Training Command Millington TN Research Branch. Millington, TN: Institute for Simulation and Training, University of Central Florida; 1975. [Google Scholar]
  • 14.National Cancer Institute. Clear & Simple: Developing Effective Print Materials for Low-Literate Readers. National Cancer Institute: Bethesda, MD, 1994. [Google Scholar]
  • 15.Özdemir B. SMOG readability (https://charactercalculator.com/smog-readability/) Accessed November 6, 2023.
  • 16.Ley P, Florio T. The use of readability formulas in health care. Psychol Health Med 1996;1:7–28. [Google Scholar]
  • 17.National Institutes of Health. Clear & simple July 7, 2021. Accessed January 15, 2024. (https://www.nih.gov/institutes-nih/nih-office-director/office-communications-public-liaison/clear-communication/clear-simple/) (2023).
  • 18.Nielsen-Bohlman L, Panzer AM, Kindig DA. Health Literacy: A Prescription to End Confusion. Institute of Medicine, National Academies Press: Washington, DC, 2004. [PubMed] [Google Scholar]
  • 19.Weiss BD. Health Literacy: A Manual for Clinicians. American Medical Association Foundation, American Medical Association: Chicago, IL, 2003. [Google Scholar]
  • 20.Sarraju A, Bruemmer D, Van Iterson E, et al. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA 2023;329(10):842–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Kerbage A, Kassab J, El Dahdah J, et al. Accuracy of ChatGPT in common gastrointestinal diseases: Impact for patients and providers. Clin Gastroenterol Hepatol 2024;22(6):1323–5.e3. [DOI] [PubMed] [Google Scholar]
  • 22.Henson JB, Glissen Brown JR, Lee JP, et al. Evaluation of the potential utility of an artificial intelligence chatbot in gastroesophageal reflux disease management. Am J Gastroenterol 2023;118(12):2276–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Yeo YH, Samaan JS, Ng WH, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023;29(3):721–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Pugliese N, Wai-Sun Wong V, Schattenberg JM, et al. Accuracy, reliability, and comprehensibility of ChatGPT-generated medical responses for patients with nonalcoholic fatty liver disease. Clin Gastroenterol Hepatol 2024;22(4):886–9.e5. [DOI] [PubMed] [Google Scholar]
  • 25.Lai Y, Liao F, Zhao J, et al. Exploring the capacities of ChatGPT: A comprehensive evaluation of its accuracy and repeatability in addressing helicobacter pylori-related queries. Helicobacter 2024;29(3):e13078. [DOI] [PubMed] [Google Scholar]
  • 26.Lee TC, Staller K, Botoman V, et al. ChatGPT answers common patient questions about colonoscopy. Gastroenterology 2023;165(2):509–11.e7. [DOI] [PubMed] [Google Scholar]
  • 27.Ali H, Patel P, Obaitan I, et al. Evaluating the performance of ChatGPT in responding to questions about endoscopic procedures for patients. iGIE 2023;2(4):553–9. [Google Scholar]
  • 28.Vaswani A Shazeer N Parmar N, et al. Attention is all you need. Advances in Neural Infomation Processing Systems. 2017;arXiv:1706.03762. [Google Scholar]
  • 29.Devlin J, Chang MW, Lee K. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL 2019;1:4171–86 [Google Scholar]
  • 30.Rooney MK, Santiago G, Perni S, et al. Readability of patient education materials from high-impact medical journals: A 20-year analysis. J Patient Exp 2021;8:2374373521998847–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Stossel LM, Segar N, Gliatto P, et al. Readability of patient education materials available at the point of care. J Gen Intern Med 2012;27(9):1165–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Taylor-Clarke K, Henry-Okafor Q, Murphy C, et al. Assessment of commonly available education materials in heart failure clinics. J Cardiovasc Nurs 2012;27(6):485–94. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Clinical and Translational Gastroenterology are provided here courtesy of American College of Gastroenterology

RESOURCES