Abstract
Artificial Intelligence’s (AI) role in providing information on Celiac Disease (CD) remains understudied. This study aimed to evaluate the accuracy and reliability of ChatGPT-3.5 in generating responses to 20 basic CD-related queries. This study assessed ChatGPT-3.5, the dominant publicly accessible version during the study period, to establish a benchmark for AI-assisted CD education. The accuracy of ChatGPT’s responses to twenty frequently asked questions (FAQs) was assessed by two independent experts using a Likert scale, followed by categorization based on CD management domains. Inter-rater reliability (agreement between experts) was determined through cross-tabulation, Cohen’s kappa, and Wilcoxon signed-rank tests. Intra-rater reliability (agreement within the same expert) was evaluated using the Friedman test with post hoc comparisons. ChatGPT demonstrated high accuracy in responding to CD FAQs, with expert ratings predominantly ranging from 4 to 5. While overall performance was strong, responses to management strategies excelled compared to those related to disease etiology. Inter-rater reliability analysis revealed moderate agreement between the two experts in evaluating ChatGPT’s responses (κ = 0.22, p-value = 0.026). Although both experts consistently assigned high scores across different CD management categories, subtle discrepancies emerged in specific instances. Intra-rater reliability analysis indicated high consistency in scoring for one expert (Friedman test=0.113), while the other exhibited some variability (Friedman test<0.001). ChatGPT exhibits potential as a reliable source of information for CD patients, particularly in the domain of disease management.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-025-15898-6.
Keywords: Celiac disease, Artificial intelligence, ChatGPT, Accuracy, Reliability
Subject terms: Machine learning, Immunology
Introduction
Artificial intelligence (AI) has emerged as a powerful tool in healthcare, with applications ranging from diagnostics to treatment optimization1. AI-powered chatbots, in particular, have gained prominence due to their ability to provide accessible and potentially personalized health information. These systems employ advanced deep learning algorithms to process vast amounts of data and generate human-like text responses2. Natural language processing (NLP), a subfield of AI, is essential for enabling chatbots to understand and respond to human language effectively. By developing algorithms that can process and analyze free text data, NLP has the potential to significantly improve access to medical information for both healthcare professionals and patients3,4. One prominent example of an AI chatbot incorporating NLP capabilities is ChatGPT, a large language model (LLM) capable of generating human-quality text based on the prompts it receives5.
The rise of the internet has transformed the way individuals seek information about their health. For various reasons, including limited access to healthcare providers, patients increasingly turn to online resources, including chatbots, to answer questions about their medical conditions. This trend is particularly evident in the realm of self-diagnosis and treatment seeking, where individuals often rely on online information to guide their healthcare decisions6. Celiac disease (CD) is a chronic autoimmune disorder7 triggered by the consumption of gluten in genetically predisposed individuals8–10. Due to its complex nature and often delayed diagnosis, many individuals with CD seek information and support online, often before or in conjunction with traditional healthcare settings11–13. The strict gluten-free diet (GFD) required for CD management can be challenging and overwhelming, further driving patients to seek additional guidance and resources14. In response to this need, AI chatbots like ChatGPT have emerged as a potential resource for individuals with CD15. However, given the sensitive nature of medical information and the potential for harm from inaccurate advice, it is crucial to evaluate the accuracy and reliability of AI chatbots in the context of CD management16.
ChatGPT has demonstrated potential applications across various medical domains, including rare and complex diseases17, dentistry18, radiology19, Parkinson’s disease20, pediatrics21,22, and hepatology. While recent studies have explored the use of chatbots in CD management23, these investigations often suffer from methodological limitations, such as the use of overly general queries and a lack of intra-rater reliability assessment. This study aims to rigorously evaluate ChatGPT’s capacity to provide accurate and consistent information to individuals with CD. By examining ChatGPT’s responses to frequently asked questions (FAQs) about CD, we seek to understand its potential as a valuable resource for patients and healthcare providers. The study’s findings will contribute to the growing body of knowledge on the application of AI in healthcare, particularly in the context of chronic disease management. By understanding the strengths and limitations of AI chatbots like ChatGPT, we can better inform their development and integration into healthcare settings, ultimately improving the quality of care and support for individuals with CD.
Materials and methods
Study design and setting
This study aimed to evaluate the accuracy and reliability of ChatGPT-3.5’s responses to FAQs about CD. ChatGPT-3.5 was selected due to its widespread use and unrestricted access, ensuring relevance to real-world patient interactions at the time of data collection. A total of 20 FAQs were submitted to the chatbot. To develop these questions, an initial list of 40 commonly asked CD-related questions was generated using ChatGPT-3.5, simulating real-world user interactions. This list was then reviewed and refined by clinical experts, who excluded questions that were redundant, overly simplistic, or closely aligned with textbook-style formulations. Questions were included based on their clinical relevance, diversity of content, and accessibility to the general public, with the goal of covering practical concerns commonly encountered in real-world patient interactions. The final set of 20 questions represented a diverse and balanced selection covering key domains of CD-related information, including disease definition and etiology, diagnostic procedures, and treatment and management strategies (Supplementary Table 1).
This study did not involve human participants directly but relied on expert evaluations of responses generated by ChatGPT-3.5. To evaluate the consistency of ChatGPT’s responses, each of the 20 FAQs was submitted to ChatGPT-3.5 three times, each in a new and independent session initiated at least 24 h apart. This ensured that the model did not retain any memory or contextual influence from previous interactions. All sessions were conducted using a clean interface with no prior conversation history, and the questions were randomized in order for each session to prevent any order bias. Two independent CD experts, each with over 10 years of clinical experience, rated each set of three responses on a 5-point Likert scale (1: Strongly Disagree, 5: Strongly Agree) based on accuracy and comprehensiveness. Higher scores indicated greater alignment with expert-defined correct information. Expert A is a PhD-level medical immunologist and director of a national research center for CD and gluten-related disorders. Expert B is a board-certified subspecialist in gastroenterology and hepatology, and director of a gastrointestinal research institute. Both experts have extensive clinical and research experience in the diagnosis, management, and patient education aspects of CD.
The scoring criteria were as follows: A score of 5 (Strongly Agree) indicates that the response is accurate and covers all the necessary information. A score of 4 (Agree) signifies that the response is correct and mostly comprehensive, but there may be some missing or incorrect details. A score of 3 (Neutral) reflects that the response is partly correct, but the details are primarily incorrect, missing, or not relevant. A score 3 was applied when a response conveyed a recognizable core message but suffered from substantial limitations like notable omissions, oversimplification, or partially inaccurate framing. A score of 2 (Disagree) denotes that the response is incorrect, but it does have some correct elements. Actually, score 2 indicated responses that were largely misleading or incorrect, with only minor correct elements that were insufficient to support user understanding or safe application. A score of 1 (Strongly Disagree) indicates that the response and its entire content are incorrect or irrelevant.
To further investigate the performance of ChatGPT-3.5 across different CD management groups, the 20 FAQs were categorized into three groups: Definition and Causes (FAQs related to the definition and underlying causes of CD), Diagnosis (FAQs pertaining to the diagnostic procedures and tools used for CD identification), and Treatment (FAQs concerning the various treatment options and management strategies for CD).
The study was approved by the ethical committee of the Shahid Beheshti University of Medical Sciences in Tehran, Iran (IR.SBMU.RETECH.REC.1403.166), and research was performed in accordance with the Declaration of Helsinki. All participants provided written informed consent prior to their inclusion in the study.
Statistical analysis
Accuracy assessment
The evaluation of ChatGPT’s response accuracy involved a three-step process. First, individual expert scores for each question were obtained using a median and range approach. These scores were then combined to generate a single mean score with a corresponding standard deviation (SD) for each question. Second, the analysis examined Likert-scale ratings across different answer options (1–3) provided by both experts (A and B). Third, the average scores were categorized and evaluated based on CD management groups. Responses with scores closer to 5 were considered to have higher accuracy.
Reliability analysis
Inter-rater reliability was assessed using a multi-step analytical approach. First, a descriptive comparison was conducted by presenting the median and range of scores given by each expert for every individual ChatGPT response (as shown in Table 1). This allowed for a transparent, response-level examination of how closely or differently each expert scored the same outputs. Following this, a cross-tabulation analysis was performed to explore the frequency distribution of ratings assigned by each expert across the full set of responses. This step helped visualize patterns of agreement or disagreement between raters. Subsequently, Cohen’s kappa was calculated to measure the degree of categorical agreement beyond chance, and the Wilcoxon signed-rank test was used to assess whether there were statistically significant differences in the scoring tendencies between the two experts.
Table 1.
Accuracy assessment of ChatGPT responses for celiac disease FAQ.
Questions | Expert A | Expert B | Combined Score |
---|---|---|---|
What is celiac disease? | 5 (5–5) | 4 (3–5) | 4.50 ± 0.84 |
What causes celiac disease? | 4 (2–5) | 4 (4–5) | 4.00 ± 1.10 |
What are the symptoms of celiac disease? | 4 (4–5) | 4 (4–5) | 4.33 ± 0.52 |
How is celiac disease diagnosed? | 5 (4–5) | 4 (4–5) | 4.50 ± 0.55 |
Can children develop celiac disease? | 4 (4–5) | 5 (4–5) | 4.50 ± 0.55 |
Is celiac disease hereditary? | 4 (4–5) | 4 (4–5) | 4.33 ± 0.52 |
What is gluten and why is it a problem for people with celiac disease? | 4 (4–5) | 4 (3–5) | 4.17 ± 0.75 |
What foods can people with celiac disease eat? | 4 (4–5) | 4 (4–5) | 4.33 ± 0.52 |
What foods should people with celiac disease avoid? | 5 (5–5) | 5 (4–5) | 4.83 ± 0.41 |
Is a gluten-free diet the only treatment for celiac disease? | 4 (2–5) | 5 (4–5) | 4.17 ± 1.17 |
Can people with celiac disease eat oats? | 5 (5–5) | 4 (4–5) | 4.67 ± 0.52 |
Can people with celiac disease eat wheat starch? | 5 (2–5) | 5 (4–5) | 4.33 ± 1.21 |
Are there any medications that can treat celiac disease? | 5 (4–5) | 4 (3–5) | 4.33 ± 0.82 |
How long does it take for someone with celiac disease to see improvements after starting a gluten-free diet? | 5 (4–5) | 4 (3–5) | 4.33 ± 0.82 |
Can someone with celiac disease have an occasional cheat meal with gluten? | 5 (5–5) | 4 (4–5) | 4.67 ± 0.52 |
What are the potential long-term complications of celiac disease? | 5 (5–5) | 5 (4–5) | 4.83 ± 0.41 |
How can people with celiac disease manage their symptoms while traveling or eating out? | 5 (5–5) | 5 (5–5) | 5.00 ± 0.00 |
Are there any support groups for people with celiac disease? | 5 (5–5) | 4 (4–5) | 4.67 ± 0.52 |
Can people with celiac disease still drink beer or other alcoholic beverages? | 5 (5–5) | 3 (3–5) | 4.33 ± 1.03 |
How often should someone with celiac disease get tested for other autoimmune disorders? | 4 (4–5) | 4 (4–5) | 4.33 ± 0.52 |
Scores for each of the three responses are aggregated for each question. Individual expert scores are presented as the median and range, while combined scores are depicted as the mean and standard deviation. Abbreviation: Frequently Asked Questions (FAQs).
To evaluate intra-rater reliability, the Friedman test with post hoc pairwise comparisons was used to assess whether each expert rated the three ChatGPT-generated responses to the same question (Answer 1, 2, and 3) consistently. Bonferroni-adjusted p-values were calculated to account for multiple comparisons in the post hoc analysis.
All statistical analyses were performed using SPSS (version 27) and R (version 4.3.2). A p-value of less than 0.05 was considered statistically significant.
Results
Accuracy
Table 1 summarizes the accuracy assessment of ChatGPT’s responses to CD FAQs by two independent experts (Expert A and Expert B) and their combined scores. Across the various questions, combined scores ranged from 4.00 ± 1.10 to 5.00 ± 0.00, where values represent mean Likert score ± standard deviation (SD). A score of 5.00 ± 0.00 reflects perfect agreement and high accuracy, while 4.00 ± 1.10 indicates slightly more variability between raters. Notably, questions such as “How can people with CD manage their symptoms while traveling or eating out?” received a perfect score of 5.00 ± 0.00, indicating high accuracy and covering all the necessary information in response. Conversely, questions like “What causes CD?” displayed the lowest combined scores of 4.00 ± 1.10, implying that the response is correct and mostly comprehensive, but there may be some missing or incorrect details. For greater transparency, the complete set of individual ratings for all three ChatGPT responses per question (A1–A3 and B1–B3) is provided in Supplementary Table 2.
Table 2 presents the distribution of individual expert ratings for each of the three ChatGPT responses per question, categorized using a 5-point Likert scale. For this descriptive analysis, each response instance (i.e., Answer1, Answer2, and Answer3) was treated as an independent event, and ratings were pooled across all 20 questions to summarize the overall distribution of scores. Expert A’s ratings showed some variability, with the majority of responses falling within the categories of “accurate and comprehensive” (Score 5) or “mostly correct with some potential issues” (Score 4). The percentage of responses rated within these top two categories ranged from 90% (Answer 1) to 100% (Answer 2), yielding an overall mean of 95%. Expert B’s ratings were slightly more varied, though the majority were still within the same top categories. For Expert B, scores in these categories ranged from 80% (Answer 3) to 100% (Answer 1), with an overall mean of 90%. The full content of ChatGPT’s responses, which these ratings are based on, is available in Supplementary Table 1.
Table 2.
Distribution of expert ratings across the 5-Point Likert scale for each ChatGPT Response.
Likert Score Description | Expert A | Expert B | ||||
---|---|---|---|---|---|---|
Answer1 | Answer2 | Answer3 | Answer1 | Answer2 | Answer3 | |
1. The response and its entire content are incorrect or irrelevant | 0 (0.00) | 0 (0.00) | 0 (0.00) | 0 (0.00) | 0 (0.00) | 0 (0.00) |
2. The response is incorrect, but it does have some correct elements | 2 (10.00) | 0 (0.00) | 1 (5.00) | 0 (0.00) | 0 (0.00) | 0 (0.00) |
3. The response is partly correct, but the details are primarily incorrect, missing, or not relevant | 0 (0.00) | 0 (0.00) | 0 (0.00) | 0 (0.00) | 2 (10.00) | 4 (20.00) |
4. The response is correct and mostly comprehensive, but there may be some missing or incorrect details | 1 (5.00) | 8 (40.00) | 8 (40.00) | 3 (15.00) | 13 (65.00) | 11 (55.00) |
5. The response is accurate and covers all the necessary information | 17 (85.00) | 12 (60.00) | 11 (55.00) | 17 (85.00) | 5 (25.00) | 5 (25.00) |
Values in parentheses represent percentage scores. Ratings were assigned based on a 5-point Likert scale, where 1 = “The response and its entire content are incorrect or irrelevant” and 5 = “The response is accurate and covers all the necessary information.”.
Figure 1A depicts the average scores for three CD management categories (“Definition and Causes,” “Diagnosis,” and “Treatment”). All categories received scores above 4.00 (mean ± SD), with values ranging from 4.30 ± 0.75 to 4.52 ± 0.75. These results suggest a generally favorable assessment of responses related to these management aspects.
Fig. 1.
Average Scores of Responses by CD Management Category and Expertise.
Reliability
Inter-rater reliability
Table 3 presents the agreement on scores between the two experts (Expert A and Expert B) for all ChatGPT responses. The table reveals the highest concordance (21 out of 60 responses) for the “strongly agree” category (Score 5). Partial agreement was observed for the “agree” category (Score 4), with 12 out of 60 responses showing agreement between both experts. Cohen’s Kappa coefficient, reported in Table 4 (κ = 0.22, 95% CI: 0.04–0.39, p-value = 0.026), was calculated to assess agreement between the two independent experts and reflects fair inter-rater agreement. The Wilcoxon signed-ranks test did not detect significant differences between experts’ ratings for Answer B1 compared to A1 (Z = − 0.743, p-value = 0.500) and B3 compared to A3 (Z = − 1.812, p-value = 0.078). However, a significant difference was found for Answer B2 compared to A2 (p-value = 0.045).
Table 3.
Cross-Tabulation of agreement between experts on likert scale scores for ChatGPT responses (n = 60).
Expert A | Expert B | ||||
---|---|---|---|---|---|
1. The response and its entire content are incorrect or irrelevant | 2. The response is incorrect, but it does have some correct elements | 3. The response is partly correct, but the details are primarily incorrect, missing, or not relevant | 4. The response is correct and mostly comprehensive, but there may be some missing or incorrect details | 5. The response is accurate and covers all the necessary information | |
1. The response and its entire content are incorrect or irrelevant | 0 | 0 | 0 | 0 | 0 |
2. The response is incorrect, but it does have some correct elements | 0 | 0 | 0 | 0 | 3 |
3. The response is partly correct, but the details are primarily incorrect, missing, or not relevant | 0 | 0 | 0 | 0 | 0 |
4. The response is correct and mostly comprehensive, but there may be some missing or incorrect details | 0 | 0 | 2 | 12 | 3 |
5. The response is accurate and covers all the necessary information | 0 | 0 | 4 | 15 | 21 |
Responses were scored based on a 5-point Likert scale. Descriptive definitions for each score are shown in both rows and columns.
Table 4.
Inter-rater reliability and differences in scores between Experts.
Wilcoxon signed-ranks test | Kappa (A vs. B) | |||
---|---|---|---|---|
Z | p-value | κ (95% CI) | p-value | |
Answer B1 – Answer A1 | –0.743 | 0.500 | 0.22 (0.04, 0.39) | 0.026 |
Answer B2 – Answer A2 | –2.179 | 0.045 | ||
Answer B3 – Answer A3 | –1.812 | 0.078 |
Inter-rater reliability was assessed by comparing mean paired scores for each of the three answers (B1 to B3) against answers from expert A (A1 to A3) using the Wilcoxon signed-ranks test. Additionally, inter-rater reliability was evaluated by determining the agreement on scores between the two experts for all ChatGPT responses through the application of the Cohen’s Kappa measure.
Figure 1B further illustrates inter-rater reliability within specific CD management categories. While both raters assigned generally high mean scores across all categories (“Definition and causes”: Rater A − 4.33 ± 0.82, Rater B − 4.27 ± 0.70; “Diagnosis”: Rater A − 4.58 ± 0.51, Rater B − 4.42 ± 0.51; “Treatment”: Rater A − 4.67 ± 0.78, Rater B − 4.36 ± 0.70), a non-significant difference was observed in all categories.
Intra-rater reliability
Table 5 shows high intra-rater reliability for Expert A (χ² = 0.343, p-value = 0.113) with no significant differences detected between scores assigned to their answer choices (A1, A2, A3). However, significant differences were observed for Expert B (χ² = 14.3, p-value < 0.001) between their ratings for answer choices B1, B2, and B3, suggesting some variability in scoring within this specific rater.
Table 5.
Intra-rater reliability and variability in responses within Raters.
Friedman test | Post Hoc | ||||
---|---|---|---|---|---|
Chi-square | p-value | p-value | adjusted p-value | ||
A1 | A2 | 4.42 | 0.113 | 0.705 | 1.000 |
A1 | A3 | 0.343 | 1.000 | ||
A2 | A3 | 0.345 | 1.000 | ||
B1 | B2 | 14.3 | < 0.001 | 0.005 | 0.014 |
B1 | B3 | 0.002 | 0.005 | ||
B2 | B3 | 0.594 | 1.000 |
Intra-rater reliability was assessed using Friedman test to assess differences among multiple paired responses (A1, A2, A3, B1, B2, B3). Post hoc analysis was conducted to examine pairwise differences using Bonferroni adjustment.
Discussion
The current study aimed to evaluate the accuracy and reliability of ChatGPT’s responses to FAQs about CD. Two independent experts evaluated the information provided by ChatGPT in response to a variety of CD FAQs. Overall, the responses achieved high accuracy, with combined scores ranging from 4.00 ± 1.10 to a perfect score of 5.00 ± 0.00. This indicates that ChatGPT was generally successful in conveying truthful and complete information on CD. Notably, questions regarding management strategies, such as traveling or eating out while having CD, received perfect scores, highlighting exceptional clarity and practical usefulness in these areas. The study also assessed the consistency of ChatGPT’s responses through both inter-rater and intra-rater reliability analyses. Inter-rater reliability, measured using Cohen’s kappa coefficient, indicated fair agreement between the two experts. While some discrepancies existed in scoring specific responses, no significant difference was observed across the three main domains (Definition and Causes, Diagnosis, and Treatment), suggesting that ChatGPT’s performance was relatively stable across different raters. In contrast, intra-rater reliability analysis revealed variability between the experts. Expert A demonstrated consistent scoring across repeated responses to the same question, while Expert B showed greater variation. This inconsistency may reflect two underlying factors: (1) subjective interpretation by the rater when assessing responses with borderline or nuanced content, and (2) actual variability in ChatGPT’s outputs to identical prompts. As such, intra-rater variability in this context should be understood as a function of both human judgment and inherent fluctuations in AI-generated responses—each of which has implications for evaluating and applying LLM-based tools in clinical communication.
Our findings indicate that ChatGPT exhibits high accuracy in responding to FAQs about CD, providing comprehensive and truthful information. This aligns with previous research demonstrating ChatGPT’s capabilities in various medical domains. Jansson-Knodell et al. (2023) conducted a comparative analysis of ChatGPT, Google Bard, and Bing Chat in answering CD-related queries, using ten commonly asked questions about CD. ChatGPT achieved a 90% accuracy rate in providing clear and understandable information, surpassing the average accuracy of 72.5% across all chatbots. Furthermore, the same research group evaluated ChatGPT’s ability to generate GFD plans for CD patients, finding 100% accuracy in identifying gluten-free food items24. While no studies have specifically reported low accuracy rates for ChatGPT in addressing CD FAQs, existing research supports its overall performance. Sciberras et al. (2024) and Kerbage et al. (2023) demonstrated ChatGPT’s accuracy in handling inflammatory bowel disease (IBD) inquiries, with accuracy levels consistently above 75%25,26. Moreover, studies on diabetes, head and neck cancer, total knee replacement, and lung cancer have reported ChatGPT accuracy rates ranging from 70 to 80%, demonstrating its potential as a reliable source of medical information26–29. In conclusion, the present study, combined with previous research, strongly suggests that ChatGPT can serve as a valuable tool for providing accurate and informative responses to patients with CD and potentially other medical conditions.
To effectively assess the potential of AI in managing CD across domains of definition, causes, diagnosis, and treatment, it is essential to consider both accuracy and reliability. In our study, we assessed inter-rater reliability among experts for AI-generated information on CD management. Our findings indicated substantial agreement between the two experts across all evaluated categories. These results align with previous research demonstrating high inter-rater reliability for AI-generated content in various medical domains. For instance, Jansson-Knodell et al. (2023) reported moderate inter-rater reliability among AI chatbots for CD-related FAQs, although their focus was on chatbot consistency rather than human evaluator agreement23. Conversely, Cankurtaran E et al. (2023) and Walker HL et al. (2023) observed excellent and substantial inter-rater reliability, respectively, among human experts evaluating AI-generated content for IBD and hepato-pancreatico-biliary conditions30,31. Similarly, Zhang et al. (2024) found high agreement between orthopedic surgeons assessing AI-generated information on total knee replacement32. Collectively, these findings support the notion that AI-generated content related to CD and other medical conditions can exhibit a high degree of consistency when evaluated by human experts.
To complement the assessment of reliability, intra-rater reliability was evaluated by administering each question to ChatGPT three times. Each expert independently rated the resulting responses on separate occasions. This analysis aimed to determine the consistency of the experts’ ratings across multiple evaluations. Findings indicated high consistency in ratings for Expert A, while Expert B exhibited some variability. This suggests potential fluctuations in ChatGPT’s responses to identical prompts, emphasizing the importance of considering such variability when relying on AI-generated content for medical information.
To the best of our knowledge, this study is the first to directly assess intra-rater reliability in the context of AI-generated responses for CD diagnosis, specifically examining the consistency of expert ratings across repeated ChatGPT outputs. While this study evaluated ChatGPT-3.5, the release of ChatGPT-4.0 underscores the need for ongoing research to assess iterative improvements in AI performance. Future studies should compare versions to quantify advancements in accuracy for medical queries. One limitation of this study is its reliance on only 20 FAQs, which may not fully capture the diversity and complexity of questions related to CD. However, these 20 questions were carefully selected from an initial list of 40, ensuring they encompass a broad range of CD-related topics. Additionally, the initial list of FAQs was generated using ChatGPT-3.5 itself, potentially introducing bias in question selection and possibly influencing the subsequent evaluation of its responses. Despite this, the selected questions covered all the necessary dimensions of CD-related information, arguably better than if they had been manually selected. However, the lack of direct input from patients or caregivers in the generation or validation of the FAQs may limit the extent to which the selected questions reflect real-world patient concerns and informational needs. The study’s design also involved subjective assessments by only two experts, which, despite their extensive experience, may limit the generalizability of the findings due to individual biases and interpretations. Although both raters were highly experienced in CD, one a PhD-level medical immunologist and director of a national CD research center, and the other a subspecialist in gastroenterology and director of a gastrointestinal research institute, the absence of a registered dietitian may have limited the evaluation of nutrition-specific aspects such as gluten-free food safety, dietary adequacy, and regional variability in labeling standards. However, this method is common in similar studies and has been employed in recent research in this field23,30,33,34. Furthermore, while inter-rater reliability showed fair agreement, variability in intra-rater reliability, particularly for Expert B, suggests inconsistencies in scoring, which could affect the overall reliability of the results. While this study focused on basic FAQs to reflect common patient concerns, ChatGPT’s performance on more nuanced or critical-thinking-dependent questions remains unexplored and warrants further investigation. Finally, although the study did not involve human participants directly, as it focuses on evaluating ChatGPT’s responses, the lack of real-world user interactions and feedback may not be relevant here as the primary objective was to assess the chatbot’s accuracy and reliability through expert evaluation.
Conclusion
Our findings indicate that ChatGPT demonstrates high accuracy in conveying truthful and comprehensive information about CD, particularly in areas such as management strategies. While inter-rater reliability among experts was generally fair, intra-rater reliability varied, highlighting the potential for inconsistencies in ChatGPT’s responses. Despite limitations such as the relatively small sample size of FAQs and the reliance on expert judgment, this study contributes to the growing body of research on AI-generated medical information. The results suggest that ChatGPT has the potential to serve as a valuable tool for providing information to patients with CD, but further research is needed to address the identified limitations and to explore the impact of ChatGPT on patient outcomes.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
This study was supported by the Celiac Disease and Gluten-Related Disorders Research Center, Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran. We thank Professor Doruk Oner for his insightful comments and suggestions that improved the manuscript.
Abbreviations
- AI
Artificial Intelligence
- NLP
Natural Language Processing
- CD
Celiac disease
- FAQs
Frequently asked questions
- GFD
Gluten-free diet
- SD
Standard deviation
- IBD
Inflammatory bowel disease
Author contributions
Conceptualization: M.M.G., N.A., M.A.L., M.R.N.Data curation: M.M.G., N.A., M.A.L.Formal analysis: M.A.L.Methodology: M.M.G., N.A., M.A.L.Project administration: M.R.N. A.S.Software: M.A.L.Supervision: M.R.N.Validation: M.R.N.Visualization: M.A.L.Writing – original draft: M.M.G., N.A., M.A.L. C.C. Writing – review & editing: M.M.G., N.A., M.A.L. C.C.
Funding
Not applicable.
Data availability
All data generated or analyzed during this study are included in this published article (and its Supplementary Information files).
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Mohadeseh Mahmoudi Ghehsareh, Nastaran Asri and Mehdi Azizmohammad Looha.
References
- 1.Guillen-Grima, F. et al. Evaluating the efficacy of ChatGPT in navigating the Spanish medical residency entrance examination (MIR): promising horizons for AI in clinical medicine. Clin. Pract.13, 1460–1487 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Mohammad-Rahimi, H. et al. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int. Endod J.57, 305–314. 10.1111/iej.14014 (2024). [DOI] [PubMed] [Google Scholar]
- 3.Mahmoudi Ghehsareh, M. et al. Application of artificial intelligence in Celiac disease: from diagnosis to patient follow-up. Iran. J. Blood Cancer. 15, 125–137. 10.61186/ijbc.15.3.125 (2023). [Google Scholar]
- 4.Jo, H. & Park, D. H. Effects of chatgpt’s AI capabilities and human-like traits on spreading information in work environments. Sci. Rep.14, 7806. 10.1038/s41598-024-57977-0 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Johnson, D. et al. Assessing the accuracy and reliability of AI-Generated medical responses: an evaluation of the Chat-GPT model. Res. Sq. 10.21203/rs.3.rs-2566942/v1 (2023).39108497 [Google Scholar]
- 6.Shahsavar, Y. & Choudhury, A. User intentions to use ChatGPT for Self-Diagnosis and Health-Related purposes: Cross-sectional survey study. JMIR Hum. Factors. 10, e47564. 10.2196/47564 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Mahmoudi Ghehsareh, M. et al. The correlation between fecal microbiota profiles and intracellular junction genes expression in young Iranian patients with Celiac disease. Tissue Barriers, 2347766, 10.1080/21688370.2024.2347766 [DOI] [PMC free article] [PubMed]
- 8.Taraz, T. et al. Overview of the compromised mucosal integrity in Celiac disease. J. Mol. Histol.55, 15–24. 10.1007/s10735-023-10175-0 (2024). [DOI] [PubMed] [Google Scholar]
- 9.Taraghikhah, N. et al. An updated overview of spectrum of gluten-related disorders: clinical and diagnostic aspects. BMC Gastroenterol.20, 258. 10.1186/s12876-020-01390-0 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Ashtari, S. et al. Prevalence of gluten-related disorders in Asia-Pacific region: a systematic review. J. Gastrointest. Liver Diseases: JGLD. 28, 95–105. 10.15403/jgld.2014.1121.281.sys (2019). [DOI] [PubMed] [Google Scholar]
- 11.McNally, S. L. et al. Can consumers trust web-based information about Celiac disease? Accuracy, comprehensiveness, transparency, and readability of information on the internet. Interact. J. Med. Res.1, e1. 10.2196/ijmr.2010 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Rostami-Nejad, M. et al. Endoscopic and histological pitfalls in the diagnosis of celiac disease: A multicentre study assessing the current practice. Revista espanola de enfermedades digestivas: organo oficial de la Sociedad Espanola de Patologia Digestiva 105, 326–333, (2013). 10.4321/S1130-01082013000600003 [DOI] [PubMed]
- 13.Khoshbaten, M. et al. Fertility disorder associated with Celiac disease in males and females: fact or fiction? J. Obstet. Gynaecol. Res.37, 1308–1312. 10.1111/j.1447-0756.2010.01518.x (2011). [DOI] [PubMed] [Google Scholar]
- 14.Alghamdi, E. & Alnanih, R. Chatbot design for a healthy life to Celiac patients: A study according to a new behavior change model. Int. J. Adv. Comput. Sci. Appl.1210.14569/IJACSA.2021.0121077 (2021).
- 15.Altamimi, I., Altamimi, A., Alhumimidi, A. S., Altamimi, A. & Temsah, M. H. Artificial intelligence (AI) chatbots in medicine: A supplement, not a substitute. Cureus15, e40922. 10.7759/cureus.40922 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Edwards, A., Elwyn, G., Smith, C., Williams, S. & Thornton, H. Consumers’ views of quality in the consultation and their relevance to ‘shared decision-making’ approaches. Health Expect.4, 151–161. 10.1046/j.1369-6513.2001.00116.x (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Bragazzi, N. L. & Toward Clinical Generative, A. I. Conceptual framework. Jmir Ai. 3, e55957. 10.2196/55957 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Alhaidry, H. M., Fatani, B., Alrayes, J. O. & Almana, A. M. Alfhaed, N. K. ChatGPT in dentistry: A comprehensive review. Cureus10.7759/cureus.38317 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Zanon, C. ChatGPT goes to the radiology department: A pictorial review. (2023). 10.20944/preprints202312.0714.v1
- 20.Aggarwal, N. Contribution of ChatGPT in parkinson’s disease detection. Nuclear Med. Mol. Imaging. 58, 101–103. 10.1007/s13139-024-00857-2 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Wei, Q. Evaluation of ChatGPT’s Performance in Providing Treatment Recommendations for Pediatric Diseases. Pediatr. Discovery. 110.1002/pdi3.42 (2023). [DOI] [PMC free article] [PubMed]
- 22.Ying, L. Screening/Diagnosis of pediatric endocrine disorders through the artificial intelligence model in different Language settings. Eur. J. Pediatrics. 183, 2655–2661. 10.1007/s00431-024-05527-1 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Jansson-Knodell, C. & Rubio-Tapia, A. Investigating accuracy and performance of artificial intelligence for Celiac disease information: A comparative study. Official J. Am. Coll. Gastroenterol. | ACG. 118, S1829 (2023). [Google Scholar]
- 24.Jansson-Knodell, C., Gardinier, D., Weekley, K. & Rubio-Tapia, A. S1830 Artificial intelligence for Gluten-Free diet advice for Celiac disease. Official J. Am. Coll. Gastroenterol. | ACG. 11810.14309/01.ajg.0000956960.96170.ab (2023).
- 25.Sciberras, M. et al. Accuracy of information given by ChatGPT for patients with inflammatory bowel disease in relation to ECCO guidelines. J. Crohns Colitis. 10.1093/ecco-jcc/jjae040 (2024). [DOI] [PubMed] [Google Scholar]
- 26.Goodman, R. S. et al. Accuracy and reliability of chatbot responses to physician questions. JAMA Netw. Open.6, e2336483. 10.1001/jamanetworkopen.2023.36483 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Nakhleh, A., Spitzer, S. & Shehadeh, N. ChatGPT’s response to the diabetes knowledge questionnaire: implications for diabetes education. Diabetes Technol. Ther.25, 571–573. 10.1089/dia.2023.0134 (2023). [DOI] [PubMed] [Google Scholar]
- 28.Kuşcu, O., Pamuk, A. E., Sütay Süslü, N. & Hosal, S. Is ChatGPT accurate and reliable in answering questions regarding head and neck cancer? Front. Oncol.13, 1256459 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Rahsepar, A. A. et al. How AI responds to common lung cancer questions: ChatGPT vs Google bard. Radiology307, e230922. 10.1148/radiol.230922 (2023). [DOI] [PubMed] [Google Scholar]
- 30.Cankurtaran, R. E., Polat, Y. H., Aydemir, N. G., Umay, E. & Yurekli, O. T. Reliability and usefulness of ChatGPT for inflammatory bowel diseases: an analysis for patients and healthcare professionals. Cureus15, e46736. 10.7759/cureus.46736 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Walker, H. L. et al. Reliability of medical information provided by chatgpt: assessment against clinical guidelines and patient information quality instrument. J. Med. Internet Res.25, e47479. 10.2196/47479 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Zhang, S., Liau, Z. Q. G., Tan, K. L. M. & Chua, W. L. Evaluating the accuracy and relevance of ChatGPT responses to frequently asked questions regarding total knee replacement. Knee Surg. Relat. Res.36, 15. 10.1186/s43019-024-00218-5 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Naqvi, H. A. et al. Evaluation of online chat-based artificial intelligence responses about inflammatory bowel disease and diet. Eur. J. Gastroenterol. Hepatol.36, 1109–1112. 10.1097/meg.0000000000002815 (2024). [DOI] [PubMed] [Google Scholar]
- 34.Gilson, A. et al. How does ChatGPT perform on the united States medical licensing examination (USMLE)? The implications of large Language models for medical education and knowledge assessment. JMIR Med. Educ.9, e45312. 10.2196/45312 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data generated or analyzed during this study are included in this published article (and its Supplementary Information files).