Readability and quality assessment of AI-powered chatbot responses to overactive bladder patient questions: a comparative study

Mutlu Deger; Tunahan Ates; Ismail Onder Yilmaz; Mehmet Gurkan Arıkan; Ibrahim Halil Sukur; Fesih Ok; Nebil Akdogan; Volkan Izol

doi:10.1186/s12894-025-02033-w

. 2026 Jan 8;26:31. doi: 10.1186/s12894-025-02033-w

Readability and quality assessment of AI-powered chatbot responses to overactive bladder patient questions: a comparative study

Mutlu Deger ¹, Tunahan Ates ^2,^✉, Ismail Onder Yilmaz ¹, Mehmet Gurkan Arıkan ¹, Ibrahim Halil Sukur ³, Fesih Ok ³, Nebil Akdogan ¹, Volkan Izol ¹

PMCID: PMC12869913 PMID: 41501759

Abstract

Introduction

Overactive bladder (OAB) is a common urological condition affecting millions of people worldwide, significantly reducing their quality of life. Patient compliance and active participation in disease management are critical to achieving successful outcomes. This study aims to understand the potential role of AI-assisted chatbots in educating patients with OAB and their impact on health literacy.

Methods

We compared responses from four AI chatbots (ChatGPT, DeepSeek, Claude, and Gemini) to 16 standardized questions from the AUA Overactive Bladder Patient Guide. Two board-certified urologists independently evaluated responses using Ensuring Quality Information for Patients (EQIP) tool and Google E-E-A-T principles.

Results

Inter-rater reliability was excellent (ICC = 0.97 for EQIP, κ = 0.89 for E-E-A-T). A significant difference was found between chatbots in terms of readability scores (Gunning Fox Index p = 0.008, Flesch-Kincaid Grade Level p < 0.001), with all responses requiring education levels above the recommended 6th-8th grade. However, significant differences emerged in information quality (EQIP, p < 0.001; E-E-A-T, p < 0.001). Gemini demonstrated superior performance in both EQIP (60.2 ± 6.92) and E-E-A-T scores (13.5) compared to all other chatbots.

Conclusion

AI chatbots show potential for patient education but produce content with readability levels too complex for general audiences. Significant quality variations exist between models. These findings emphasize the need for collaboration between healthcare professionals and AI developers to create more accessible, reliable health information systems.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12894-025-02033-w.

Keywords: AI, Large language model, Chatbot, Overactive bladder, Patient information

Introduction

Overactive bladder (OAB) is a common urological condition affecting millions of people worldwide, significantly reducing their quality of life [1]. Symptoms include urgency, frequency, and nocturia; in some cases, urgency incontinence may also occur [2]. Management of OAB encompasses a wide range of options, from lifestyle modifications to medical and surgical treatments [3]. Patient compliance and active participation in disease management are critical to achieving successful outcomes. In this context, accurate and comprehensible patient information materials (PIMs) are essential.

However, the complexity of health-related information and the abundance of medical jargon can make PIMs difficult for patients to understand [4]. Previous studies indicate that much online health information exceeds the reading level of the general adult population [5, 6]. Analyses of the readability of PIMs, particularly on urological topics, have revealed that these materials often require a high level of education, posing a barrier for individuals with low health literacy [7]. This can prevent patients from making informed decisions about their health and negatively impact treatment outcomes.

In recent years, artificial intelligence (AI)-powered chatbots have emerged as a novel channel for accessing health information [8]. These chatbots, which are based on large language models (LLMs) such as Claude, ChatGPT, Gemini, and DeepSeek, have the potential to deliver information quickly and conveniently. However, concerns remain regarding the accuracy, reliability, and quality of AI-generated health information [9, 10]. Inaccurate or misleading information, particularly on sensitive medical topics, can have serious consequences. Therefore, health information provided by chatbots, like that from traditional PIMs, must be rigorously evaluated for readability and quality.

This study compared the readability and quality scores of responses provided by four different chatbots (ChatGPT, DeepSeek, Claude, and Gemini) to 16 patient questions from the Overactive Bladder Patient Guide PDF, available through the AUA Free Patient Education Materials. Readability analyses used formulas, including the Gunning Fog Index (GFI) [11] and the Flesch-Kincaid Grade Level (FKGL) [12]. Information quality assessments were conducted using the Ensuring Quality Information for Patients (EQIP) tool [13] and the Google Experience, Expertise, Authoritativeness, and Trustworthiness (E-E-A-T) guidelines [14]. This study represents a significant step toward understanding the potential role of AI-enabled chatbots in patient education and their impact on health literacy. The findings will offer valuable insights for the development of future AI-based healthcare applications and the enhancement of patient education materials.

Materials and methods

Study design

During the data collection process, the questions posed to the AI chatbots were sourced from the official patient information guide prepared by the American Urological Association (AUA) to provide a scientific foundation and ensure standardization. All 16 questions from the guide were included in the study, and the quality of the chatbots’ responses to specific, standardized clinical questions was evaluated. Within the scope of the study, responses generated by four different AI chatbots— ChatGPT, DeepSeek, Claude, and Gemini—were analyzed using established readability formulas and information quality assessment tools. These chatbots were selected based on their user base, widespread use in research, and technological currency. Free versions of the chatbots were utilized. All prompts were submitted on August 8–12, 2025, via the models’ web interfaces: ChatGPT-5 (OpenAI), DeepSeek (DeepSeek AI), Claude Sonnet 4 (Anthropic), and Gemini 2.5 Flash (Google). For each model, default settings in English were used; a fresh session with cleared history was initiated, and any custom instructions were disabled. The chatbots employed in this study represent current and widely used LLMs.

The study was based on the PDF document titled “Overactive Bladder Patient Guide,” a patient information resource published by the AUA Urology Care Foundation and available online [15] (Supplementary Material 1). To access this PDF, visit https://www.urologyhealth.org/educational-resources and apply the following filters: Language: English; Product Format: Download; Topic: Bladder Control; and Product Type: Brochure. A total of 16 patient questions were selected from two sections on page 15 of the guide: the four-question “Questions to Ask Your Health Care Team” and the twelve-items in “Questions to Ask About Treatment” (Supplementary Material 1). These questions address OAB symptoms, their impact on quality of life, and patients’ general health concerns.

Data collection

Each of the 16 patient questions identified was posed individually to the four different AI chatbots mentioned above. The questions were presented to each chatbot in natural language format, without any additional context or prompting. To prevent potential bias, new user accounts were created for each chatbot, and previous search histories were cleared. The questions were presented to each chatbot in the same fixed order to avoid any order-related bias. The chatbot responses were recorded in text format for subsequent analysis. Only the initial response from each chatbot to each question was used as the basis for evaluation. No manual editing was applied to the length, format, or content of the responses.

Assessment tools and methodology

Readability analysis

Readability measures how easily a text can be understood. In this study, two readability formulas were employed:

Gunning fog index (GFI)

GFI is a readability formula designed to estimate the number of years of formal education required to understand a text on first reading. This method is particularly useful for assessing the complexity of English texts by considering sentence length and the number of complex words. The index has found widespread application in various fields, including education, healthcare, and translation studies, for the purpose of assessing the accessibility of written materials. A higher GFI score indicates a less readable text, suggesting that more years of education are required to understand it [11]. The GFI is calculated using the following formula: Inline graphic

Flesch-kincaid grade level (FKGL)

Determines the text’s corresponding grade level in U.S. schools. A higher FKGL score indicates a less readable text, suggesting that more years of education are required to understand it. For example, a score of 8.0 indicates that the text is appropriate for eighth-grade students. Health materials for the general public are typically aimed at grades 6–8 [12]. The FKGL is calculated using the following formula: Inline graphic

GFI and FKGL scores were automatically calculated for each chatbot response using the textstat Python library. All readability calculations were manually reviewed and corrected by an expert linguist in cases of discrepancies.

Information quality analysis

Two different tools were used for information quality assessment:

Ensuring quality information for patients (EQIP) score

EQIP is a 20-item instrument designed to assess the quality of written health information presentation [13]. This tool assesses various aspects of information, including its clarity, accuracy, comprehensiveness, timeliness, and potential to assist patients in decision-making. Each item is evaluated according to a specific scoring system, resulting in a total EQIP score. Higher EQIP scores indicate higher-quality information presentation. The formula is as follows:

The results were classified into four categories:

0–25%: Severe quality problems.

26–50%: Serious quality issues.

51–75%: Good quality with minor issues.

76–100%: Well written.

Google E-E-A-T (experience, expertise, authoritativeness, trustworthiness) guidelines

The E-E-A-T framework is used by Google to evaluate the quality and reliability of web content in its search engine rankings. Content, especially on sensitive topics such as health and finance—referred to as Your Money or Your Life (YMYL)—is expected to meet high E-E-A-T standards [14]. This study assessed the extent to which chatbot responses adhered to the E-E-A-T principles, specifically whether the content was created by experienced and expert individuals, whether these individuals were competent in their field, and whether the information provided was generally reliable. The assessment aimed to analyze the quality of healthcare information in chatbot responses from the perspective of Google’s search quality guidelines. A rubric was developed for the E-E-A-T evaluation (Table 1), with each guideline scored on a scale from 1 to 5, where 1 represents poor and 5 represents excellent. The highest possible total score was 20. Higher E-E-A-T scores indicate greater reliability and quality of information.

Table 1.

The rubric of Google E-E-A-T

Criteria	1 - Very Poor	2 - Poor	3 - Fair	4 - Good	5 - Excellent
Experience	No experience sharing, completely theoretical	Minimal sign of experience, theoretical approach	Limited sign of experience, practical info	Professional experience implied, knowledge of clinical processes	Explicit clinical experience sharing, clear personal approach
Expertise	Insufficient or incorrect medical knowledge	Limited medical knowledge, superficial	Basic medical information, lacking details	Accurate medical information, sufficient detail	Comprehensive, detailed, and correct medical knowledge
Authority	No expert identity, source missing	Limited expert identity	Moderate level of expert identity	Strong expert identity implied	Explicit expert identity, source references
Trustworthiness	Problematic reliability, lack of risk warnings	Limited indicators of reliability	Basic reliability, standard warnings	Reliable information, realistic expectations	Patient safety prioritized, transparent risk information

Open in a new tab

^*EEAT: Google Principles defined as: Experience (first-hand experience), Expertise (specialized knowledge), Authoritativeness (source reputation), and Trustworthiness (accuracy and honesty)

To assess quality and trustworthiness, each response was independently evaluated by two board-certified urologists (N.A. and I.O.Y.), each with over 10 years of experience, who were blinded to the chatbot source. To ensure inter-rater reliability, a calibration process was conducted beforehand: pilot scoring was performed on ten sample responses, and training continued until consensus was reached. During the formal evaluation, any disagreements between the two primary raters were resolved by a third senior urologist with equivalent credentials (M.D.), who served as an adjudicator to determine the final consensus score.

Ethical issue

Since this study utilized publicly available data and did not involve human participants, approval from an ethics committee was not required.

Statistical analysis

The resulting GFI, FKGL, EQIP, and Google E-E-A-T scores were used to conduct comparative analyses between chatbots. Data obtained in the study were evaluated across 16 questions and 4 groups. Descriptive statistics for each group are presented as mean ± standard deviation for parametric values and median (interquartile range) for non-parametric values. The Shapiro-Wilk test was used to assess data normality. When normality was met, a one-way ANOVA was performed; if at least one group did not meet the normality assumption, the non-parametric Kruskal-Wallis test was used for group comparisons. If the test result was statistically significant (p < 0.05), post-hoc analyses were conducted to identify differences between groups: the Dunn test was used for non-parametric data and Tukey test was used for parametric data. Statistical significance was accepted at p < 0.05.

Inter-rater reliability was assessed using ICC(2,1) (two-way random effects, absolute agreement) for EQIP and Cohen’s kappa for E-E-A-T, both with 95% confidence intervals. For the primary analyses, a single consensus score was created by averaging the two raters’ item-level scores; this mean score was then used in subsequent statistical analyses. All analyses were conducted using GraphPad Prism 10.

Results

Interrater reliability

Inter-rater reliability was excellent for both assessment tools. For the EQIP evaluations (n = 64 responses (16 questions × 4 chatbots)), the intraclass correlation coefficient was ICC(2,1) = 0.97 (95% CI: 0.96–0.98), indicating near-perfect agreement. For the E-E-A-T assessments (n = 64 responses (16 questions × 4 chatbots)), Cohen’s kappa coefficient was κ = 0.89 (95% CI: 0.79–0.96), which, according to Landis and Koch’s criteria [16] represents “almost perfect agreement” between the two reviewers. These high reliability coefficients demonstrate consistent and reproducible scoring across both evaluation instruments.

The full responses generated by ChatGPT, DeepSeek, Claude, and Gemini are provided in Supplementary Materials 2, 3, 4, and 5, respectively.

Readability scores

A statistically significant difference was found between the chatbots in terms of GFI (p = 0.008). Among the GFI scores, Claude had the highest score (indicating the lowest readability) at 15.8 ± 4.24 (95% CI: 13.54–18.08), while DeepSeek had the lowest score (indicating the highest readability) at 11.13 ± 3.73 (95% CI: 9.15–13.11) (Table 2). These values indicate that the texts are generally difficult to read and require a higher level of education. The distribution of GFI scores is illustrated in Fig. 1A.

Table 2.

Comparison of scores

	Chatgpt-5			DeepSeek			Claude Sonnet 4			Gemini 2.5 Flash
		95% CI			95% CI			95% CI			95% CI
		Lower	Upper		Lower	Upper		Lower	Upper		Lower	Upper	p-value
GFI	12.19 ± 3.76	10.19	14.2	11.13 ± 3.73	9.15	13.11	15.8 ± 4.24	13.54	18.06	13.57 ± 3.87	11.51	15.63	0.008 ^a
FKGL	11.75 ± 3.29	9.99	13.51	9.33 ± 3.69	9.33	11.27	15.71 ± 5.12	12.98	18.55	12.37 ± 4.52	9.96	14.78	< 0.001 ^a
EQIP	50.35 ± 10.5	44.7	55.99	43.69 ± 6.83	40.05	47.34	41.43 ± 4.93	38.8	44.06	60.2 ± 6.92	56.61	63.88	< 0.001 ^a
E-E-A-T	10.5 (2.5)			11 (0.75)			10 (2.75)			13.5 (3.75)			< 0.001 ^b

Open in a new tab

Values are expressed as mean ± standart deviation and median (interquartil range)

GFI Gunning Fog Index, FKGL Flesch-Kincaid Grade Level, EQIP Ensuring Quality Information for Patients, EEAT Google (Experience, Expertise, Authoritativeness, Trustworthiness) Principles

^aOne-Way ANOVA

^bKruskal-Wallis, CI: Confidence Interval

Fig. 1 — 1A: Comparison of GFI Scores, 1B: Comparison of FKGL Scores, 1C: Comparison of EQIP Scores, 1D: Comparison of Google E-E-A-T Scores

A statistically significant difference was also observed among the chatbots in terms of the Flesch-Kincaid Grade Level (FKGL) scores (p < 0.001). According to these scores, Claude required the highest reading grade level, with a mean score of 15.71 ± 5.12 (95% CI: 12.98–18.55), while DeepSeek required the lowest grade level, with a mean score of 9.33 ± 3.69 (95% CI: 9.33–11.27) (Table 2). The distribution of FKGL scores is illustrated in Fig. 1B. These results indicate that the readability of the health information generated by the chatbots is well above the recommended 6th to 8th grade level for the general public.

Information quality scores

Statistically significant differences were observed among the chatbots in terms of EQIP scores (p < 0.001). Gemini demonstrated the highest quality, with a mean score of 60.2 ± 6.92 (95% CI: 56.61–63.88), while Claude had the lowest, with a mean score of 41.43 ± 4.93 (95% CI: 38.8–44.06) (Table 2). The distribution of EQIP scores is illustrated in Fig. 1C.

A statistically significant difference was also observed between the chatbots in terms of E-E-A-T scores (p < 0.001). According to these scores, Gemini achieved the highest median score of 13.5 (3.75), while DeepSeek had the lowest median score of 11 (0.75) (Table 2). The distribution of E-E-A-T scores is illustrated in Fig. 1D.

Among the chatbots, only Claude emphasized that the responses were informative and recommended consulting a healthcare provider. None of the chatbots provided referrals. Overall, DeepSeek stood out among chatbots in terms of readability, while Gemini outperformed the others in terms of information quality according to EQIP and E-E-A-T criteria.These findings suggest that there are quality differences in the delivery of health information by AI-powered chatbots and highlight the need for improvements, particularly in reliability and competence.

Discussion

This study compared the readability and information quality of responses provided by four different AI chatbots (ChatGPT, DeepSeek, Claude, and Gemini) to patient questions about overactive bladder. Our findings revealed several key insights. We observed that all chatbots varied in readability; however, their reading levels exceeded the targeted level, and the quality of the information provided were inconsistent. Notably, Gemini demonstrated superior performance, particularly on the EQIP and E-E-A-T metrics. This suggests that, although advancements in linguistic fluency have led to a degree of parity among AI models, the competitive advantage is increasingly shifting toward the quality, credibility, and perceived expertise of the content produced.

A statistically significant difference in readability was found between chatbots based on GFI and FKGL analyses. However, scores for all chatbot-generated texts were significantly higher than the recommended 6th to 8th grade reading level for the general public, indicating that AI-generated health information may still be difficult for individuals with low health literacy to understand. This finding is consistent with previous studies by Skierkowski et al. [4], which showed that much online health information exceeds the reading level of the general adult population, and analyses of urological patient materials have similarly revealed high required education levels [17]. Although some research suggests AI can improve readability through prompt engineering [18], our results highlight the need for developers to prioritize simplifying complex medical terminology and adapting language for broader accessibility. This is crucial for democratizing healthcare access, especially for patients with low literacy, enabling them to make informed decisions and engage more actively in treatment processes [19].

In terms of information quality, statistically significant differences were observed between chatbots in the Ensuring Quality Information for Patients (EQIP) and Google E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) scores. Notably, Gemini outperformed other chatbots in both metrics, which aligns with Google’s integration of its own E-E-A-T principles into the model’s development. This result underscores variations in AI-generated health information quality, with some models providing more reliable and authoritative content than others, consistent with concerns in prior literature about accuracy and reliability [9, 10]. Future research should explore underlying reasons for these differences and strategies to enhance all models, such as improving source transparency and verification systems [20]. Our findings align with and extend prior studies on AI-generated health content (e.g., Hershenhouse et al. [18] on readability in urology), demonstrating that while we are not the first to assess chatbot quality, this work uniquely applies EQIP and E-E-A-T metrics to OAB patient questions.

A notable finding regarding safety and natural language processing (NLP) is that only Claude consistently includes a “consult a healthcare professional” warning in its outputs. The absence of such protective disclaimers in AI chatbot content can encourage overreliance on unverified recommendations, potentially causing delays in diagnosis and treatment. Therefore, we recommend that developers implement a minimum “safety package” comprising:

A clear statement that outputs do not replace professional medical diagnosis or treatment.
Guidance to consult a licensed healthcare professional for personalized decisions.
Transparent display of sources and timestamps.
Brief triage guidance with prompt referral for red-flag symptoms.
A clear statement acknowledging uncertainty regarding the evidence.

Standardizing these elements can mitigate potential harm by aligning user expectations across models and may also help limit the clinical impact of quality differences identified by EQIP and E-E-A-T frameworks.

This study has significant implications for clinical practice, particularly in OAB management. AI-powered chatbots can enhance health literacy by providing fast, accessible information on OAB symptoms (e.g., urgency and nocturia), but their complex readability levels necessitate clinician oversight to supplement with personalized explanations. In OAB treatment, these tools could be integrated to improve patient adherence— for instance, by generating tailored reminders for lifestyle modifications (e.g., fluid intake management or bladder training exercises) or explaining medication side effects (e.g., antimuscarinics) in simpler terms, aligning with AUA guidelines. Clinically, this might reduce non-compliance rates, which are high in OAB due to symptom frustration, and facilitate early intervention for complications like incontinence. For example, chatbots could triage “red-flag” symptoms (e.g., sudden worsening urgency) and prompt timely consultations, potentially improving quality of life outcomes as supported by systematic reviews on AI in patient education [21]. Healthcare professionals should collaborate with developers to customize these models for OAB-specific scenarios, such as integrating them into telehealth platforms for ongoing symptom tracking. Such applications could democratize OAB care, especially in underserved populations, but require validation through patient outcome studies.

It is also important to consider the methodology of our study. We deliberately employed a straightforward query approach to simulate typical user interactions, without utilizing advanced prompt engineering techniques. However, the quality of LLM outputs can be significantly influenced by the structure of the prompt. Future research should investigate how different prompting strategies—such as instructing the model to assume the persona of a urologist or tailoring its language for patients with low health literacy—can enhance the quality and usefulness of responses. Identifying optimal prompt strategies for healthcare-related queries is a crucial next step to maximize the potential of these tools for both patients and clinicians.

Limitations

This study includes only four different chatbot models (using free versions) and is limited to a specific health topic, with a sample size of 16 questions based on the AUA guideline. To address these limitations, future research should adopt a longitudinal design to monitor evolving model performance, incorporate a broader range of chatbots and questions, and evaluate responses across multiple languages to assess global applicability. Additionally, critical issues such as the ethical implications of chatbots, data privacy, and the dissemination of misinformation warrant further investigation. A key limitation inherent in studies of this nature is the rapidly evolving landscape of large language models. The models assessed in this research are not static; therefore, this study should be viewed as a snapshot of the models’ capabilities as of August 2025. The performance hierarchy and specific characteristics of ChatGPT, Claude, Gemini, and DeepSeek may have changed by the time of publication. Finally, we did not audit the models’ training data or tuning objectives; thus, any suggested links between specific companies’ quality frameworks and model performance should be interpreted as plausible hypotheses rather than demonstrated causal relationships.

Conclusion

AI-powered chatbots hold significant potential for enhancing patient information processes. However, their readability levels are often too complex for the general public, and both readability and information quality vary across different models. This indicates a need for improvements to facilitate broader adoption of these technologies in healthcare. Among AI models, DeepSeek excels in readability, whereas Gemini distinguishes itself by delivering higher information quality. These findings underscore the importance of collaboration between healthcare professionals and AI developers to deliver patient-centered, clear, and reliable health information.

Supplementary Information

Supplementary Material 1.^{(4.6MB, pdf)}

Supplementary Material 2.^{(22.3KB, docx)}

Supplementary Material 3.^{(20.4KB, docx)}

Supplementary Material 4.^{(17.4KB, docx)}

Supplementary Material 5.^{(21.2KB, docx)}

Authors’ contributions

TA: Formal Analysis, Writing—Original Draft, Data Collection, Validation, Supervision, Resources. IOY: Visualization, Data Collection, Data Curation, Writing—Original Draft, Formal Analysis, Project Administration, Funding Acquisition. MGA: Formal Analysis, Writing—Original Draft, Writing—Review and Editing, Software Validation, Supervision, Resources. IHS: Data Collection, Data Curation, Resources, Writing—Review and Editing. FO : Formal Analysis, Writing—Original Draft, Validation. VI: Formal Analysis, Writing—Original Draft, Validation, Supervision, Visualization.

Funding

Not applicable.

Data availability

The data supporting this study’s findings are available on request from the corresponding author.

Declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Irwin DE, Milsom I, Hunskaar S, et al. Population-based survey of urinary incontinence, overactive bladder, and other lower urinary tract symptoms in five countries: results of the EPIC study. Eur Urol. 2006;50(6):1306–14. discussion 1314–1315. [DOI] [PubMed] [Google Scholar]
2.Haylen BT, de Ridder D, Freeman RM, et al. An international urogynecological association (IUGA)/International continence society (ICS) joint report on the terminology for female pelvic floor dysfunction. Int Urogynecol J. 2010;21(1):5–26. [DOI] [PubMed] [Google Scholar]
3.Harding CK, Lapitan MC, Arlandis S et al. EAU guidelines on non-neurogenic female luts. Accessable at: https://uroweb.org/guidelines/non-neurogenic-female-luts.
4.Skierkowski DD, Florin P, Harlow LL, Machan J, Ye Y. A readability analysis of online mental health resources. Am Psychol. 2019;74(4):474–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Koo K, Shee K, Yap RL. Readability analysis of online health information about overactive bladder. Neurourol Urodyn. 2017;36(7):1782–7. [DOI] [PubMed] [Google Scholar]
6.Clancy AA, Hickling D, Didomizio L, et al. Patient-targeted websites on overactive bladder: what are our patients reading? Neurourol Urodyn. 2018;37(2):832–41. [DOI] [PubMed] [Google Scholar]
7.Du* C, Lee W, Lucioni A, Kobashi K, Lee U, MP02-06 READABILITY OF PATIENT EDUCATION MATERIALS ON PELVIC ORGAN PROLAPSE. OVERACTIVE BLADDER, AND STRESS URINARY INCONTINENCE. J Urol. 2019;201:e12. [Google Scholar]
8.Büker M, Mercan G. Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root Canal retreatment: A comparative assessment. Int J Med Inf. 2025;201:105948. [DOI] [PubMed] [Google Scholar]
9.Kayra MV, Anil H, Ozdogan I, Baradia SMA, Toksoz S. Evaluating AI chatbots in penis enhancement information: a comparative analysis of readability, reliability and quality. Int J Impot Res. 2025;37(7):558–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Sorich MJ, Menz BD, Hopkins AM. Quality and safety of artificial intelligence generated health information. BMJ. 2024;384:q596. [DOI] [PubMed] [Google Scholar]
11.Świeczkowski D, Kułacz S. The use of the gunning fog index to evaluate the readability of Polish and english drug leaflets in the context of health literacy challenges in medical linguistics: an exploratory study. Cardiol J. 2021;28(4):627–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Brewer J. Measuring Text Readability Using Reading Level. In: Khosrow-Pour M, editor. Advanced Methodologies and Technologies in Modern Education Delivery. Hershey: IGI Global; 2019. p. 93–103.
13.Moult B, Franck LS, Brady H. Ensuring quality information for patients: development and preliminary validation of a new instrument to improve the quality of written health care information. Health Expect. 2004;7(2):165–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Search Quality Rater Guidelines. An Overview. https://services.google.com/fh/files/misc/hsw-sqrg.pdf.
15.Urology Care Foundation. Bladder Control-Overactive Bladder Patient Guide. Published online 2025. Accessed August 2. 2025. https://www.urologyhealth.org/educational-materials/overactive-bladder.
16.Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74. [PubMed] [Google Scholar]
17.Anıl H, Kayra MV. The digital dialogue on premature ejaculation: evaluating the efficacy of artificial intelligence-driven responses. Int Urol Nephrol. 2025;57(9):2829–36. [DOI] [PubMed] [Google Scholar]
18.Hershenhouse JS, Mokhtar D, Eppler MB, et al. Accuracy, readability, and understandability of large Language models for prostate cancer information to the public. Prostate Cancer Prostatic Dis. 2025;28(2):394–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Swisher AR, Wu AW, Liu GC, Lee MK, Carle TR, Tang DM. Enhancing health literacy: evaluating the readability of patient handouts revised by chatgpt’s large Language model. Otolaryngol Head Neck Surg. 2024;171(6):1751–7. [DOI] [PubMed] [Google Scholar]
20.Yıldız HA, Söğütdelen E. AI chatbots as sources of STD information: A study on reliability and readability. J Med Syst. 2025;49(1):43. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Nasra M, Jaffri R, Pavlin-Premrl D, et al. Can artificial intelligence improve patient educational material readability? A systematic review and narrative synthesis. Intern Med J. 2025;55(1):20–34. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1.^{(4.6MB, pdf)}

Supplementary Material 2.^{(22.3KB, docx)}

Supplementary Material 3.^{(20.4KB, docx)}

Supplementary Material 4.^{(17.4KB, docx)}

Supplementary Material 5.^{(21.2KB, docx)}

Data Availability Statement

The data supporting this study’s findings are available on request from the corresponding author.

[CR1] 1.Irwin DE, Milsom I, Hunskaar S, et al. Population-based survey of urinary incontinence, overactive bladder, and other lower urinary tract symptoms in five countries: results of the EPIC study. Eur Urol. 2006;50(6):1306–14. discussion 1314–1315. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Haylen BT, de Ridder D, Freeman RM, et al. An international urogynecological association (IUGA)/International continence society (ICS) joint report on the terminology for female pelvic floor dysfunction. Int Urogynecol J. 2010;21(1):5–26. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Harding CK, Lapitan MC, Arlandis S et al. EAU guidelines on non-neurogenic female luts. Accessable at: https://uroweb.org/guidelines/non-neurogenic-female-luts.

[CR4] 4.Skierkowski DD, Florin P, Harlow LL, Machan J, Ye Y. A readability analysis of online mental health resources. Am Psychol. 2019;74(4):474–83. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Koo K, Shee K, Yap RL. Readability analysis of online health information about overactive bladder. Neurourol Urodyn. 2017;36(7):1782–7. [DOI] [PubMed] [Google Scholar]

[CR6] 6.Clancy AA, Hickling D, Didomizio L, et al. Patient-targeted websites on overactive bladder: what are our patients reading? Neurourol Urodyn. 2018;37(2):832–41. [DOI] [PubMed] [Google Scholar]

[CR7] 7.Du* C, Lee W, Lucioni A, Kobashi K, Lee U, MP02-06 READABILITY OF PATIENT EDUCATION MATERIALS ON PELVIC ORGAN PROLAPSE. OVERACTIVE BLADDER, AND STRESS URINARY INCONTINENCE. J Urol. 2019;201:e12. [Google Scholar]

[CR8] 8.Büker M, Mercan G. Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root Canal retreatment: A comparative assessment. Int J Med Inf. 2025;201:105948. [DOI] [PubMed] [Google Scholar]

[CR9] 9.Kayra MV, Anil H, Ozdogan I, Baradia SMA, Toksoz S. Evaluating AI chatbots in penis enhancement information: a comparative analysis of readability, reliability and quality. Int J Impot Res. 2025;37(7):558–63. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Sorich MJ, Menz BD, Hopkins AM. Quality and safety of artificial intelligence generated health information. BMJ. 2024;384:q596. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Świeczkowski D, Kułacz S. The use of the gunning fog index to evaluate the readability of Polish and english drug leaflets in the context of health literacy challenges in medical linguistics: an exploratory study. Cardiol J. 2021;28(4):627–31. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Brewer J. Measuring Text Readability Using Reading Level. In: Khosrow-Pour M, editor. Advanced Methodologies and Technologies in Modern Education Delivery. Hershey: IGI Global; 2019. p. 93–103.

[CR13] 13.Moult B, Franck LS, Brady H. Ensuring quality information for patients: development and preliminary validation of a new instrument to improve the quality of written health care information. Health Expect. 2004;7(2):165–75. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Search Quality Rater Guidelines. An Overview. https://services.google.com/fh/files/misc/hsw-sqrg.pdf.

[CR15] 15.Urology Care Foundation. Bladder Control-Overactive Bladder Patient Guide. Published online 2025. Accessed August 2. 2025. https://www.urologyhealth.org/educational-materials/overactive-bladder.

[CR16] 16.Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74. [PubMed] [Google Scholar]

[CR17] 17.Anıl H, Kayra MV. The digital dialogue on premature ejaculation: evaluating the efficacy of artificial intelligence-driven responses. Int Urol Nephrol. 2025;57(9):2829–36. [DOI] [PubMed] [Google Scholar]

[CR18] 18.Hershenhouse JS, Mokhtar D, Eppler MB, et al. Accuracy, readability, and understandability of large Language models for prostate cancer information to the public. Prostate Cancer Prostatic Dis. 2025;28(2):394–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Swisher AR, Wu AW, Liu GC, Lee MK, Carle TR, Tang DM. Enhancing health literacy: evaluating the readability of patient handouts revised by chatgpt’s large Language model. Otolaryngol Head Neck Surg. 2024;171(6):1751–7. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Yıldız HA, Söğütdelen E. AI chatbots as sources of STD information: A study on reliability and readability. J Med Syst. 2025;49(1):43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Nasra M, Jaffri R, Pavlin-Premrl D, et al. Can artificial intelligence improve patient educational material readability? A systematic review and narrative synthesis. Intern Med J. 2025;55(1):20–34. [DOI] [PubMed] [Google Scholar]

PERMALINK

Readability and quality assessment of AI-powered chatbot responses to overactive bladder patient questions: a comparative study

Mutlu Deger

Tunahan Ates

Ismail Onder Yilmaz

Mehmet Gurkan Arıkan

Ibrahim Halil Sukur

Fesih Ok

Nebil Akdogan

Volkan Izol

Abstract

Introduction

Methods

Results

Conclusion

Supplementary Information

Introduction

Materials and methods

Study design

Data collection

Assessment tools and methodology

Readability analysis

Gunning fog index (GFI)

Flesch-kincaid grade level (FKGL)

Information quality analysis

Ensuring quality information for patients (EQIP) score

Google E-E-A-T (experience, expertise, authoritativeness, trustworthiness) guidelines

Table 1.

Ethical issue

Statistical analysis

Results

Interrater reliability

Readability scores

Table 2.

Fig. 1.

Information quality scores

Discussion

Limitations

Conclusion

Supplementary Information

Authors’ contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases