Evaluating the Readability of AI‐Generated Patient Information on Chronic Diseases

Faheed Shafau; Chase Wahl

doi:10.1002/cdt3.70020

letter

. 2025 Sep 1;11(4):316–317. doi: 10.1002/cdt3.70020

Evaluating the Readability of AI‐Generated Patient Information on Chronic Diseases

Faheed Shafau ^1,^✉, Chase Wahl ¹

PMCID: PMC12670965 PMID: 41341736

Dear Editor,

The use of artificial intelligence (AI) to generate patient information has become increasingly common in medicine. In radiology, AI has demonstrated potential in improving patient education and care [1]. With the increased usage of AI and large language models (LLM) in medicine, it is important that their readability is evaluated. The Joint Commission states that the reading level of patient education material should be at or below a 5th grade reading level; thus, their outputs must meet established readability standards to support effective patient education [2]. In this study, we evaluated the readability of information generated by LLMs ChatGPT, Gemini, and Copilot on chronic diseases such as diabetes, cancer, and heart disease.

We evaluated the readability of AI‐generated responses to common patient questions about diabetes, cancer, and heart disease. Using standardized prompts, we collected 12 outputs (4 per disease) from ChatGPT, Gemini, and Copilot. Readability was assessed using three validated formulas: Flesch‐Kincaid Grade Level, Simple Measure of Gobbledygook (SMOG) Index, and Gunning Fog Index. A one‐way analysis of variance (ANOVA) was performed to assess statistically significant differences in readability among the 3 LLMs for each readability formula.

Readability scores across the three LLMs exceeded recommended readability thresholds (Table 1). For the Flesch‐Kincaid Grade Level, Gemini had the highest average score at 12.13, followed by Copilot at 11.87, and ChatGPT at 11.40. With the SMOG Index, Gemini again produced the highest score at 11.03, followed by Copilot at 10.75, and ChatGPT at 10.53. Gunning Fog also showed a consistent pattern, Gemini led with 12.97, followed by Copilot with 12.80, and ChatGPT was the lowest score with 12.10. ANOVA results showed no statistically significant differences in readability among the LLMs across the three formulas, with p‐values of 0.604 for Flesch–Kincaid, 0.704 for SMOG, and 0.333 for Gunning Fog Index.

Table 1.

Mean ± SD and p values for Flesch–Kincaid, SMOG, and Gunning Fog readability metrics by AI platform.

Metric	ChatGPT	Copilot	Gemini	p value
Flesch–Kincaid	11.40 ± 1.56	11.87 ± 1.43	12.13 ± 2.27	0.604
SMOG Index	10.53 ± 1.14	10.75 ± 1.24	11.03 ± 1.91	0.704
Gunning Fog Index	12.10 ± 0.91	12.80 ± 1.61	12.97 ± 1.81	0.333

Open in a new tab

The scores produced by the reading level calculator correlate to the reading level associated with the text. Each LLM produced an average reading level greater than the 10th grade. Information generated by AI does not consider patient literacy levels, which can impact the care that patients receive. Systematic reviews have shown that patient education materials often fail to meet national readability recommendations, and the increased use of AI without consideration for literacy levels can further perpetuate challenges [3]. These results underscore the importance of selecting AI tools that address the suggested 5th–6th‐grade readability levels for health education materials.

While our study focused on readability metrics using validated formulas (Flesch‐Kincaid, SMOG, and Gunning Fog), we acknowledge that these tools do not fully capture patient comprehension. Readability scores provide an estimate of the expected reading level but do not assess whether patients can understand and make informed decisions based on the materials presented to them. In future directions, we aim to incorporate user comprehension testing to evaluate how patients perceive AI‐generated content. A small pilot evaluation with nonexpert participants assessing clarity, perceived usefulness, and confidence in decision‐making could offer insights into the practical utility of AI in patient education. Additionally, we plan to apply the validated Global Quality Score (GQS) Likert Scale (1–5) to assess the quality and reliability of the AI information using physician graders. Finally, we intend to prompt the AI tools to generate information at a specific reading level and then assess the adherence of the AI outputs to that target.

Author Contributions

Faheed Shafau conceived the study idea, collected the data, and drafted the manuscript. Chase Wahl collected the data and performed the data analysis.

Ethics Statement

The authors have nothing to report.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This study involved the use of large language models (LLMs), specifically ChatGPT, Gemini, and Copilot, to generate patient education content on chronic diseases. These tools were used for the generation of data in the context of a readability analysis. No AI tools were used to write this manuscript.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

1. Rajpurkar P. and Lungren M. P., “The Current and Future State of AI Interpretation of Medical Images,” New England Journal of Medicine 388, no. 21 (2023): 1981–1990, 10.1056/nejmra2301725. [DOI] [PubMed] [Google Scholar]
2. Stossel L. M., Segar N., Gliatto P., Fallar R., and Karani R., “Readability of Patient Education Materials Available at the Point of Care,” Journal of General Internal Medicine 27, no. 9 (2012): 1165–1170, 10.1007/s11606-012-2046-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Okuhara T., Furukawa E., Okada H., Yokota R., and Kiuchi T.,“Readability of Written Information for Patients Across 30 Years: A Systematic Review of Systematic Reviews,” Patient Education and Counseling 135 (2025): 108656, 10.1016/j.pec.2025.108656. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

[cdt370020-bib-0001] 1. Rajpurkar P. and Lungren M. P., “The Current and Future State of AI Interpretation of Medical Images,” New England Journal of Medicine 388, no. 21 (2023): 1981–1990, 10.1056/nejmra2301725. [DOI] [PubMed] [Google Scholar]

[cdt370020-bib-0002] 2. Stossel L. M., Segar N., Gliatto P., Fallar R., and Karani R., “Readability of Patient Education Materials Available at the Point of Care,” Journal of General Internal Medicine 27, no. 9 (2012): 1165–1170, 10.1007/s11606-012-2046-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cdt370020-bib-0003] 3. Okuhara T., Furukawa E., Okada H., Yokota R., and Kiuchi T.,“Readability of Written Information for Patients Across 30 Years: A Systematic Review of Systematic Reviews,” Patient Education and Counseling 135 (2025): 108656, 10.1016/j.pec.2025.108656. [DOI] [PubMed] [Google Scholar]

PERMALINK

Evaluating the Readability of AI‐Generated Patient Information on Chronic Diseases

Faheed Shafau

Chase Wahl

Table 1.

Author Contributions

Ethics Statement

Conflicts of Interest

Acknowledgments

Data Availability Statement

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Evaluating the Readability of AI‐Generated Patient Information on Chronic Diseases

Faheed Shafau

Chase Wahl

Table 1.

Author Contributions

Ethics Statement

Conflicts of Interest

Acknowledgments

Data Availability Statement

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases