Dear Editor,
The use of artificial intelligence (AI) to generate patient information has become increasingly common in medicine. In radiology, AI has demonstrated potential in improving patient education and care [1]. With the increased usage of AI and large language models (LLM) in medicine, it is important that their readability is evaluated. The Joint Commission states that the reading level of patient education material should be at or below a 5th grade reading level; thus, their outputs must meet established readability standards to support effective patient education [2]. In this study, we evaluated the readability of information generated by LLMs ChatGPT, Gemini, and Copilot on chronic diseases such as diabetes, cancer, and heart disease.
We evaluated the readability of AI‐generated responses to common patient questions about diabetes, cancer, and heart disease. Using standardized prompts, we collected 12 outputs (4 per disease) from ChatGPT, Gemini, and Copilot. Readability was assessed using three validated formulas: Flesch‐Kincaid Grade Level, Simple Measure of Gobbledygook (SMOG) Index, and Gunning Fog Index. A one‐way analysis of variance (ANOVA) was performed to assess statistically significant differences in readability among the 3 LLMs for each readability formula.
Readability scores across the three LLMs exceeded recommended readability thresholds (Table 1). For the Flesch‐Kincaid Grade Level, Gemini had the highest average score at 12.13, followed by Copilot at 11.87, and ChatGPT at 11.40. With the SMOG Index, Gemini again produced the highest score at 11.03, followed by Copilot at 10.75, and ChatGPT at 10.53. Gunning Fog also showed a consistent pattern, Gemini led with 12.97, followed by Copilot with 12.80, and ChatGPT was the lowest score with 12.10. ANOVA results showed no statistically significant differences in readability among the LLMs across the three formulas, with p‐values of 0.604 for Flesch–Kincaid, 0.704 for SMOG, and 0.333 for Gunning Fog Index.
Table 1.
Mean ± SD and p values for Flesch–Kincaid, SMOG, and Gunning Fog readability metrics by AI platform.
| Metric | ChatGPT | Copilot | Gemini | p value |
|---|---|---|---|---|
| Flesch–Kincaid | 11.40 ± 1.56 | 11.87 ± 1.43 | 12.13 ± 2.27 | 0.604 |
| SMOG Index | 10.53 ± 1.14 | 10.75 ± 1.24 | 11.03 ± 1.91 | 0.704 |
| Gunning Fog Index | 12.10 ± 0.91 | 12.80 ± 1.61 | 12.97 ± 1.81 | 0.333 |
The scores produced by the reading level calculator correlate to the reading level associated with the text. Each LLM produced an average reading level greater than the 10th grade. Information generated by AI does not consider patient literacy levels, which can impact the care that patients receive. Systematic reviews have shown that patient education materials often fail to meet national readability recommendations, and the increased use of AI without consideration for literacy levels can further perpetuate challenges [3]. These results underscore the importance of selecting AI tools that address the suggested 5th–6th‐grade readability levels for health education materials.
While our study focused on readability metrics using validated formulas (Flesch‐Kincaid, SMOG, and Gunning Fog), we acknowledge that these tools do not fully capture patient comprehension. Readability scores provide an estimate of the expected reading level but do not assess whether patients can understand and make informed decisions based on the materials presented to them. In future directions, we aim to incorporate user comprehension testing to evaluate how patients perceive AI‐generated content. A small pilot evaluation with nonexpert participants assessing clarity, perceived usefulness, and confidence in decision‐making could offer insights into the practical utility of AI in patient education. Additionally, we plan to apply the validated Global Quality Score (GQS) Likert Scale (1–5) to assess the quality and reliability of the AI information using physician graders. Finally, we intend to prompt the AI tools to generate information at a specific reading level and then assess the adherence of the AI outputs to that target.
Author Contributions
Faheed Shafau conceived the study idea, collected the data, and drafted the manuscript. Chase Wahl collected the data and performed the data analysis.
Ethics Statement
The authors have nothing to report.
Conflicts of Interest
The authors declare no conflicts of interest.
Acknowledgments
This study involved the use of large language models (LLMs), specifically ChatGPT, Gemini, and Copilot, to generate patient education content on chronic diseases. These tools were used for the generation of data in the context of a readability analysis. No AI tools were used to write this manuscript.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
- 1. Rajpurkar P. and Lungren M. P., “The Current and Future State of AI Interpretation of Medical Images,” New England Journal of Medicine 388, no. 21 (2023): 1981–1990, 10.1056/nejmra2301725. [DOI] [PubMed] [Google Scholar]
- 2. Stossel L. M., Segar N., Gliatto P., Fallar R., and Karani R., “Readability of Patient Education Materials Available at the Point of Care,” Journal of General Internal Medicine 27, no. 9 (2012): 1165–1170, 10.1007/s11606-012-2046-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Okuhara T., Furukawa E., Okada H., Yokota R., and Kiuchi T.,“Readability of Written Information for Patients Across 30 Years: A Systematic Review of Systematic Reviews,” Patient Education and Counseling 135 (2025): 108656, 10.1016/j.pec.2025.108656. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.
