Skip to main content
Chronic Diseases and Translational Medicine logoLink to Chronic Diseases and Translational Medicine
letter
. 2025 Dec 24;12(1):73–74. doi: 10.1002/cdt3.70031

Evaluating Large Language Model Adherence to Targeted Fifth‐Grade Readability Standards in Patient Education on Chronic Conditions

Faheed Shafau 1,, Chase Wahl 1, Marcus Kado 1, Garrett Miedema 1
PMCID: PMC13097565  PMID: 42022584

Summary

  • Large Language Models (LLMs) frequently generate patient education materials that exceed recommended reading levels.

  • Prompting LLMs to produce PEMs at a 5th‐grade level consistently produced statistically lower readability scores than unprompted outputs.

  • These findings suggest that simple prompt engineering can improve clarity and accessibility of LLM‐generated PEMs.


To the Editor,

Artificial intelligence (AI) usage has been increasing. Many fields have implemented the use of AI and Large Language Models (LLMs), especially in medicine. Furthermore, many patients have increasingly been using AI; often, they will prompt AI with questions before even stepping into a physician's office. The question lies in whether the information produced by AI is reliable and if this information is concise and easy to read across all patient populations. Previous research has evaluated the readability of LLM outputs regarding diseases such as ankylosing spondylitis and other chronic conditions [1]. However, the evaluation of whether or not the LLMs will adhere to a specific reading level guideline you request remains undervalued. The US Department of Health and Human Services recommends that patient education materials (PEMs) be written between the fourth to sixth grade reading levels [2]. In our study, we prompted the LLMs to answer common questions that patients would have regarding chronic conditions (cancer, diabetes, and heart disease) at a fifth‐grade reading level.

We prompted the AI with four questions per condition (cancer, diabetes, and heart disease) and specifically had the LLMs adhere to a fifth‐grade reading level within our prompts. We then analyzed each AI output (four outputs per disease) with three validated reading level indices: Gunning‐Fog, SMOG‐Index, and Flesch‐Kincaid grade level. Each index produces a value that correlates with a US grade level. Afterwards, we conducted a one‐way analysis of variance (ANOVA) on the means for ChatGPT, Copilot, and Gemini. This was done to determine statistically significant differences in the reading levels calculated for their outputs.

As shown in Table 1, mean Flesch–Kincaid scores were lowest for Copilot (5.27 ± 1.43) and ChatGPT (5.44 ± 1.50). Scores were significantly higher for Gemini (7.79 ± 1.14, p < 0.001). Similarly, Gemini produced higher SMOG (6.64 ± 1.10) and Gunning Fog Index scores (8.07 ± 0.99) compared to ChatGPT and Copilot (p = 0.005 and p = 0.001, respectively). A one‐way ANOVA revealed statistically significant differences in readability scores across all three metrics. Gemini outputs were consistently at a higher reading level than ChatGPT and Copilot.

Table 1.

Mean readability scores by Large Language Models using Flesh–Kincaid, SMOG, and Gunning Fog indices.

Metric ChatGPT Copilot Gemini p value
Flesch–Kincaid 5.44 ± 1.50 5.27 ± 1.43 7.79 ± 1.14 < 0.001
SMOG Index 5.37 ± 1.04 5.30 ± 0.98 6.64 ± 1.10 0.005
Gunning Fog Index 6.43 ± 1.22 6.42 ± 1.19 8.07 ± 0.99 0.001

Among the LLMs evaluated, Gemini produces responses at a higher reading level than ChatGPT and Copilot. ChatGPT and Copilot produced mean reading level scores closer to the fifth‐grade reading level standards set in the prompts. These findings are promising, as they show ChatGPT and Copilot can meet standard reading level guidelines when prompted. In contrast, Gemini's higher reading level may make its health information less accessible to patients with low health literacy.

Our findings focused on the reading levels produced by LLM‐generated information in response to commonly asked questions patients would have about chronic diseases. These findings underscore the need for broader evaluation that extends beyond readability alone. Future work should incorporate measures of patient comprehension, cultural relevance, and practical utility. Additionally, we hope to compare the results for the information generated when prompted to read at a fifth‐grade reading level versus when not prompting the LLMs to do so. We believe that this could provide some insight into how drastic the changes really are when you are specific in your prompting, you give the LLMs. Furthermore, we recognize the need to refine our methods to better capture these dimensions. While informative, these results do not encompass all criteria necessary to determine information suitability for patient use.

Author Contributions

Faheed Shafau: conceptualization, data extraction, data analysis, drafting. Chase Wahl: data extraction, data analysis, drafting. Garrett Miedema: drafting. Marcus Kado: drafting.

Funding

The authors received no specific funding for this work.

Ethics Statement

The authors have nothing to report.

Conflicts of Interest

The authors declare no conflicts of interest.

Artificial Intelligence Generated Content Declaration

Artificial intelligence was used to generate patient educational materials. These tools were used for the generation of data in the context of calculating their readability levels. No AI was used to write this manuscript.

Data Availability Statement

All data analyzed are included in this published article and its supplementary information.

References

  • 1. Kara M., Ozduran E., Kara M. M., Özbek İ. C., and Hancı V., “Evaluating the Readability, Quality, and Reliability of Responses Generated by ChatGPT, Gemini, and Perplexity on the Most Commonly Asked Questions About Ankylosing Spondylitis,” PLoS One 20, no. 6 (2025): e0326351, 10.1371/journal.pone.0326351. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Pashkova A., Bangalore R., Tan C., et al., “Assessing the Readability of Anesthesia‐Related Patient Education Materials From Major Anesthesiology Organizations,” BioMed Research International 2022, no.1 (2022): 3284199, 10.1155/2022/3284199. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data analyzed are included in this published article and its supplementary information.


Articles from Chronic Diseases and Translational Medicine are provided here courtesy of Chinese Medical Association

RESOURCES