Abstract
Readability formulas are prominent health communication assessment tools, but they can yield varying estimates. Such variation is often treated as error in computerized tools but can result from text preprocessing decisions in manual and computerized assessments alike. This study illustrates the effect of preprocessing on reading grade level estimates in short-form online content, thereby illustrating the importance of reporting these decisions and the limitations of these formulas.
We manually counted words, sentences, and syllables in a sample of 100 Tweets by U.S. state health agencies from 2012 through 2022. We applied the Simplified Measure of Gobbledygook and Flesch-Kincaid formulas under 7 inclusive preprocessing scenarios, differentially including URLs, hashtags, and/or numbers in word counts. We compared resulting estimates to those from a restrictive baseline that excluded these elements. Wilcoxon signed-rank tests revealed significant differences in median grade level estimates. No significant differences were found in the percentage of Tweets meeting an 8th-grade benchmark. Linear regression showed that baseline estimates did not adequately explain observed variation.
Despite the potential benefit of interpretability, we conclude that readability formulas are unreliable for short-form online content. Instead, we recommend directly using word, sentence, and syllable counts. We also recommend conducting sensitivity analyses for readability assessments.
Introduction
The US Agency for Healthcare Research and Quality lists understandable communication among the markers of health literate organizations (2019). Reading grade level estimates are quantitative metrics of readability that practitioners can use to assess their materials in pursuit of this goal. However, research has shown significant variation across estimator formulas and online calculators (Zhou et al., 2017). In health fields, this variation is typically framed as error. For example, health literacy research has discussed this variation as an issue of accuracy (Mac et al., 2022) and a reason to choose manual calculations over computerized ones (Grabeel et al., 2018).
While such variation might include some errors, variation is to be expected when quantitatively describing the readability of online text. This is due to preprocessing: the preparation of text for quantitative analysis (Nesca et al., 2022). Reading grade level estimates come from formulas using quantitative variables like word, syllable, and sentence counts. These variables may seem straightforward enough to not require much human decision making to calculate them. However, even common text features like addresses and acronyms require human judgement about whether to include them in calculations and how. Online texts include additional text features like hashtags, which might be treated as common words or ignored completely by readers depending on the context. Further, social media messages might use informal grammar and punctuation, complicating the idea of what a sentence is. It is technically possible to conduct audience research describing interactions with all possible text features across all online platforms to make these preprocessing decisions. However, the impracticality of this approach means readability assessments typically depend on preprocessing decisions that reflect assumptions and limited data about reading practices among target audience members.
Understanding the limits of reading grade level estimates is important considering the amount of effort focused on short-form online texts in health communication today. Previous research about readability of short-form social media emphasized the need for authoritative posters to share more readable health content (Hoedebecke et al., 2017; Morse et al., 2024). Building this capacity would align with recommendations from the Centers for Disease Control and Prevention to use Twitter and other social media platforms to reach news media with key messages during a public health emergency (2014). It would also help ensure public audiences are able to engage with large amounts of public health communication: the amount of public health communication from state agencies on Twitter alone more than doubled after the start of the COVID-19 pandemic (Mendez et al., 2025). Beyond the use of prominent social media platforms, understanding the limits of popular readability assessments for short-form texts is also relevant to clinicians’ use of online patient messaging systems and the growing interest in automated chatbots for health communication. However, previous research on preprocessing for readability estimates focused on the impact of sampling procedures within longer texts (Wang et al., 2012), leaving a research gap around the more fine-tuned aspects of preprocessing.
To address this gap, we present a study of the impact of preprocessing on reading grade level estimates for short-form social media posts. Our aim is to elucidate both the importance of reporting these decisions and the limitations of reading grade level estimates for short-form online texts. While preprocessing decisions are coded into readability software, manual calculations rely on the same kinds of decisions too. Thus this study contributes an increased understanding of the need for caution when using readability assessments, regardless of method used.
Materials and Methods
We analyzed a random sample (n=100) from a complete dataset (n=425,694) of original Tweets by US state public health agencies published between January 1, 2012, and December 31, 2022 (Mendez, 2023). Appendix 1 lists the accounts included in the broader dataset. This study period allowed us to examine Tweets published before and after the start of the COVID-19 pandemic without pandemic Tweets dominating the sample.
We manually counted each Tweet’s words, syllables, sentences, and polysyllabic words (words with more than 2 syllables) under a restrictive baseline preprocessing scenario. To arrive at these restrictive baseline calculations, we completely excluded three text features from our calculations: URLs, hashtags, and numbers. We manually recorded additional counts attributed exclusively to the inclusion of either URLs, hashtags, or numbers. We treated URLs as 3-syllable words. We included hashtags only if they were used in place of plain-text words in a sentence. We included numbers as if they were spelled out. Appendix 2 further details our preprocessing decisions.
In R version 4.2.3 (RCore Team, 2021) we used these counts to calculate reading grade level estimates for each Tweet using two formulas. We used the Simplified Measure of Gobbledygook (SMOG) (McLaughlin, 1969) adjusted for short texts (Scott, 2024), as well as the Flesch-Kincaid readability formula (FK) (Kincaid et al., 1975). We applied both formulas to each Tweet 8 times: the restrictive baseline scenario and 7 more inclusive preprocessing scenarios that included different combinations of URLs, hashtags, and/or numbers. Table 1 outlines all preprocessing scenarios we compared in this study.
Table 1:
Descriptions of the different preprocessing scenarios compared in this study, differentially including hashtags, numbers, and URLs
| Preprocessing Scenario | Hashtags Included | Numbers Included | URLs Included |
|---|---|---|---|
| Restrictive Baseline | No | No | No |
| +URLs | No | No | Yes |
| +Nums | No | Yes | No |
| +URLs +Nums | No | Yes | Yes |
| +Tags | Yes | No | No |
| +Tags +URLs | Yes | No | Yes |
| +Tags +Nums | Yes | Yes | No |
| +URLs +Tags +Nums | Yes | Yes | Yes |
We used proportion tests with a threshold of p=0.05 to identify significant differences in the portion of Tweets meeting 8th- and 6th-grade reading level benchmarks due to more inclusive preprocessing.
We also used simple linear regression to model the reading grade level estimates under inclusive preprocessing scenarios as a function of the restrictive baseline estimate alone. We interpreted intercepts and beta coefficients under a threshold of p=0.05 to determine if the baseline restrictive estimate adequately predicted the estimates under more inclusive preprocessing. We interpreted R2 values of at least 0.8 as indicative of models that adequately explained the observed variation in estimates (Gao, 2024). We graphed the residuals to verify whether these linear models were appropriate.
We also used Wilcoxon signed-rank tests with a threshold of p=0.05 to identify significant differences in median reading grade level estimates as a result of more inclusive preprocessing.
We did not seek ethical review for this study because the study data comprise publicly available communications from government agencies, intended for mass audiences.
Results
Our sample contained at least 4 Tweets from each year in the study period and contained Tweets from 40 states. Tweets in our sample ranged in length from 31 to 305 characters, with a mean of 165.6, standard deviation of 70.0, and median of 140. (Note: Twitter’s character counts treat usernames and URLs differently from other words, unlike our raw character counts. Further, the maximum length of Tweets increased during the study period).
Under restrictive baseline preprocessing, the mean Tweet word count was 21.2, with a median of 19. The mean sentence count was 2.0, with a median of 2. The mean syllable count was 34.7, with a median of 29.5. The mean polysyllabic word count was 3.6, with a median of 3. On average, the inclusion of URLs contributed more additional words, syllables, and polysyllabic words than did the inclusion of numbers. The inclusion of hashtags contributed the lowest number of syllables, words, and polysyllabic words. Table 2 summarizes these counts.
Table 2:
Summary metrics of the distribution of sentence, words, syllable, and polysyllabic word counts attributed to the baseline preprocessing scenario, or the additional inclusion of URLs, hashtags, or numbers alone
| Count attributed to | Mean sentence Count (SD) | Mean word count (SD) | Mean syllable count (SD) | Mean polysyllabic word count (SD) |
|---|---|---|---|---|
| Restrictive baseline | 2.0 (1.1) | 21.2 (10.3) | 34.7 (17.1) | 3.6 (2.5) |
| +URLs | 0.4 (0.5) | 1.1 (0.7) | 3.1 (2.0) | 1.0 (0.7) |
| +Tags | <0.1 (0.1) | 0.4 (0.6) | 1.2 (2.3) | 0.2 (0.5) |
| +Nums | 0 (0) | 0.5 (1.3) | 2.6 (9.0) | 0.4 (1.0) |
SMOG reading grade level estimates ranged from 3.1 (elementary school education) to 21.2 (postgraduate education). The mean was 11.3. The median was 11.2. FK reading grade level estimates ranged from −0.5 to 34.1. We deemed these minimum and maximum values too extreme for interpretation. The mean was 9.4. The median was 9.0. Figure 1 depicts the distribution of SMOG and FK estimates.
Figure 1:

Distribution of reading grade level estimates per Tweet, by Formula. Mean size of range denoted by a dot. The 25th, 50th, and 75th percentile marked with horizontal lines. One observation, equal to 25.4, is cut off in this plot to better visualize the rest of the data
The baseline portion of Tweets at or below an 8th-grade reading level was 15% according to the SMOG and 44% according to the FK. The portion meeting this benchmark under inclusive preprocessing did not significantly differ under either formula. The baseline portion of Tweets at or below a 6th-grade level was 6% according to the SMOG and 31% according to the FK. The portion meeting this benchmark significantly differed via the FK formula under the inclusive preprocessing that included hashtags alone, as well as the scenario that includes hashtags, URLs, and numbers alike. Table 3 summarizes these results.
Table 3:
Summary of proportion tests comparing the portion of Tweets meeting an 8th- or 6th-grade reading level benchmark across inclusive preprocessing scenarios to the portions under the restrictive baseline, using both SMOG and FK formulas
| Preprocessing Scenario | Formula | Portion, 8th grade reading level or lower | P-value for t-test comparison to baseline | Portion, 6th grade reading level or lower | P-value for t-test comparison to baseline |
|---|---|---|---|---|---|
| +URLs | SMOG | 9% | 0.28 | 1% | 0.12 |
| FK | 44% | 1.00 | 20% | 0.10 | |
| +Nums | SMOG | 15% | 1.00 | 6% | 1.00 |
| FK | 40% | 0.67 | 28% | 0.76 | |
| +URLs +Nums | SMOG | 9% | 0.28 | 1% | 0.12 |
| FK | 40% | 0.67 | 19% | 0.07 | |
| +Tags | SMOG | 12% | 0.68 | 5% | 1.00 |
| FK | 39% | 0.57 | 28% | 0.76 | |
| +Tags +URLs | SMOG | 7% | 0.11 | 1% | 0.12 |
| FK | 40% | 0.67 | 17% | 0.03 | |
| +Tags +Nums | SMOG | 12% | 0.68 | 5% | 1.00 |
| FK | 36% | 0.31 | 25% | 0.43 | |
| +URLs +Tags +Nums | SMOG | 7% | 0.11 | 1% | 0.12 |
| FK | 37% | 0.39 | 16% | 0.02 |
Across both formulas, all regression models beta coefficients with p-values meeting our significance threshold. Across both formulas, only the inclusive preprocessing scenario including numbers alone led to a model with an insignificant intercept. R2 values for 3 SMOG models and 2 FK models met our benchmark of 0.8. However, after graphically examining residual plots, this was only relevant for the FK estimate under the inclusive preprocessing scenario that included URLs alone. Residual plots showed 3 preprocessing scenarios to have estimates with nonlinear relationships to the baseline. Table 4 summarizes the regression models. Figure 2 and Figure 3 depict residual plots.
Table 4:
Summary of simple linear regression models representing the reading grade level estimate under an inclusive preprocessing scenario as a function of the estimate under the restrictive baseline
| Inclusive Preprocessing Scenario | Formula | Simple Linear Regression Model | ||
|---|---|---|---|---|
| Intercept | Beta | Adjusted R2 | ||
| +URLs | SMOG | 3.52 | 0.73 | 0.74 |
| FK | 1.79 | 0.83 | 0.83 | |
| +Nums | SMOG | 0.23 | 1.01 | 0.95 |
| FK | 0.84 | 1.00 | 0.65 | |
| +URLs +Nums | SMOG | 3.73 | 0.74 | 0.70 |
| FK | 2.58 | 0.83 | 0.51 | |
| +Tags | SMOG | 0.91 | 0.94 | 0.94 |
| FK | 0.74 | 0.97 | 0.91 | |
| +Tags +URLs | SMOG | 3.98 | 0.71 | 0.67 |
| FK | 2.41 | 0.81 | 0.71 | |
| +Tags +Nums | SMOG | 1.13 | 0.95 | 0.88 |
| FK | 1.55 | 0.97 | 0.60 | |
| +URLs +Tags +Nums | SMOG | 4.18 | 0.71 | 0.63 |
| FK | 3.18 | 0.81 | 0.45 | |
Figure 02:

Plots of predicted reading grade level and the residuals for the simple linear regression models representing the inclusive SMOG estimate as a function of the restrictive baseline
Figure 03:

Plots of predicted reading grade level and the residuals for the simple linear regression models representing the inclusive FK estimate as a function of the restrictive baseline
Wilcoxon signed-rank tests all had p-values <0.05, indicating a change in median reading grade level estimates due to more inclusive preprocessing, under all scenarios, across both formulas. Table 5 summarizes these results.
Table 5:
Summary of Wilcoxon signed-rank tests comparing the distribution of reading grade estimates under each inclusive preprocessing scenario to the distribution of estimates under the restrictive baseline
| Inclusive Preprocessing Scenario | Formula | p-value, Wilcoxon rank sum test |
|---|---|---|
| +URLs | SMOG | <0.001 |
| FK | <0.01 | |
| +Nums | SMOG | <0.001 |
| FK | <0.001 | |
| +URLs +Nums | SMOG | <0.001 |
| FK | <0.001 | |
| +Tags | SMOG | <0.001 |
| FK | <0.001 | |
| +Tags +URLs | SMOG | <0.001 |
| FK | <0.001 | |
| +Tags +Nums | SMOG | <0.001 |
| FK | <0.001 | |
| +URLs +Tags +Nums | SMOG | <0.001 |
| FK | <0.001 |
We conducted a sensitivity analysis to determine whether our findings were due to outlier Tweets skewing our data. We repeated our significance tests after dropping Tweets with outlier reading grade level estimates, word counts, sentence counts, syllable counts, or polysyllabic word counts. Our conclusions did not change. Appendix 3 details our sensitivity analysis.
Discussion
We estimated the reading grade level of Tweets from state public health agencies using SMOG and FK formulas, under preprocessing scenarios that differentially included URLs, numbers, and/or hashtags. We used three analysis methods to examine the impact of these more inclusive preprocessing decisions compared to estimates from a restrictive baseline excluding these elements. We found significant differences between the baseline’s median estimate and those from each inclusive scenario, across both formulas. We found no significant differences in the portion of Tweets meeting an 8th-grade reading level benchmark. We found some differences under a 6th-grade benchmark. Simple linear regression revealed only one scenario in which the baseline reading grade level estimate adequately explained an estimate under more inclusive preprocessing. Our conclusions remained the same
Our findings suggest that researchers should not apply reading grade level formulas to Tweets and other short-form online text. Though their outputs appear interpretable, reading grade level formulas may be misleading considering our findings, which included nonsensical extreme estimates and significant differences due to preprocessing decisions. We attribute this in part to the very definition of these formulas, which collapse several data points into a summary estimate. Preprocessing decisions around even a single text feature can significantly impact results when these features account for a sizable portion of a message, as in the case of a short social media post with several hashtags. We also attribute this in part to the origin of these formulas. The SMOG and the FK were developed for long-form print materials which did not include many common features of today’s social media messages. Note that this is not just true of the formulas we used. We would expect similar issues for other popular formulas like the Gunning Fog Index (Gunning, 1952, p. 36), Coleman Liau Index (Coleman and Liau, 1975), Automated Readability Index (Kincaid et al., 1975), and Dale-Chall Readability Formula (Dale and Chall, 1948), which were also created for long-form print texts.
Instead of reading grade formulas, we recommend using raw count data, which. are more transparent than readability formulas. Though metrics like polysyllabic word frequency or average word length may not be as succinct, they change more predictably in response to text edits. However, using these metrics will not eliminate the need for human decision making. These metrics still require preprocessing to turn text into quantitative data. Thus, we also recommend using sensitivity analyses to describe results under a range of reasonable preprocessing decisions rather than attempting to justify a single set of preprocessing decisions as a ground truth.
Our findings align with long-standing calls for caution when using reading grade level estimates, even for longer print materials. Over the past several decades, the Centers for Disease Control and Prevention (1999, p. 23), the National Cancer Institute (1994, p. 15), and the Agency for Healthcare Research and Quality (2015) have recommended that reading grade level estimates be combined with audience testing for more accurate conclusions. These authoritative voices stress the fact that reading grade level formulas do not reflect important elements like graphic design or cultural relevance, which can also affect readability. Further they do not necessarily reflect how readers may interact with texts in the real world. For this reason, the creators of the Suitability Assessment of Materials contextualized readability among other factors including graphic design, content organization, and cultural appropriateness (Doak et al., 1996, p. 49–60). Even still, they recommended revising content based on tests with target audience members (Doak et al., 1996, p. 167–170).
Our findings also have a key implication for practice: audience feedback is essential, even for short-form social texts. Metrics like polysyllabic word count change more transparently under different preprocessing scenarios. However, they do not come with readability benchmarks. Further, there is no universal standard for social media text preprocessing decisions. Whether someone reads a number or a URL like any other word differs by audience, platform, and context. So, while practitioners can use quantitative data to help implement readability principles, they will still require direct audience feedback to assess whether messages meet audience needs. While researchers can use quantitative data to assess readability at scale, their conclusions will lack evidence of external validity without audience testing. Though there is a place for quick, scalable communication assessments in today’s crowded information environments, this does not diminish the need to improve communication through community engagement.
Limitations
Our findings are limited by the small set of text features we examined in order to maintain a feasible scope for our study. Further, we treated URLs uniformly while making context-specific decisions about numbers and hashtags. We made this decision in light of not being able to observe how URL previews originally appeared on the platform, especially considering our study goal to describe the importance of text preprocessing rather than provide a definitive reference for estimated reading grade level variance.
Supplementary Material
Disclosure of Interest
This study was funded by the National Cancer Institute (grant 3U54CA156732-13S1). The content of this article is solely the responsibility of the authors and does not necessarily represent the official views of the funder. The authors report there are no competing interests to declare.
Data Availability
Data used in this analysis come from a dataset available at Harvard Dataverse (Mendez, 2023). Our R code and data for this analysis are available in an Open Science Framework repository (Mendez et al., 2023).
References
- Agency for Healthcare Research and Quality. (2019). Ten attributes of health literate health care organizations. https://www.ahrq.gov/health-literacy/publications/ten-attributes.html#designs
- Agency for Healthcare Research and Quality. (2015). Tip 6. Use Caution with readability formulas for quality reports. https://www.ahrq.gov/talkingquality/resources/writing/tip6.html
- Centers for Disease Control and Prevention (U.S.). (1999). Scientific and technical information: Simply put. (2nd ed.). CDC Office of Communication. https://stacks.cdc.gov/view/cdc/11353 [Google Scholar]
- Centers for Disease Control and Prevention (U.S.). (2014). CERC: Working with the Media. https://www.cdc.gov/cerc/media/pdfs/CERC_Working_with_the_Media.pdf
- Coleman M & Liau TL (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2), 283–284. [Google Scholar]
- Dale E & Chall JS (1948). A formula for predicting readability. 27(1), 11–20, 28. [Google Scholar]
- Doak CC, Doak LG, & Root LH Teaching patients with low literacy skills (2nd ed.). J.B. Lippincott Company. [Google Scholar]
- Gao J (2024) R-Squared (R2): How much variation is explained? Research Methods in Medicine & Health Sciences, 5(4), 104–109. 10.1177/26320843231186398 [DOI] [Google Scholar]
- Grabeel KL, Russomanno J, Oelschlegel S, Tester E, & Heidel RE (2018). Computerized versus hand-scored health literacy tools: A comparison of Simple Measure of Gobbledygook (SMOG) and Flesch-Kincaid in printed patient education materials. Journal of the Medical Library Association, 106(1), 38–45. 10.5195/jmla.2018.262 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gunning R (1952). The technique of clear writing. McGraw-Hill. [Google Scholar]
- Hoedebecke K, Beaman L, Mugambi J, Shah S, Mohasseb M, Vetter C, Yu K, Gergianaki I, & Couvillon E (2017). Health care and social media: What patients really understand. F1000Research, 6, 118. 10.12688/f1000research.10637.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kincaid JP, Fishburne RP Jr., Rogers RL, & Chissom BS (1975). Derivation of new readability formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for navy enlisted personnel. (Research Branch Report 8–75). Institute for Simulation and Training, University of Central Florida. [Google Scholar]
- Mac O, Ayre J, Bell K, McCaffery K, & Muscat DM (2022). Comparison of Readability Scores for Written Health Information Across Formulas Using Automated vs Manual Measures. JAMA Network Open, 5(12), e2246051. 10.1001/jamanetworkopen.2022.46051 [DOI] [PMC free article] [PubMed] [Google Scholar]
- McLaughlin GH (1969). SMOG grading: A new readability formula. Journal of Reading, 12(8), 639–646. [Google Scholar]
- Mendez SR (2023). State Public Health Agency, CDC, and FDA Tweets (2012 through 2022) [Data set]. Harvard Dataverse. V1. 10.7910/DVN/VX4HK8 [DOI] [Google Scholar]
- Mendez SR, Munoz-Najar S, Emmons KM, & Viswanath K (2025). US State Public Health Agencies’ Use of Twitter From 2012 to 2022: Observational Study. Journal of medical Internet research, 27, e59786. 10.2196/59786 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mendez SR, Galvez SM-N, Emmons KM, & Viswanath K (2024). Describing the Variation in Reading Grade Level Estimates of Public Health Tweets. osf.io/guk3e. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morse E, Odigie E, Gillespie H, & Rameau A (2024). The Readability of Patient-Facing Social Media Posts on Common Otolaryngologic Diagnoses. Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery, 170(4), 1051–1058. 10.1002/ohn.584 [DOI] [PubMed] [Google Scholar]
- National Cancer Institute. (1994) Clear & simple: Developing effective print materials for low-literate readers (NIH Publication No. 95–3594). https://eric.ed.gov/?id=ED381691
- Nesca M, Katz A, Leung CK, & Lix LM (2022). A scoping review of preprocessing methods for unstructured text data to assess data quality. International journal of population data science, 7(1), 1757. 10.23889/ijpds.v6i1.1757 [DOI] [PMC free article] [PubMed] [Google Scholar]
- RCore Team (2021). R: A language and environment for statistical computing. (Version 4.2.3) R Foundation for Statistical Computing. [Computer software] https://www.R-project.org/ [Google Scholar]
- Scott B (2024, August 26). The SMOG readability formula, a simple measure of gobbledygook. Readability Formulas. https://readabilityformulas.com/the-smog-readability-formula/ [Google Scholar]
- Wang LW, Miller MJ, Schmitt MR, & Wen FK (2013). Assessing readability formula differences with written health information materials: application, results, and recommendations. Research in Social & Administrative Pharmacy, 9(5), 503–516. 10.1016/j.sapharm.2012.05.009 [DOI] [PubMed] [Google Scholar]
- Zhou S, Jeong H, & Green PA How consistent are the best-known readability equations in estimating the readability of design standards? IEEE Transactions on Professional Communication, 60(1), 97–111. 10.1109/TPC.2016.2635720 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data used in this analysis come from a dataset available at Harvard Dataverse (Mendez, 2023). Our R code and data for this analysis are available in an Open Science Framework repository (Mendez et al., 2023).
