Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2013 Oct 7;21(e1):e169–e172. doi: 10.1136/amiajnl-2013-002172

The effect of word familiarity on actual and perceived text difficulty

Gondy Leroy 1, David Kauchak 2
PMCID: PMC3957403  PMID: 24100710

Abstract

There is little evidence that readability formula outcomes relate to text understanding. The potential cause may lie in their strong reliance on word and sentence length. We evaluated word familiarity rather than word length as a stand-in for word difficulty. Word familiarity represents how well known a word is, and is estimated using word frequency in a large text corpus, in this work the Google web corpus. We conducted a study with 239 people, who provided 50 evaluations for each of 275 words. Our study is the first study to focus on actual difficulty, measured with a multiple-choice task, in addition to perceived difficulty, measured with a Likert scale. Actual difficulty was correlated with word familiarity (r=0.219, p<0.001) but not with word length (r=−0.075, p=0.107). Perceived difficulty was correlated with both word familiarity (r=−0.397, p<0.001) and word length (r=0.254, p<0.001).

Keywords: Text Simplification, Health Literacy, User Study, Readability, Comprehension

Introduction

There exist many different readability formulae, some of which were conceived years ago. Their continued popularity now is a testimony to the need for an efficient means to evaluate the difficulty of text for patients and consumers. Several formulae are provided in text editing software (eg, in Microsoft Word) or made available in online tools (eg, http://www.readabilityformulas.com). They are used in numerous research projects and are recommended to help simplify text for medical information.1

Even though they are extremely popular, there is little evidence that simplifying text using these formulae is associated with increased understanding.2 Exceptions are Swanson and Fox,3 who in 1953 found that simpler articles, according to the formulae, led to higher understanding but not higher retention, and Freed et al,4 who found higher recognition memory for experimental text with a decreased readability grade level but also more focused with crucial information in tables. However, neither study focused on using readability measures for text simplification. Work showing direct application of formulae leading to increased understanding, retention or learning is rare, possibly due to the difficulty of publishing non-significant results, especially for such a popular and accepted tool. In addition, studies often do not differentiate between actual and perceived difficulty, a distinction supported by both the health belief model (HBM)5 and the theory of planned behavior (TPB).6 In a review of 24 studies, the fourth dimension in the HBM, perceived barriers, was shown to be the most significant in explaining health behavior.5 Similarly, in TPB, perceived difficulty, a factor of perceived behavioral control, has been found to be the stronger predictor of intentions and behavior.5 We are working towards providing a modern approach to text simplification with demonstrated impact. In this paper we address one component used by most existing readability formulae: the difficulty of individual words. Currently, word length is used as a stand-in for word difficulty and is measured in characters or syllables (eg, in SMOG, Linsear, Lix, Coleman–Liau, Flesch grade level readability formulae).7 However, examples demonstrate that this is not always an accurate indicator of word difficulty: ‘disorientation’ or ‘diabetes’ would be considered more difficult than ‘apnea’ by most formulae, but in many cases people know the meaning of the first words but not the last.

We propose a new measure for evaluating word difficulty based on word familiarity. Familiarity can be practically estimated by the frequency with which a word occurs in a large corpus of English text. Words with a low occurrence frequency are assumed to be less familiar and therefore more difficult because a reader will not encounter them as often and is less likely to know their meaning. Similarly, text that uses more low frequency words can be expected to be more difficult. In earlier work we have seen indirect evidence of this relationship and found that easy texts used more words with higher word frequencies.8 9 We also saw a positive effect on understanding and learning when words with low familiarity (ie, low frequency) were replaced with high frequency equivalents.10 11 In contrast, Ryder and Hughes12 did not find such an effect in their studies with high school students.

To our knowledge, no one has directly measured the difficulty for individual readers of a large set of words with different occurrence frequencies in English texts. Given the lack of clear data, we conducted a user study to evaluate the impact of word familiarity directly on word difficulty. The dataset will be made available to the community.

Dataset construction

Word set

To generate a representative sample of words with different frequencies, we combined two resources: the Google web corpus13 and the Moby word II list. The Google web corpus contains n-gram counts from a corpus of a trillion words from public webpages. It is made available by the Linguistic Data Consortium (http://www.ldc.upenn.edu/) for a small fee. There are 13 588 391 unigrams (single words) in the corpus, each with their frequency count. The Moby word list is a list of common English words and their definitions and is made available for free at Infochimps.com (http://www.infochimps.com/collections/moby-project-word-lists). We used the set containing 64 000 common English dictionary words.

To select a subset of words with sufficiently different word frequencies, we identified the top 1% most frequent words (1st percentile) in the Google web corpus, the 9–10% most frequent words (10th percentile), the 19–20% most frequent words (20th percentile) and so on until the 99–100% most frequent words (100th percentile). From each percentile we randomly selected 25 words for which there is also a definition available in the Moby word list. We excluded words that are formulae, html or internet-specific syntax, or number–letter combinations. We reviewed the word list to exclude proper names, for example, the word ‘Mendelsohn’ was excluded because it is the name of a German musician. Table 1 provides an overview of our word set characteristics. For example, ‘work’, ‘management’ and ‘power’ are included in the 1st percentile, ‘rancor’, ‘furan’ and ‘shorebird’ in the 50th percentile, and ‘shaggymane’, ‘dropkicker’ and ‘hiplength’ in the 100th percentile.

Table 1.

Word set characteristics

  Word frequency (familiarity) Word length
Minimum Maximum Average Minimum Maximum Average
1st Percentile 216 988 964 719 988 717 313 239 137 3 10 5
10th Percentile 6 008 690 6 424 555 6 193 930 2 15 7
20th Percentile 1 420 722 1 463 432 1 444 546 3 13 7
30th Percentile 523 999 536 215 530 586 5 13 8
40th Percentile 226 266 232 082 229 647 4 19 8
50th Percentile 101 654 103 462 102 575 5 12 8
60th Percentile 44 554 45 376 44 976 5 13 8
70th Percentile 18 051 18 533 18 324 4 14 9
80th Percentile 6545 6724 6637 6 16 9
90th Percentile 1902 1963 1931 6 13 9
100th Percentile 252 272 262 7 15 10
Overall 252 719 988 717 29 255 686 2 19 8

For each word, we selected the most common meaning and then shortened the definition by removing usage examples. If the word itself appeared in its definition, the definition was rephrased based on WordNet, an online (http://wordnet.princeton.edu/) lexical dictionary.14 15 For example, for ‘immorally’ (70th percentile), the original definition was

without regard for morality; “he acted immorally when his own interests were at stake”

which we shortened and rephrased to

without regard for traditionally held principles.

The final set consisted of 275 words, each with its correct definition.

Evaluating word difficulty

We examined two different aspects of word difficulty: the perceived and actual difficulty. To measure how difficult the words actually are, that is how many words are known to participants, we asked participants to choose the correct definition from four options. For each word, the four definition options were constructed by including the correct definition along with three randomly selected definitions from one of the other words in the set. To measure how difficult a word was perceived to be, we asked participants to judge its difficulty by indicating the difficulty level on a five-point Likert scale, with one indicating a very easy word and five a very difficult word. The specific question asked was: ‘How difficult would this word look in a text given to patients?’

Data collection

We recruited participants using Amazon's Mechanical Turk (MTurk, http://www.mturk.com/), an online crowd sourcing service where workers select small tasks to work on. Currently, there are over half a million workers and over 300 000 available tasks. MTurk has been used in a wide range of settings16 and attracts workers from all over the world with varied demographic characteristics.17 18 With precautions taken to filter out unproductive workers, the data quality is at least as good as that from more traditional approaches.18 19

MTurk workers select human intelligence tasks (HIT) based on their title, description, payment and/or personal interest. Only workers located in the USA with a 95% or better HIT performance rating were invited to our HIT. We required workers first to answer common demographic questions as well as their use of English at home. Then, they were asked to evaluate different words. Workers were paid 2 cents for each word they evaluated.

To get a reasonable sample, we designed our study to collect evaluations from 50 participants for each word. As is customary on MTurk, workers were at liberty to evaluate as many or as few words as they chose. For each word, participants labeled the perceived difficulty on a five-point Likert scale and selected their best guess for the definition from the four options. The order of definitions was randomized per word and the order of the words was randomized. MTurk ensured that no worker attempted the same word twice and that 50 different workers evaluated each word.

Evaluation

This large dataset allows us to calculate correlations analysis for individual data points (N=13 750), per word (N=275) and per percentile (N=11) for both actual and perceived difficulty. For brevity, we include only the analysis per word in which the 50 different evaluations are averaged per word.

Demographic information and metadata

A total of 239 workers participated in the data collection (see table 2). Slightly more than half were men (59%). The majority identified themselves as white (82%), followed by Asian (14%) and black or African American (6%). Very few identified as Hispanic or Latino (6%). Many of them had a high school diploma (39%), and a slightly larger group had some college degree (16% associate's degree and 32% bachelor's degree). Only a small minority did not have a high school diploma (1%) or had a higher college degree (11% with a master's degree and 2% with a doctorate).

Table 2.

Participant demographics

Characteristic N (%)
Gender
 Female 98 (41)
 Male 141 (59)
Race (multiple choices allowed)
 American Indian/native Alaskan 5 (2)
 Asian 34 (14)
 Black or African American 14 (6)
 Native Hawaiian or other Pacific islander 3 (1)
 White 195 (82)
Ethnicity
 Hispanic or Latino 15 (6)
 Not Hispanic or Latino 224 (94)
Education (highest completed)
 Less than high school 2 (1)
 High school diploma 94 (39)
 Associate's degree 37 (16)
 Bachelor's degree 77 (32)
 Master's degree 25 (11)
 Doctorate 4 (2)
Language skills (frequency of speaking English at home)
 Never English
 Rarely English
 Half English 2 (1)
 Mostly English 15 (5)
 Only English 225 (94)

On average, workers evaluated 75 words, with seven workers evaluating all 275 words and 11 workers evaluating only one word. The average time spent on a word was 12 s, with the minimum time spent 2 s and the maximum 58 s.

Actual word difficulty

Figure 1 shows the average actual difficulty, that is, the percentage of words correctly defined, for the 25 words in each percentile. The results show that words with a lower frequency of occurrence (higher percentile) are more difficult and less often correctly defined by participants. We calculated a one-tailed Pearson correlation coefficient for the results. The average percentage correct correlated negatively with the percentile (N=275, r=−0.381, p<0.001) indicating that higher percentiles contain more difficult words. In addition, a complementary, more specific analysis using word frequencies instead of percentiles confirms this result, with a significant correlation showing that a lower frequency results in fewer words correctly defined (N=275, r=0.219, p<0.001).

Figure 1.

Figure 1

Average actual difficulty for words grouped by word frequency of occurrence.

Because word length is a crucial factor in most readability formulae, we evaluated the relationship between word length and actual difficulty. We found no relationship between the word length and actual difficulty (N=275, r=−0.075, p=0.107): the percentage of words correctly defined did not relate to the word length.

To complete our analysis, we also evaluated the time the participants spent evaluating the word and found a significant negative correlation between actual difficulty and time spent (N=275, r=−0.674, p<0.001), indicating that more time was spent on words with lower scores.

Perceived word difficulty

Figure 2 shows the average perceived difficulty on a five-point Likert scale for the 25 words in each percentile. The results show that words with a higher frequency of occurrence (lower percentile) are consistently perceived as easier. We calculated a one-tailed Pearson correlation coefficient, which showed a significant positive correlation between perceived difficulty (lower is easier) and percentile (N=275, r=0.611, p<0.001). Similar to actual difficulty, the complementary analysis using word frequencies instead of percentiles confirms this relationship with a significant negative correlation between perceived difficulty and word frequency (N=275, r=−0.397, p<0.001). Words with higher word familiarity are seen as easier.

Figure 2.

Figure 2

Average perceived difficulty for words grouped by word frequency of occurrence.

As with the analysis for actual difficulty, we also evaluated the relationship between perceived difficulty and word length. In contrast to actual difficulty, the length of the word does have an effect on perceived difficulty. There was a positive, significant correlation between perceived difficulty and word length (N=275, r=0.254, p<0.001): longer words are seen as more difficult.

For time spent evaluating the word, there was a significant correlation between time and word length (N=275, r=0.656, p<0.001) with more time being spent on words seen as more difficult.

Conclusions

We evaluated the use of a reader's familiarity with a word, estimated by the word's frequency in common English text, as a stand-in for word difficulty. We conducted a user study to evaluate 275 words with different frequencies and gathered 50 evaluations for each word of actual difficulty (how well can people choose the correct definition of the word) and of perceived difficulty (how difficult does a word look). Our results show that word frequency is strongly associated with both actual difficulty and perceived difficulty. Words with higher frequency were more often defined correctly and were labeled as appearing less difficult. Word length, frequently used in readability formulae as a stand-in for difficulty, did not relate to actual difficulty and was only weakly related to perceived difficulty. Because of these results, we argue that word frequency is a better metric to estimate word difficulty than word length. Further studies with complete texts instead of single words and with different types of readers are needed to evaluate the metric for its relation to information understanding.

As with all practical studies, our evaluation has limitations. The first relates to workers on Amazon Mechanical Turk. It can be assumed that these workers are more computer literate than other Americans reading online text. Future work will focus on evaluating whether the relationship is as strong in younger readers and readers with lower reading skills. The second limitation relates to the general nature of the words. We did not limit our set to medical words. Furthermore, our procedure limits words to those found in the Moby word list, which may bias the word list towards common words. However, the list is large and our approach helps exclude technical terms or web-specific terms. While this makes the data less medically focused it is still useful for the medical domain because all words, not only medical words, need to be sufficiently simple in patient materials. Similarly, word familiarity was estimated using the Google web corpus, a general corpus. A corpus more specific to a particular patient population, for example, for different ethnicities or age groups, may lead to more fine-tuned formulae. A third limitation is the selection of alternative definitions. Because alternative definitions were assigned automatically and randomly, choosing the correct definition often allowed for easy elimination of one or more alternative definitions resulting in a high percentage of correct answers. We conducted our study in this manner to bring the most difficult case first. We expect that with alternative definitions more closely related to the word, the effect of word familiarity on difficulty will be even stronger. Finally, we point out that we worked with single words (unigrams) and future work will include multiword phrases.

Acknowledgments

The authors would like to thank their study participants.

Footnotes

Funding: This work was supported by the US National Library of Medicine, NIH/NLM 1R03LM010902.

Competing interests: None.

Ethics approval: The study was reviewed by the institutional review board of Claremont Graduate University.

Provenance and peer review: Not commissioned; externally peer reviewed.

References

  • 1.Weiss BD. Health literacy and patient safety: help patients understand (manual for clinicians). American Medical Assocation, 2007 [Google Scholar]
  • 2.Gemoets D, Rosemblat G, Tse T, et al. Assessing readability of consumer health information: an exploratory study. Stud Health Technol Inform 2004;107:869–73 [PubMed] [Google Scholar]
  • 3.Swanson CE, Fox HG. Validity of readability formulas. J Appl Psychol 1953;37: 114–18 [Google Scholar]
  • 4.Freed E, Long D, Rodriguez T, et al. The effects of two health information texts on patient recognition memory: a randomized controlled trial. Patient Educ Couns 2013;92:260–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Janz NK, Becker MH. The health belief model: a decade later. Health Educ Q 1984;11:1–47 [DOI] [PubMed] [Google Scholar]
  • 6.Trafimow D, Sheeran P, Conner M, et al. Evidence that perceived behavioral control is a multidimensional construct: perceived control and perceived difficulty. Br J Soc Psychol 2002;41:101–21 [DOI] [PubMed] [Google Scholar]
  • 7.Ley P, Florio T. The use of readability formulas in health care. Psychol Health Med 1996;1:7–28 [Google Scholar]
  • 8.Leroy G, Endicott JE. Combining NLP with evidence-based methods to find text metrics related to perceived and actual text difficulty. Presented at the 2nd ACM SIGHIT International Health Informatics Symposium (ACM IHI 2012); Florida, Miami: 2012 [Google Scholar]
  • 9.Leroy G, Endicott JE. Term familiarity to indicate perceived and actual difficulty of text in medical digital libraries. Presented at the International Conference on Asia-Pacific Digital Libraries (ICADL 2011)—Digital Libraries—for Culture Heritage, Knowledge Dissemination, and Future Creation; Beijing, China 2011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Leroy G, Kauchak D, Mouradi O. A user-study measuring the effects of lexical simplification and coherence enhancement on perceived and actual text difficulty. Int J Med Inform 2013. [Epub ahead of print] [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Leroy G, Endicott JE, Kauchak D, et al. User evaluation of the effects of a text simplification algorithm using term familiarity on perception, understanding, learning, and information retention. J Med Internet Res 2013;15:e144 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ryder RJ, Hughes M. The effect on text comprehension of word frequency. J Educ Res 195;78:286–91 [Google Scholar]
  • 13.Brants T, Franz A. Web 1T 5-gram Version 1. Philadelphia: Linguistic Data Consortium, 2006 [Google Scholar]
  • 14.Miller GA. WordNet: a lexical database for English. Commun ACM 1995;38:39–41 [Google Scholar]
  • 15.Fellbaum C. WordNet: an electronic lexical database. Cambridge, Mass: MIT Press, 1998 [Google Scholar]
  • 16.Kittur A, Chi EH, Suh B. Crowdsourcing user studies with Mechanical Turk. Presented at the Proceedings of the SIGCHI Conference on Human Factors in Computing Systems Florence; Italy: 2008 [Google Scholar]
  • 17.Ross J, Irani L, Silberman MS, et al. Who are the Crowdworkers?: Shifting Demographics in Mechanical Turk. Presented at the CHI ‘10 Extended Abstracts on Human Factors in Computing Systems; Atlanta, Georgia, USA: 2010 [Google Scholar]
  • 18.Paolacci G, Changler J, Ipeirotis PG. Running experiments on amazon mechanical turk. Running Experiments Amazon Mech Turk 2010;5:411–19 [Google Scholar]
  • 19.Buhrmester M, Kwang T, Gosling SD. Amazon's Mechanical Turk: A New Source of Inexpensive, Yet High-Quality, Data? Perspectives on Psychological Science 2011;6:3–5. [DOI] [PubMed] [Google Scholar]

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES