The concerns expressed by Garcia et al. (1) are misplaced due to a range of misconceptions about word usage frequency, word rank, and expert-constructed word lists such as LIWC (Linguist Inquiry and Word Count) (2). We provide a complete response in our paper's online appendices (3). Garcia et al. (1) suggest that the set of function words in the LIWC dataset (2) show a wide spectrum of average happiness with positive skew (figure 1A in ref. 1) when, according to their interpretation, these words should exhibit a Dirac δ function located at neutral (havg = 5 on a 1–9 scale). However, many words tagged as function words in the LIWC dataset readily elicit an emotional response in raters as exemplified by “greatest” (havg = 7.26), “best” (havg = 7.26), “negative” (havg = 2.42), and “worst” (havg = 2.10). In our study (3), basic function words that are expected to be neutral, such as “the” (havg = 4.98) and “to” (havg = 4.98), were appropriately scored as such. Moreover, no meaningful statement about biases can be made for sets of words chosen without frequency of use properly incorporated.
Garcia et al. (1) compare our work on English with a similar sized survey by Warriner et al. (4). Warriner et al. generated a merged list of 13,915 English words, the bulk of which are a list of lemmas taken from movie subtitles, a mismatch with the corpora we used in creating our English word list labMT (language assessment by Mechancical Turk). In figure 1B of ref. 1, Garcia et al. make a flawed comparison between the two word lists because the words behind each histogram are not the same. For shared words, the minor difference in median havg of 0.07—much less than the observed positivity bias—cannot be because of our use of cartoon faces (emoticons). The earlier Affective Norms for English Words (ANEW) study upon which we modeled our work (5) also uses cartoons and yet found a lower median for words shared with Warriner et al. (5.29 versus 5.44) (4). All three datasets agree well in more general statistical comparisons (4).
In attempting to say anything about a given quality of words as it relates to use frequency within a specific corpora, a complete census of words by frequency must be on hand, otherwise uncontrolled sampling issues arise. In Fig. 1A, we plot average happiness as a function of frequency of use for the word list Garcia et al. (1) created from Google Books. The scatter plot is clearly unsuitable for linear regression. We show an estimate of cumulative coverage at the bottom, which crashes soon after reaching 5,000 words.
Sampling issues aside, Garcia et al. (1) state that regression against frequency f is a better choice than using rank r because information is lost in moving from f to r. However, the general adherence of natural language to Zipf’s law, f ∼ r−1, provides an immediate counterargument, even acknowledging the possibility of a scaling break (6). Fig. 1B shows how use rank is well suited for regression, and is the basis for the “jellyfish” plots we presented in our work (3). In Fig. 1C, we present how havg behaves as a function of 1/f, illustrating both the error in choosing log10 f and that our results will be essentially unchanged if we regress against 1/f.
Acknowledgments
This work was supported in part by National Science Foundation Grant DMS-0940271 (to C.M.D.) and National Science Foundation CAREER Award 0846668 (to P.S.D.).
Footnotes
The authors declare no conflict of interest.
References
- 1.Garcia D, Garas A, Schweitzer F. The language-dependent relationship between word happiness and frequency. Proc Natl Acad Sci USA. 2015;112:E2983. doi: 10.1073/pnas.1502909112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Pennebaker JW, Booth RJ, Francis ME. 2007 Linguistic Inquiry and Word Count: LIWC 2007. Available at homepage.psy.utexas.edu/HomePage/Faculty/Pennebaker/Reprints/LIWC2007_OperatorManual.pdf. Accessed May 15, 2014.
- 3.Dodds PS, et al. Human language reveals a universal positivity bias. Proc Natl Acad Sci USA. 2015;112(8):2389–2394. doi: 10.1073/pnas.1411678112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Warriner AB, Kuperman V, Brysbaert M. Norms of valence, arousal, and dominance for 13,915 English lemmas. Behav Res Methods. 2013;45(4):1191–1207. doi: 10.3758/s13428-012-0314-x. [DOI] [PubMed] [Google Scholar]
- 5.Bradley MM, Lang PJ. Affective Norms for English Words (anew): Stimuli, Instruction Manual and Affective Ratings. Technical report c-1. Univ of Florida; Gainesville, FL: 1999. [Google Scholar]
- 6.Williams JR, Bagrow JP, Danforth CM, Dodds PS. 2015. Text mixing shapes the anatomy of rank-frequency distributions: A modern Zipfian mechanics for natural language. arXiv:1409.3870.