Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2009 Nov 14;2009:426–430.

Perplexity Analysis of Obesity News Coverage

Delano J McFarlane 1, Noémie Elhadad 1, Rita Kukafka 1
PMCID: PMC2815490  PMID: 20351893

Abstract

An important task performed during the analysis of health news coverage is the identification of news articles that are related to a specific health topic (e.g. obesity). This is often done using a combination of keyword searching and manual encoding of news content. Statistical language models and their evaluation metric, perplexity, may help to automate this task. A perplexity study of obesity news was performed to evaluate perplexity as a measure of the similarity of news corpora to obesity news content. The results of this study showed that perplexity increased as news coverage became more general relative to obesity news (obesity news ≈ 187, general health news ≈ 278, general news ≈ 378, general news across multiple publishers ≈ 382). This indicates that language model perplexity can measure the similarity news content to obesity news coverage, and could be used as the basis for an automated health news classifier.

Introduction

News coverage of health and medicine can have an impact on the public’s knowledge of health related issues and can influence the public’s health seeking behavior1. Studying media coverage of health topics, and using the results to try to improve news quality, can therefore have a positive impact on public health2,3. Therefore it is extremely important that analyses and evaluations of media coverage of health topics be accurate, thorough and timely.

One task often performed when studying the media’s coverage of health is identifying news content related to particular health topics. This task is typically performed using a combination of keyword searching of news content databases like Lexis-Nexis and the manual reading and encoding of news content by researchers4,5. Keyword searches can be problematic because they can produce false positives that lead to news articles incorrectly being included because they contain seemingly relevant keywords. For example if the keywords “diet*” and “weight*” (where * is a wildcard) were used in a keyword list for finding obesity news articles, the following news article would be incorrectly included.

“A Chinese company blamed for providing tainted dietary supplements that led to positive dope tests for 11 members of the Greek national weightlifting team has admitted to sending an apology letter to team officials” 6

To correct for these types of errors news content is sometimes manually reviewed to ensure corpus validity5,7. Unfortunately this task can be very time-consuming, requiring that multiple coders read and encode many news articles.

There are numerous informatics methods that may help in classifying health news coverage. The method explored in this paper is statistical language modeling, and its evaluation metric perplexity. Language models are probability distributions of words and sequences of words in a text. Language models for different documents or corpora can be compared using measures like perplexity and relative entropy, which are non-commutative measures that quantify a probability distribution’s ability to predict events in some other probability distribution. Examples of the use of these methods include Brown’s use of perplexity to estimate the entropy of English text8, Stetson’s use of relative entropy to show that medical sign-out notes constitute a sublanguage different from other types of medical notes9 and Elhadad’s use of perplexity to compare the linguistic complexity of technical and health consumer texts10.

The aim of this study is to evaluate the ability of statistical language modeling and perplexity to measure the similarity of news content to news coverage related to the topic of obesity. The expected results were that perplexity would increase as content became more general relative to obesity news coverage. This result would indicate that statistical language model perplexity can measure the similarity of news content to obesity news coverage, and can be used as the basis for an automated health news topic classifier.

Methods

Language Models and Perplexity

Statistical language modeling is used to estimate the probability of occurrence of word sequences in text. N-gram language modeling is the most basic and common statistical language modeling method used. These language models consider word sequences of length m to be Markov processes with probability

P(w1,w2,,wm)i=1mP(wi|wi(n1),,wi1)

where n refers to the order of the Markov process (typically 2 or 3, bigram or trigram language model respectively)11. To achieve more accurate language model probability estimates various probability smoothing techniques including Good-Turing estimation, Katz smoothing and Knesser-Ney smoothing are often employed11.

Given a training corpus and its corresponding language model (for instance, a trigram model with Katz smoothing), one can estimate how well the language model predicts a new text composed of unseen sentences by computing its perplexity (PP). Perplexity of a text is defined in terms of the entropy of its sentences (H). The entropy, in turn, is defined in terms of the probability distribution defined by the language model.

PP(s1,s2,,sk)=2H(s1,s2,,sk)H(s1,s2,,sk)=i=1kP(si)log2P(si)

The lower the perplexity, the better the unseen sentences fit the corpus underlying the given language model. Thus, we rely on perplexity to determine how close a set of sentences is to a given corpus.

Data Collection and Processing

Four corpora were used in this study; an obesity news corpus, a general health news corpus, a general news corpus and a cross-publisher general news corpus. The first three corpora consist of news published online by Reuters between January 1, 2008 and June 1, 2008. Reuters news was collected using the SalientNews12 RSS aggregation system. The health news corpus consists of news from Reuters’ health news feed. The obesity news corpus was constructed by filtering Reuters health news for articles with titles or taglines containing any of the keywords; “obesity”, “obese”, “diet*”, “weigh*”, “nutrition*”, “exercise*” (where “*” represents a wildcard search). No articles in the obesity news corpus were included in the health news corpus. The general news corpus included all Reuters news published online during that period, excluding all articles in either the Reuters health or obesity news corpora. A random sample of news articles from English Gigaword13 was used to construct the cross-publisher general news corpus. Gigaword is an archive of newswire text acquired over numerous years from multiple news sources including the Associated Press and the New York Times.

Half of the Reuters obesity news articles were randomly selected to be in the obesity news reference set, the rest of the obesity news articles were used as an obesity news test set. To control for possible differences in study results based on corpus size, news article test sets comparable in size to the obesity news reference set were constructed from the Reuters health news, Reuters general news and Gigaword corpora.

Language models (LMs) were generated for all corpora. Perplexity was measured for the Gigaword, Reuters general news, health news and obesity test LMs relative to the Reuters obesity news reference LM. Perplexity results where sentence boundaries are treated as words as well as results based solely on real word occurrences were calculated. Perplexities that include sentence boundaries tend to be higher, but provide insight into how sentence structure impacts perplexity. To provide further information on corpora differences the number of out-of-vocabulary words (words not found in the reference LM) was also calculated.

The SRILM14 toolkit was used to generate language models and calculate perplexities. Default modeling and perplexity settings were used (trigram models, Good-Turing discounting, Katz back-off smoothing). Language models for Reuters news articles were generated using the first page of the news text published online. The Lingpipe15 toolkit was used for text preprocessing (e.g. sentence chunking).

Results

Table 1 contains sizing information for the corpora used in the study. The Reuters general news corpus contained 25,027 news articles, the health corpus contained 2,675 articles and the complete obesity corpus contained 288 articles. The reference obesity news language model was trained using a random sample of 144 obesity news articles. Random samples of 121 and 19,825 news articles from Gigaword were used as the cross-publisher corpora.

Table 1.

Corpus Sizes

Corpus # News Articles # Sentences # Words
Reuters Obesity News (Reference) 144 1,551 41,143
Reuters Obesity News (Test) 144 1,630 42,308
Reuters Health News (small sample) 144 1,398 36,234
Reuters Health News 2,675 32,206 730,285
Reuters General News (small sample) 144 1,398 34,614
Reuters General News 25,027 233,778 5,854,300
Gigaword (small sample) 121 2,681 56,233
Gigaword 19,825 394,849 8,460,935

Table 2 contains the perplexity results. Perplexity increased as content became more general relative to the obesity news content. This trend existed even for LMs generated using corpus sizes comparable to the reference corpus. For corpora of comparable size, relative to the obesity news reference LM, the obesity news test LM perplexity was 187, the general health news LM perplexity was 278, the general news LM perplexity was 378 and the Gigaword LM perplexity was 382. This is a 48% increase in perplexity for general health news, a 101% increase for general news and a 103% increase for Gigaword. The LMs for the larger general health, general news and Gigaword corpora did not produce appreciably higher perplexity results. Perplexities are even greater and increase even more rapidly when sentence boundaries are considered (54%, 129% and 149% increases in perplexity for Reuters general health, Reuters general news and Gigaword small corpora respectively). The number of out-of-vocabulary words relative to the obesity news reference corpus also increases as content becomes more general. As with perplexity, this trend in the number of out-of-vocabulary words exists even when LMs are constructed using corpora of the same size as the reference corpus.

Table 2.

Perplexity Study Results

Corpus Perplexity (w/sentence boundaries) # Out of Vocab Words
Reuters Obesity News 187.945 (236.747) 5,335
Reuters Health News (small sample) 278.708 (366.293) 7,431
Reuters Health News 275.271 (362.576) 135,532
Reuters General News (small sample) 378.453 (542.888) 11,614
Reuters General News 372.451 (531.705) 1,966,504
Gigaword (small sample) 382.819 (591.633) 19,604
Gigaword 387.677 (594.937) 2,966,079

Discussion

These results indicate that language model perplexity can measure the similarity of news content to obesity news coverage. As discussed earlier, perplexity is a measure of the ability of a probability distribution to predict events in another distribution. To better understand the meaning behind the perplexity results consider that if language model B has a perplexity of 50 relative to language model A, and tri-gram language models are used, this is equivalent to saying that given the knowledge of language model A, on average there are only 50 possible words that could follow any 2 word sequence in the content used to generate language model B. Therefore in this study a perplexity of 378 for general news relative to obesity news means that given a two-word sequence from the general news corpus, knowledge gained from modeling the obesity news corpus on average only limits the potential subsequent word in the general news content to 1 in 378 possibilities. Consequently an increase in perplexity of 101% when going from the obesity news LM to the general news LM represents noteworthy restrictions in the obesity news content’s vocabulary and language use compared to the general news content.

Also noteworthy are the sentence boundary and Gigaword perplexity results. The perplexity results when sentence boundaries were considered was noticeably higher than when sentence boundaries were ignored. This is significant because it indicates that obesity news content possesses sentence structure characteristics that differ from more general news content. Since statistical language models can capture such differences more easily than simple keyword searches can, this suggests that statistical language modeling and perplexity may be better than keyword searching at creating a corpus of health news content.

The perplexity results for the Gigaword corpus were of interest because they were not considerably higher than the perplexities for the Reuters general news corpus. This suggests that language use across news publishers may not be substantially different and the results from this study may be generalizable across news sources.

One limitation of this study is the size of the obesity news corpus. Because of its limited size the obesity news reference corpus may be artificially restricted in its use of vocabulary and syntax. This could produce exaggerated perplexity results. This limitation may be mitigated somewhat by calculating perplexities using corpora with sizes similar to the reference corpus, but a larger reference set may produce more accurate, reliable and generalizable results. To address this limitation future studies should use a larger reference corpus, perhaps collected over a longer period of time. Also, news content of other health topics should be analyzed to identify if these findings are limited to obesity news coverage or if they also apply to other health news topics.

Conclusion

The study results were consistent with expectations. Despite the noted limitations the results suggest that statistical language modeling and perplexity can be used to measure the similarity of news coverage to obesity news and perhaps other news content specific to particular health topics. Future research should try to generalize the findings of this study to include health topics other than obesity. Also, an automated health news classifier based on statistical language modeling and perplexity should be developed and evaluated for use. An accurate and easy to use health news classification system would benefit health news researchers by facilitating the classification of health news content, and the public by providing more robust ways of acquiring news content that is relevant to their health concerns.

Acknowledgments

Special thanks to Stephen Johnson for providing much needed insight into this research. McFarlane is supported by NLM training grant LM007079-15.

References

  • 1.Iyengar S, Reeves R. Overview. In: Iyengar S, editor. Do the media govern? Politicians, voters, and reporters in America. Thousand Oaks, CA: Sage Publications; 1997. pp. 211–216. [Google Scholar]
  • 2.Angermeyer MC, Schulze B. Reinforcing stereotypes: How the focus on forensic cases in news reporting may influence public …. International Journal of Law and Psychiatry. 2001;24:469–486. doi: 10.1016/s0160-2527(01)00079-6. [DOI] [PubMed] [Google Scholar]
  • 3.Au JS, Yip PS, Chan CL, Law YW. Newspaper reporting of suicide cases in Hong Kong. Crisis. 2004;25:161–8. doi: 10.1027/0227-5910.25.4.161. [DOI] [PubMed] [Google Scholar]
  • 4.Milazzo S, Ernst E. Newspaper coverage of complementary and alternative therapies for cancer--UK 2002–2004. Supportive Care in Cancer. 2006;14:885–9. doi: 10.1007/s00520-006-0068-z. [DOI] [PubMed] [Google Scholar]
  • 5.Corrigan PW, Watson AC, Gracia G, Slopen N, Rasinski K, Hall LL. Newspaper stories as measures of structural stigma. Psychiatric services. 2005;56:551–6. doi: 10.1176/appi.ps.56.5.551. [DOI] [PubMed] [Google Scholar]
  • 6.Ransom I.“Chinese company says sent apology letter to Greeks” Reuters April102008. Available from http://www.reuters.com/article/healthNews/idUSSP8979620080410 (Accessed July 31, 2009).
  • 7.Calloway C, Jorgensen CM, Saraiya M, Tsui J. A content analysis of news coverage of the HPV vaccine by U.S. newspapers, January 2002–June 2005. Journal of women’s health (2002) 2006;15(7):803–9. doi: 10.1089/jwh.2006.15.803. [DOI] [PubMed] [Google Scholar]
  • 8.Brown PF, Pietra VJD, Mercer RL, Pietra SAD, Lai JC. An estimate of an upper bound for the entropy of English. Computational Linguistics. 1992;18 [Google Scholar]
  • 9.Stetson PD, Johnson SB, Scotch M, Hripcsak G. The sublanguage of cross-coverage. AMIA Annual Symposium Proceedings. 2002:742–6. [PMC free article] [PubMed] [Google Scholar]
  • 10.Elhadad N. User-Sensitive Text Summarization: Application to the Medical Domain. Ph.D. Thesis, 2006
  • 11.Chen SF, Goodman J.An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report, 1998
  • 12.McFarlane DJ, Kukafka R. Using term frequency to identify trends in the media’s coverage of health. AMIA Annual Symposium Proceedings. 2007;1045 [PubMed] [Google Scholar]
  • 13.Graff D. English Gigaword. Linguistic Data Consortium; Philadelphia: 2003. [Google Scholar]
  • 14.Stolcke A.SRILM-an Extensible Language Modeling Toolkit. Seventh International Conference on Spoken Language …. 2002
  • 15.Alias-i. LingPipe. http://alias-i.com/lingpipe;.accessed in 2008;

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES