Skip to main content
. 2020 Sep 16;26(1):89–113. doi: 10.1007/s11366-020-09697-1

Table 2.

The basic news data (January 1 to May 15, 2020)

Newspaper Language Number of articles Tokensa
English Corpus People’s Daily (人民日报) Chinese 126 663,329
Guangming Daily (光明日报) Chinese 103
Xinhua Daily News (新华每日快讯) Chinese 118
Chinese Corpus China Daily English 506 395,336
Beijing Review English 263

aThe process of tokenization is explained by KH Coder Manual. Specifically, KH Coder uses the Standard POS Tagger to lemmatize separated words, and the Snowball Stemmer for stemming. The Snowball Stemmer follows a set of rules in different language. Details of tokenization process can be found at: https://khcoder.net/en/manual_en_v3.pdf, p.17