. 2020 Sep 16;26(1):89–113. doi: 10.1007/s11366-020-09697-1

Table 2.

The basic news data (January 1 to May 15, 2020)

Newspaper	Language	Number of articles	Tokens^a
English Corpus	People’s Daily (人民日报)	Chinese	126	663,329
	Guangming Daily (光明日报)	Chinese	103
	Xinhua Daily News (新华每日快讯)	Chinese	118
Chinese Corpus	China Daily	English	506	395,336
Chinese Corpus	Beijing Review	English	263

^aThe process of tokenization is explained by KH Coder Manual. Specifically, KH Coder uses the Standard POS Tagger to lemmatize separated words, and the Snowball Stemmer for stemming. The Snowball Stemmer follows a set of rules in different language. Details of tokenization process can be found at: https://khcoder.net/en/manual_en_v3.pdf, p.17