Table 2.
The basic news data (January 1 to May 15, 2020)
Newspaper | Language | Number of articles | Tokensa | |
---|---|---|---|---|
English Corpus | People’s Daily (人民日报) | Chinese | 126 | 663,329 |
Guangming Daily (光明日报) | Chinese | 103 | ||
Xinhua Daily News (新华每日快讯) | Chinese | 118 | ||
Chinese Corpus | China Daily | English | 506 | 395,336 |
Beijing Review | English | 263 |
aThe process of tokenization is explained by KH Coder Manual. Specifically, KH Coder uses the Standard POS Tagger to lemmatize separated words, and the Snowball Stemmer for stemming. The Snowball Stemmer follows a set of rules in different language. Details of tokenization process can be found at: https://khcoder.net/en/manual_en_v3.pdf, p.17