The Language Corpus System of Modern Chinese Study (LCSMCS) word frequencies, based on a corpus of 20 million characters of which 2 million have been segmented into words and assigned their parts-of-speech (PoS) [8]; available at http://www.dwhyyjzx.com/cgi-bin/yuliao/, checked on September 24, 2009).
|
|
The Lancaster Corpus of Mandarin Chinese (LCMC), based on a corpus of 73 million characters (50 million words; see http://www.lancs.ac.uk/fass/projects/corpus/LCMC/, checked on September 24, 2009). This is the corpus underlying A frequency dictionary of mandarin Chinese: Core vocabulary for learners
[6].
|
The Academia Sinica Balanced Corpus of Modern Chinese based on 5 million characters and compiled by the Institute of Information Science and the CKIP group in Academia Sinica (http://www.sinica.edu.tw/SinicaCorpus/, checked on September 24, 2009).
|
Draft for modern Chinese word set for common use ≪≫ () (2008) compiled by the State Language Commission of China
[9]. This list contains 56,008 frequency-ranked words, the frequencies of which are based on a segmented part of 45 million characters from the Chinese (General) Balanced Corpus, a segmented corpus of 135 million characters based on People's Daily 2001-2005, and a modern Chinese literature corpus of 70 million characters constructed by Xiamen University. The word frequencies themselves, however, are not yet publicly available.
|
Word list YW2001 (92,843 words) reported by Sun et al. [10] as part of the major outcome of the national corpus project by the Chinese Information Processing Platform, based on a corpus of ca. 800 million characters, but the list is not publicly accessible so far.
|