Skip to main content
. 2010 Jun 2;5(6):e10729. doi: 10.1371/journal.pone.0010729

Table 1. Word frequency lists of Chinese.

  • The Language Corpus System of Modern Chinese Study (LCSMCS) word frequencies, based on a corpus of 20 million characters of which 2 million have been segmented into words and assigned their parts-of-speech (PoS) [8]; available at http://www.dwhyyjzx.com/cgi-bin/yuliao/, checked on September 24, 2009).

  • The Lancaster Corpus of Mandarin Chinese (LCMC), based on a corpus of 73 million characters (50 million words; see http://www.lancs.ac.uk/fass/projects/corpus/LCMC/, checked on September 24, 2009). This is the corpus underlying A frequency dictionary of mandarin Chinese: Core vocabulary for learners [6].

  • The Academia Sinica Balanced Corpus of Modern Chinese based on 5 million characters and compiled by the Institute of Information Science and the CKIP group in Academia Sinica (http://www.sinica.edu.tw/SinicaCorpus/, checked on September 24, 2009).

  • Draft for modern Chinese word set for common useInline graphic≫ (Inline graphic) (2008) compiled by the State Language Commission of China [9]. This list contains 56,008 frequency-ranked words, the frequencies of which are based on a segmented part of 45 million characters from the Chinese (General) Balanced Corpus, a segmented corpus of 135 million characters based on People's Daily 2001-2005, and a modern Chinese literature corpus of 70 million characters constructed by Xiamen University. The word frequencies themselves, however, are not yet publicly available.

  • Word list YW2001 (92,843 words) reported by Sun et al. [10] as part of the major outcome of the national corpus project by the Chinese Information Processing Platform, based on a corpus of ca. 800 million characters, but the list is not publicly accessible so far.