Skip to main content
. 2011 Apr 12;366(1567):1101–1107. doi: 10.1098/rstb.2010.0315

Table 1.

Language corpora consulted by language family. The corpus size is given from the documentation of each of the corpora used. The language classification is given from the online Ethnologue database [9].

language family language size (no. of words)
Indo-European English 100 million
Russian 140 million
Greek 47 million
Portuguese 45 million
Spanish 1 million
Chilean Spanish 450 million
French 31 390 000
Czech 100 million
Polish 450 million
Sino-Tibetan Chinese 1 million
Uralic Finnish 21 329 990
Estonian 1 million
Niger-Congo Swahili 2 million
Altaic Turkish 2 million
Austronesian Māori 1 million
unclassified languages Basque 5 million
Creole Tok Pisin 864 900