Skip to main content
. 2015 Jul 9;10(7):e0129031. doi: 10.1371/journal.pone.0129031

Table 6. Size of vocabulary V (i.e., number of types) when texts are decomposed in different sorts of types, being these: word-lemma-tag (w-l-t), plain words, lemma-POS (l-pos), lemma-POS of words in the dictionary (l-pos dic), lemmas, and lemmas of words in the dictionary (lemma dic).

The latter provide the most radical transformation, as it yields the largest reduction in resulting vocabulary.

w-l-t word l-pos l-pos dic lemma lemma dic
Clarissa 23624 20492 17058 10315 15356 9041
Moby-Dick 20777 18516 15774 10426 14226 9141
Ulysses 32952 29450 26412 14136 24089 12469
Don Quijote 23359 21180 11872 7906 11128 7432
La Regenta 24053 21871 12509 10500 11768 9900
Artamène 31574 25161 7605 5349 7177 5008
Bragelonne 28803 25775 12994 11342 12127 10744
Seitsemän 22851 22035 9749 7788 9607 7658
Kevät ja 26087 25071 9897 9054 9733 8898
Vanhempieni 37247 35931 14751 13678 14566 13510