Table 3.
Quantitative description of the different corpora used.
Corpora | Number of types | Number of tokens | Average document size | Numbers of documents |
---|---|---|---|---|
TASA | 57,800 | 5,285,933 | 140.41 | 37,600 |
Wikipedia | 66,035 | 7,015,782 | 175.39 | 40,000 |
Fiction | 66,632 | 3,964,482 | 101.56 | 40,000 |
Non-Fiction | 60,917 | 2,860,230 | 114.41 | 25,000 |
Mixed | 81,349 | 13,134,480 | 131.35 | 100,000 |