Table 6.
Preprocessing Steps | No. of Tokens | Vocabulary Size | % of Tokens in Vocabulary | TTR |
---|---|---|---|---|
After tokenization | 1,167,630 | 89,696 | 7.681885529 | 0.077 |
After stopwords removal | 870,521 | 89,003 | 10.22410717 | 0.102 |
After punctuation removal | 746,292 | 889.87 | 11.92388502 | 0.119 |
Alphanumeric to alphabetic word | 746,292 | 86,271 | 11.5599524 | 0.116 |
After single-letter word removal | 620,133 | 86,098 | 13.8837959 | 0.139 |
After lemmatization | 620,133 | 50,043 | 8.069720528 | 0.081 |