Table 8.
Corpus | # Documents | # Words / # Unique Words |
---|---|---|
WSJ-400 |
400 |
214 K / 19 K |
WSJ-600 |
600 |
309 K / 23.5 K |
WSJ-1300 |
1,300 |
680 K / 36 K |
WSJx2 |
2,600 |
1.3 M words / 36 K |
WSJx3 |
3,900 |
2.6 M words / 36 K |
WSJs5 | 3,246(±40) | 1.69 M (±42 K) words / 36 K |
Synthetic corpora with various levels of redundancy , for WSJs5 we report averages and standard deviation based on 10 replications.