Table 8.
Corpora Descriptive statistics
| Corpus | # Documents | # Words / # Unique Words |
|---|---|---|
| WSJ-400 |
400 |
214 K / 19 K |
| WSJ-600 |
600 |
309 K / 23.5 K |
| WSJ-1300 |
1,300 |
680 K / 36 K |
| WSJx2 |
2,600 |
1.3 M words / 36 K |
| WSJx3 |
3,900 |
2.6 M words / 36 K |
| WSJs5 | 3,246(±40) | 1.69 M (±42 K) words / 36 K |
Synthetic corpora with various levels of redundancy , for WSJs5 we report averages and standard deviation based on 10 replications.