Skip to main content
. 2013 Jan 16;14:10. doi: 10.1186/1471-2105-14-10

Table 8.

Corpora Descriptive statistics

Corpus # Documents # Words / # Unique Words
WSJ-400
400
214 K / 19 K
WSJ-600
600
309 K / 23.5 K
WSJ-1300
1,300
680 K / 36 K
WSJx2
2,600
1.3 M words / 36 K
WSJx3
3,900
2.6 M words / 36 K
WSJs5 3,246(±40) 1.69 M (±42 K) words / 36 K

Synthetic corpora with various levels of redundancy , for WSJs5 we report averages and standard deviation based on 10 replications.