Skip to main content
. 2013 Jan 16;14:10. doi: 10.1186/1471-2105-14-10

Table 4.

Comparison of extracted collocations

Corpus name Corpus type Size of corpus# words / # distinct words #extracted collocations (TMI / PMI) Average #documents per collocation
WSJ-400
Non-redundant
214 K / 19 K
551/565
20.2/19.9
WSJ-600
Non-redundant
309 K / 23.5 K
943/1,000
15.5/15.2
WSJ-1300
Non-redundant
680 K / 36 K
1,881/2,518
10.8/9.7
WSJs5 Synthetic Redundant 1.69 M (±42 K)/36 K 3,035±(63)/17,015±(950) 7.4±(0.11)/2.8±(0.09)

Comparison of extracted collocations on synthetic redundant corpora and non-redundant corpora (WSJ – X words / Y distinct words). Collocations were extracted using using True Mutual Information and Pointwise Mutual Information (with cutoffs of 0.001 and 0.01 respectively).