. 2013 Jan 16;14:10. doi: 10.1186/1471-2105-14-10

Table 4.

Comparison of extracted collocations

Corpus name	Corpus type	Size of corpus# words / # distinct words	#extracted collocations (TMI / PMI)	Average #documents per collocation
WSJ-400	Non-redundant	214 K / 19 K	551/565	20.2/19.9
WSJ-600	Non-redundant	309 K / 23.5 K	943/1,000	15.5/15.2
WSJ-1300	Non-redundant	680 K / 36 K	1,881/2,518	10.8/9.7
WSJs5	Synthetic Redundant	1.69 M (±42 K)/36 K	3,035±(63)/17,015±(950)	7.4±(0.11)/2.8±(0.09)

Comparison of extracted collocations on synthetic redundant corpora and non-redundant corpora (WSJ – X words / Y distinct words). Collocations were extracted using using True Mutual Information and Pointwise Mutual Information (with cutoffs of 0.001 and 0.01 respectively).