Table 4.
Comparison of extracted collocations
| Corpus name | Corpus type | Size of corpus# words / # distinct words | #extracted collocations (TMI / PMI) | Average #documents per collocation |
|---|---|---|---|---|
| WSJ-400 |
Non-redundant |
214 K / 19 K |
551/565 |
20.2/19.9 |
| WSJ-600 |
Non-redundant |
309 K / 23.5 K |
943/1,000 |
15.5/15.2 |
| WSJ-1300 |
Non-redundant |
680 K / 36 K |
1,881/2,518 |
10.8/9.7 |
| WSJs5 | Synthetic Redundant | 1.69 M (±42 K)/36 K | 3,035±(63)/17,015±(950) | 7.4±(0.11)/2.8±(0.09) |
Comparison of extracted collocations on synthetic redundant corpora and non-redundant corpora (WSJ – X words / Y distinct words). Collocations were extracted using using True Mutual Information and Pointwise Mutual Information (with cutoffs of 0.001 and 0.01 respectively).