Table 5.
Performance of different clustering algorithms on 20NG and RCV1 data. CHINC is our proposed method. BOW, FB (Freebase), or YG (YAGO2) represent bag of word features, the entities generated by our world knowledge specification approach based on Freebase or YAGO2, respectively. We compared all the numbers of HINC and CHINC with CITCC, which is the strongest baseline. The percentage in the brackets are the relative number compared to CITCC. CITCC uses 250K constraints generated based on ground-truth labels of documents.
| Kmeans | ITCC | CITCC | HINC | CHINC | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Features Data | BOW | BOW +FB | BOW +YG | BOW | BOW +FB | BOW +YG | BOW | FB | YG | FB | YG |
| 20NG | 0.429 | 0.447 | 0.437 | 0.501 | 0.525 | 0.513 | 0.569 | 0.571 (+0.4%) | 0.541 (−4.9%) | 0.631 (+10.9%) | 0.600 (+5.5%) |
| MCAT | 0.549 | 0.575 | 0.559 | 0.604 | 0.630 | 0.619 | 0.652 | 0.645 (−1.1%) | 0.625 (−4.1%) | 0.698 (+7.1%) | 0.685 (+5.1%) |
| CCAT | 0.403 | 0.419 | 0.410 | 0.472 | 0.494 | 0.481 | 0.535 | 0.542 (+1.3%) | 0.515 (−3.7%) | 0.606 (+13.3%) | 0.574 (+7.3%) |
| ECAT | 0.417 | 0.436 | 0.424 | 0.493 | 0.516 | 0.505 | 0.562 | 0.561 (−0.2%) | 0.530 (−5.7%) | 0.624 (+11.0%) | 0.588 (+4.6%) |