Skip to main content
. Author manuscript; available in PMC: 2015 Dec 22.
Published in final edited form as: KDD. 2015 Aug;2015:1215–1224. doi: 10.1145/2783258.2783374

Table 5.

Performance of different clustering algorithms on 20NG and RCV1 data. CHINC is our proposed method. BOW, FB (Freebase), or YG (YAGO2) represent bag of word features, the entities generated by our world knowledge specification approach based on Freebase or YAGO2, respectively. We compared all the numbers of HINC and CHINC with CITCC, which is the strongest baseline. The percentage in the brackets are the relative number compared to CITCC. CITCC uses 250K constraints generated based on ground-truth labels of documents.

Kmeans ITCC CITCC HINC CHINC
Features Data BOW BOW +FB BOW +YG BOW BOW +FB BOW +YG BOW FB YG FB YG
20NG 0.429 0.447 0.437 0.501 0.525 0.513 0.569 0.571 (+0.4%) 0.541 (−4.9%) 0.631 (+10.9%) 0.600 (+5.5%)
MCAT 0.549 0.575 0.559 0.604 0.630 0.619 0.652 0.645 (−1.1%) 0.625 (−4.1%) 0.698 (+7.1%) 0.685 (+5.1%)
CCAT 0.403 0.419 0.410 0.472 0.494 0.481 0.535 0.542 (+1.3%) 0.515 (−3.7%) 0.606 (+13.3%) 0.574 (+7.3%)
ECAT 0.417 0.436 0.424 0.493 0.516 0.505 0.562 0.561 (−0.2%) 0.530 (−5.7%) 0.624 (+11.0%) 0.588 (+4.6%)