. Author manuscript; available in PMC: 2015 Dec 22.

Published in final edited form as: KDD. 2015 Aug;2015:1215–1224. doi: 10.1145/2783258.2783374

Table 5.

Performance of different clustering algorithms on 20NG and RCV1 data. CHINC is our proposed method. BOW, FB (Freebase), or YG (YAGO2) represent bag of word features, the entities generated by our world knowledge specification approach based on Freebase or YAGO2, respectively. We compared all the numbers of HINC and CHINC with CITCC, which is the strongest baseline. The percentage in the brackets are the relative number compared to CITCC. CITCC uses 250K constraints generated based on ground-truth labels of documents.

	Kmeans			ITCC			CITCC	HINC		CHINC
Features Data	BOW	BOW +FB	BOW +YG	BOW	BOW +FB	BOW +YG	BOW	FB	YG	FB	YG
20NG	0.429	0.447	0.437	0.501	0.525	0.513	0.569	0.571 (+0.4%)	0.541 (−4.9%)	0.631 (+10.9%)	0.600 (+5.5%)
MCAT	0.549	0.575	0.559	0.604	0.630	0.619	0.652	0.645 (−1.1%)	0.625 (−4.1%)	0.698 (+7.1%)	0.685 (+5.1%)
CCAT	0.403	0.419	0.410	0.472	0.494	0.481	0.535	0.542 (+1.3%)	0.515 (−3.7%)	0.606 (+13.3%)	0.574 (+7.3%)
ECAT	0.417	0.436	0.424	0.493	0.516	0.505	0.562	0.561 (−0.2%)	0.530 (−5.7%)	0.624 (+11.0%)	0.588 (+4.6%)