Skip to main content
. 2016 Aug 4;11(8):e0159644. doi: 10.1371/journal.pone.0159644

Table 1. Representative recent supervised learning methods to detect duplicates in general domains.

Method Domain Expert curated set (DU + DI) Technique(s)
[15] Geospatial 1,927 + 1,927 DT and SVM
[26] Product matching 1,000 + 1,000 SVM
[14] Document Retrieval 2,500 + 2,500 SVM
[27] Bug report 534 + 534 NB, DT and SVM
[28] Spam check 1,750 + 2,000 SVM
[29] Web visitor 250,000 + 250,000 LR, RF, and SVM

DU: duplicate pairs; DI: distinct pairs; NB: Naïve Bayes; DT: Decision Tree; SVM: Support Vector Machine; LR: Logistic Regression; RF: Random Forest; The dataset listed here is for supervised learning. Some work might have other datasets.