Table 1. Representative recent supervised learning methods to detect duplicates in general domains.
Method | Domain | Expert curated set (DU + DI) | Technique(s) |
---|---|---|---|
[15] | Geospatial | 1,927 + 1,927 | DT and SVM |
[26] | Product matching | 1,000 + 1,000 | SVM |
[14] | Document Retrieval | 2,500 + 2,500 | SVM |
[27] | Bug report | 534 + 534 | NB, DT and SVM |
[28] | Spam check | 1,750 + 2,000 | SVM |
[29] | Web visitor | 250,000 + 250,000 | LR, RF, and SVM |
DU: duplicate pairs; DI: distinct pairs; NB: Naïve Bayes; DT: Decision Tree; SVM: Support Vector Machine; LR: Logistic Regression; RF: Random Forest; The dataset listed here is for supervised learning. Some work might have other datasets.