Skip to main content
[Preprint]. 2024 Feb 5:arXiv:2402.03484v1. [Version 1]

Table 2:

Evaluations on subsets of the PubCLogs test set. Subsets are grouped by combined coclick counts of article pairs. We compare standard baseline models versus Highlight Similar Article Title (HSAT).

Top 0.1% Top third Middle third Bottom third

Model R P F1 R P F1 R P F1 R P F1

HighlightAll 100.0 24.37 39.19 100.0 21.88 35.90 100.0 22.06 36.15 100.00 20.40 33.88
Overlapper 69.65 33.47 45.21 63.75 25.75 36.68 61.73 25.00 35.59 60.23 23.37 33.68
BM25 [31] 69.57 61.70 65.40 69.79 70.26 70.02 69.04 75.55 72.14 63.00 77.11 69.34

Word2Vec [23] 67.84 56.03 61.37 54.56 53.56 54.06 50.39 54.27 52.26 40.72 48.94 44.45
BioWord2Vec [41] 62.09 51.77 56.47 43.88 42.67 43.27 40.28 43.07 41.63 32.99 39.60 36.00
MPNet [35] 83.16 68.09 74.87 71.62 69.56 70.58 66.18 70.90 68.46 56.68 68.20 61.91
MedCPT [16] 72.62 58.87 65.02 65.69 64.00 64.83 61.34 65.92 63.54 52.98 63.86 57.91

GPT-3.5 [24] 51.38 35.46 41.96 56.71 41.53 47.95 53.73 42.89 47.70 44.80 39.76 42.13
GPT-4 [25] 87.13 55.11 67.51 80.05 56.27 66.08 77.44 58.62 66.73 69.54 55.48 61.72

HSAT (ours) 97.62 97.48 97.55 92.57 93.12 92.84 92.18 93.09 92.64 89.38 90.34 89.86