[Preprint]. 2024 Feb 5:arXiv:2402.03484v1. [Version 1]

Table 2:

Evaluations on subsets of the PubCLogs test set. Subsets are grouped by combined coclick counts of article pairs. We compare standard baseline models versus Highlight Similar Article Title (HSAT).

	Top 0.1%			Top third			Middle third			Bottom third

Model	R	P	F₁	R	P	F₁	R	P	F₁	R	P	F₁

HighlightAll	100.0	24.37	39.19	100.0	21.88	35.90	100.0	22.06	36.15	100.00	20.40	33.88
Overlapper	69.65	33.47	45.21	63.75	25.75	36.68	61.73	25.00	35.59	60.23	23.37	33.68
BM25 [31]	69.57	61.70	65.40	69.79	70.26	70.02	69.04	75.55	72.14	63.00	77.11	69.34

Word2Vec [23]	67.84	56.03	61.37	54.56	53.56	54.06	50.39	54.27	52.26	40.72	48.94	44.45
BioWord2Vec [41]	62.09	51.77	56.47	43.88	42.67	43.27	40.28	43.07	41.63	32.99	39.60	36.00
MPNet [35]	83.16	68.09	74.87	71.62	69.56	70.58	66.18	70.90	68.46	56.68	68.20	61.91
MedCPT [16]	72.62	58.87	65.02	65.69	64.00	64.83	61.34	65.92	63.54	52.98	63.86	57.91

GPT-3.5 [24]	51.38	35.46	41.96	56.71	41.53	47.95	53.73	42.89	47.70	44.80	39.76	42.13
GPT-4 [25]	87.13	55.11	67.51	80.05	56.27	66.08	77.44	58.62	66.73	69.54	55.48	61.72

HSAT (ours)	97.62	97.48	97.55	92.57	93.12	92.84	92.18	93.09	92.64	89.38	90.34	89.86