Table 4.

Retrospective results, with respect to: per-article AUC, NDCG@20, precision@10 and precision@20. For each we report the means and standard deviations over the 133 articles for which candidate sets were annotated for the respective domains. All sentences not in candidate sets are assumed to be irrelevant, these results are therefore noisy and likely pessimistic. We bold cells corresponding to the best performing methods for each metric, PICO element pair.

Method	Mean AUC (SD)	Mean NDCG@20 (SD)	Precision@3 (SD)	Precision@10 (SD)	Precision@20 (SD)
Population
Direct only	0.904 (0.106)	0.530 (0.270)	0.347 (0.298)	0.183 (0.126)	0.116 (0.070)
DS	0.941 (0.063)	0.484 (0.243)	0.256 (0.242)	0.202 (0.126)	0.129 (0.075)
Nguyen	0.917 (0.091)	0.537 (0.275)	0.328 (0.281)	0.189 (0.128)	0.117 (0.072)
SDS	0.947 (0.059)	0.548 (0.263)	0.336 (0.276)	0.212 (0.133)	0.132 (0.076)
Interventions
Direct only	0.893 (0.099)	0.493 (0.265)	0.397 (0.293)	0.216 (0.148)	0.139 (0.086)
DS	0.933 (0.068)	0.507 (0.239)	0.344 (0.295)	0.250 (0.164)	0.172 (0.099)
Nguyen	0.921 (0.073)	0.536 (0.254)	0.419 (0.300)	0.248 (0.162)	0.158 (0.097)
SDS	0.936 (0.063)	0.530 (0.249)	0.389 (0.323)	0.252 (0.164)	0.172 (0.099)
Outcomes
Direct only	0.837 (0.096)	0.261 (0.241)	0.180 (0.244)	0.114 (0.117)	0.080 (0.072)
DS	0.896 (0.078)	0.308 (0.223)	0.117 (0.203)	0.148 (0.133)	0.120 (0.091)
Nguyen	0.870 (0.085)	0.339 (0.256)	0.228 (0.268)	0.151 (0.137)	0.106 (0.084)
SDS	0.900 (0.079)	0.333 (0.233)	0.138 (0.212)	0.160 (0.134)	0.124 (0.092)