Table 1.
Model | Size | Precision | Recall | F1 | AUROC | NDCG |
---|---|---|---|---|---|---|
HL | 106 | 10.0 [1.3, 18.7] | 20.0 [5.4, 34.6] | 12.8 [2.5, 23.1] | 85.4 [80.8, 90.0] | 40.6 [36.4, 44.9] |
HL + Aug. | 106 | 30.7 [20.8, 40.6] | 53.3 [38.7, 68.0] | 37.8 [27.7, 47.9] | 83.4 [79.5, 87.3] | 55.7 [51.5, 59.9] |
WS | 4239 | 83.3 [64.5, 100.0] | 53.3 [38.7, 68.0] | 60.8 [50.6, 71.0] | 91.4 [87.8, 95.0] | 84.5 [81.1, 88.0] |
WS + Aug. | 4239 | 70.0 [55.4, 84.6] | 60.0 [48.1, 72.0] | 61.4 [55.3, 67.5] | 94.4 [91.3, 97.6] | 87.3 [83.6, 91.0] |
WS indicates weak supervision models, HL indicates hand-labeled models, and Aug. indicates augmentation. Scores are computed with 95% confidence intervals (where n = the size column), with bold text indicating best performance overall