Table 2.
Observer | Variant | Input | AUROC | Acc | Sen | Spe | SB | Log loss |
---|---|---|---|---|---|---|---|---|
AI | N1 | 0.97 ± 0.02 | 0.97 | 0.93 | 1 | 0.07 | 0.42 | |
0.94 ± 0.05 | 0.91 | 1 | 0.84 | –0.16 | 0.39 | |||
0.89 ± 0.06 | 0.91 | 1 | 0.84 | –0.16 | 0.44 | |||
N2 | 0.99 ± 0.01 | 0.97 | 1 | 0.95 | –0.05 | 0.13 | ||
0.98 ± 0.02 | 0.94 | 1 | 0.89 | –0.11 | 0.15 | |||
0.93 ± 0.05 | 0.97 | 1 | 0.95 | –0.05 | 0.30 | |||
N3 | 0.98 ± 0.02 | 0.94 | 0.93 | 0.95 | 0.01 | 0.15 | ||
0.99 ± 0.01 | 0.97 | 1 | 0.95 | –0.05 | 0.14 | |||
0.93 ± 0.05 | 0.85 | 1 | 0.74 | –0.26 | 0.36 | |||
Human | – | 0.82 | 0.93 | 0.74 | –0.19 |
AUROC = area under the ROC curve; Acc = accuracy (overall classification correctness); Sen = sensitivity (female classification accuracy); Spe = specificity (male classification accuracy); SB = sex bias (specificity – sensitivity). For DL networks, the AUROC is the average across all five models from the 5-fold cross validation. The log loss is calculated between the true labels and the average probability of being female.