Table 2.
Performance metrics of all DL network configurations and human observer.
Observer | Variant | Input | AUROC | Acc | Sen | Spe | SB | Log loss |
---|---|---|---|---|---|---|---|---|
AI | N1 |
![]() |
0.97 ± 0.02 | 0.97 | 0.93 | 1 | 0.07 | 0.42 |
![]() |
0.94 ± 0.05 | 0.91 | 1 | 0.84 | –0.16 | 0.39 | ||
![]() |
0.89 ± 0.06 | 0.91 | 1 | 0.84 | –0.16 | 0.44 | ||
N2 |
![]() |
0.99 ± 0.01 | 0.97 | 1 | 0.95 | –0.05 | 0.13 | |
![]() |
0.98 ± 0.02 | 0.94 | 1 | 0.89 | –0.11 | 0.15 | ||
![]() |
0.93 ± 0.05 | 0.97 | 1 | 0.95 | –0.05 | 0.30 | ||
N3 |
![]() |
0.98 ± 0.02 | 0.94 | 0.93 | 0.95 | 0.01 | 0.15 | |
![]() |
0.99 ± 0.01 | 0.97 | 1 | 0.95 | –0.05 | 0.14 | ||
![]() |
0.93 ± 0.05 | 0.85 | 1 | 0.74 | –0.26 | 0.36 | ||
Human | – | 0.82 | 0.93 | 0.74 | –0.19 |
AUROC = area under the ROC curve; Acc = accuracy (overall classification correctness); Sen = sensitivity (female classification accuracy); Spe = specificity (male classification accuracy); SB = sex bias (specificity – sensitivity). For DL networks, the AUROC is the average across all five models from the 5-fold cross validation. The log loss is calculated between the true labels and the average probability of being female.