. 2024 Dec 3;14:30136. doi: 10.1038/s41598-024-81718-y

Table 2.

Performance metrics of all DL network configurations and human observer.

Observer	Variant	AUROC	Acc	Sen	Spe	SB	Log loss
AI	N1	0.97 ± 0.02	0.97	0.93	1	0.07	0.42
		0.94 ± 0.05	0.91	1	0.84	–0.16	0.39
		0.89 ± 0.06	0.91	1	0.84	–0.16	0.44
	N2	0.99 ± 0.01	0.97	1	0.95	–0.05	0.13
		0.98 ± 0.02	0.94	1	0.89	–0.11	0.15
		0.93 ± 0.05	0.97	1	0.95	–0.05	0.30
	N3	0.98 ± 0.02	0.94	0.93	0.95	0.01	0.15
		0.99 ± 0.01	0.97	1	0.95	–0.05	0.14
		0.93 ± 0.05	0.85	1	0.74	–0.26	0.36
Human	–		0.82	0.93	0.74	–0.19

AUROC = area under the ROC curve; Acc = accuracy (overall classification correctness); Sen = sensitivity (female classification accuracy); Spe = specificity (male classification accuracy); SB = sex bias (specificity – sensitivity). For DL networks, the AUROC is the average across all five models from the 5-fold cross validation. The log loss is calculated between the true labels and the average probability of being female.