Skip to main content
. 2023 Dec 8;13:21772. doi: 10.1038/s41598-023-48721-1

Table 6.

Classification and Repeatability metrics comparing binary with multiclass models, both with and without Monte Carlo (MC) dropout.

Model Classification Repeatability
% ext. mis %p as n %n as p % ext. dis QWK 95% LoA
Binary 21.83% 32.16% 20.66% 12.50% 0.621 0.617
Binary-MC 25.74% 26.90% 25.61% 11.14% 0.704 0.366
Three-class 5.87% 8.77% 7.27% 0.95% 0.796 0.470
Three-class-MC (#36) 3.44% 5.85% 4.16% 0.69% 0.856 0.240

Comparison of binary and multiclass models on “Test Set 2”, highlighting relevant classification metrics (% p as n: % precancer+ as normal; % n as p: % normal as precancer+; and % ext. mis.: % extreme misclassifications) and repeatability metrics (% ext. dis.: % extreme disagreement i.e. extreme disagreement between image pairs across women; QWK: quadratic weighted kappa; and 95% LoA: 95% limits of agreement on a Bland Altman plot, highlighting the continuous score repeatability). All four models: binary, binary with Monte-Carlo (MC) dropout, three-class and three-class with MC dropout incorporate the same configurations as the top performing model (#36), with the only exception being the presence or absence MC dropout and whether the models output binary or three-class predictions (as indicated by the corresponding name). All three-class models were trained using the “3 level all patients” ground truth mapping (normal, gray zone, precancer+), while the binary models were trained on binary (normal, precancer+) ground truths. The metrics highlighted here indicate that three-class models perform better than binary models in terms of both repeatability and classification metrics, while MC dropout improves repeatability.