Table 6.
Classification and Repeatability metrics comparing binary with multiclass models, both with and without Monte Carlo (MC) dropout.
Model | Classification | Repeatability | ||||
---|---|---|---|---|---|---|
% ext. mis | %p as n | %n as p | % ext. dis | QWK | 95% LoA | |
Binary | 21.83% | 32.16% | 20.66% | 12.50% | 0.621 | 0.617 |
Binary-MC | 25.74% | 26.90% | 25.61% | 11.14% | 0.704 | 0.366 |
Three-class | 5.87% | 8.77% | 7.27% | 0.95% | 0.796 | 0.470 |
Three-class-MC (#36) | 3.44% | 5.85% | 4.16% | 0.69% | 0.856 | 0.240 |
Comparison of binary and multiclass models on “Test Set 2”, highlighting relevant classification metrics (% p as n: % precancer+ as normal; % n as p: % normal as precancer+; and % ext. mis.: % extreme misclassifications) and repeatability metrics (% ext. dis.: % extreme disagreement i.e. extreme disagreement between image pairs across women; QWK: quadratic weighted kappa; and 95% LoA: 95% limits of agreement on a Bland Altman plot, highlighting the continuous score repeatability). All four models: binary, binary with Monte-Carlo (MC) dropout, three-class and three-class with MC dropout incorporate the same configurations as the top performing model (#36), with the only exception being the presence or absence MC dropout and whether the models output binary or three-class predictions (as indicated by the corresponding name). All three-class models were trained using the “3 level all patients” ground truth mapping (normal, gray zone, precancer+), while the binary models were trained on binary (normal, precancer+) ground truths. The metrics highlighted here indicate that three-class models perform better than binary models in terms of both repeatability and classification metrics, while MC dropout improves repeatability.