Skip to main content
. 2021 Sep 29;38(2):513–519. doi: 10.1093/bioinformatics/btab670

Table 1.

Generalization accuracy of NuCLS models trained and evaluated on the corrected single-rater dataset using internal–external cross-validation

Detection Segmentation Classification
Fold N AP @.5 mAP @.5:.95 N Median IOU Median DICE N Superclasses? Accuracy MCC AUROC (micro) AUROC (macro)
1 (Val.) 6102 75.3 34.4 1389 78.5 87.9 5351 No 71.0 58.1 93.3 84.6
Yes 77.5 65.2 93.7 89.0
2 15442 74.9 33.2 3474 78.0 87.6 13597 No 70.1 56.9 93.8 83.6
Yes 79.4 68.2 94.6 86.5
3 12672 74.0 33.8 1681 80.2 89.0 11176 No 68.6 57.0 93.5 87.1
Yes 79.0 68.1 94.4 89.4
4 8260 75.3 33.5 1948 80.9 89.5 7288 No 73.1 61.8 94.5 85.0
Yes 83.9 73.5 96.1 87.4
5 7295 74.9 31.5 1306 78.1 87.7 6294 No 61.7 47.0 89.3 79.2
Yes 68.4 52.4 89.0 80.8
Mean (Std) 74.8 (0.5) 33.0 (0.9) 79.3 (1.3) 88.5 (0.8) No 68.4 (4.2) 55.7 (5.4) 92.8 (2.0) 83.7 (2.9)
Yes 77.7 (5.7) 65.6 (7.9) 93.5 (2.7) 86.0 (3.2)

Note: All accuracy values are percentages. Fold 1 acted as the validation set for hyperparameter tuning, so the bottom row shows mean and standard deviation of four values (folds 2–5). Note that the number of testing set nuclei varied by fold because the split happens at the level of hospitals and not nuclei. Note that the classification accuracy is consistently higher when the assessment was done at the level of superclasses. Abbreviations: AP@.5, average precision when a threshold of 0.5 is used for considering a detection to be true; mAP@.5:.95, mean average precision at detection thresholds between 0.5 and 0.95.