Table 2. Diagnostic performance of the three pretrained deep learning models in the four classification tasks.
Tasks | Models | RadImageNet | ImageNet | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
ACC | AUC (95% CI) | Sensitivity | Specificity | F1 | ACC | AUC (95% CI) | Sensitivity | Specificity | F1 | |||
Nuclear grade | ResNet50 | 0.667 | 0.560 (0.469–0.571) | 0.400 | 0.720 | 0.286 | 0.610 | 0.510 (0.486–0.619) | 0.474 | 0.458 | 0.452 | |
InceptionV3 | 0.828 | 0.510 (0.485–0.515) | 0.030 | 0.987 | 0.061 | 0.806 | 0.537 (0.465–0.563) | 0.531 | 0.500 | 0.513 | ||
DenseNet121 | 0.761 | 0.540 (0.474–0.547) | 0.200 | 0.873 | 0.218 | 0.650 | 0.563 (0.450–0.571) | 0.433 | 0.693 | 0.292 | ||
ER | ResNet50 | 0.558 | 0.574 (0.450–0.589) | 0.524 | 0.623 | 0.610 | 0.642 | 0.520 (0.417–0.548) | 0.903 | 0.151 | 0.772 | |
InceptionV3 | 0.532 | 0.480 (0.406–0.527) | 0.651 | 0.302 | 0.647 | 0.577 | 0.579 (0.448–0.586) | 0.573 | 0.585 | 0.641 | ||
DenseNet121 | 0.513 | 0.460 (0.447–0.513) | 0.621 | 0.302 | 0.628 | 0.526 | 0.540 (0.467–0.550) | 0.553 | 0.472 | 0.606 | ||
PR | ResNet50 | 0.610 | 0.570 (0.496–0.587) | 0.920 | 0.220 | 0.730 | 0.526 | 0.493 (0.472–0.537) | 0.744 | 0.242 | 0.640 | |
InceptionV3 | 0.474 | 0.460 (0.433–0.491) | 0.558 | 0.364 | 0.546 | 0.513 | 0.400 (0.386–0.533) | 0.872 | 0.045 | 0.669 | ||
DenseNet121 | 0.493 | 0.460 (0.453–0.521) | 0.698 | 0.227 | 0.609 | 0.552 | 0.530 (0.497–0.553) | 0.697 | 0.364 | 0.638 | ||
HER2 | ResNet50 | 0.649 | 0.583 (0.455–0.584) | 0.396 | 0.330 | 0.422 | 0.541 | 0.450 (0.416–0.566) | 0.563 | 0.530 | 0.442 | |
InceptionV3 | 0.541 | 0.573 (0.495–0.583) | 0.667 | 0.480 | 0.485 | 0.622 | 0.525 (0.489–0.568) | 0.250 | 0.800 | 0.300 | ||
DenseNet121 | 0.642 | 0.530 (0.455–0.535) | 0.208 | 0.850 | 0.274 | 0.669 | 0.560 (0.468–0.566) | 0.250 | 0.870 | 0.329 |
ACC, accuracy; AUC, area under the curve; CI, confidence interval; ER, estrogen receptor; PR, progesterone receptor; HER2, human epidermal growth factor receptor 2.