Table 2.
Results (n) | Test performance (%) | ||||||||
---|---|---|---|---|---|---|---|---|---|
TP | TN | FP | FN | AUC [95%CI] | Sensitivity [95%CI] | Specificity [95%CI] | Accuracy [95%CI] | P † | |
Training | 1368 | 5051 | 113 | 696 | 82.05 [81.01-83.08] | 66.28 (1368/2064) [64.10-68.32] | 97.81 (5051/5164) [97.41-98.20] | 88.81 (6419/7228) [88.06-89.53] | |
Internal Validation | |||||||||
Deep-learning model | 356 | 1268 | 23 | 160 | 83.61 [81.58-85.64] | 68.99 (356/516) [65.12-73.06] | 98.22 (1268/1291) [97.44-98.92] | 89.87 (1624/1807) [88.39-91.23] | |
Radiologist 1 | 181 | 1239 | 52 | 335 | 65.52 [63.40-67.65] | 35.08 (181/516) [30.81-39.34] | 95.97 (1239/1291) [94.89-96.98] | 78.58 (1420/1807) [76.62-80.45] | < 0.0001 |
Radiologist 2 | 115 | 1248 | 43 | 401 | 59.48 [57.62-61.34] | 22.29 (115/516) [18.60-25.78] | 96.67 (1248/1291) [95.66-97.60] | 75.43 (1363/1807) [73.38-77.40] | < 0.0001 |
TP = true positive, TN = true negative, FP = false positive, FN = false negative
†: compare between radiologists and deep learning model.
Delong's test was used to compare the AUCs.