Table 2.
The performance of the DL model and readers in the test data set
AUC | Accuracy | Sensitivity | Specificity | ||
---|---|---|---|---|---|
Image-based analysis | |||||
DL model | 0.95 | 0.92 (2876/3133) | 0.87 (213/245) | 0.92 (2663/2888) | |
Reader A | Without DL model | 0.96 | 0.96 (3001/3133) | 0.97 (238/245) | 0.96 (2763/2888) |
With DL model | 0.97 | 0.96 (3012/3133) | 0.98 (239/245) | 0.96 (2773/2888) | |
Comparison | 0.100 | 0.091 | 1.000 | 0.123 | |
Reader B | Without DL model | 0.93 | 0.98 (3065/3133) | 0.88 | 0.99 (2850/2888) |
With DL model | 0.95 | 0.98 (3079/3133) | 0.90 (220/245) | 0.99 (2859/2888) | |
Comparison | 0.006a | 0.001a | 0.074 | 0.015a | |
Reader C | Without DL model | 0.96 | 0.94 | 0.95 (233/245) | 0.94 (2723/2888) |
With DL model | 0.99 | 0.95 (2971/3133) | 1.00 (245/245) | 0.94 (2726/2888) | |
Comparison | <0.001a | 0.077 | 0.002a | 0.877 | |
Reader D | Without DL model | 0.93 | 0.96 (3016/3133) | 0.90 (220/245) | 0.97 (2796/2888) |
With DL model | 0.96 | 0.95 (2973/3133) | 0.96 (235/245) | 0.95 (2738/2888) | |
Comparison | <0.001a | <0.001a | 0.001a | <0.001a | |
Patient-based analysis | |||||
DL model | 0.98 | 0.96 (48/50) | 0.96 (24/25) | 0.96 (24/25) | |
Reader A | Without DL model | 0.98 | 0.98 (49/50) | 1.00 (25/25) | 0.96 (24/25) |
With DL model | 1.00 | 1.00 (50/50) | 1.00 (25/25) | 1.00 (25/25) | |
Comparison | 0.317 | 1.000 | N/A | 1.000 | |
Reader B | Without DL model | 0.96 | 0.96 (48/50) | 0.92 (23/25) | 1.00 (25/25) |
With DL model | 1.00 | 1.00 (50/50) | 1.00 (25/25) | 1.00 (25/25) | |
Comparison | 0.149 | 0.480 | 0.480 | N/A | |
Reader C | Without DL model | 0.98 | 0.98 (49/50) | 0.96 (24/25) | 1.00 (25/25) |
With DL model | 1.00 | 1.00 (50/50) | 1.00 (25/25) | 1.00 (25/25) | |
Comparison | 0.317 | 1.000 | 1.000 | N/A | |
Reader D | Without DL model | 0.94 | 0.94 (47/50) | 0.88 (22/25) | 1.00 (25/25) |
With DL model | 1.00 | 0.98 (49/50) | 0.96 (24/25) | 1.00 (25/25) | |
Comparison | 0.073 | 0.480 | 0.480 | N/A |
AUC, area under the curve; DL, deep learning; N/A, not applicable.
Accuracy, sensitivity, and specificity were calculated by using the threshold that achieved the Youden index. Comparisons between AUC values were performed with the DeLong’s test, and comparisons among accuracy, sensitivity, and specificity were performed by using the McNemar’s test.
astatistically significant difference.